spot_img
HomeResearch & DevelopmentNavigating the Past: How AI Handles Archival Audio and...

Navigating the Past: How AI Handles Archival Audio and Voice Recognition Challenges

TLDR: A study using UNESCO’s mid-20th century radio recordings investigates the performance of modern language identification (LID) and speaker recognition (SR) tools on old, multilingual, and cross-age audio. It finds that LID systems like Whisper V3 are highly effective with accented and multilingual speech. However, speaker recognition embeddings prove fragile, showing significant drops in accuracy when identifying speakers across different ages or languages, highlighting key challenges for archival audio processing and speaker indexing.

Old audio recordings held by organizations like the United Nations and UNESCO are incredibly valuable for history and culture. However, accessing these vast archives is often difficult because their descriptions are incomplete. Modern speech processing tools, designed to help identify languages and speakers, face significant hurdles when dealing with these older, often multilingual, and sometimes accented recordings, as well as voices that change over many years.

A recent study, titled On Barriers to Archival Audio Processing, by Peter Sullivan and Muhammad Abdul-Mageed from the University of British Columbia, delves into these challenges. Their research used a unique collection of mid-20th century radio recordings from UNESCO, spanning from 1952 to 1980 and covering 20 different languages. The main goal was to see how well current off-the-shelf language identification (LID) and speaker recognition (SR) systems perform on such diverse and aged audio, especially when speakers are multilingual or their voices have changed over time.

For language identification, the researchers tested popular models like Whisper and Massively Multilingual Speech (MMS). Their findings showed that Whisper, particularly its latest version (V3), is remarkably good at handling speech with accents and multiple languages. It achieved high accuracy rates, demonstrating its potential for helping archives categorize their multilingual audio files. In contrast, the MMS model struggled more with accented speech, highlighting the importance of training these systems on a wide variety of audio.

However, the study revealed a more complex picture for speaker recognition. Tools that identify speakers often rely on ‘speaker embeddings,’ which are like unique digital fingerprints of a person’s voice. The research found that these embeddings are quite fragile when dealing with recordings of the same person made years apart (cross-age) or in different languages (cross-lingual). For cross-age comparisons, the similarity between voice embeddings dropped significantly as the time gap between recordings increased, stabilizing after about 10 years. This suggests that a person’s voice changes enough over a decade to make it harder for the system to recognize them consistently.

Similarly, when a speaker used different languages, their voice embeddings showed a substantial drop in similarity, with a much wider range of results. This indicates that factors like a speaker’s fluency in a second language or the similarity between the languages themselves might affect how their voice is recognized. These findings suggest that while LID systems are becoming more robust, speaker recognition still faces significant biases related to age, language, and even the recording equipment used.

Also Read:

The study concludes that while tools like Whisper V3 offer promising solutions for identifying languages in archival audio, the challenges in consistently identifying speakers across different ages and languages remain. Overcoming these issues is crucial for archives that aim to use these technologies for speaker indexing, which would greatly improve public access to invaluable historical recordings. The research also underscores the importance of using real-world archival audio to test and improve speech processing technologies, helping to uncover and address biases in these systems.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -