Our team recently attended the 25th Interspeech Conference, held from September 1st to 5th on Kos Island, Greece. This year’s theme, "Speech and Beyond," highlighted new developments in speech technology, focusing on areas like healthcare diagnostics, virtual assistants, and even animal sound recognition. It was a great opportunity for experts worldwide to share their work and discuss the latest trends. Here are some of the key topics and insights we gathered from the event.
A major topic at the conference was the use of Large Language Models (LLMs) to improve Automatic Speech Recognition (ASR) and Spoken Language Understanding (SLU) systems. Unlike traditional ASR models, LLMs are trained on large amounts of text data, which helps them better understand the context of spoken words. This makes them particularly useful for better recognition on domains with missing or scarce acoustic training data.
The research team (Jinlong Xue, Yayue Deng, Yicheng Han, Yingming Gao, Ya Li) showed how they are using LLMs to fix errors in ASR outputs, while others are exploring new methods for real-time speech recognition. There is also a growing trend of adding more text data to the training process for acoustic models, which improves their ability to recognize less common words and phrases. This approach is gaining popularity as it leads to better overall performance of ASR systems.
To learn more about how to create your own large language model, check out our step-by-step guide.
The Whisper model and its updated versions were one of key topics at the conference. The research is focused on improving Whisper’s performance, such as making the text alignment more accurate, reducing errors, and better detecting pauses in speech. One example is Crisper Whisper, – a modified version originally designed for medical diagnostics. It offers more precise timing and fewer mistakes, making it a strong alternative to WhisperX for various uses, like legal transcription where high accuracy is needed.
The conference also highlighted several Whisper-like models using open datasets, which could make these advanced tools available for a wider range of applications. This growing ecosystem of Whisper-based models shows promise for improving accessibility and adapting to different industries and languages.
As synthetic voice technology improves, it’s becoming more important to detect fake audio, known as deep fakes. The paper highlighted several new methods for spotting these manipulations. Some researchers are using advanced models to identify small differences between real and fake voices, even when the fake is very convincing.
These developments are crucial for keeping voice data secure in areas like security, media, and entertainment, where it’s important to trust that audio is genuine.
In the healthcare field, a major update was the release of a new dataset for dysarthric speech by Mark Hasegawa-Johnson, the creator of the UASpeech corpus. This dataset is designed to help improve speech recognition for people with severe speech impairments and will be used in a research challenge this November to test new models and approaches.
Google continues the development of its project, previously known as Euphonia, which aims to enhance speech recognition for individuals with speech disorders. They presented new techniques that make their models better at understanding and transcribing slurred or irregular speech, benefiting users with conditions like cerebral palsy or ALS.
These advancements are essential for developing technology that can accurately recognize and respond to the unique speech patterns of people with dysarthria, helping them communicate more effectively.
ASR (Automatic Speech Recognition) and speech technology in e-learning were well-covered topics at the conference. Many of the techniques presented were similar to the ones we developed five years ago. For example, using phonological features for mispronunciation detection. The paper is describing using a wav2vec2 model with modified CTC loss, while we used a transformer model for similar tasks as well.
The conference offered valuable insights into speech technology's latest trends and future possibilities. We’re looking forward to using these ideas in our current projects and working with the research community to explore new possibilities.
We plan to use LLMs to improve ASR accuracy, adopt Whisper model enhancements for specialized transcriptions, and refine our speech recognition capabilities for individuals with disorders. For e-learning, we’re exploring real-time feedback solutions for better pronunciation training. These advancements will elevate our projects and drive progress in the field.
Stay tuned for more in-depth reports on the topics discussed at the event. If you have any questions or would like to talk about any of these findings, feel free to reach out!