Interspeech 2017 flashback and 2018 expectations

Published: August 31, 2018

# Computer Vision

# Speech Processing

# AI / ML

Last Interspeech in a glance

Interspeech is the world’s largest and most comprehensive conference on the science and technology of spoken language processing.

Interspeech 2017 gathered circa 2000 participants in Stockholm, Sweden and it exceeded expected capacity of the conference. There were lots of great people to to meet and to listen to Hideki Kawahara, Simon King, Jan Chrowski, Tara N. Sainath and many-many others. Papers acceptance rate traditionally was rather high at 51%. ICASSP 2017 had similar number, yet other ML-related conferences have this metrics closer to 20–30%.

Most of works can be classified to one of the following groups:

Speech recognition (in various contexts).
TTS (mono- and multilingual).
Voice conversion.
Speaker verification spoofing.
Acoustic models.
Language modeling.
Speech and medicine.
Low-resourced languages.
Multi-model systems.
Dialogue systems.
Emotions recognition.

Unfortunately, posters and oral sessions overlapped, thus occasionally you had to choose from two interesting events happening at the same time or train quick walking skills travelling between Aula Magna and main building. Yet, overall it was pleasant to observe how recent hot trends from ML community carefully invade speech processing community. Examples include GANs for speech enhancement and voice conversion, usage of WaveNet-like architectures in non-TTS setting, RNN transducer etc. Also it was interesting that in case of acoustic modeling, going deeper is not always a solution. It sounds like a big difference from image-related problems where adding more layers often yields a better model.

Still, some researches do not abandon old-school approaches and invent new ways to apply DTW to various tasks.

Some notable papers:

Michael McAuliffe. Montreal Forced Aligner: trainable text-speech alignment using Kaldi. Also mentions other aligners (and corpuses) of potential interest.
Loweimi et al. Robust Source-filter Separation of Speech Signal in the Phase Domain. Also their experiments on windows for phase processing from Interspeech 2011: Phase-only speech reconstruction using very short frames.
H. Hadian. Duration modeling for LVCSR using neural networks. Authors project duration into discrete domain and train using crossentropy.
Oliver Siohan. CTC training of multi-phone acoustic models for speech recognition. Introduces M-phones for acoustic modeling. M-phone is a variable length sequence of phones. Model is capable of capturing coarticulation effects.
Sibo Tong. An Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation. Author compares shared layers architecture with recognizer trained on IPA targets. Also application of speaker adaptation techniques to language adaptation is shown.
Alexander Gutkin. Uniform Multilingual Multi-Speaker Acoustic Model for Statistical Parametric Speech Synthesis of Low-Resourced Languages. TTS for segments of multiple IPA phones. Trained on multiple languages, features and segments are based on relations between languages.
Van Hai Do. Multitask learning using mismatched transcription. Shows how to transcribe records for Georgian language if you have only Chinese transcribers.
Jeong-Uk Bang. Improving Speech Recognizers by Refining Broadcast Data with Inaccurate Subtitle Timestamps. Extraction of aligned data from video with subtitles.
Sercan Arik. Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting. Next iteration of small-footprint keyword spotting model.
Chin-Cheng Hsu. Voice conversion from unaligned corpura using VAE WGAN. GAN for STRAIGHT vocoder optima parameters search.
Saurabhchand Bhati. Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Applications. Approach to unsupervised speech transcription based on virtual phones and words. State-of-the-art accuracy on spoken term detection task.
K.M. Knill. Use of Graphemic Lexicons for Spoken Language Assessment. Speech assessment based on interphonemic distances.
Wei Li. Improving Mispronunciation Detection for Non-Native Learners with Multisource Information and LSTM-Based Deep Models. Mispronunciations detection in Mandarin. Based on iCALL corpus.

What to expect from Interspeech 2018?

Apollo dataset

One of key expectation is release of 16k hours of transcribed audio for Apollo missions to public domain. Vocabulary size probably won’t be huge, yet amount of recordings is on par with what is used in modern commercial ASR systems. Previous biggest available public corpus was LibriSpeech with circa 800 hours of read English. You already could train a decent ASR with that data, so what would be possible with 16 times more speech?

Forenoon tutorials

This year we will focus on two tutorials. The forenoon one is End-to-end models in ASR by Rohit Prabhavalkar and Tara Sainath. End-to-end models have become a hot topic in recent years.

A common feature of all of these models is that they are composed of a single neural network, which when given input acoustic frames directly outputs a probability distribution over graphemes or word hypotheses. In fact, as has been demonstrated in recent work, such end-to-end models can surpass the performance of conventional ASR systems.

The tutorial would cover the historical development of end-to-end approaches and also describe similarities and differences between popular models.

Another notable tutorial that unfortunately is going to be at the same time is Spoken dialog technology for education domain applications by Vikram Ramanarayanan, Keelan Evanini and David Suendermann-Oeft.

… tutorial will cover the state of the art in dialog technologies for educational domain applications, with a particular focus on language learning and assessment. This will include an introduction to the various components of spoken dialog systems and how they can be applied to develop conversational applications in the educational domain, as well as some advanced topics such as methods for speech scoring.

The practical part of the tutorial would be around the HALEF platform and OpenVXML. A brief search suggests that the spoken part would be based on Kaldi. Conventional ASR systems often struggle when used for non-native speech assessment, thus it would be interesting to see if authors would use some custom model or apply a standard recognizer.

Afternoon tutorials

In the afternoon session, we are planning to attend Information theory of deep learning tutorial by Naftali Tishby. It should be a really exciting talk from a theoretic point of view as deep learning theory is still underdeveloped. Works on this topic are usually going with rather severe simplifications like the elimination of layers of non-linearity.

Tutorial on Articulatory representations by Carol Espy-Wilson and Mark Tiede should also be of interest. Especially for e-learning applications. Manners and places of articulation are essential properties describing phonemes' pronunciation. Accurate estimation of these features should make ASR systems output less black-boxy compared to just probability distributions over phones.

Special sessions

Special sessions this year would cover various topics from paralinguistics to speech recognition for Indian languages.

Depending on the schedule, we’ll try to get to see the following three:
Deep neural networks: How can we interpret what they learned? It should be a nice session to check after the information theory tutorial.
Low resource speech recognition challenge for Indian languages. Being low on data is a common thing for anyone working with languages outside of the mainstream setting. Thus, any tips and tricks would be really valuable.
Spoken CALL shared task. Second edition. Core event for sampling approaches to language learning.

Technical program

There would be hundreds of papers presented. It is impossible to cover all of them.
There is a lot of overlap between sections. Especially on day 2.
We will try to focus on the following sections:

Show and tell.
End-to-end speech recognition.
Voice conversion.
Models of speech perception.
Acoustic model adaptation.
Novel NN architectures for acoustic modeling.
Measuring pitch and articulation.
Spoken corpora annotation.
Articulatory information, modeling and inversion.
Source separation from Monaural input.
Speech enhancement.
Language identification.
Speaker diarization.
Computer-assisted language learning (CALL).

We look forward to seeing you soon at the Interspeech 2018 in Hyderabad!