Our Expectations from INTERSPEECH 2019

Published: August 29, 2019

# Speech Processing

# AI / ML

In less than a month, from Sep. 15–19, 2019, Graz, Austria will become home for INTERSPEECH, the world‘s most prominent conference on spoken language processing. The conference unites science and technology under one roof and becomes a platform for over 2000 participants who will share their insights, listen to eminent speakers, and attend tutorials, challenges, exhibitions, and satellite events.

What are our expectations of it as participants and presenters?

Keynotes

Tanja Schultz, the spokesperson of the University Bremen area “Minds, Media, Machines”, will talk on biosignal processing for human-machine interaction. As human interaction involves a wide range of biosignals from speech, gestures, motion, and brain activities, it is crucial to correctly interpret all of them to ensure truly effective human-machine interaction. We are waiting for Tanja Schultz to describe her work on Silent Speech Interfaces which rely on articulatory muscle movement to recognize and synthesize silently produced speech, and Brain Computer Interfaces which use brain activity to recognize speech and convert electrocortical signals into audible speech. Let’s move the the new era of brain-to-text and brain-to-speech technology!

Manfred Kaltenbacher of Vienna University of Technology will discuss physiology and physics of voice production. This talk has a more medical slant, as it looks at voice production from the point of view of physiology and physics. At the same time, it will discuss current computer simulations for pre-surgical predictions of voice quality, and the development of examination and training of voice professionals — an interesting step from usual technology-oriented talks.

Mirella Lapata, Professor of natural language processing in the School of Informatics at the University of Edinburgh, will talk about learning natural language interfaces with neural models. Back to technology and AI, the talk will address the structured prediction problem of mapping natural language onto machine-interpretable representations. We definitely think that it will be useful for any NLP specialist to know more about a neural network-based general modeling framework — the most promising approach of recent years.

Tutorials

There are eight of them and we love them all! The tutorials tackle diverse topics, but they all discuss the most interesting recent developments and breakthroughs.

Two tutorials concern Generative adversarial networks, showing once again the power of this approach. The tutorial we are going to is offered by National Taiwan University and Academia Sinica. It is dedicated to speech signal processing, including speech enhancement, voice conversion, speech synthesis, and, more specifically, sentence generation. Moreover, we can expect real-life GAN algorithms for text style transformation, machine translation, and abstractive summarization without paired data.

The second tutorial by Carnegie Mellon University and Bar-Ilan University shows how GANs can be used for speech and speaker recognition and other systems. The tutorial will discuss whether it is possible to fool systems with carefully crafted inputs and how to identify and avoid attacks of such crafted “adversarial” inputs. Finally, we will discuss recent work on introducing “backdoors” into systems through poisoned training examples, such that the system can be triggered into false behaviors when provided with specific types of inputs, but not otherwise.

We are also waiting for the tutorial on another popular technique in speech processing, the end-to-end approach. We are expecting from the tutorial by Mitsubishi, Nagoya University, NTT, and Johns Hopkins University some interesting insights into advanced methods for neural end-to-end speech processing, namely, unification, integration, and implementation. The tutorial will explore a new open-source toolkit ESPnet (end-to-end speech processing toolkit) used on the unified framework and integrated systems.

Nagoya University specialists will offer a tutorial on statistical voice conversion with direct waveform modeling. The tutorial will give an overview of this approach and introduce freely available software, “Sprocket” as a statistical VC toolkit and “PytorchWaveNetVocoder” as a neural vocoder toolkit. It looks like a good chance to try your hand at voice conversion.

Google AI is preparing a tutorial on another hot topic, neural machine translation. However, the tutorial looks more like an overview of the history, mainstream techniques, and recent advancements.

Expanding the keynote speech, specialists from Université Grenoble Alpes and Maastricht University will present biosignal-based speech processing, including silent speech to brain-computer interfaces with real data and code.

Uber AI will present its approach to modeling and deploying dialog systems with open-source tools from scratch.

Finally, Paderborn University and NTT will present their insights into microphone array signal processing and deep learning for speech enhancement with hybrid techniques uniting signal processing and neural networks.

Special events and challenges

Special sessions and challenges constitute a separate part of the conference and focus on relevant ‘special’ topics, ranging from computational paralinguistics, distant speech recognition, and zero resource speech processing to processing of a child’s speech, emotional speech, and code-switching. All papers are already submitted, but it will be very interesting to see the finals and discuss the winners’ approaches.

Besides from challenges, there will be many more special events and satellite conferences to meet the needs of all specialists working in the field of speech processing: from a workshop for young female researchers to a special event for high school teachers. Participants will be able to join the first-ever INTERSPEECH Hackathon or choose between nine specialized conferences and satellite workshops.

SLaTE

A special event that is the most important for us is the workshop held by the Special Interest Group (SIG) of the International Speech Communication Association (ISCA) as a part of Speech and Language Technology in Education (SLaTE) events. The event brings together practitioners and researchers working on the use of speech and natural language processing for education. This year’s workshop will not only have an extensive general session with 19 papers, but also it will feature a special session about the Spoken CALL Shared Task (version 3) with 4 papers, and 4 demo papers.

Our poster

Our biggest expectation is, of course, our participation! The article written by our stellar specialists Ievgen Karaulov and Dmytro Tkanov entitled “Attention model for articulatory features detection” was approved for a poster session. Our approach is a variation of end-to-end speech processing. The article proves that using binary phonological features in the Listen, Attend, and Spell (LAS) architecture can show good results for phone recognition even on a small training set like TIMIT. More specifically, the attention model is used to train the manners and places of articulation detectors end-to-end and to explore joint phone recognition and articulatory features detection in multitask learning settings.

Our SLaTE paper

Yes, we present not just one paper! Since our solution showed the best result in the text subset of the CALL v3 shared task, we wrote a paper exploring our approach and now we are going to present it at SLaTE. The paper called “Embedding-based system for the text part of CALL v3 shared task” by four our team members (Volodymyr Sokhatskyi, Olga Zvyeryeva, Ievgen Karaulov, and Dmytro Tkanov) focuses on NNLM and BERT text embeddings and their use in a scoring system to measure grammatical and semantic correctness of students’ phrases. Our approach does not rely on the reference grammar file for scoring proving the possibility to achieve the highest results without a predefined set of correct answers.

Even now we already anticipate this INTERSPEECH to be a gala event that will give us more knowledge, ideas, and inspiration — and a great adventure for our teammates.