Read our blog and carry on - Speech Processing

Stay informed and inspired in the world of AI with us.

Speech Processing

Voice Biometrics Recognition and Opportunities It Gives

Voice biometry is changing the way businesses operate by using distinctive features of a person's voice, like pitch and rhythm, to confirm their identity. This technology, a central part of Voice AI, turns these voice characteristics into digital "voiceprints" that are used for secure authentication. Unlike traditional methods such as fingerprint or facial recognition, voice biometry can be used remotely with just standard microphones, making it both practical and non-intrusive. This technology enhances security using advanced algorithms that block fraudulent attempts, making it a popular choice in various sectors requiring reliable and user-friendly authentication solutions, such as finance, healthcare, and customer support. The voice biometric market, valued at $1.261 billion in 2021, is expected to grow significantly, with a projected annual growth rate of 21.7%. By 2026, the market is anticipated to exceed $3.9 billion. Voice recognition is a valuable method capable of improving the security and customer service and offering rich personalization experience. Today we’ll explore, how it works and take a look on use cases in different areas of business Voice is produced when humans push the air from the lungs through the vocal cords, causing them to vibrate. Vibrations resonate in the nasal and oral cavity, releasing the sounds to the world. Each human's voice has unique characteristics, such as pitch, tone, and rhythm, shaped by the anatomy of their vocal organs. This makes the voice as unique as fingerprints, faces, or eyes. Voice recognition identifies individuals by analyzing the unique characteristics of their voice. This involves two key stages: Acoustic Analysis This stage involves analyzing the voice sample as an acoustic wave. Technicians use a waveform or a spectrogram to visualize the voice. Waveform displays the amplitude of voice, featuring the loudness, while spectrogram reflects the frequency, representing them in color or grayscale shading. Mathematical Modeling After analyzing the voice, its unique characteristics are transformed into numerical values through mathematical modeling. This step uses statistical and artificial intelligence methods to create a precise numerical representation of the voice, known as a voiceprint. Active & Passive Extraction Active Voiceprint Extraction requires the person to actively participate by repeating specific phrases. It’s used in systems that need very accurate voiceprints. Passive Voiceprint Extraction captures voice data naturally during regular conversation, like during a customer service call. It doesn’t require any specific effort from the user, making it more convenient and less intrusive. The choice between active and passive extraction depends on the needs of the system, such as the level of security required and how intrusive the process can be for users. Voiceprints are securely saved in a database, and each is stored in a unique format set by the biometrics provider. This special format ensures that no one can recreate the original speech from the voiceprint, protecting the speaker's privacy. Voiceprint Comparison When a new voice sample is provided, it is quickly compared to the stored voiceprints to check for a match, which is crucial for verifying identities. This comparison can happen in a few ways: Main Challenges Solution The language learning platform supports various types of exercises, including writing ones, guessing games, and pronunciation training. This module focuses on providing precise, unsupervised pronunciation training, helping the students to refine their pronunciation skills autonomously. How It Works When a student speaks, the system displays a visual waveform of their speech. This points out errors by highlighting incorrect words, syllables, or phonemes and offers the correct pronunciation. It also presents alternative pronunciations, providing learners with a broad understanding of different speaking styles. The pronunciation evaluation module uses artificial neural networks and deep learning to analyze speech patterns, while machine learning and statistical methods identify common errors. Decision trees analyze speech patterns against set linguistic rules to determine pronunciation accuracy, identify errors, and suggest corrections. Implementation The development team upgraded from traditional MATLAB-based ASR models to a more sophisticated, TensorFlow-powered end-to-end ASR system. This new system uses the International Phonetic Alphabet (IPA) to convert sounds directly into phonetic symbols, efficiently supporting multiple languages within a single system. Key features include: Conclusion Analyzing unique voice characteristics offers endless possibilities in various business areas. More secure than traditional passwords, voice recognition can safeguard customers’ money and sensitive information, like health records. Quick processing of client support requests, easy and non-intrusive authentication will both please the customers and make business more efficient. Voice recognition can even become a key selling feature in your product – like training pronunciation of language learners. SciForce has rich experience in speech processing and voice recognition. Contact us to explore new opportunities for your business.

Automatic Speech Recognition (ASR) Systems Compared

Automatic speech recognition (ASR) systems are becoming an increasingly important part of human-machine interaction. Simultaneously, they are still too expensive to develop from scratch. Companies need to choose between using a cloud API for an ASR system developed by tech giants or playing with open-source solutions. In this post, we compare eight of the most popular ASR systems to facilitate the choice for your project needs and team’s skills. We have conducted our tests to define the word error rate (WER) for some listed ASR systems. We promise to update and add any new info when possible. Let’s dive right into it. Automatic speech recognition (ASR) is a technology identifying and processing human voice with the help of computer hardware and software-based techniques. You can use it to determine the words spoken or authenticate the person’s identity. In recent years, ASR has become popular across industries in the customer service departments. Basic ASR systems recognize isolated-word entries such as yes-or-no responses and spoken numerals. However, more sophisticated ASR systems support continuous speech and allow entering direct queries or replies, such as a request for driving directions or the telephone number of a specific contact. The state-of-the-art ASR systems recognize wholly spontaneous speech that is natural, unrehearsed, and contains minor errors or hesitation markers. However, commercial systems offer little access to detailed model outputs, including attention matrices, probabilities of individual words or symbols, or intermediate layers outputs, and limited integrability into other software. Hence, ASR systems like AT&T Watson, Microsoft Azure Speech Service, Google Speech API, and Nuance Recognizer (bought by Microsoft in April 2021) are not that much flexible. In response to these limitations, more open-source ASR systems and frameworks enter the picture. However, the growing number of such systems makes it challenging to understand which of them suits the project’s needs best, which offers complete control over the process, which can be used without too much effort and deep knowledge of Machine and Deep Learning. So, let’s reveal all the nuts and bolts. Of course, commercial ASR systems developed by such tech giants as Google or Microsoft offer the best accuracy in speech recognition. On the downside, they seldom provide developers much control over the system, usually allowing them to expand vocabulary or pronunciation but leaving the algorithms untouched. Google Cloud Speech-to-Text is a service powered by deep learning neural networking and designed for voice search or speech transcription applications. Currently, it is the clear leader among other ASR services in terms of accuracy and the languages covered. Language support Currently, the system recognizes 137 languages and variants with an extensive vocabulary in default, common, and search recognition models. You can use the default model to transcribe any audio type while using search and command ones for short audio clips only. Input It is possible to directly stream sound files that are less than a minute long to perform the so-called _synchronous_ speech recognition when you talk to your phone and get a text back. The proper way to do it is to upload it to Google Storage and use the _asynchronous_ API for longer files. Supported audio encodings are MP3, FLAC, LINEAR16, MULAW, AMR, AMR\_WB, OGG\_OPUS, SPEEX\_WITH\_HEADER\_BYTE, WEBM\_OPUS. Pricing Google provides one free hour of audio processing and $1,44 per hour of audio. Pricing for Google Cloud Speech-to-Text.(Image credit: Google) Models Google offers four pre-built models: default, voice commands and search, phone calls, and video transcription. The Standard model is best suited for general use, like single-speaker long-form audio, while the Video model is better at transcribing multiple speakers (and videos). In reality, the newest and more expensive Video model performs better in all settings. Selecting models for Google Cloud Speech-to-Text.(Image credit: Google) Customization A user can customize the number of hypotheses returned by the ASR, specify the language of the audio file and enable a filter to remove profanities from the output text. Moreover, speech recognition can be customized to a specific context by adding the so-called hints — a set of words and phrases that are likely to be spoken, such as custom words and names, to the vocabulary and in voice-control use cases. Accuracy Franck Dernoncourt reported that Google’s WER of 12,1% (LibriSpeech clean dataset). We also have tested Google Cloud Speech on the LibriSpeech dataset sample and got the WER for male and female clear voice 17,8% and 18,8% correspondingly, while 32,5% and 25,3% for noisy environments correspondingly. The cloud-based Microsoft Azure Speech Services API helps create speech-enabled features in applications, like voice command control, user dialog using natural speech conversation, and speech transcription and dictation. The Speech API is part of Cognitive Services (previously Project Oxford). In its basic REST API model, it doesn’t support intermediate results during recognition. Language support Microsoft’s speech-to-text service supports 95 languages and regional variations, text-to-speech service support 137 ones. Speech-to-speech and speech-to-text translation services support 71 languages. Speaker recognition, a service that verifies and identifies the speaker by their voice characteristics, is available in 13 languages Input The REST API supports audio streams up to 60 seconds, and you can use it for online transcription as a replacement of the Speech SDK. For longer audio files, you should use the Speech SDK or Speech-to-text REST API v3.0. Using the Speech SDK, consider that the default audio streaming format is WAV (16 kHz or 8 kHz, 16-bit), other formats are also supported with GStreamer: MP3, OPUS/OGG, FLAC, ALAW in wav container, MULAW in wav container, ANY (using for the scenario with an unknown media format). Pricing Speech to Text — $1 per hour Speech to Text with Custom Speech Model — $1.40 per hour There is a free version for one concurrent request with the threshold of 5 hours per month. For more detailed plans, check out pricing page here. Interfaces Microsoft provides two ways for developers to add the Speech Services API to their apps: Customization The Speech service allows users to adapt baseline models based on their acoustic and language data, customizing their vocabulary and acoustic models and pronunciation. _The diagram presents the features of the Custom Speech by Azure._ _Source_ Accuracy Franck Dernoncourt reported that Azure’s WER of 18,8% (LibriSpeech clean dataset). We also have tested Microsoft’s speech-to-text service on the LibriSpeech dataset sample and got the WER for male and female clear voice 11,7% and 13,5% correspondingly, while 26% and 21,3% for noisy environments correspondingly. Amazon Transcribe is an automatic speech recognition system that, above all, adds punctuation and formatting by default so that the output is more intelligible, and you can use it without any further editing. Language support At present, Amazon Transcribe supports 31 languages, including regional variations of English and French. Input Amazon Transcribe supports both 16 kHz and 8kHz audio streams and multiple audio encoding, including WAV, MP3, MP4, and FLAC with time stamps for every word so that it is possible, as Amazon claims on its website, to “easily locate the audio in the original source by searching for the text.” The service calls are limited to two hours per API call. Pricing Amazon’s pricing is based on the pay-as-you-go of audio transcriber per month, starting from $0,0004 per second for the first 250,000 minutes. A free tier is available for 12 months with a limit of 60 minutes per month. _All the pricing details available on_ _Amazon’s page_ Customization Amazon Transcribe allows for customized vocabularies written in the accepted format and using characters from the allowed character set for each supported language. Accuracy Having a limited collection of languages and only one baseline model available, Amazon Transcribe shows the WER of 22%. Watson Speech to Text service is an ASR system that provides automatic transcription services. The system uses machine intelligence to combine information about grammar and language structure with knowledge about the composition of the audio signal to transcribe the human voice accurately. As more speech is heard, the system retroactively updates the transcription. Language Support The IBM Watson Speech to Text service supports 19 languages and variations. Input The system supports 16kHz and 8kHz audio streams in MP3, MPEG, WAV, FLAC, OUPS, and other formats. Pricing IBM provides you with a free plan of up to 500 minutes per month (no customization available). Users can conduct up to 100 concurrent transcriptions for $0.02 per minute for 1–999,999 minutes per month within the Plus plan. The details for other plans are available per request. Interfaces The Watson Speech to Text service offers three interfaces: Customization For a limited selection of languages, the IBM Watson Speech to Text service offers a customization interface that allows developers to augment their speech recognition capabilities. You can improve speech recognition requests’ accuracy by customizing a base model for domains such as medicine, law, information technology, and others. The system allows customization of the language and the acoustic models. Accuracy Franck Dernoncourt reported that Azure’s WER of 9,8% (LibriSpeech clean dataset). We also have tested Microsoft’s speech-to-text service on the LibriSpeech dataset sample and got the WER for male and female clear voice 17,4% and 19,6% correspondingly, while 37,5% and 27,4% for noisy environments correspondingly. SpeechMatics, both cloud-based and on-premise service HQ-ed in the UK, uses recurrent neural networks and statistical language modeling. Enterprise-targeted service offers free and premium features like real-time transcriptions and audio-file upload. Language support They cover 31 languages. SpeechMatics promises to cope with challenges like noisy environments, different accents, and dialects. Input The audio and video formats that this ASR system supports are: WAV, MP3, AAC, OGG, FLAC, WMA, MPEG, AMR, CAF, MP4, MOV, WMV, MPEG, M4V, FLV, MKV. The company claims that other formats could also be supported but after the additional user acceptance test. Pricing As the majority of enterprise-targeted companies, SpeechMatics provides plans and pricing details on request. The company uses a volume-based strategy and provides a 14-day free trial. Customization This system is highly configurable — you can create your user interface tailored to your needs. There is no default UI provided, but you can have it through one partner when needed. You can add your own words to the dictionary and teach the engine how to recognize them. You also can tune up the system to exclude sensitive information or profanations. SpeechMatics also supports real-time subtitling. Accuracy Franck Dernoncourt tested SpeechMatics on the LibriSpeech clean test data set (English) — WER 7,3%, which is pretty good compared to other commercial ASR systems. The variety of open-source ASR systems makes it challenging to find those that combine flexibility with an acceptable word error rate. In this post, we have selected Kaldi and HTK as popular ones across community platforms. Kaldi was initially made for researchers, but it has made a name for itself fast. Kaldi is a Johns Hopkins University toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. Famous for its results that can actually compete with and even beat Google, Kaldi is, however, challenging to master and set up to work correctly, requiring extensive customization and training on your corpus. Languages support Kaldi does not provide the linguistic resources required to build a recognizer in a language, but it does have a recipe to create one on your own. When you give enough data, you can train your model with Kaldi. You also can use some pre-built models on Kaldi’s page. Input The audio files are accepted in the WAV format. Customization Kaldi is a toolkit for building language and acoustic models to create ASR systems on your own. It supports linear transforms, feature-space discriminative training, deep neural networks, MMI, boosted MMI and MCE discriminative training. Accuracy We have tested Kaldi on the LibriSpeech dataset sample and got the WER for male and female clear voice about 28%% and 28,1% correspondingly, while about 46,7% and 40,2% for noisy environments correspondingly. HTK, or Hidden Markov Model Toolkit, written in C programming language, was developed in the Cambridge University Engineering Department to handle HMMs. HTK focuses on speech recognition, but you can also use it for text-to-speech tasks and even DNA sequencing. Today, Microsoft has obtained the copyright to the original HTK code but still encourages changes to the source code. Interestingly, being one of the oldest projects, it is still used extensively and has new versions released. Moreover, HTK has an insightful and detailed book called “The HTKBook” which describes both the mathematical basis of speech recognition and how to do specific actions in HTK. Language support Similar to Kaldi, HTK is language-independent, with the possibility to build a model for every language. Input By default, the speech file format is HTK, but the toolkit also supports various formats. You can set the configuration parameter SOURCEFORMAT for other formats. Customization Fully customizable, HTK offers training tools to estimate the parameters of a set of HMMs using training utterances and their associated transcriptions and recognition tools to transcribe unknown utterances. Accuracy The toolkit was evaluated using the well-known WSJ Nov ’92 database. The result was impressive at 3.2% WER, using a trigram language model on a 5000-words corpus. However, in real life, the WER reaches 25–30%. Using any open-source ASR toolkit, the optimal way to quickly develop an ASR recognizer would be to use the open-source code to the papers that have the highest results on well-known corpora (the Facebook AI Research Automatic Speech Recognition Toolkit, a Tensorflow implementation of the LAS model, or a Tensorflow library of deep learning models and datasets, to name a few). The technology of automatic speech recognition has been around for some time. Though the systems are improving, problems still exist. For example, an ASR system cannot always correctly recognize the input from a person who speaks with a heavy accent or dialect or speaks in several languages. There are various tools, both commercial and open-source, to integrate ASR into the company’s applications. When choosing between them, the crucial point is to find the correct balance between the usually higher quality of proprietary systems and the flexibility of open-source toolkits. Companies need to understand their resources as well as their business needs. If ASR is used in conventional and well-researched settings and does not require too much additional information, a ready-to-use system is the most optimal solution. On the contrary, if ASR is the project’s core, a more flexible open-source toolkit becomes a better option.

Text-to-Speech Synthesis: an Overview

In my childhood, one of the funniest interactions with a computer was to make it read a fairy tale. You could copy a text into a window and soon listen to a colorless metallic voice stumble through commas and stop weaving a weirdly accented story. At those times it was a miracle. Nowadays the goal of TTS — the Text-to-Speech conversion technology — is not to simply have machines talk, but to make them sound like humans of different ages and genders. In perspective, we’ll be able to listen to machine-voiced audiobooks and news on TV or to communicate with assistants without noticing the difference. How it can be achieved and the main competitors in the field — read in our post. As a rule, the quality of TTS system synthesizers is evaluated from different aspects, including intelligibility, naturalness, and preference of the synthetic speech \[4\], as well as human perception factors, such as comprehensibility \[3\]. Intelligibility:* the quality of the audio generated, or the degree of each word being produced in a sentence*. Naturalness: the quality of the speech generated in terms of its timing structure, pronunciation, and rendering emotions. Preference: the listeners’ choice of the better TTS; preference and naturalness are influenced by the TTS system, signal quality, and voice, in isolation and in combination. Comprehensibility: the degree to of received messages are understood. Developments in Computer Science and Artificial Intelligence influence the approaches to speech synthesis that have evolved through the years in response to recent trends and new possibilities in data collection and processing. While for a long time, the two main methods of Text-to-Speech conversion are concatenative TTS and parametric TTS, the Deep Learning revolution has added a new perspective to the problem of speech synthesis, shifting the focus from human-developed speech features to fully machine-obtained parameters \[1,2\]. Concatenative TTS relies on high-quality audio clip recordings, which are combined together to form the speech. In the first step voice actors are recorded saying a range of speech units, from whole sentences to syllables that are further labeled and segmented by linguistic units from phones to phrases and sentences forming a huge database. During speech synthesis, a Text-to-Speech engine searches such database for speech units that match the input text, concatenates them together, and produces an audio file. Pros \- High quality of audio in terms of intelligibility; \- Possibility to preserve the original actor’s voice; Cons \- Such systems are very time-consuming because they require huge databases, and hard-coding the combination to form these words; \- The resulting speech may sound less natural and emotionless, because it is nearly impossible to get the audio recordings of all possible words spoken in combinations of emotions, prosody, stress, etc. Examples: Singing Voice Synthesis is the type of speech synthesis that fits the best opportunities of concatenative TTS. With the possibility to record a specific singer, such systems are able to preserve the heritage by restoring records of stars of the past days, as in Acapella Group, as well as to make your favorite singer perform another song according to your liking, as in Vocaloid. The formant synthesis technique is a rule-based TTS technique. It produces speech segments by generating artificial signals based on a set of specified rules mimicking the formant structure and other spectral properties of natural speech. The synthesized speech is produced using an additive synthesis and an acoustic model. The acoustic model uses parameters like, voicing, fundamental frequency, noise levels, etc that vary over time. Formant-based systems can control all aspects of the output speech, producing a wide variety of emotions and different tone voices with the help of some prosodic and intonation modeling techniques. Pros \- Highly intelligible synthesized speech, even at high speeds, avoiding the acoustic glitches; \- Less dependant on a speech corpus to produce the output speech; \- Well-suited for embedded systems, where memory and microprocessor power are limited. Cons \- Low naturalness: the technique produces artificial, robotic-sounding speech that is far from the natural speech spoken by a human. \- Difficult to design rules that specify the timing of the source and the dynamic values of all filter parameters for even simple words Examples The formant synthesis technique is widely used for mimicking the voice features that take speech as input and find the respective input parameters that produce speech, mimicking the target speech. One of the most famous examples is espeak-ng, an open-source multilingual speech synthesis system based on the Klatt synthesizer. This system is included as the default speech synthesizer in the NVDA open-source screen reader for Windows, Android, Ubuntu, and other Linux distributions. Moreover, its predecessor eSpeak was used by Google Translate for 27 languages in 2010. To address the limitations of concatenative TTS, a more statistical method was developed. The idea lying behind it is that if we can make approximations of the parameters that make the speech, we can train a model to generate all kinds of speech. The parametric method combines parameters, including fundamental frequency, magnitude spectrum, etc., and processes them to generate speech. In the first step, the text is processed to extract linguistic features, such as phonemes or duration. The second step requires extraction of _vocoder features_, such as cepstra, spectrogram, fundamental frequency, etc., that represent some inherent characteristic of human speech and are used in audio processing. These features are hand-engineered and, along with the linguistic features are fed into a mathematical model called a Vocoder. While generating a waveform, the vocoder transforms the features and estimates parameters of speech like phase, speech rate, intonation, and others. The technique uses Hidden Semi-Markov models — transitions between states still exist, and the model is Markov at that level, but the explicit model of duration within each state is not Markov. Pros: \- Increased naturalness of the audio. Unfortunately, though, the technology to create emotional voices is not yet perfected, but this is something that parametric TTS is capable of. Besides the emotional voices, it has much potential in such areas as speaker adaptation and speaker interpolation; \- Flexibility: it is easier to modify pitch for emotional change, or use MLLR adaptation to change voice characteristics; \- Lower development cost: it requires merely 2–3 hours of voice actor recording time which entangles fewer records, a smaller database, and less data processing. Cons: \- Lower audio quality in terms of intelligibility: there are many artifacts resulting in muffled speech, with buzzing sound ever present, noisy audio; \- The voice can sound robotic: in the TTS based on a statistical model, the muffled sound makes the voice sound stable but unnatural and robotic. Examples: Though first introduced in the 1990s, the parametric TTS engine became popular around 2007, with Festival Speech Synthesis System from the University of Edinburgh and Carnegie Mellon University’s Festvox being examples of such engines lying in the heart of speech synthesis systems, such as FreeTTS. The DNN (Deep Neural Network) based approach is another variation of the statistical synthesis approaches that are used to overcome the inefficiency of decision trees used in HMMs to model complex context dependencies. A step forward and an eventual breakthrough was letting machines design features without human intervention. The features designed by humans are based on our understanding of speech, but it is not necessarily correct. In DNN techniques, the relationship between input texts and their acoustic realizations is modeled by a DNN. The acoustic features are created using maximum likelihood parameter generation trajectory smoothing. Features obtained with the help of Deep Learning, are not human readable, but they are computer-readable, and they represent data required for a model. Pros \- A huge improvement both in terms of intelligibility and naturalness; \- Do not require extensive human preprocessing and development of features Cons \- As a recent development, Deep Learning speech synthesis techniques still require research. Examples: It is the deep learning technique that dominates the field now, being at the core of practically all successful TTS systems, such as WaveNet, Nuance TTS, or SampleRNN. Nuance TTS* and *Sample RNN are two systems that rely on recurrent neural networks. SampleRNN, for instance, uses a hierarchy of Recurrent Layers that have different clock rates to process the audio. Multiple RNNs form a hierarchy, where the top level takes large chunks of inputs, processes them, and passes them to the lower level that processes smaller chunks and so on through the bottom level that generates a single sample. The techniques render far less intelligible results but work fast. WaveNet***,** being the core of *Google Could Text-to-Speech, is a fully convolutional neural network, which takes digitized raw audio waveform as input, which then flows through these convolution layers and outputs a waveform sample. Though close to perfect in its intelligibility and naturalness, WaveNet is unacceptably slow (the team reported that it takes around 4 minutes to generate 1 second of audio). Finally, the new wave of end-to-end training brought Google’s Tacotron model that learns to synthesize speech directly from (text, audio) pairs. It takes characters of the text as inputs, passes them through different neural network submodules, and generates the spectrogram of the audio. As we can see, the evolution of speech synthesis increasingly relies on machines in both determinations of the necessary features and processing them without the assistance of human-developed rules. This approach improves the overall quality of the audio produced and significantly simplifies the data collection and preprocessing process. However, each approach has its niche, and even less efficient concentrative systems may become the optimal choice depending on the business needs and resources. King, Simon. “A beginners ’ guide to statistical parametric speech synthesis.” (2010). Kuligowska, K, Kisielewicz, P. and Wlodarz, A. (2018) Speech synthesis systems: disadvantages and limitations, International Journal of Engineering & Technology, \[S.l.\], v. 7, n. 2.28, p. 234–239. Pisoni, D. B. et al., “Perception of synthetic speech generated by rule,” in Proceedings of the IEEE, 1985, pp. 1665–1676. Stevens, C. et al., “Online experimental methods to evaluate text-to-speech (TTS) synthesis: effects of voice gender and signal quality on intelligibility, naturalness and preference,” Computer Speech and Language, vol. 19, pp. 129–146, 2005.

Our Expectations from INTERSPEECH 2019

In less than a month, from Sep. 15–19, 2019, Graz, Austria will become home for INTERSPEECH, the world‘s most prominent conference on spoken language processing. The conference unites science and technology under one roof and becomes a platform for over 2000 participants who will share their insights, listen to eminent speakers, and attend tutorials, challenges, exhibitions, and satellite events. What are our expectations of it as participants and presenters? Tanja Schultz*, the spokesperson of the University Bremen area “Minds, Media, Machines”, will talk on *_biosignal processing for human-machine interaction__._ As human interaction involves a wide range of biosignals from speech, gestures, motion, and brain activities, it is crucial to correctly interpret all of them to ensure truly effective human-machine interaction. We are waiting for Tanja Schultz to describe her work on Silent Speech Interfaces which rely on articulatory muscle movement to recognize and synthesize silently produced speech, and Brain Computer Interfaces which use brain activity to recognize speech and convert electrocortical signals into audible speech. Let’s move the the new era of brain-to-text and brain-to-speech technology! Manfred Kaltenbacher* of Vienna University of Technology will discuss *_physiology and physics of voice production._ This talk has a more medical slant, as it looks at voice production from the point of view of physiology and physics. At the same time, it will discuss current computer simulations for pre-surgical predictions of voice quality, and the development of examination and training of voice professionals — an interesting step from usual technology-oriented talks. Mirella Lapata*, Professor of natural language processing in the School of Informatics at the University of Edinburgh, will talk about *_learning natural language interfaces with neural models__._ Back to technology and AI, the talk will address the structured prediction problem of mapping natural language onto machine-interpretable representations. We definitely think that it will be useful for any NLP specialist to know more about a neural network-based general modeling framework — the most promising approach of recent years. There are eight of them and we love them all! The tutorials tackle diverse topics, but they all discuss the most interesting recent developments and breakthroughs. Two tutorials concern Generative adversarial networks, showing once again the power of this approach. The tutorial we are going to is offered by National Taiwan University and Academia Sinica. It is dedicated to speech signal processing, including speech enhancement, voice conversion, speech synthesis, and, more specifically, sentence generation. Moreover, we can expect real-life GAN algorithms for text style transformation, machine translation, and abstractive summarization without paired data. The second tutorial by Carnegie Mellon University and Bar-Ilan University shows how GANs can be used for speech and speaker recognition and other systems. The tutorial will discuss whether it is possible to _fool_ systems with carefully crafted inputs and how to identify and avoid attacks of such crafted “adversarial” inputs. Finally, we will discuss recent work on introducing “backdoors” into systems through poisoned training examples, such that the system can be triggered into false behaviors when provided with specific types of inputs, but not otherwise. We are also waiting for the tutorial on another popular technique in speech processing, the end-to-end approach. We are expecting from the tutorial by Mitsubishi, Nagoya University, NTT, and Johns Hopkins University some interesting insights into advanced methods for neural end-to-end speech processing, namely, unification, integration, and implementation. The tutorial will explore a new open-source toolkit ESPnet (end-to-end speech processing toolkit) used on the unified framework and integrated systems. Nagoya University specialists will offer a tutorial on statistical voice conversion with direct waveform modeling. The tutorial will give an overview of this approach and introduce freely available software, “Sprocket” as a statistical VC toolkit and “PytorchWaveNetVocoder” as a neural vocoder toolkit. It looks like a good chance to try your hand at voice conversion. Google AI is preparing a tutorial on another hot topic, neural machine translation. However, the tutorial looks more like an overview of the history, mainstream techniques, and recent advancements. Expanding the keynote speech, specialists from Université Grenoble Alpes and Maastricht University will present biosignal-based speech processing, including silent speech to brain-computer interfaces with real data and code. Uber AI will present its approach to modeling and deploying dialog systems with open-source tools from scratch. Finally, Paderborn University and NTT will present their insights into microphone array signal processing and deep learning for speech enhancement with hybrid techniques uniting signal processing and neural networks. Special sessions and challenges constitute a separate part of the conference and focus on relevant ‘special’ topics, ranging from computational paralinguistics, distant speech recognition, and zero resource speech processing to processing of a child’s speech, emotional speech, and code-switching. All papers are already submitted, but it will be very interesting to see the finals and discuss the winners’ approaches. Besides from challenges, there will be many more special events and satellite conferences to meet the needs of all specialists working in the field of speech processing: from a workshop for young female researchers to a special event for high school teachers. Participants will be able to join the first-ever INTERSPEECH Hackathon or choose between nine specialized conferences and satellite workshops. SLaTE A special event that is the most important for us is the workshop held by the Special Interest Group (SIG) of the International Speech Communication Association (ISCA) as a part of Speech and Language Technology in Education (SLaTE) events. The event brings together practitioners and researchers working on the use of speech and natural language processing for education. This year’s workshop will not only have an extensive general session with 19 papers, but also it will feature a special session about the Spoken CALL Shared Task (version 3) with 4 papers, and 4 demo papers. Our biggest expectation is, of course, our participation! The article written by our stellar specialists Ievgen Karaulov and Dmytro Tkanov entitled “Attention model for articulatory features detection” was approved for a poster session. Our approach is a variation of end-to-end speech processing. The article proves that using binary phonological features in the Listen, Attend, and Spell (LAS) architecture can show good results for phone recognition even on a small training set like TIMIT. More specifically, the attention model is used to train the manners and places of articulation detectors end-to-end and to explore joint phone recognition and articulatory features detection in multitask learning settings. Yes, we present not just one paper! Since our solution showed the best result in the text subset of the CALL v3 shared task, we wrote a paper exploring our approach and now we are going to present it at SLaTE. The paper called “Embedding-based system for the text part of CALL v3 shared task” by four our team members (Volodymyr Sokhatskyi, Olga Zvyeryeva, Ievgen Karaulov, and Dmytro Tkanov) focuses on NNLM and BERT text embeddings and their use in a scoring system to measure grammatical and semantic correctness of students’ phrases. Our approach does not rely on the reference grammar file for scoring proving the possibility to achieve the highest results without a predefined set of correct answers. Even now we already anticipate this INTERSPEECH to be a gala event that will give us more knowledge, ideas, and inspiration — and a great adventure for our teammates.

Top Speech-to-Speech Translation Apps — What’s New?

A year ago, we wrote a post about the best speech-to-speech translation apps as of 2017. Even though the same giants still dominate the market: who would imagine the modern world without, for example, Google Translate or Baidu in the East — the market landscape is changing with new products and trends emerging and they are worth mentioning. We cannot call Google Assistant just a translation app, of course, but, among other functionalities, it may be used as such. Based on Google Translate, is another way to quickly and easily launch multilingual interpreting services. The biggest con so far is that to translate text, the assistant requires an Internet connection. This app was not born last year, but recently it has surfaced as an important player in the translation apps market. It boasts impressive accuracy of full languages for a range of the most popular languages as well as speech-to-text and text-only support for additional languages that are less common. Its biggest cons are the need for an Internet connection or data to work and its availability only for Apple devices and on a subscription basis. One of the most highly praised translation apps, this free app offers 80 input languages and 44 output languages for immediate voice translation. Among other useful services, it corrects spelling, suggests the correct word, preserves the translation history, and shares texts directly with other applications. Main con: featuring a robust speech recognition engine, the app has a weaker translation module for a variety of languages. Though voice translation apps have improved over the past year to cover more languages and ensure more accurate speech recognition and translation, the clear trend is to use wearables — specialized devices providing real-time translations. In late 2017, Google presented its first earbud headphones packed with the power of Google Translate. However, it turned out that the real-time translation was only available when using Pixel Buds with a Pixel phone. Besides, they could not work without a data connection. Now, finally, the concept has changed to cover more devices, virtually, all Assistant-optimized headphones and Android phones. At present when your Google Assistant-enabled headphones and phone are paired with each other, you can simply say “Help me interpret Spanish” or any other of the supported languages, and you will hear translations and respond to them on your headphones while holding out your phone to the person you’re talking to. Real-time translation is available in 40 languages on the Google Buds support page, but only 27 languages are listed under “Talk” for speech translation. A much-talked-about crowdfunded project, Travis® Touch (former Travis the Translator) is a handheld device that is supposed to rid us of language barriers forever. The goal of Travis is to translate conversations into 80 languages, 20 of which will have offline support. However, there is a difference in quality between online and offline translations. Besides, Travis is still being developed and improved upon, so by now, it remains a product that has the potential to turn our world upside down. When connected to the Internet, Travis offers smart AI-assisted translation for 105 languages (though some of them do not support voice-in/voice-out functionality). When offline, it supports 16 languages with the basic word-to-word translation. Another futuristically-looking wearable device that hangs around your neck and translates speech in real time is ili. Positioning itself as the first wearable translator for travelers, ili is lightweight, fast, and does not require wifi to work. Conveniently, its library is optimized for travel-related scenarios, such as restaurants, shopping, and transportation. However, ili’s translation capability is currently limited to Spanish, Mandarin, and Japanese and the translation is one-way, so the question of whether you understand the person you’re talking to remains open. As early as in 2016, a wearable translation device called the “Pilot” shook the tech world. The device consists of two earpieces, one for each speaker, and a mobile app. In this way, real-time translation is delivered right into your ear canal. Specially designed with noise-canceling microphones, the Pilot earpiece detects speech and filters out ambient noise. Afterward, the Pilot then performs speech recognition, machine translation, and speech synthesis to output the translated version into the second earpiece. Integrated with iTranslate, Bragi’s Dash Pro earbuds offer real-time translation in almost 40 languages (16 of which are supported offline) and are called the first truly wireless smart earphones. In theory, if two people own Dash Pro earphones, they can converse as normal. However, it is rather unlikely, so the person wearing the earphones can hand their phone over to the other person and they can hear the translation through the app. Though the translation accuracy still needs improvement, it has already become a valuable assistant in doing its job on an acceptable level. Another similar product that appeared in the summer of 2017, a set of earpieces called One2One relies on IBM’s Watson to translate 9 languages in real time. It uses its own SIM, meaning that you need to be in the range of a data connection but you do not need Wi-Fi. Even though the earpiece does not depend directly on the smartphone, it won’t work if you’re completely offline. Besides, the translation accuracy still remains at the 85% point, so don’t expect perfect translations. As you can see, 2018 continues the developments of 2017 making the apps better, faster, and integrated into wearable devices. We hope that the trend continues — and the next year’s post we’ll dictate into a microphone — and you’ll be able to read or hear it in a language you choose.

Whitepapers

whitepaper:

Recommender Systems

Towards Automatic Text Summarization: Extractive Methods

For those who had academic writing, summarization — _the task of producing a concise and fluent summary while preserving key information content and overall meaning —_ was if not a nightmare, then a constant challenge close to guesswork to detect what the professor would find important. Though the basic idea looks simple: find the gist, cut off all opinions and details, and write a couple of perfect sentences, the task inevitably ends up in toil and turmoil. On the other hand, in real life we are perfect summarizers: we can describe the whole War and Peace in one word, be it “masterpiece” or “rubbish”. We can read tons of news about state-of-the-art technologies and sum them up in “Musk sent Tesla to the Moon”. We would expect that the computer could be even better. Where humans are imperfect, artificial intelligence depraved of emotions and opinions of its own would do the job. The story began in the 1950s. An important research of these days introduced a method to extract salient sentences from the text using features such as _word and phrase frequency_. In this work, Luhl proposed to weight the sentences of a document as a function of high-frequency words, ignoring very high-frequency common words –the approach that became one of the pillars of NLP. World-frequency diagram. Abscissa represents individual words arranged in order of frequency By now, the whole branch of natural language processing dedicated to summarization emerged, covering a variety of tasks: · headlines (from around the world); · outlines (notes for students); · minutes (of a meeting); · previews (of movies); · synopses (soap opera listings); · reviews (of a book, CD, movie, etc.); · digests (TV guide); · biography (resumes, obituaries); · abridgments (Shakespeare for children); · bulletins (weather forecasts/stock market reports); · sound bites (politicians on a current issue); · histories (chronologies of salient events). The approaches to text summarization vary depending on the number of input documents (single or multiple), purpose (generic, domain-specific, or query-based), and output (extractive or abstractive). Extractive summarization* means identifying important sections of the text and generating them verbatim producing a subset of the sentences from the original text; while *abstractive summarization reproduces important material in a new way after interpretation and examination of the text using advanced natural language techniques to generate a new shorter text that conveys the most critical information from the original one. Obviously, abstractive summarization is more advanced and closer to human-like interpretation. Though it has more potential (and is generally more interesting for researchers and developers), so far the more traditional methods have proved to yield better results. That is why in this blog post we’ll give a short overview of such traditional approaches that have beaten a path to advanced deep learning techniques. By now, the core of all extractive summarizers is formed of three independent tasks: 1) Construction of an intermediate representation of the input text There are two types of representation-based approaches: topic representation and indicator representation. Topic representation transforms the text into an intermediate representation and interprets the topic(s) discussed in the text. The techniques used for this differ in terms of their complexity and are divided into frequency-driven approaches, topic word approaches, latent semantic analysis, and Bayesian topic models. Indicator representation describes every sentence as a list of formal features (indicators) of importance such as sentence length, position in the document, having certain phrases, etc. 3) Selection of a summary comprising a number of sentences The summarizer system selects the top _k_ most important sentences to produce a summary. Some approaches use greedy algorithms to select the important sentences and some approaches may convert the selection of sentences into an optimization problem where a collection of sentences is chosen, considering the constraint that it should maximize overall importance and coherency and minimize the redundancy. Let’s have a closer look at the approaches we mentioned and outline the differences between them: This common technique aims to identify words that describe the topic of the input document. An advance of the initial Luhn’s idea was to use a log-likelihood ratio test to identify explanatory words known as the “topic signature”. Generally speaking, there are two ways to compute the importance of a sentence: as a function of the number of topic signatures it contains, or as the proportion of the topic signatures in the sentence. While the first method gives higher scores to longer sentences with more words, the second one measures the density of the topic words. This approach uses frequency of words as indicators of importance. The two most common techniques in this category are: word probability and TFIDF (Term Frequency Inverse Document Frequency). The probability of a word w is determined as the number of occurrences of the word, f (w), divided by the number of all words in the input (which can be a single document or multiple documents). Words with highest probability are assumed to represent the topic of the document and are included in the summary. TFIDF, a more sophisticated technique, assesses the importance of words and identifies very common words (that should be omitted from consideration) in the document(s) by giving low weights to words appearing in most documents. TFIDF has given way to centroid-based approaches that rank sentences by computing their salience using a set of features. After creation of TFIDF vector representations of documents, the documents that describe the same topic are clustered together and centroids are computed — pseudo-documents that consist of the words whose TFIDF scores are higher than a certain threshold and form the cluster. Afterwards, the centroids are used to identify sentences in each cluster that are central to the topic. Latent semantic analysis (LSA) is an unsupervised method for extracting a representation of text semantics based on observed words. The first step is to build a term-sentence matrix, where each row corresponds to a word from the input (n words) and each column corresponds to a sentence. Each entry of the matrix is the weight of the word i in sentence j computed by TFIDF technique. Then singular value decomposition (SVD) is used on the matrix that transforms the initial matrix into three matrices: a term-topic matrix having weights of words, a diagonal matrix where each row corresponds to the weight of a topic, and a topic-sentence matrix. If you multiply the diagonal matrix with weights with the topic-sentence matrix, the result will describe how much a sentence represent a topic, in other words, the weight of the topic i in sentence j. A logical development of analyzing semantics, is to perform discourse analysis, finding the semantic relations between textual units, to form a summary. The study on cross-document relations was initiated by Radev, who came up with the cross-document Structure Theory (CST) model](http://www.aclweb.org/anthology/W00-1009). In his model, words, phrases or sentences can be linked with each other if they are semantically connected. CST was indeed useful for document summarization to determine sentence relevance as well as to treat repetition, complementarity, and inconsistency among the diverse data sources. Nonetheless, the significant limitation of this method is that the CST relations should be explicitly determined by humans. While other approaches do not have very clear probabilistic interpretations, Bayesian topic models are probabilistic models that thanks to their describing topics in more detail can represent the information that is lost in other approaches. In topic modeling of text documents, the goal is to infer the words related to a certain topic and the topics discussed in a certain document, based on the prior analysis of a corpus of documents. It is possible with the help of Bayesian inference which calculates the probability of an event based on a combination of common sense assumptions and the outcomes of previous related events. The model is constantly improved by going through many iterations where a prior probability is updated with observational evidence to produce a new posterior probability. The second large group of techniques aims to represent the text based on a set of features and use them to directly rank the sentences without representing the topics of the input text. Influenced by PageRank algorithm, these methods represent documents as a connected graph, where sentences form the vertices and edges between the sentences indicate how similar the two sentences are. The similarity of two sentences is measured with the help of cosine similarity with TFIDF weights for words and if it is greater than a certain threshold, these sentences are connected. This graph representation results in two outcomes: the sub-graphs included in the graph create topics covered in the documents, and the important sentences are identified. Sentences that are connected to many other sentences in a sub-graph are likely to be the center of the graph and will be included in the summary Since this method does not need language-specific linguistic processing, it can be applied to various languages \[43\]. At the same time, measuring only the formal side of the sentence structure without the syntactic and semantic information limits the application of the method. Machine learning approaches that treat summarization as a classification problem are widely used now trying to apply Naive Bayes, decision trees, support vector machines, Hidden Markov models, and Conditional Random Fields to obtain a true-to-life summary. As it has turned out, the methods explicitly assume the dependency between sentences (Hidden Markov model and Conditional Random Fields) often outperform other techniques. Figure 1: Summary Extraction Markov Model to Extract 2 Lead Sentences and Additional Supporting Sentences Figure 2: Summary Extraction Markov Model to Extract 3 Sentences Yet, the problem with classifiers is that if we utilize supervised learning methods for summarization, we need a set of labeled documents to train the classifier, meaning the development of a corpus. A possible way out is to apply semi-supervised approaches that combine a small amount of labeled data along with a large amount of unlabeled data in training. Overall, machine learning methods have proved to be very effective and successful both in single and multi-document summarization, especially in class-specific summarization such as drawing scientific paper abstracts or biographical summaries. Though abundant, all the summarization methods we have mentioned could not produce summaries that would be similar to human-created summaries. In many cases, the soundness and readability of created summaries are not satisfactory, because they fail to cover all the semantically relevant aspects of data in an effective way and afterward they fail to connect sentences in a natural way. In our next post, we’ll talk more about ways to overcome these problems and new approaches and techniques that have recently appeared in the field.

Interspeech 2018 Highlights

This year the Sciforce team has traveled as far as India to one of the most important events in the speech processing community, the Interspeech conference. It is a truly scientific conference, where every speech, poster, or demo is accompanied by a paper published in the ISCA journal. As usual, it covered most of the speech-related topics, and even more: automatic speech recognition (ASR) and generation (TTS), voice conversion and denoising, speaker verification and diarization, spoken dialogue systems, language education, and healthcare-related topics. ● This year’s keynote was “Speech research for emerging markets in multilingual society”. With several sessions on providing speech technologies to cover dozens of languages spoken in India, it shows an important shift from focusing on several well-researched languages in the developed market to broader coverage. ● Quite in line with that, while ASR for endangered languages is still a matter of academic research and funded by non-profit organizations, ASR for under-resourced languages with a sufficient amount of speakers is found attractive for industry. ● End-to-end (attention-based) models gradually become mainstream speech recognition. More traditional hybrid HMM+DNN models (mostly, based on Kaldi toolkit) remain nevertheless popular and provide state-of-art results in many tasks. ● Speech technologies in education are gaining momentum, and healthcare-related speech technologies have already formed a big domain. ● Though Interspeech is a speech-processing conference, there are many overlappings with other areas of ML, such as Natural Language Processing (NLP), or video and image processing. Spoken language understanding, multimodal systems, and dialogue agents were widely presented. ● The conference covered some fundamental theoretical aspects of machine learning, which can be equally applied to speech, computer vision, and other areas. ● More and more researchers share their code so that their results can be checked and reproduced. ● Ultimately, ready-to-use open-source solutions were presented, e.g. HALEF, S4D. Our Top At the conference, we focused on topics related to the application of speech technologies to language education and on more general topics such as automatic speech recognition, learning speech signal representations, etc. We also visited two pre-conference tutorials — End-To-End Models for ASR and Information Theory of Deep Learning. This tutorial given by Rohit Prabhavalkar and Tara Sainath from Google Inc., USA. was undeniably one of the most valuable events of the conference bringing new ideas and uncovering some important details even for quite experienced specialists. Conventional pipelines involve several separately trained components such as an acoustic model, a pronunciation model, a language model, and 2nd-pass rescoring for ASR. In contrast, end-to-end models are typically sequence-to-sequence models that output words or graphemes directly and simplify the pipeline greatly. The tutorial presented several end-to-end ASR models, starting with the first model called Connectionist Temporal Network (CTC) which receives acoustic data at the input, passes it through the encoder and outputs softmax representing the distribution over characters or (sub)word and its development RNN-T which incorporates a language model component trained jointly. Yet, most state-of-art end-to-end solutions use attention-based models. The attention mechanism summarizes encoder features relevant to predicting the next label. Most of the modern architectures are improvements on Listen, Attend and Spell (LAS) proposed by Chan and Chorowski in 2015. The LAS model consists of an encoder (similar to an acoustic model), which has the pyramidal structure to reduce the time step, an attention (alignment) model, and a decoder — an analog to a pronunciation or a language model. LAS offers good results without an additional language model and is able to recognize out-of-vocabulary words. However, to decrease word error rate (WER), special techniques are used, such as shallow fusion, which is the integration of separately trained LM and is used as input to the decoder and as additional input to the final output layer. One of the most noticeable events of this year’s Intespeech was a tutorial by Naftali Tishby from the Hebrew University of Jerusalem. Although the author first proposed this approach more than a decade ago and it is familiar to the community, and this tutorial was a Skype teleconference, there were no free seats at the venue. Naftali Tishby started with an overview of deep learning models and information theory. He covered information plane-based analysis, described the learning dynamics of neural networks and other models, and, finally, showed the impact of multiple layers on the learning process. Although the tutorial is highly theoretical and requires a mathematical background to understand, deep learning practitioners can take away the following useful tips: ● The information plane is a useful tool for analyzing the behavior of complex DNNs. ● If a model can be presented as a Markov chain, it would likely have predefined learning dynamics in the information plane. ● There are two learning phases: capturing inputs-targets relation and representation compression. Though his research covers a very small subset of modern neural network architectures, N. Tishby’s theory spawns lots of discussions in the deep learning community. There are two major speech-related tasks for foreign language learners: computer-aided language learning (CALL) and computer-aided pronunciation training (CAPT). The main difference is that CALL applications are focused on vocabulary, grammar, and semantics checking, and CAPT applications do pronunciation assessment. Most of the CALL solutions use ASR at their back end. However, a conventional ASR system trained on native speech is not suitable for this task, due to students’ accents, language errors, lots of incorrect words, or out-of-vocabulary words (OOV). Therefore, techniques from Natural Language Processing (NLP) and Natural Language Understanding (NLU) should be applied to determine the meaning of the student’s utterance and detect errors. Most systems are trained on non-native speech corpora with a fixed native language, using in-house corpora. Most of CAPT papers use ASR models in a specific way, for forced alignment. A student’s waveform is aligned in time with the textual prompt, and the confidence score for each phone is used to estimate the quality of pronunciation of this phone by the user. However, some novel approaches were presented, where, for example, the relative distance between different phones is used to assess student’s language proficiency, and involves end-to-end training. Bonus: CALL shared task is an annual competition based on a real-world task. Participants from both academia and industry presented their solutions which were benchmarked on an opened dataset consisting of two parts: speech processing and text processing. They contain German prompts and English answers by a student. Language (vocabulary, grammar) and the meaning of the responses have been assessed independently by human experts. The task is open-ended, i.e. there are multiple ways to say the same thing, and only a few of them are specified in the dataset. This year, A. Zeyer and colleagues presented a new ASR model showing the best-ever results on LibriSpeech corpus (1000 hours of clean English speech) — the reported WER is 3.82%. This is another example of an end-to-end model, an improvement of LAS. It uses special Byte-Pair-Encoding subword units, having 10K subword targets in total. For a smaller English corpus — Switchboard (300 hours of telephone-quality speech) the best result is shown by a modification of the Lattice-free MMI (Maximum Mutual Information) approach by H. Hadian et. al. — 7.5% WER. Despite the success of end-to-end neural network approaches, one of their main shortcomings is that they need huge databases for their training. For endangered languages with few native speakers, creating such database is close to impossible. This year, traditionally, there was a session on ASR for such languages. The most popular approach for this task is transfer learning, i. e. training a model on well supported language(s) and retraining on an underresourced one. Unsupervised (sub)word units discovery is another widely used approach. A bit different task is ASR for under-resourced languages. In this case, a relatively small dataset (dozens of hours) is usually available. This year, Microsoft organized a challenge on Indian languages ASR, and even shared a dataset, containing circa 40 hours of training material and 5 hours of test dataset in Tamil, Telugu and Gujarati. The winner is a system named “BUT Jilebi” that uses Kaldi-based ASR with LF-MMI objective, speaker adaptation using feature-space maximum likelihood linear regression (fMMLR and data augmentation with speed perturbation. This year we have seen many presentations on voice conversion. For example, trained on VCTK corpus (40 hours of native English speech), a voice conversion tool computes the speaker embedding or i-vector of a new target speaker using only a single target speaker’s utterance. The results sound a bit robotic, yet the target voice is recognizable. Another interesting approach for word-level speech processing is Speech2Vec. It resembles Word2Vec widely used in the field of natural language processing, and lets learn fixed-length embeddings for variable length word speech segments. Under the hood, Speech2Vec uses encoder-decoder model with attention. Other topics included speech synthesis manners discrimination, unsupervised phone recognition and many more. With the development of Deep Learning, the Interspeech conference, originally intended for the speech processing and DSP community, gradually transformed into a broader platform for communication of machine learning scientists irrespective of their field of interest. It becomes the place to share common ideas across different areas of machine learning, and to inspire multi-modal solutions where speech processing occurs together (and sometimes in the same pipeline) with video and natural language processing. Sharing ideas between fields, undoubtedly, speeds up progress; and this year’s Interspeech conference has shown several examples of such sharing. 1\. A. Graves, S. Fernández, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006. \[pdf\] 2\. A. Graves. Sequence Transduction with Recurrent Neural Networks. Representation Learning Workshop, ICML 2012. \[pdf\] 3\. W. Chan, N. Jaitly, Q. V. Le, O. Vinyals. Listen, Attend, and Spell. 2015. \[pdf\] 4\. J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio. Attention-Based Models for Speech Recognition. 2015. \[pdf\] 5\. G. Pundak, T. Sainath, R. Prabhavalkar, A. Kannan, Ding Zhao. Deep context: end-to-end contextual speech recognition. 2018. \[pdf\] 6\. N. Tishby, F. Pereira, W. Bialek. The Information Bottleneck Method. Invited paper, in “Proceedings of 37th Annual Allerton Conference on Communication, Control and Computing”, pages 368–377, (1999). \[pdf\] 7\. Evanini, K., Timpe-Laughlin, V., Tsuprun, E., Blood, I., Lee, J., Bruno, J., Ramanarayanan, V., Lange, P., Suendermann-Oeft, D. Game-based Spoken Dialog Language Learning Applications for Young Students. Proc. Interspeech 2018, 548–549. \[pdf\] 8\. Nguyen, H., Chen, L., Prieto, R., Wang, C., Liu, Y. Liulishuo’s System for the Spoken CALL Shared Task 2018. Proc. Interspeech 2018, 2364–2368. \[pdf\] 9\. Tu, M., Grabek, A., Liss, J., Berisha, V. Investigating the Role of L1 in Automatic Pronunciation Evaluation of L2 Speech. Proc. Interspeech 2018, 1636–1640 \[pdf\] 10\. Kyriakopoulos, K., Knill, K., Gales, M. A Deep Learning Approach to Assessing Non-native Pronunciation of English Using Phone Distances. Proc. Interspeech 2018, 1626–1630 \[pdf\] 11\. Zeyer, A., Irie, K., Schlüter, R., Ney, H. Improved Training of End-to-end Attention Models for Speech Recognition. Proc. Interspeech 2018, 7–11 \[pdf\] 12\. Hadian, H., Sameti, H., Povey, D., Khudanpur, S. End-to-end Speech Recognition Using Lattice-free MMI. Proc. Interspeech 2018, 12–16 \[pdf\] 13\. He, D., Lim, B.P., Yang, X., Hasegawa-Johnson, M., Chen, D. Improved ASR for Under-resourced Languages through Multi-task Learning with Acoustic Landmarks. Proc. Interspeech 2018, 2618–2622 \[pdf\] 14\. Chen, W., Hasegawa-Johnson, M., Chen, N.F. Topic and Keyword Identification for Low-resourced Speech Using Cross-Language Transfer Learning. Proc. Interspeech 2018, 2047–2051 \[pdf\] 15\. Hermann, E., Goldwater, S. Multilingual Bottleneck Features for Subword Modeling in Zero-resource Languages. Proc. Interspeech 2018 \[pdf\] 16\. Feng, S., Lee, T. Exploiting Speaker and Phonetic Diversity of Mismatched Language Resources for Unsupervised Subword Modeling. Proc. Interspeech 2018, 2673–2677 \[pdf\] 17\. Godard, P., Boito, M.Z., Ondel, L., Berard, A., Yvon, F., Villavicencio, A., Besacier, L. Unsupervised Word Segmentation from Speech with Attention. Proc. Interspeech 2018, 2678–2682 \[pdf\] 18\. Glarner, T., Hanebrink, P., Ebbers, J., Haeb-Umbach, R. Full Bayesian Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery. Proc. Interspeech 2018, 2688–2692 \[pdf\] 19\. Holzenberger, N., Du, M., Karadayi, J., Riad, R., Dupoux, E. Learning Word Embeddings: Unsupervised Methods for Fixed-size Representations of Variable-length Speech Segments. Proc. Interspeech 2018, 2683–2687 \[pdf\] 20\. Pulugundla, B., Baskar, M.K., Kesiraju, S., Egorova, E., Karafiát, M., Burget, L., Černocký, J. BUT System for Low Resource Indian Language ASR. Proc. Interspeech 2018, 3182–3186 \[pdf\] 21\. Liu, S., Zhong, J., Sun, L., Wu, X., Liu, X., Meng, H. Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance. Proc. Interspeech 2018, 496–500 \[pdf\] 22\. Chung, Y., Glass, J. Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech. Proc. Interspeech 2018, 811–815 \[pdf\] 23\. Lee, J.Y., Cheon, S.J., Choi, B.J., Kim, N.S., Song, E. Acoustic Modeling Using Adversarially Trained Variational Recurrent Neural Network for Speech Synthesis. Proc. Interspeech 2018, 917–921 \[pdf\] 24\. Tjandra, A., Sakti, S., Nakamura, S. Machine Speech Chain with One-shot Speaker Adaptation. Proc. Interspeech 2018, 887–891 \[pdf\] 25\. Renkens, V., van Hamme, H. Capsule Networks for Low Resource Spoken Language Understanding. Proc. Interspeech 2018, 601–605 \[pdf\] 26\. Prasad, R., Yegnanarayana, B. Identification and Classification of Fricatives in Speech Using Zero Time Windowing Method. Proc. Interspeech 2018, 187–191 \[pdf\] 27\. Liu, D., Chen, K., Lee, H., Lee, L. Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings. Proc. Interspeech 2018, 3748–3752.

Interspeech 2017 flashback and 2018 expectations

Interspeech is the world’s largest and most comprehensive conference on the science and technology of spoken language processing. Interspeech 2017 gathered circa 2000 participants in Stockholm, Sweden and it exceeded expected capacity of the conference. There were lots of great people to to meet and to listen to Hideki Kawahara, Simon King, Jan Chrowski, Tara N. Sainath and many-many others. Papers acceptance rate traditionally was rather high at 51%. ICASSP 2017 had similar number, yet other ML-related conferences have this metrics closer to 20–30%. Most of works can be classified to one of the following groups: Deep neural networks: How can we interpret what they learned? It should be a nice session to check after the information theory tutorial. Low resource speech recognition challenge for Indian languages. Being low on data is a common thing for anyone working with languages outside of the mainstream setting. Thus, any tips and tricks would be really valuable. Spoken CALL shared task. Second edition. Core event for sampling approaches to language learning. There would be hundreds of papers presented. It is impossible to cover all of them. There is a lot of overlap between sections. Especially on day 2. We will try to focus on the following sections:

Top 8 Speech-to-Speech Translation Apps of 2017

One of the undeniable focuses of the coming year will be voice recognition technologies. The voice-control, voice-assistant revolution is pushing us to talk to objects in our homes and offices. Already in 2017, users began interacting more and more with their machines the same way we interact with each other: by talking. Alexa, Cortana, Einstein, Google, Siri, and Watson are already becoming valuable assistance and practically members of the family to some of us. But will we go beyond the interaction with machines? Will combining voice recognition with machine translation bring us long-awaited worldwide communication without borders? GP Bullhound, a London-based tech investment firm, in its 10 biggest tech predictions report expects what will be the case in 2018. As many as one billion people will start using voice recognition technology for translation in 2018, bringing the world closer to a fundamental shift in human-to-human communication. With advances in neural network machine learning, and a pool of almost 5 billion smartphone users, computers will be able to understand not just words but also grammar, which will provide a more natural flowing translation and boost the use of language translation in 2018. _Google Translate_ Evidently, when it comes to speech recognition and instant translation, Google Translate is the first to come to mind — and for good reason. Google supports more languages than its competitors and dominates the field serving as the basis for other web apps. Yet, in 2017 both its biggest rivals and smaller companies offered apps with outstanding text-to-speech and voice-to-voice translation functionalities. _Microsoft Translator_ Microsoft’s answer to Google Translate has free apps for Windows, iOS, and Android that can translate speech, text, and images (though, no video). Microsoft Translator only supports 60 languages, and not all features are available for all languages, yet it outruns Google in the real-time conversation mode which makes it easier to have natural conversations with foreigners. _ITranslate Voice_ ITranslate Voice provides instant text-to-speech and voice-to-voice translation on iOs and Android devices. It supports 44 languages and dialects, but not all to the same degree. Yet, according to some reviews, it already features better voice input and output than even Google Translate. One of its features, called AirTranslate, can translate conversation between two people on two iOS devices on the go.which makes it worth trying. _TripLingo_ An app for travelers, TripLingo combines an interactive phrasebook with an instant voice translator, along with other useful travel and language learning tools. It offers instant voice translation in 42 languages, including formal, casual, and slang variants for commonly used phrases. _SayHi_ SayHi offers instant speech-to-speech translation in 90 languages and dialects (including multiple Arabic dialects) for the iPhone and Kindle. The app claims 95% accuracy for voice recognition. Besides, the app offers the possibility to program the voice used to be male or female, and to set up its speed. With more people relying on speech recognition apps in Asia, the market has seen the boom of instant translation software designed specifically for the region. _Baidu Translate_ Baidu Translate provides translation service for 16 popular languages and in 186 directions. Having 5 million authoritative dictionaries, the app offers real-time speech-to-speech translation and camera translation of multiple languages, including English, Chinese, Japanese, and Korean. For offline translation, Baidu Translate provides authoritative phrasebook packs and offline voice packs of Japanese, Korean and American English. As an additional feature for users travelling to Asian countries or to the USA, the app provides useful expressions for everyday conversation. With Baidu, the leading Internet-search company in China, having developed a voice system that in some cases can recognize English and Mandarin words better than people, and their new system, called Deep Speech 2, relying entirely on machine learning for translation, Baidu Translate has a potential to revolutionize the technology. _iFlytek Input_ In China, over 500 million people use iFlytek Input to overcome obstacles in multilanguage communications or to communicate with a speaker of another Chinese dialect. The app was developed by iFlytek, a leading Chinese AI company that applies deep learning in a number of fields, including speech recognition, natural language processing, and machine translation, and was placed among the “50 Smartest Companies 2017”. _Naver Papago_ The Korean app specializes in translation between English, Korean, Simplified Mandarin, Chinese, and Japanese. The app can do voice translation to get real-time results and translate conversations, images, and text. Some helpful features include allowing users to choose two different images to establish the correct context and featuring conversation and offline modes.

SciForce: who we are

SciForce is a Ukraine-based IT company specializing in development of software solutions based on science-driven information technologies. We have wide-ranging expertise in many key AI technologies, including Data Mining, Digital Signal Processing, Natural Language Processing, Machine Learning, Image Processing, and Computer Vision. This focus allows us to offer state-of-the-art solutions in data science-related projects for commerce, banking and finance, healthcare, gaming, media and publishing industry, and education. We offer AI solutions to any organization or industry that deals with massive amounts of data. Our applications help reduce costs, improve customer satisfaction and productivity, and increase revenues. Our team boasts over 40 versatile specialists in two offices in Kharkiv and Lviv — the regional capitals and the most rapidly developing IT centers of Eastern and Western Ukraine. Our specialists include managers, architects, developers, designers, and QA specialists, as well as data scientists, medical professionals, and linguists. With such an organizational structure, we have the flexibility to help our customers both launch short-term small or medium projects and build long-term partnerships that would strengthen our partners’ in-house teams and change the vision of an offshore team from mere contractors to an important part of the organization with the corresponding motivation and loyalty. Aside from the development of software _per se_, SciForce renders the full range of consulting services in deploying a new project or fine-tuning an ongoing project that needs re-evaluation or restructuring. The philosophy of SciForce is not only to hire experienced professionals but to foster specialists through mentoring and knowledge sharing. The dedicated SciForce Academy project helps us find young talents, which not only facilitates internal hiring and transfers but also creates a productive atmosphere of mutual respect, trust, and patience. In our corporate blog, we are going to share our knowledge and insights into frontier information technologies, provide expert opinions from our specialists, and offer you a glance at our daily life. Stay with us!

Follow Sciforce on Medium Check our case studies