logo
SciForce Blog

Read our blog and carry on - Computer Vision

Stay informed and inspired in the world of AI with us.

Recommended
Top Computer Vision Opportunities and Challenges for 2024
How to Scale AI in Your Organization

According to WEKA's 2023 Global Trends in AI Report, 69% of organizations now have AI projects up and running, and 28% are using AI across their whole business. This shows a big move from just trying out AI to making it a key part of how companies operate and succeed. However, this is just the beginning as the major point is not to have only AI but to have it work to your benefit. Organizations have to address various challenges such as the collection of data, hiring the right skills, and fitting AI into their existing system. This guide serves both new companies and big businesses. It gives you clear examples and direct advice on how to work around these problems. We will discuss what specific things you can do to make the most of AI, whether you want to improve your processes, give better customer service, or make better business decisions. We could help you not only to use AI but to make the best use of it to lead the competition in your area. Artificial Intelligence (AI) and Machine Learning (ML) are two modern technologies that are restructuring the way businesses work. The AI study by 451 Research data has revealed that most companies start using AI/ML not just to cut expenses but to generate revenue as well. They are using AI/ML to revamp their profit systems, sharpen their sales strategies, and enhance their product and service offerings. This demonstrates the change of viewpoint, AI/ML becoming a driver of business growth and not just a hands-on tool. For integrating AI in your business to be effective, you need to have clear goals and plans of implementation. We have put together a short guide to get you started in a smart direction for your business. 1. Identify Objectives The first step in your AI integration is clearly stating your goals. This can be: 2. Assess Your Current Setup It's important to note that about 80% of AI projects don't move past the testing phase or lab setup. This often happens because standardizing the way models are built, trained, deployed, and monitored can be tough. AI projects usually need a lot of resources, which makes them challenging to manage and set up. However, this doesn't mean small businesses can't use AI. With the right approach, even smaller companies can make AI work for them, effectively bringing it into their operations. Computational Resources AI models, especially those using machine learning or deep learning, need a lot of computing power to work well to process large datasets. This is important for training the AI, doing calculations, and handling user queries in real-time. Small businesses that don't have massive infrastructure can choose cloud computing services like AWS, Google Cloud, or Microsoft Azure. They have the necessary hardware and can adjust your performance to your needs. Data Quality and Quantity AI requires access to a lot of clean and organized data that is essential for training AI to identify patterns, make correct predictions, and answer questions. Collecting and preparing this kind of high-quality, error-free data in large amounts can be difficult, often taking up to 80% of the time from the start of the project to its deployment. For businesses that don’t have massive amounts of structured data, the solutions can be as follows: Expertise Effective AI implementation requires a strong team capable of creating algorithms, analyzing data, and training models. It involves complex math and statistics and advanced software skills like programming in Python or R, using machine learning frameworks (e.g. TensorFlow or PyTorch), and applying data visualization tools. For businesses that can't afford to gather and maintain a professional AI team, the solution is to partner with niche companies that focus on AI development services, like SciForce. Specialized service providers have the necessary technical skills and business experience that allow them to create tailored AI solutions for your needs. Integration Integrating AI into existing business operations requires planning to ensure smooth incorporation with current software and workflows, avoiding significant disruptions. Challenges include resolving compatibility, ensuring data synchronization, and maintaining workflow efficiency as AI features are introduced. To overcome integration challenges, choose AI solutions with easy compatibility with standard business software, focusing on those with APIs and SDKs for seamless integration. Prefer AI platforms with plug-and-play features for CRM and ERP systems. SciForce offers integration services, specializing in AI solutions that integrate effortlessly with existing software, hardware, and operations with zero disruptions. Ongoing Maintenance and Updates Before engaging in the implementation of AI solutions in the company, remember that AI systems need regular updates, including consistent data stream and software improvements. This helps AI adapt, learn from new inputs, and stay secure against threats. Creating AI from scratch, you will need a permanent internal team to maintain it. If you opt for an out-of-the-box solution, the vendor will deliver automatic updates. Partnering with SciForce, you receive managed AI services with our professionals handling the maintenance and updates of your system. 3. Choose Your AI Tools and Technologies With a variety of AI/ML tools available in the market, it’s hard to choose the one that will suit your needs, especially if it’s your first AI project. Here we asked our ML experts to share top tools they use in their everyday work. Databases AI\ML can’t exist without databases that are the foundation for data handling, training, and analysis. SciForce top choice is Qdrant, a specialized vector database, that excels in this role by offering flexibility, high performance, and secure hosting options. It's particularly useful for creating AI assistants using organizational data. Machine Learning Here is our top choice of the tools that allow us to easier AI model management and deployment. Speech Processing Frameworks These tools help our team to refine voice recognition and teach computers to understand human language better. Large Language Models There are lots of tools for working with LLMs, but many of them are complex inside and not straightforward. Yet, our team picked some tools for you that simplify working with LLMs: Data Science Our Data Science team considers the DoWhy library a valuable tool for causal analysis. It helps to analyze and work with data in more depth, focusing on the cause-and-effect connections between different elements: 4. Start Small and Scale Gradually Begin with small AI projects to see what works best for your business. Learn from these projects and gradually implement more complex AI solutions. - Be focused Start with a small, well-defined AI project that addresses a specific business need or pain point. This could be automating a single task or improving a specific process. Define clear, achievable objectives for your initial AI project. This helps in measuring success and learning from the experience. - Gather a cross-functional team Assemble a team with diverse skills, including members from the relevant business unit, IT, and the ones with specific AI skills you need. This ensures the project benefits from different perspectives. You can also turn to a service provider with relevant expertise. - Use Available Data Begin with the data you already have. This approach helps in understanding the quality and availability of your data for AI applications. In case you lack data, consider using public datasets or purchasing ones. - Scale Based on Learnings Once you have the first results, review them and plan your next steps. To achieve your first goals, you can plan to expand the scope of AI within your business. - Build on Success Use the success of your initial projects to encourage the wider use of AI in your organization. Share what worked and what you learned to get support from key decision-makers. - Monitor and Adjust In managing AI initiatives, it's critical to regularly assess their impact and adapt as needed. Define key performance indicators (KPIs) relevant to each project, such as process efficiency or customer engagement metrics. Employ analytics tools for ongoing monitoring, ensuring continuous alignment with business goals. Read on to learn how to assess AI performance within your business. To make the most of AI for your business, it's essential to measure its impact using Key Performance Indicators (KPIs). These indicators help track AI performance and guide improvements, ensuring that AI efforts are delivering clear results and driving your business forward. 1. Defining Success Metrics To benefit from AI in your business, it's crucial to pick the right Key Performance Indicators (KPIs). These should align with your main business objectives and clearly show how your AI projects are performing: 1. Align with Business Goals Start by reviewing your business objectives. Whether it's growth, efficiency, or customer engagement, ensure your KPIs are directly linked to these goals. 2. Identify AI Impact Areas Pinpoint where AI is expected to make a difference. Is it streamlining operations, enhancing customer experiences, or boosting sales? 3. Choose Quantifiable Metrics Select metrics that offer clear quantification. This might include numerical targets, percentages, or specific performance benchmarks. 4. Ensure Relevance and Realism KPIs should be both relevant to the AI technology being used and realistic in terms of achievable outcomes. 5. Plan for Continuous Review Set up a schedule for regular KPI reviews to adapt and refine your metrics as needed, based on evolving business needs and AI capabilities. Baseline Measurement and Goal Setting Record key performance metrics before integrating AI to serve as a reference point. This helps in directly measuring AI's effect on your business, such as tracking improvements in customer service response times and satisfaction scores. Once you have a baseline, set realistic goals for what you want to achieve with AI. These should be challenging but achievable, tailored to the AI technology you're using and the areas you aim to enhance. Regular Monitoring and Reporting Regularly checking KPIs and keeping up with consistent reports is essential. This ongoing effort makes sure AI efforts stay in line with business targets, enabling quick changes based on real results and feedback. 1. Reporting Schedule Establish a fixed schedule for reports, such as monthly or quarterly, to consistently assess KPI trends and impacts. 2. Revenue Monitoring Monitor revenue shifts, especially those related to AI projects, to measure their direct impact on sales. 3. Operational Costs Comparison Analyze operational expenses before and after AI adoption to evaluate financial savings or efficiencies gained. 4. Customer Satisfaction Tracking Regularly survey customer satisfaction, noting changes that correlate with AI implementations, to assess AI's effect on service quality. ROI Analysis of AI Projects Determining the Return on Investment (ROI) of any project is essential for smart investment in technology. Here’s a concise guide to calculating ROI for AI projects: 1. Cost-Benefit Analysis List all expenses for your AI project, such as development costs, software and hardware purchases, maintenance fees, and training for your team. Then, determine the financial benefits the AI project brings, such as increased revenue and cost savings. 2. ROI Calculation Determine the financial advantages your AI project brings, including any increase in sales or cost reductions. Calculate the net benefits by subtracting the total costs from these gains. Then, find the ROI: 3. Ongoing Evaluation Continuously revise your ROI analysis to include any new data on costs or benefits. This keeps your assessment accurate and helps adjust your AI approach as necessary. Future Growth Opportunities Use the success of your current AI projects as a springboard for more growth and innovation. By looking at how these projects have improved your business, you can plan new ways to use AI for even better results: Expanding AI Use Search for parts of your business that haven't yet benefited from AI, using your previous successes as a guide. For example, if AI has already enhanced your customer service, you might also apply it to make your supply chain more efficient. Building on Success Review your best-performing AI projects to see why they succeeded. Plan to apply these effective strategies more broadly or deepen their impact for even better results. Staying Ahead with AI Keep an eye on the latest in AI and machine learning to spot technologies that could address your current needs or open new growth opportunities. Use the insights from your AI projects to make smart, data-informed choices about where to focus your AI efforts next. AI transforms business operations by enhancing efficiency and intelligence. It upgrades product quality, personalizes services, and streamlines inventory with predictive analytics. Crucial for maintaining a competitive edge, AI optimizes customer experiences and enables quick adaptation to market trends, ensuring businesses lead in their sectors. Computer Vision Computer Vision (CV) empowers computers to interpret and understand visual data, allowing them to make informed decisions and take actions based on what they "see." By automating tasks that require visual inspection and analysis, businesses can increase accuracy, reduce costs, and open up new opportunities for growth and customer engagement. - Quality Control in Manufacturing Computer Vision (CV) streamlines the inspection process by quickly and accurately identifying product flaws, surpassing manual checks. This ensures customers receive only top-quality products. - Retail Customer Analytics CV analyzes store videos to gain insights into how customers shop, what they prefer, and how they move around. Retailers can use this data to tailor marketing efforts and arrange stores in ways that increase sales and improve shopping experiences. - Automated Inventory Management CV helps manage inventory by using visual recognition to keep track of stock levels, making the restocking process automatic and reducing the need for manual stock checks. This increases operational efficiency, keeps stock at ideal levels, and avoids overstocking or running out of items. Case: EyeAI – Space Optimization & Queue Management System Leveraging Computer Vision, we created EyeAI – SciForce custom video analytics product for space optimization and queue management. It doesn’t require purchasing additional hardware or complex integrations – you can immediately use it even with one camera in your space. - Customer Movement Tracking: Our system observes how shoppers move and what they buy, allowing us to personalize offers, and improve their shopping journey. - Store Layout Optimization: We use insights to arrange stores more intuitively, placing popular items along common paths to encourage purchases. - Traffic Monitoring: By tracking shopper numbers and behavior, we adjust staffing and marketing to better match customer flow. - Checkout Efficiency: We analyze line lengths and times, adjusting staff to reduce waits and streamline checkout. - Identifying Traffic Zones: We pinpoint high and low-traffic areas to optimize product placement and store design, enhancing the overall shopping experience. Targeted for HoReCa, retail, public security, healthcare sectors, it analyzes customer behavior and movements and gives insights of space optimization for better security and customer service. Natural Language Processing Natural Language Processing (NLP) allows computers to handle and make sense of human language, letting them respond appropriately to text and spoken words. This automation of language-related tasks helps businesses improve accuracy, cut down on costs, and create new ways to grow and connect with customers. Customer Service Chatbots NLP enables chatbots to answer customer questions instantly and accurately, improving satisfaction by cutting down wait times. This technology helps businesses expand their customer service without significantly increasing costs. Sentiment Analysis for Market Research NLP examines customer opinions in feedback, social media, and reviews to gauge feelings towards products or services. These insights guide better marketing, product development, and customer service strategies. Automated Document Processing NLP automates the handling of large amounts of text data, from emails to contracts. It simplifies tasks like extracting information, organizing data, and summarizing documents, making processes faster and reducing human errors. Case: Recommendation and Classification System for Online Learning Platform We improved a top European online learning platform using advanced AI to make the user experience even better. Knowing that personalized recommendations are key (like how 80% of Netflix and 60% of YouTube views come from them), our client wanted a powerful system to recommend and categorize courses for each user's tastes. The goal was to make users more engaged and loyal to the platform. We needed to enhance how users experience the platform and introduce a new feature that automatically sorts new courses based on what users like. We approached this project with several steps: - Gathering Data: First, we set up a system to collect and organize the data we needed. - Building a Recommendation System: We created a system that suggests courses to users based on their preferences, using techniques that understand natural language and content similarities. - Creating a Classification System: We developed a way to continually classify new courses so they could be recommended accurately. - Integrating Systems: We smoothly added these new systems into the platform, making sure users get personalized course suggestions. The platform now automatically personalizes content for each user, making learning more tailored and engaging. Engagement went up by 18%, and the value users get from the platform increased by 13%. Adopting AI and ML is about setting bold goals, upgrading tech, smart resource use, accessing top data, building an expert team, and aiming for continuous improvement. It isn't just about successful competition — it's about being a trendsetter. Here at SciForce, we combine AI innovations and practical solutions, delivering clear business results. Contact us for a free consultation.

Face Detection Explained: State-of-the-Art Methods and Best Tools

So many of us have used different Facebook applications to see us aging, turned into rock stars, or applied festive make-up. Such waves of facial transformations are usually accompanied by warnings not to share images of your faces — otherwise, they will be processed and misused. But how does AI use faces in reality? Let’s discuss state-of-the-art applications for face detection and recognition. First, detection and recognition are different tasks. _Face detection_ is the crucial part of face recognition determining the number of faces on the picture or video without remembering or storing details. It may define some demographic data like age or gender, but it cannot recognize individuals. _Face recognition_ identifies a face in a photo or a video image against a pre-existing database of faces. Faces indeed need to be enrolled into the system to create the database of unique facial features. Afterward, the system breaks down a new image into key features and compares them against the information stored in the database. First, the computer examines either a photo or a video image and tries to distinguish faces from any other objects in the background. There are methods that a computer can use to achieve this, compensating for illumination, orientation, or camera distance. Yang, Kriegman, and Ahuja presented a classification for face detection methods. These methods are divided into four categories, and the face detection algorithms could belong to two or more groups. This method relies on the set of rules developed by humans according to our knowledge. We know that a face must have a nose, eyes, and mouth within certain distances and positions with each other. The problem with this method is to build an appropriate set of rules. If the rules are too general or too detailed, the system ends up with many false positives. However, it does not work for all skin colors and depends on lighting conditions that can change the exact hue of a person’s skin in the picture. The template matching method uses predefined or parameterized face templates to locate or detect the faces by the correlation between the predefined or deformable templates and input images. The face model can be constructed by edges using the edge detection method. A variation of this approach is the _controlled background technique_. If you are lucky to have a frontal face image and a plain background, you can remove the background, leaving face boundaries. For this approach, the software has several classifiers for detecting various types of front-on faces and some for profile faces, such as detectors of eyes, a nose, a mouth, and in some cases, even a whole body. While the approach is easy to implement, it is usually inadequate for face detection. The feature-based method extracts structural features of the face. It is trained as a classifier and then used to differentiate facial and non-facial regions. One example of this method is color-based face detection that scans colored images or videos for areas with typical skin color and then looks for face segments. _Haar Feature Selection_ relies on similar properties of human faces to form matches from facial features: location and size of the eye, mouth, bridge of the nose, and the oriented gradients of pixel intensities. There are 38 layers of cascaded classifiers to obtain the total number of 6061 features from each frontal face. You can find some pre-trained classifiers here. Histogram of Oriented Gradients (HOG) is a feature extractor for object detection. The features extracted are the distribution (histograms) of directions of gradients (oriented gradients) of the image. Histogram of Oriented Gradients (HOG) is a feature extractor for object detection. The features extracted are the distribution (histograms) of directions of gradients (oriented gradients) of the image. Gradients are typically large round edges and corners and allow us to detect those regions. Instead of considering the pixel intensities, they count the occurrences of gradient vectors to represent the light direction to localize image segments. The method uses overlapping local contrast normalization to improve accuracy. The more advanced appearance-based method depends on a set of delegate training face images to find out face models. It relies on machine learning and statistical analysis to find the relevant characteristics of face images and extract features from them. This method unites several algorithms: _Eigenface-based algorithm_ efficiently represents faces using Principal Component Analysis (PCA). PCA is applied to a set of images to lower the dimension of the dataset, best describing the variance of data. In this method, a face can be modeled as a linear combination of eigenfaces (set of eigenvectors). Face recognition, in this case, is based on the comparing of coefficients of linear representation. _Distribution-based_* *algorithms like PCA and Fisher’s Discriminant define the subspace representing facial patterns. They usually have a trained classifier that identifies instances of the target pattern class from the background image patterns. _Hidden Markov Model_ is a standard method for detection tasks. Its states would be the facial features, usually described as strips of pixels. _Sparse Network of Winnows_ defines two linear units or target nodes: one for face patterns and the other for non-face patterns. _Naive Bayes Classifiers_ compute the probability of a face to appear in the picture based on the frequency of occurrence of a series of the pattern over the training images. _Inductive learning_ uses such algorithms as Quinlan’s C4.5 or Mitchell’s FIND-S to detect faces starting with the most specific hypothesis and generalizing. _Neural networks,_ such as GANs, are among the most recent and most powerful methods for detection problems, including face detection, emotion detection, and face recognition. In video images, you can use movement as a guide. One specific face movement is blinking, so if the software can determine a regular blinking pattern, it determines the face. Various other motions indicate that the image may contain a face, such as flared nostrils, raised eyebrows, wrinkled foreheads, and opened mouths. When a face is detected and a particular face model matches with a specific movement, the model is laid over the face, enabling face tracking to pick up further face movements. The state-of-the-art solutions usually combine several methods, extracting features, for example, to be used in machine learning or deep learning algorithms. There are dozens of face detection solutions, both proprietary and open-source, that offer various features, from simple face detection to emotion detection and face recognition. Amazon Rekognition is based on deep learning and is fully integrated into the Amazon Web Service ecosystem. It is a robust solution both for face detection and recognition, and it is applicable to detect eight basic emotions like “happy”, “sad”, “angry”, etc. Meanwhile, you can determine up to 100 faces in a single image with this tool. There is an option for video, and the pricing is different for different kinds of usage. Face++ is a face analysis cloud service that also has an offline SDK for iOS and Android. You can perform an unlimited amount of requests, but just three per second. It also supports Python, PHP, Java, Javascript, C++, Ruby, iOS, Matlab, providing services like gender and emotion recognition, age estimation, and landmark detection. They primarily operate in China, are exceptionally well funded, and are known for their inclusion in Lenovo products. However, bear in mind that its parent company, Megvii has been sanctioned by the US government in late 2019. Face Recognition and Face Detection API (Lambda Labs) provides face recognition, facial detection, eye position, nose position, mouth position, and gender classification. It offers 1000 free requests per month. Kairos offers a variety of image recognition solutions. Their API endpoints include identifying gender, age, facial recognition, and emotional depth in photos and videos. They offer 14 days free trial with a maximum limit of 10000 requests, providing SDKs for PHP, JS, .Net, and Python. Microsoft Azure Cognitive Services Face API allows you to make 30000 requests per month, 20 requests per minute on a free basis. For paid requests, the price depends on the number of recognitions per month, starting from $1 per 1000 recognitions. Features include age estimation, gender and emotion recognition, landmark detection. SDKs support Go, Python, Java, .Net, andNode.js. Paravision is a face recognition company for enterprises providing self-hosted solutions. Face and activity recognition and COVID-19 solutions (face recognition with masks, integration with thermal detection, etc.) are among their services. The company has SDKs for C++ and Python. Trueface is also serving enterprises, providing features like gender recognition, age estimation, and landmark detection as a self-hosted solution. Ageitgey/face\_recognition is a GitHub repository with 40k stars, one of the most extensive face recognition libraries. The contributors also claim it to be the “simplest facial recognition API for Python and the command line.” However, their drawbacks are the latest release as late as 2018 and 99.38% model recognition accuracy, which could be much better in 2021. It also does not have a REST API. Deepface is a framework for Python with 1,5k stars on GitHub, providing facial attribute analysis like age, gender, race, and emotion. It also provides REST API. FaceNet developed by Google uses the Python library for implementation. The repository boasts of 11,8k starts. Meanwhile, the last significant updates were in 2018. The accuracy of recognition is 99,65%, and it does not have REST API. InsightFace is another Python library with 9,2k stars in GitHub, and the repository is actively updating. The recognition accuracy is 99,86%. They claim to provide a variety of algorithms for face detection, recognition, and alignment. InsightFace-REST is an actively updating repository that “aims to provide convenient, easily deployable and scalable REST API for InsightFace face detection and recognition pipeline using FastAPI for serving and NVIDIA TensorRT for optimized inference.” OpenCV isn’t an API, but it is a valuable tool with over 3,000 optimized computer vision algorithms. It offers many options for developers, including Eigenfacerecognizer, LBPHFacerecognizer, or lpbhfacerecognition face recognition modules. OpenFace is a Python and Torch implementation of face recognition with deep neural networks. It rests on the CVPR 2015 paper FaceNet: A Unified Embedding for Face Recognition and Clustering. Face detection is the first step for further face analysis, including recognition, emotion detection, or face generation. However, it is crucial to collect all the necessary data for further processing. Robust face detection is a prerequisite for sophisticated recognition, tracking, and analytics tools and the cornerstone of computer vision.

Brain-Computer Interfaces: Your Favorite Guide

At the beginning of April 2021, Neuralink’s new video featuring a monkey playing Pong with his mind hit the headlines. The company’s as-always-bold statements promise to give back the freedom of movement to people with disabilities. We decided to look beyond the hype and define what these brain-computer systems are capable of in reality. Let’s dive right into it. Brain-computer interfaces (BCIs)* or *Brain-machine interfaces (BMIs) capture a user’s brain activity and translate it into commands for an external application. Though both terms are synonymous, BCI uses externally recorded signals (e.g., electroencephalography) while BMI gathers the signals of implanted sources. We are using the BCI term further as an inclusive one, implying that both brain and system are on par in interactive, adaptive control crucial for successful BCI. What are the BCI applications? Initially, the development of BCIs was aimed to help the paralyzed patients to control assistive devices with their thoughts. It is also crucial for stroke patients’ rehabilitation devices. BCI has proved to be efficient with various mental activities like higher-order cognitive tasks (e.g., calculation), language, imagery, and selective attention tasks (auditory, tactile attention, and visual attention). In practice, the BCIs can help people who have lost the freedom of movement to restore their independence in daily life. In March 2021, the BrainGate research consortium presented the wireless brain-computer interface replacing the “gold standard” wired system. Though the wireless BCI system is the very first step to the primary goal, it can provide the ability to move for the patients without the caregiver’s interaction. Moreover, BCIs enter the mass market with new use cases, and we are going to dwell on it further. What are the types of BCIs? Brain-computer interfaces can be divided into three major groups, depending on the technique that is measuring the brain’s signal: What types of brain’s signal BCI is acquiring? The system can use any brain’s electrical signals measured by applications on the scalp, on the cortical surface, or in the cortex to control external application. Speaking formally, the most researched signals are: The signal acquisition means measuring the brain’s signals using EEG techniques for the brain’s electric signals or fMRI for the brain’s blood flow to define the user’s intentions. The principal is relevant to other approaches. Feature extraction means analyzing the digital signals to define the user’s intent, filtering out irrelevant signals, and “compressing” them into a suitable form for feature translation. Feature Translation is when the signals are converted into the commands for the output’s device reflecting the user’s intent. Device Output supports the functions like letter selection, robotic arm operation, cursor control, etc. This function provides feedback for the user, closing the control loop. Per the European Commission initiative for BCI research, Brain/Neural Computer Interaction Horizon 2020, the actual applications include: Machine learning techniques. Since online usage of BCIs generates unlabeled data, the authors of the recent research (David Hübner et al.) opted for unsupervised learning to design a novel classification approach. K. Palani Thanaraj et al. demonstrated the effectiveness of using deep learning networks for epilepsy detection. This implies the broader usage for further development with these techniques in the field of BCI. TeleBCI. Simultaneously, a spike in telehealth impacts the BCI too. By telehealth, we mean delivering healthcare services remotely via modern electrical devices. Check our post “Top 5 Medical Specialties Most Interested in Telehealth” for details. As one research shows, teleBCI (telemedical BCI) could provide an alternative way of communication like a virtual keyboard for paralyzed patients (Andrew Geronimo & Zachary Simmons). As BrainGate use case of wireless BCI shows, assisting the patients who are using remote BCI systems is particularly helpful under the pandemic. Computer vision. The combination of neuroscience and computer vision is an emerging trend. Researches show the efficiency of using machine learning techniques to understand the brain’s activity patterns in the “EEG-as-image” approach (Jacob Jiexun Liao et al.). Communication and motor functions’ restoring. Improvements in restoring motor and communication functions include using noninvasive EEG-based BCIs (Aziz Koçanoğulları et al.). Alborz Rezazadeh Sereshkeh et al. showed how the measurements of EEG and fNIRS could enhance the classification accuracy of BCIs for imagined speech recognition. It is already happening. The Canadian startup _Muse_ developed an EEG-based application to measure sleep and focus quality and assist in meditation. _Dream_ is another consumer-targeted BCI headband for sleep improvement. Moreover, market segments like virtual gaming, military communication, and home control systems are the primary drivers in the industry. For example, _NeuroSky_, a US company is developing EEG-based headbands for gaming and development since 2009. Other companies like _Neurosity_ and _NextMind_ are developing devices for visual attention decoding and productivity enhancement. What are the BCI’s actual market size and forecasts? The forecasts for the BCI field are promising. The expected revenue in 2027 is USD 3,7 billion, growing from the 2020 USD 1390,49 million with CAGR (Growth Annual Compound Rate) of 15%. Trends in BCI promise that more user-friendly and portable devices will be spreading to narrow clinical niches and the mass market. Moreover, hi-fidelity signal acquisition, processing, and application of machine learning techniques contribute to the industry’s developments. Additionally, as researchers state that the future of BCI relies intensely on the following factors:

Memorability in Computer Vision

Among many things that define us as humans, there is our ability to remember things such as images in great detail, and sometimes after a single view. What is even more interesting, humans tend to remember and forget the same things, suggesting that there might be some general internal capability to encode and discard the same types of information. What makes certain images more memorable than others? Research suggests that pictures of people, salient actions and events are more memorable than natural landscapes and images that lack distinctiveness will soon be forgotten. We can conclude that memorable and forgettable images must have certain intrinsic visual features, making some information easier to remember than others. To prove this fact, a number of computer vision projects, such as Isola 2011, Khosla 2013, Dubey 2015 managed to reliably estimate the memorability ranks of novel pictures. However, the task of predicting image memorability is quite complex: images that are memorable do not even look alike. A baby elephant, a kitchen, an abstract painting, and an old man’s face can have the same level of memorability, but no visual recognition algorithm would cluster them together. So what are the common visual features of memorable, or forgettable, images? And is it ever possible to predict which images people will remember? Memorability is a relatively new concept in computer vision that assesses the chance that a particular image will be stored in either short-term or long-term memory. From the psychological perspective, visual memory has been a focus of attention in research for decades: for instance, thanks to psychological research, we know that different images are more or less remembered depending on many factors concerning intrinsic visual appearance and user’s context. In computer vision, researchers have revealed that color, simple image features derived from pixel statistics, and object statistics, such as number of objects, do not have strong correlation with memorability. The factors that play a role are object and scene semantics, aesthetics and interestingness, and high-level visual attributes (such as emotions, actions, movements, appearance of objects, etc.,). Besides, people tend to memorize the same images, which gives us hope that memorability is something that we can measure and predict. It is well-known that the basis for any successful ML project is the availability of extensive and meaningful data. With more advancement in memorability research, several small datasets were developed and publicly released as a part of specific projects, including face photographs, scene categories, visualization pictures, and affective impact on image memorability. The most important is MIT’s large-scale image memorability dataset (LaMem) containing roughly 60,000 images annotated by crowdsourcing that was published together with a memorability prediction model (MemNet) for benchmarking the task. As the visual memory studies progress, new research has expanded to cover video memorability which also resulted in the creation of the large-scale VideoMem dataset containing 10,000 soundless videos of 7 seconds. In recent years, a number of projects in deep learning emerged to address the task of memorability prediction. The models they introduced managed to achieve results close to human consistency (0.68), with the model called MemNet being the most prominent one. The established idea is to treat memorability prediction as a regression task. Among a number of proposed models, MemNet developed by MIT is considered to be the most successful and well-known one. It is based on convolutional neural networks (CNN) that have proven successful in various visual recognition tasks \[10, 21, 29, 35, 32\]. As memorability depends on both scenes and objects, the first step in developing the model was to initialize the training using the pre-trained Hybrid-CNN \[37\], trained on both ILSVRC 2012 \[30\] and Places dataset \[37\]. Since memorability is a single real-valued output, the Hybrid-CNN was fine-tuned with an Euclidean loss layer. A similar approach is used to predict memorability of videos on the basis of the VideoMem dataset. Treating memorability prediction not as a regression, but as a classification task, Technicolor developed a model that could even surpass MemNet. The model used semantic features derived from an image captioning (IC) system. Such an IC model builds an encoder comprising a CNN and a long short-term memory recurrent network (LSTM) for learning a joint image-text embedding. Thus the CNN image feature and the word2vec representation of the image caption are projected on a 2D embedding space which enforces the alignment between an image and its corresponding semantic caption that could be used to predict image memorability. A set of hyper-parameters, including the number of neurons per layer, dropout coefficient, activation function, and optimizer were selected with the Bayesian optimization library Hyperas 1 to maximize the average Spearman correlation coefficient between the predicted scores and the ground-truth scores in a 5-fold validation process. As the ultimate goal of many computer vision tasks is to attract the user’s attention, there is much research on different factors that might increase the chance that users will look at the image or the video and gain the desired information from it. In this quest for factors that contribute to image relevance, researchers try to find out whether memorability is related to the interestingness and aesthetics of a certain image. In general, interestingness is the power of attracting or holding one’s attention. Like memorability, it is largely studied in psychology resulting in discovering its various sides, such as novelty, uncertainty, conflict, and complexity, according to Berlyne. Also like in the case of memorability, users show a significant agreement, though finding something interesting is clearly subjective and depends on personal preferences and experiences. However, the bases of image memorability and interestingness are quite different, so there is little correlation. Another aspect of images that is believed to be correlated with memorability is image aesthetics. Studies prove that people are more attracted to highly aesthetically attractive pictures and they choose more aesthetically appealing pictures for authentication purposes. However, aesthetics is a fairly ephemeral concept that has to do with the beauty and human appreciation of an object. Hence, though a number of computer vision papers have tried to rate, assess and predict image aesthetics, this aspect of an image is subjectively derived and aesthetic values of an image vary from subject to subject. Hence, contrary to popular belief, unusual or aesthetically pleasing scenes are not necessarily highly memorable. Evolution has created our brain to remember only the information relevant to our survival, reproduction, happiness, etc. That is why we share what we remember and what we forget that can be used in present-day technology to capture our attention. If machines can predict what we will remember, it can be used in various areas, including education and learning, content retrieval and search, content summarizing, storytelling, content filtering, and advertising, which make us even more efficient in our everyday lives.

Machine learning: Changing the Beauty Industry

Several years ago a friend of mine, who was a biologist, was talking about their weird experiments. They were shaving mice and if this was done this in a particular way, the mice remained bald for the rest of their short lives. Their team was dreaming to extrapolate this magic technique to humans and shake the waxing industry. It is unlikely that they have succeeded, but the gist of the story is that even serious science tries to serve beauticians if they get a chance. So, maybe, it is time for machine learning to transform beauty salons where traditional biology and pharmacy fail? Obviously, machine learning can help the beauty industry in several ways, from providing statistical basis for attractiveness and helping people look more attractive to develop products which would tackle specific needs of customers. The core of the future technology is, without doubt, computer vision — the part of AI that deals with the theory and technology for building artificial systems that obtain information from images or multi-dimensional data and further process it. In the beauty industry, it is expected that computer vision would help recognize facial features, analyze the data obtained and come up with a prediction or a conclusion about the appearance. Image Credit: unsplash.com On the one hand, the ability of AI-driven computer vision to properly analyze a human face is incredibly handy for testing purposes and it might help end users choose products and techniques that would be perfect for them. In the past, it was nearly impossible to know how a new eye shadow or a face cream will actually look on the skin without physically testing them. At present, armies of data scientists are working on AI systems that can understand the human face. Once mastered, the ability to test out new looks and products will become exceptionally easy and realistic. On the other hand, AI can make a breakthrough in the development of new formulas. Data has always been used to create better products and optimize formulas. Traditionally, a perfume is physically tested, reviewed and compared before being released. At present, data can be used to optimize specific scent ratios to create the next hit. Similarly, data analysis will lead to better cosmetics. Leveraging data means better, longer-lasting formulas. Image Credit: unsplash.com Let’s now have a look at how businesses incorporate these ideas in their products. The first and the most obvious approach is to use big data to determine what is attractive from the statistical perspective:

NLP and Computer Vision Integrated

Integration and interdisciplinarity are the cornerstones of modern science and industry. One of the examples of recent attempts to combine everything is the integration of computer vision and natural language processing (NLP). Both these fields are one of the most actively developing machine learning research areas. Yet, until recently, they have been treated as separate areas without many ways to benefit from each other. It is now, with the expansion of multimedia, researchers have started exploring the possibilities of applying both approaches to achieve one result. The most natural way for humans is to extract and analyze information from diverse sources. This conforms to the theory of semiotics (Greenlee 1978) — the study of the relations between signs and their meanings at different levels. Semiotics studies the relationship between signs and meaning, the formal relations between signs (roughly equivalent to syntax), and the way humans interpret signs depending on the context (pragmatics in linguistic theory). If we consider purely visual signs, then this leads to the conclusion that semiotics can also be approached by computer vision, extracting interesting signs for natural language processing to realize the corresponding meanings. Malik summarizes Computer Vision tasks in 3Rs (Malik et al. 2016): reconstruction, recognition, and reorganization. Reconstruction refers to the estimation of a 3D scene that gave rise to a particular visual image by incorporating information from multiple views, shading, texture, or direct depth sensors. The process results in a 3D model, such as point clouds or depth images. Recognition involves assigning labels to objects in the image. For 2D objects, examples of recognition are handwriting or face recognition, and 3D tasks tackle such problems as object recognition from point clouds which assists in robotics manipulation. Reorganization means bottom-up vision when raw pixels are segmented into groups that represent the structure of an image. Low-level vision tasks include edge, contour, and corner detection, while high-level tasks involve semantic segmentation, which partially overlaps with recognition tasks. It is recognition that is most closely connected to language because it has an output that can be interpreted as words. For example, objects can be represented by nouns, activities by verbs, and object attributes by adjectives. In this sense, vision and language are connected by means of semantic representations (Gardenfors 2014; Gupta 2009). NLP tasks are more diverse as compared to Computer Vision and range from syntax, including morphology and compositionality, semantics as a study of meaning, including relations between words, phrases, sentences, and discourses, to pragmatics, a study of shades of meaning, at the level of natural communication. Some complex tasks in NLP include machine translation, dialog interface, information extraction, and summarization. It is believed that switching from images to words is the closest to machine translation. Still, such “translation” between low-level pixels or contours of an image and a high-level description in words or sentences — the task known as Bridging the Semantic Gap (Zhao and Grosky 2002) — remains a wide gap to cross. The integration of vision and language was not going smoothly in a top-down deliberate manner, so researchers came up with a set of principles. Integrated techniques were rather developed bottom-up, as some pioneers identified certain rather specific and narrow problems, attempted multiple solutions, and found a satisfactory outcome. The new trajectory started with the understanding that most present-day files are multimedia and that they contain interrelated images, videos, and natural language texts. For example, a typical news article contains a written by a journalist and a photo related to the news content. Furthermore, there may be a clip video that contains a reporter or a snapshot of the scene where the event in the news occurred. Language and visual data provide two sets of information that are combined into a single story, making the basis for appropriate and unambiguous communication. This understanding gave rise to multiple applications of an integrated approach to visual and textual content not only in working with multimedia files, but also in the fields of robotics, visual translations, and distributional semantics. The multimedia-related tasks for NLP and computer vision fall into three main categories: visual properties description, visual description, and visual retrieval. Visual properties description: A step beyond classification, the descriptive approach summarizes object properties by assigning attributes. Such attributes may be both binary values for easily recognizable properties or relative attributes describing a property with the help of a learning-to-rank framework. The key is that the attributes will provide a set of contexts as a knowledge source for recognizing a specific object by its properties. The attribute words become an intermediate representation that helps bridge the semantic gap between the visual space and the label space. Visual description: in real life, the task of visual description is to provide image or video capturing. It is believed that sentences would provide a more informative description of an image than a bag of unordered words. To generate a sentence that would describe an image, a certain amount of low-level visual information should be extracted that would provide the basic information “who did what to whom, and where and how they did it”. From the part-of-speech perspective, the quadruplets of “Nouns, Verbs, Scenes, Prepositions” can represent meaning extracted from visual detectors. Visual modules extract objects that are either a subject or an object in the sentence. Then a Hidden Markov Model is used to decode the most probable sentence from a finite set of quadruplets along with some corpus-guided priors for verb and scene (preposition) predictions. The meaning is represented using objects (nouns), visual attributes (adjectives), and spatial relationships (prepositions). Then the sentence is generated with the help of the phrase fusion technique using web-scale n-grams for determining probabilities. Visual retrieval: Content-based Image Retrieval (CBIR) is another field in multimedia that utilizes language in the form of query strings or concepts. As a rule, images are indexed by low-level vision features like color, shape, and texture. CBIR systems try to annotate an image region with a word, similar to semantic segmentation, so the keyword tags are close to human interpretation. CBIR systems use keywords to describe an image for image retrieval but visual attributes describe an image for image understanding. Nevertheless, visual attributes provide a suitable middle layer for CBIR with an adaptation to the target domain. Robotics Vision: Robots need to perceive their surroundings from more than one way of interaction. Similar to humans processing perceptual inputs by using their knowledge about things in the form of words, phrases, and sentences, robots also need to integrate their perceived picture with the language to obtain the relevant knowledge about objects, scenes, actions, or events in the real world, make sense of them and perform a corresponding action. For example, if an object is far away, a human operator may verbally request an action to reach a clearer viewpoint. Robotics Vision tasks relate to how a robot can perform sequences of actions on objects to manipulate the real-world environment using hardware sensors like depth cameras or motion cameras and having a verbalized image of their surroundings to respond to verbal commands. Situated Language: Robots use languages to describe the physical world and understand their environment. Moreover, spoken language and natural gestures are more convenient ways of interacting with a robot for a human being, if the robot is trained to understand this mode of interaction. From the human point of view, this is a more natural way of interaction. Therefore, a robot should be able to perceive and transform the information from its contextual perception into a language using semantic structures. The most well-known approach to representing meaning is Semantic Parsing, which transforms words into logical predicates. SP tries to map a natural language sentence to a corresponding meaning representation that can be a logical form like λ-calculus using Combinatorial Categorical Grammar (CCG) as rules to compositionally construct a parse tree. Early Multimodal Distributional Semantics Models: The idea lying behind Distributional Semantics Models is that words in similar contexts should have similar meaning, therefore, word meaning can be recovered from co-occurrence statistics between words and contexts in which they appear. This approach is believed to be beneficial in computer vision and natural language processing as image embedding and word embedding. DSMs are applied to jointly model semantics based on both visual features like colors, shape, or texture and textual features like words. The common pipeline is to map visual data to words and apply distributional semantics models like LSA or topic models on top of them. Visual attributes can approximate the linguistic features for a distributional semantics model. Neural* *Multimodal Distributional Semantics Models: Neural models have surpassed many traditional methods in both vision and language by learning better-distributed representation from the data. For instance, Multimodal Deep Boltzmann Machines can model joint visual and textual features better than topic models. In addition, neural models can model some cognitively plausible phenomena such as attention and memory. For attention, an image can initially give an image embedding representation using CNNs and RNNs. An LSTM network can be placed on top and act like a state machine that simultaneously generates outputs, such as image captions, or looks at relevant regions of interest in an image one at a time. For memory, commonsense knowledge is integrated into visual question-answering If combined, two tasks can solve a number of long-standing problems in multiple fields, including:

Interspeech 2018 Highlights

This year the Sciforce team has traveled as far as India to one of the most important events in the speech processing community, the Interspeech conference. It is a truly scientific conference, where every speech, poster, or demo is accompanied by a paper published in the ISCA journal. As usual, it covered most of the speech-related topics, and even more: automatic speech recognition (ASR) and generation (TTS), voice conversion and denoising, speaker verification and diarization, spoken dialogue systems, language education, and healthcare-related topics. ● This year’s keynote was “Speech research for emerging markets in multilingual society”. With several sessions on providing speech technologies to cover dozens of languages spoken in India, it shows an important shift from focusing on several well-researched languages in the developed market to broader coverage. ● Quite in line with that, while ASR for endangered languages is still a matter of academic research and funded by non-profit organizations, ASR for under-resourced languages with a sufficient amount of speakers is found attractive for industry. ● End-to-end (attention-based) models gradually become mainstream speech recognition. More traditional hybrid HMM+DNN models (mostly, based on Kaldi toolkit) remain nevertheless popular and provide state-of-art results in many tasks. ● Speech technologies in education are gaining momentum, and healthcare-related speech technologies have already formed a big domain. ● Though Interspeech is a speech-processing conference, there are many overlappings with other areas of ML, such as Natural Language Processing (NLP), or video and image processing. Spoken language understanding, multimodal systems, and dialogue agents were widely presented. ● The conference covered some fundamental theoretical aspects of machine learning, which can be equally applied to speech, computer vision, and other areas. ● More and more researchers share their code so that their results can be checked and reproduced. ● Ultimately, ready-to-use open-source solutions were presented, e.g. HALEF, S4D. Our Top At the conference, we focused on topics related to the application of speech technologies to language education and on more general topics such as automatic speech recognition, learning speech signal representations, etc. We also visited two pre-conference tutorials — End-To-End Models for ASR and Information Theory of Deep Learning. This tutorial given by Rohit Prabhavalkar and Tara Sainath from Google Inc., USA. was undeniably one of the most valuable events of the conference bringing new ideas and uncovering some important details even for quite experienced specialists. Conventional pipelines involve several separately trained components such as an acoustic model, a pronunciation model, a language model, and 2nd-pass rescoring for ASR. In contrast, end-to-end models are typically sequence-to-sequence models that output words or graphemes directly and simplify the pipeline greatly. The tutorial presented several end-to-end ASR models, starting with the first model called Connectionist Temporal Network (CTC) which receives acoustic data at the input, passes it through the encoder and outputs softmax representing the distribution over characters or (sub)word and its development RNN-T which incorporates a language model component trained jointly. Yet, most state-of-art end-to-end solutions use attention-based models. The attention mechanism summarizes encoder features relevant to predicting the next label. Most of the modern architectures are improvements on Listen, Attend and Spell (LAS) proposed by Chan and Chorowski in 2015. The LAS model consists of an encoder (similar to an acoustic model), which has the pyramidal structure to reduce the time step, an attention (alignment) model, and a decoder — an analog to a pronunciation or a language model. LAS offers good results without an additional language model and is able to recognize out-of-vocabulary words. However, to decrease word error rate (WER), special techniques are used, such as shallow fusion, which is the integration of separately trained LM and is used as input to the decoder and as additional input to the final output layer. One of the most noticeable events of this year’s Intespeech was a tutorial by Naftali Tishby from the Hebrew University of Jerusalem. Although the author first proposed this approach more than a decade ago and it is familiar to the community, and this tutorial was a Skype teleconference, there were no free seats at the venue. Naftali Tishby started with an overview of deep learning models and information theory. He covered information plane-based analysis, described the learning dynamics of neural networks and other models, and, finally, showed the impact of multiple layers on the learning process. Although the tutorial is highly theoretical and requires a mathematical background to understand, deep learning practitioners can take away the following useful tips: ● The information plane is a useful tool for analyzing the behavior of complex DNNs. ● If a model can be presented as a Markov chain, it would likely have predefined learning dynamics in the information plane. ● There are two learning phases: capturing inputs-targets relation and representation compression. Though his research covers a very small subset of modern neural network architectures, N. Tishby’s theory spawns lots of discussions in the deep learning community. There are two major speech-related tasks for foreign language learners: computer-aided language learning (CALL) and computer-aided pronunciation training (CAPT). The main difference is that CALL applications are focused on vocabulary, grammar, and semantics checking, and CAPT applications do pronunciation assessment. Most of the CALL solutions use ASR at their back end. However, a conventional ASR system trained on native speech is not suitable for this task, due to students’ accents, language errors, lots of incorrect words, or out-of-vocabulary words (OOV). Therefore, techniques from Natural Language Processing (NLP) and Natural Language Understanding (NLU) should be applied to determine the meaning of the student’s utterance and detect errors. Most systems are trained on non-native speech corpora with a fixed native language, using in-house corpora. Most of CAPT papers use ASR models in a specific way, for forced alignment. A student’s waveform is aligned in time with the textual prompt, and the confidence score for each phone is used to estimate the quality of pronunciation of this phone by the user. However, some novel approaches were presented, where, for example, the relative distance between different phones is used to assess student’s language proficiency, and involves end-to-end training. Bonus: CALL shared task is an annual competition based on a real-world task. Participants from both academia and industry presented their solutions which were benchmarked on an opened dataset consisting of two parts: speech processing and text processing. They contain German prompts and English answers by a student. Language (vocabulary, grammar) and the meaning of the responses have been assessed independently by human experts. The task is open-ended, i.e. there are multiple ways to say the same thing, and only a few of them are specified in the dataset. This year, A. Zeyer and colleagues presented a new ASR model showing the best-ever results on LibriSpeech corpus (1000 hours of clean English speech) — the reported WER is 3.82%. This is another example of an end-to-end model, an improvement of LAS. It uses special Byte-Pair-Encoding subword units, having 10K subword targets in total. For a smaller English corpus — Switchboard (300 hours of telephone-quality speech) the best result is shown by a modification of the Lattice-free MMI (Maximum Mutual Information) approach by H. Hadian et. al. — 7.5% WER. Despite the success of end-to-end neural network approaches, one of their main shortcomings is that they need huge databases for their training. For endangered languages with few native speakers, creating such database is close to impossible. This year, traditionally, there was a session on ASR for such languages. The most popular approach for this task is transfer learning, i. e. training a model on well supported language(s) and retraining on an underresourced one. Unsupervised (sub)word units discovery is another widely used approach. A bit different task is ASR for under-resourced languages. In this case, a relatively small dataset (dozens of hours) is usually available. This year, Microsoft organized a challenge on Indian languages ASR, and even shared a dataset, containing circa 40 hours of training material and 5 hours of test dataset in Tamil, Telugu and Gujarati. The winner is a system named “BUT Jilebi” that uses Kaldi-based ASR with LF-MMI objective, speaker adaptation using feature-space maximum likelihood linear regression (fMMLR and data augmentation with speed perturbation. This year we have seen many presentations on voice conversion. For example, trained on VCTK corpus (40 hours of native English speech), a voice conversion tool computes the speaker embedding or i-vector of a new target speaker using only a single target speaker’s utterance. The results sound a bit robotic, yet the target voice is recognizable. Another interesting approach for word-level speech processing is Speech2Vec. It resembles Word2Vec widely used in the field of natural language processing, and lets learn fixed-length embeddings for variable length word speech segments. Under the hood, Speech2Vec uses encoder-decoder model with attention. Other topics included speech synthesis manners discrimination, unsupervised phone recognition and many more. With the development of Deep Learning, the Interspeech conference, originally intended for the speech processing and DSP community, gradually transformed into a broader platform for communication of machine learning scientists irrespective of their field of interest. It becomes the place to share common ideas across different areas of machine learning, and to inspire multi-modal solutions where speech processing occurs together (and sometimes in the same pipeline) with video and natural language processing. Sharing ideas between fields, undoubtedly, speeds up progress; and this year’s Interspeech conference has shown several examples of such sharing. 1\. A. Graves, S. Fernández, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006. \[pdf\] 2\. A. Graves. Sequence Transduction with Recurrent Neural Networks. Representation Learning Workshop, ICML 2012. \[pdf\] 3\. W. Chan, N. Jaitly, Q. V. Le, O. Vinyals. Listen, Attend, and Spell. 2015. \[pdf\] 4\. J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio. Attention-Based Models for Speech Recognition. 2015. \[pdf\] 5\. G. Pundak, T. Sainath, R. Prabhavalkar, A. Kannan, Ding Zhao. Deep context: end-to-end contextual speech recognition. 2018. \[pdf\] 6\. N. Tishby, F. Pereira, W. Bialek. The Information Bottleneck Method. Invited paper, in “Proceedings of 37th Annual Allerton Conference on Communication, Control and Computing”, pages 368–377, (1999). \[pdf\] 7\. Evanini, K., Timpe-Laughlin, V., Tsuprun, E., Blood, I., Lee, J., Bruno, J., Ramanarayanan, V., Lange, P., Suendermann-Oeft, D. Game-based Spoken Dialog Language Learning Applications for Young Students. Proc. Interspeech 2018, 548–549. \[pdf\] 8\. Nguyen, H., Chen, L., Prieto, R., Wang, C., Liu, Y. Liulishuo’s System for the Spoken CALL Shared Task 2018. Proc. Interspeech 2018, 2364–2368. \[pdf\] 9\. Tu, M., Grabek, A., Liss, J., Berisha, V. Investigating the Role of L1 in Automatic Pronunciation Evaluation of L2 Speech. Proc. Interspeech 2018, 1636–1640 \[pdf\] 10\. Kyriakopoulos, K., Knill, K., Gales, M. A Deep Learning Approach to Assessing Non-native Pronunciation of English Using Phone Distances. Proc. Interspeech 2018, 1626–1630 \[pdf\] 11\. Zeyer, A., Irie, K., Schlüter, R., Ney, H. Improved Training of End-to-end Attention Models for Speech Recognition. Proc. Interspeech 2018, 7–11 \[pdf\] 12\. Hadian, H., Sameti, H., Povey, D., Khudanpur, S. End-to-end Speech Recognition Using Lattice-free MMI. Proc. Interspeech 2018, 12–16 \[pdf\] 13\. He, D., Lim, B.P., Yang, X., Hasegawa-Johnson, M., Chen, D. Improved ASR for Under-resourced Languages through Multi-task Learning with Acoustic Landmarks. Proc. Interspeech 2018, 2618–2622 \[pdf\] 14\. Chen, W., Hasegawa-Johnson, M., Chen, N.F. Topic and Keyword Identification for Low-resourced Speech Using Cross-Language Transfer Learning. Proc. Interspeech 2018, 2047–2051 \[pdf\] 15\. Hermann, E., Goldwater, S. Multilingual Bottleneck Features for Subword Modeling in Zero-resource Languages. Proc. Interspeech 2018 \[pdf\] 16\. Feng, S., Lee, T. Exploiting Speaker and Phonetic Diversity of Mismatched Language Resources for Unsupervised Subword Modeling. Proc. Interspeech 2018, 2673–2677 \[pdf\] 17\. Godard, P., Boito, M.Z., Ondel, L., Berard, A., Yvon, F., Villavicencio, A., Besacier, L. Unsupervised Word Segmentation from Speech with Attention. Proc. Interspeech 2018, 2678–2682 \[pdf\] 18\. Glarner, T., Hanebrink, P., Ebbers, J., Haeb-Umbach, R. Full Bayesian Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery. Proc. Interspeech 2018, 2688–2692 \[pdf\] 19\. Holzenberger, N., Du, M., Karadayi, J., Riad, R., Dupoux, E. Learning Word Embeddings: Unsupervised Methods for Fixed-size Representations of Variable-length Speech Segments. Proc. Interspeech 2018, 2683–2687 \[pdf\] 20\. Pulugundla, B., Baskar, M.K., Kesiraju, S., Egorova, E., Karafiát, M., Burget, L., Černocký, J. BUT System for Low Resource Indian Language ASR. Proc. Interspeech 2018, 3182–3186 \[pdf\] 21\. Liu, S., Zhong, J., Sun, L., Wu, X., Liu, X., Meng, H. Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance. Proc. Interspeech 2018, 496–500 \[pdf\] 22\. Chung, Y., Glass, J. Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech. Proc. Interspeech 2018, 811–815 \[pdf\] 23\. Lee, J.Y., Cheon, S.J., Choi, B.J., Kim, N.S., Song, E. Acoustic Modeling Using Adversarially Trained Variational Recurrent Neural Network for Speech Synthesis. Proc. Interspeech 2018, 917–921 \[pdf\] 24\. Tjandra, A., Sakti, S., Nakamura, S. Machine Speech Chain with One-shot Speaker Adaptation. Proc. Interspeech 2018, 887–891 \[pdf\] 25\. Renkens, V., van Hamme, H. Capsule Networks for Low Resource Spoken Language Understanding. Proc. Interspeech 2018, 601–605 \[pdf\] 26\. Prasad, R., Yegnanarayana, B. Identification and Classification of Fricatives in Speech Using Zero Time Windowing Method. Proc. Interspeech 2018, 187–191 \[pdf\] 27\. Liu, D., Chen, K., Lee, H., Lee, L. Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings. Proc. Interspeech 2018, 3748–3752.

Interspeech 2017 flashback and 2018 expectations

Interspeech is the world’s largest and most comprehensive conference on the science and technology of spoken language processing. Interspeech 2017 gathered circa 2000 participants in Stockholm, Sweden and it exceeded expected capacity of the conference. There were lots of great people to to meet and to listen to Hideki Kawahara, Simon King, Jan Chrowski, Tara N. Sainath and many-many others. Papers acceptance rate traditionally was rather high at 51%. ICASSP 2017 had similar number, yet other ML-related conferences have this metrics closer to 20–30%. Most of works can be classified to one of the following groups: Deep neural networks: How can we interpret what they learned? It should be a nice session to check after the information theory tutorial. Low resource speech recognition challenge for Indian languages. Being low on data is a common thing for anyone working with languages outside of the mainstream setting. Thus, any tips and tricks would be really valuable. Spoken CALL shared task. Second edition. Core event for sampling approaches to language learning. There would be hundreds of papers presented. It is impossible to cover all of them. There is a lot of overlap between sections. Especially on day 2. We will try to focus on the following sections:

SciForce: who we are

SciForce is a Ukraine-based IT company specializing in development of software solutions based on science-driven information technologies. We have wide-ranging expertise in many key AI technologies, including Data Mining, Digital Signal Processing, Natural Language Processing, Machine Learning, Image Processing, and Computer Vision. This focus allows us to offer state-of-the-art solutions in data science-related projects for commerce, banking and finance, healthcare, gaming, media and publishing industry, and education. We offer AI solutions to any organization or industry that deals with massive amounts of data. Our applications help reduce costs, improve customer satisfaction and productivity, and increase revenues. Our team boasts over 40 versatile specialists in two offices in Kharkiv and Lviv — the regional capitals and the most rapidly developing IT centers of Eastern and Western Ukraine. Our specialists include managers, architects, developers, designers, and QA specialists, as well as data scientists, medical professionals, and linguists. With such an organizational structure, we have the flexibility to help our customers both launch short-term small or medium projects and build long-term partnerships that would strengthen our partners’ in-house teams and change the vision of an offshore team from mere contractors to an important part of the organization with the corresponding motivation and loyalty. Aside from the development of software _per se_, SciForce renders the full range of consulting services in deploying a new project or fine-tuning an ongoing project that needs re-evaluation or restructuring. The philosophy of SciForce is not only to hire experienced professionals but to foster specialists through mentoring and knowledge sharing. The dedicated SciForce Academy project helps us find young talents, which not only facilitates internal hiring and transfers but also creates a productive atmosphere of mutual respect, trust, and patience. In our corporate blog, we are going to share our knowledge and insights into frontier information technologies, provide expert opinions from our specialists, and offer you a glance at our daily life. Stay with us!