SciForce Blog

Read our blog and carry on -NLP

Learn who we are and why we stand out among the others.

iconNLP
View all cases
iconFollow Sciforce on Medium
nullTop-5 NLP news of December by a CTO of Sciforce — Max VednullText Preprocessing for NLP and Machine Learning Tasks

As soon as you start working on a data science task you realize the dependence of your results on the data quality. The initial step — data preparation — of any data science project sets the basis for the effective performance of any sophisticated algorithm. In textual data science tasks, this means that any raw text needs to be carefully preprocessed before the algorithm can digest it. In the most general terms, we take some predetermined body of text and perform upon it some basic analysis and transformations, in order to be left with artifacts that will be much more useful for a more meaningful analytic task afterward. The preprocessing usually consists of several steps that depend on a given task and the text but can be roughly categorized into segmentation, cleaning, normalization, annotation, and analysis.

nullBiggest Open Problems in Natural Language Processing

The NLP domain reports great advances to the extent that a number of problems, such as part-of-speech tagging, are considered to be fully solved. At the same time, such tasks as text summarization or machine dialog systems are notoriously hard to crack and remain open for the past decades. However, if we look deeper into such tasks we’ll see that the problems behind them are rather similar and fall into two groups:

nullGoogle’s BERT changing the NLP Landscape

We write a lot about open problems in Natural Language Processing. We complain a lot when working on NLP projects. We pick on inaccuracies and blatant errors of different models. But what we need to admit is that NLP has already changed and new models have solved the problems that may still linger in our memory. One of such drastic developments is the launch of Google’s Bidirectional Encoder Representations from Transformers, or BERT model — the model that is called the best NLP model ever based on its superior performance over a wide variety of tasks. When Google researchers presented a deep bidirectional Transformer model that addresses 11 NLP tasks and surpassed even human performance in the challenging area of question answering, it was seen as a game-changer in NLP/NLU.

nullNLP for Low-Resource Settings

Natural language processing (NLP) is a field of Artificial Intelligence that tries to establish human-like communication with computers. Although it can boast significant success, computers still struggle with comprehending many facets of language, such as pragmatics, that are difficult to characterize formally. Moreover, most of the success is achieved in popular languages like English or other languages that have text corpora of hundreds of millions of words. But we should understand that these are only about 20 languages from approximately 7,000 languages in the world. The majority of human languages are in dire need of tools and resources to overcome the resource barrier such that NLP can deliver more widespread benefits. They are called low-resource languages languages, or languages lacking large monolingual or parallel corpora and/or manually crafted linguistic resources sufficient for building statistical NLP applications. It might look like we need only a dozen of languages to do fine in the world, so why bother with minor or extinct languages? However, building NLP applications for such languages can at the same time reinforce the ties between the world and ensure its diversity. _Transfer of annotations_ (such as POS tags, syntactic or semantic features) via cross-lingual bridges (e.g., word or phrase alignments). However, training such models with cross-lingual transfer learning usually requires linguistic knowledge and resources about the relation between the source language and the target language. Recent developments, though, offer techniques that do not require ancillary resources such as parallel corpora. In Kim et al. (2017), for instance, a cross-lingual model utilizes a common BLSTM that enables knowledge transfer from other languages, and private BLSTMs for language-specific representations without exploiting any linguistic knowledge between the source language and the target language. The cross-lingual model is trained with language-adversarial training and bidirectional language modeling to represent language-general information and preserve the information about a specific target language. _Transfer of models_ refers to training a model in a resource-rich language and applying it in a resource-poor language in zero-shot or one-shot learning. Zero-shot learning refers to training a model in one domain and assuming it generalizes more or less out-of-the-box in a low-resource domain. One-shot learning is a similar approach that uses a very limited number of examples from a low-resource domain to adapt the model trained in the rich-resource domain. This approach is particularly popular in machine translation where the weights collected for a rich-resource language pair are transferred to low-resource pairs. An example of such an approach is a model by Zoph et al. (2016). A “parent” model is trained in a high-resource language pair (French to English) and some of the trained weights are reused as the initialization for a “child” model which is further trained on a specific low-resource language pair (Hansa, Turkish, and Uzbek into English). A similar approach was explored by Nguyen and Chiang (2017) where the parent language pair is also low-resource but it was related to the child language pair. _Joint Multilingual or “Polyglot” Learning_ converts data in all languages to a shared representation (e.g., phones or multilingual word vectors) and trains a single model on a mix of datasets in all languages, to enable parameter sharing where possible. This approach is closely related to recent efforts to train a cross-lingual Transformer language model trained on 100 most popular languages and cross-lingual sentence embeddings. The latter approach learns joint multilingual sentence representations for 93 languages, belonging to more than 30 different language families and written in 28 scripts. With the help of a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on parallel corpora, the approach allows learning a classifier on top of the resulting sentence embeddings using English annotated data only and transfer it to any of the 93 languages without any modification. Drawing a conclusion, we can once more say that the actual reason many specialists work on NLP problems is to build systems that break down barriers. Given the potential impact on mankind, building systems for low-resource languages is one of the most important areas to work on. There are already a lot of promising approaches dealing with low-data settings that may include low-resource languages, dialects, sociolects, and domains, but notwithstanding the pursuit to find linguistic universalities, there is still no universal solution to cover all the languages in the world.

Whitepapers

nullNLP and Computer Vision Integrated

Integration and interdisciplinarity are the cornerstones of modern science and industry. One of the examples of recent attempts to combine everything is the integration of computer vision and natural language processing (NLP). Both these fields are one of the most actively developing machine learning research areas. Yet, until recently, they have been treated as separate areas without many ways to benefit from each other. It is now, with the expansion of multimedia, researchers have started exploring the possibilities of applying both approaches to achieve one result. The most natural way for humans is to extract and analyze information from diverse sources. This conforms to the theory of semiotics (Greenlee 1978) — the study of the relations between signs and their meanings at different levels. Semiotics studies the relationship between signs and meaning, the formal relations between signs (roughly equivalent to syntax), and the way humans interpret signs depending on the context (pragmatics in linguistic theory). If we consider purely visual signs, then this leads to the conclusion that semiotics can also be approached by computer vision, extracting interesting signs for natural language processing to realize the corresponding meanings. Malik summarizes Computer Vision tasks in 3Rs (Malik et al. 2016): reconstruction, recognition, and reorganization. Reconstruction refers to the estimation of a 3D scene that gave rise to a particular visual image by incorporating information from multiple views, shading, texture, or direct depth sensors. The process results in a 3D model, such as point clouds or depth images. Recognition involves assigning labels to objects in the image. For 2D objects, examples of recognition are handwriting or face recognition, and 3D tasks tackle such problems as object recognition from point clouds which assists in robotics manipulation. Reorganization means bottom-up vision when raw pixels are segmented into groups that represent the structure of an image. Low-level vision tasks include edge, contour, and corner detection, while high-level tasks involve semantic segmentation, which partially overlaps with recognition tasks. It is recognition that is most closely connected to language because it has an output that can be interpreted as words. For example, objects can be represented by nouns, activities by verbs, and object attributes by adjectives. In this sense, vision and language are connected by means of semantic representations (Gardenfors 2014; Gupta 2009). NLP tasks are more diverse as compared to Computer Vision and range from syntax, including morphology and compositionality, semantics as a study of meaning, including relations between words, phrases, sentences, and discourses, to pragmatics, a study of shades of meaning, at the level of natural communication. Some complex tasks in NLP include machine translation, dialog interface, information extraction, and summarization. It is believed that switching from images to words is the closest to machine translation. Still, such “translation” between low-level pixels or contours of an image and a high-level description in words or sentences — the task known as Bridging the Semantic Gap (Zhao and Grosky 2002) — remains a wide gap to cross. The integration of vision and language was not going smoothly in a top-down deliberate manner, so researchers came up with a set of principles. Integrated techniques were rather developed bottom-up, as some pioneers identified certain rather specific and narrow problems, attempted multiple solutions, and found a satisfactory outcome. The new trajectory started with the understanding that most present-day files are multimedia and that they contain interrelated images, videos, and natural language texts. For example, a typical news article contains a written by a journalist and a photo related to the news content. Furthermore, there may be a clip video that contains a reporter or a snapshot of the scene where the event in the news occurred. Language and visual data provide two sets of information that are combined into a single story, making the basis for appropriate and unambiguous communication. This understanding gave rise to multiple applications of an integrated approach to visual and textual content not only in working with multimedia files, but also in the fields of robotics, visual translations, and distributional semantics. The multimedia-related tasks for NLP and computer vision fall into three main categories: visual properties description, visual description, and visual retrieval. Visual properties description: A step beyond classification, the descriptive approach summarizes object properties by assigning attributes. Such attributes may be both binary values for easily recognizable properties or relative attributes describing a property with the help of a learning-to-rank framework. The key is that the attributes will provide a set of contexts as a knowledge source for recognizing a specific object by its properties. The attribute words become an intermediate representation that helps bridge the semantic gap between the visual space and the label space. Visual description: in real life, the task of visual description is to provide image or video capturing. It is believed that sentences would provide a more informative description of an image than a bag of unordered words. To generate a sentence that would describe an image, a certain amount of low-level visual information should be extracted that would provide the basic information “who did what to whom, and where and how they did it”. From the part-of-speech perspective, the quadruplets of “Nouns, Verbs, Scenes, Prepositions” can represent meaning extracted from visual detectors. Visual modules extract objects that are either a subject or an object in the sentence. Then a Hidden Markov Model is used to decode the most probable sentence from a finite set of quadruplets along with some corpus-guided priors for verb and scene (preposition) predictions. The meaning is represented using objects (nouns), visual attributes (adjectives), and spatial relationships (prepositions). Then the sentence is generated with the help of the phrase fusion technique using web-scale n-grams for determining probabilities. Visual retrieval: Content-based Image Retrieval (CBIR) is another field in multimedia that utilizes language in the form of query strings or concepts. As a rule, images are indexed by low-level vision features like color, shape, and texture. CBIR systems try to annotate an image region with a word, similar to semantic segmentation, so the keyword tags are close to human interpretation. CBIR systems use keywords to describe an image for image retrieval but visual attributes describe an image for image understanding. Nevertheless, visual attributes provide a suitable middle layer for CBIR with an adaptation to the target domain. Robotics Vision: Robots need to perceive their surroundings from more than one way of interaction. Similar to humans processing perceptual inputs by using their knowledge about things in the form of words, phrases, and sentences, robots also need to integrate their perceived picture with the language to obtain the relevant knowledge about objects, scenes, actions, or events in the real world, make sense of them and perform a corresponding action. For example, if an object is far away, a human operator may verbally request an action to reach a clearer viewpoint. Robotics Vision tasks relate to how a robot can perform sequences of actions on objects to manipulate the real-world environment using hardware sensors like depth cameras or motion cameras and having a verbalized image of their surroundings to respond to verbal commands. Situated Language: Robots use languages to describe the physical world and understand their environment. Moreover, spoken language and natural gestures are more convenient ways of interacting with a robot for a human being, if the robot is trained to understand this mode of interaction. From the human point of view, this is a more natural way of interaction. Therefore, a robot should be able to perceive and transform the information from its contextual perception into a language using semantic structures. The most well-known approach to representing meaning is Semantic Parsing, which transforms words into logical predicates. SP tries to map a natural language sentence to a corresponding meaning representation that can be a logical form like λ-calculus using Combinatorial Categorical Grammar (CCG) as rules to compositionally construct a parse tree. Early Multimodal Distributional Semantics Models: The idea lying behind Distributional Semantics Models is that words in similar contexts should have similar meaning, therefore, word meaning can be recovered from co-occurrence statistics between words and contexts in which they appear. This approach is believed to be beneficial in computer vision and natural language processing as image embedding and word embedding. DSMs are applied to jointly model semantics based on both visual features like colors, shape, or texture and textual features like words. The common pipeline is to map visual data to words and apply distributional semantics models like LSA or topic models on top of them. Visual attributes can approximate the linguistic features for a distributional semantics model. Neural* *Multimodal Distributional Semantics Models: Neural models have surpassed many traditional methods in both vision and language by learning better-distributed representation from the data. For instance, Multimodal Deep Boltzmann Machines can model joint visual and textual features better than topic models. In addition, neural models can model some cognitively plausible phenomena such as attention and memory. For attention, an image can initially give an image embedding representation using CNNs and RNNs. An LSTM network can be placed on top and act like a state machine that simultaneously generates outputs, such as image captions, or looks at relevant regions of interest in an image one at a time. For memory, commonsense knowledge is integrated into visual question-answering If combined, two tasks can solve a number of long-standing problems in multiple fields, including:

nullNLP vs. NLU: from Understanding a Language to Its Processing

As artificial intelligence progresses and technology becomes more sophisticated, we expect existing concepts to embrace this change — or change themselves. Similarly, in the domain of computer-aided processing of natural languages, shall the concept of natural language processing give way to natural language understanding? Or is the relation between the two concepts subtler and more complicated than merely the linear progress of a technology? In this post, we’ll scrutinize the concepts of NLP and NLU and their niches in AI-related technology. Importantly, though sometimes used interchangeably, they are actually two different concepts that have some overlap. First of all, they both deal with the relationship between natural language and artificial intelligence. They both attempt to make sense of unstructured data, like language, as opposed to structured data like statistics, actions, etc. However, NLP and NLU are opposites of a lot of other data mining techniques. Source: https://nlp.stanford.edu/~wcmac/papers/20140716-UNLU.pdf NLP is an already well-established, decades-old field operating at the cross-section of computer science, artificial intelligence, and increasingly data mining. The ultimate of NLP is to read, decipher, understand, and make sense of human languages by machines, taking certain tasks off the humans and allowing for a machine to handle them instead. Common real-world examples of such tasks are online chatbots, text summarizers, auto-generated keyword tabs, as well as tools analyzing the sentiment of a given text. NLP in its broadest sense can refer to a wide range of tools, such as speech recognition, natural language recognition, and natural language generation. Yet, the most common tasks of NLP are historically:

nullNatural Language Technologies in 2019

No matter what technology domain you are looking at, the unanimous expectation of the next year is that natural language processing will take the leading role in this domain advancement. Be it Business Intelligence, FinTech, or Healthcare, NLP seems to become a field-shaping technology. In 2018, businesses used NLP techniques in several areas:

nullTop 10 Books on NLP and Text Analysis

When it comes to choosing the right book, you become immediately overwhelmed with the abundance of possibilities: should you choose a classic for a solid base or a fresh-from-the-oven book for the newest trends? What level to stick to? Will a beginner’s guide be too easy? In this review, we have collected our Top 10 NLP and Text Analysis Books of all time, ranging from beginners to experts. by Steven Bird, Ewan Klein and Edward Loper. It is so popular, that every top seems to have it listed. Well, it is a timeless classic that provides an introduction to NLP using the Python and its NLTK library. Target readers: Beginners in NLP, computational linguists and AI developers Why it is good: The book is very practice-oriented: you won’t be introduced to complex theories behind, just plenty of code and concepts to start experimenting right away. Where to find: Target readers: Beginners in natural language processing with no required knowledge of linguistics or statistics Why it is good: Though rather old, this book gives a strong foundation in linguistics and statistical methods and to better understand the newer methods and encodings. Where to find: Target readers: Beginners in natural language and speech processing Why it is good: The book provides a solid foundational knowledge as it introduces linguistics, computer science and statistics at comprehensive depth. Where to find: Target readers: Linguists as well as researchers in informatics, artificial intelligence, language engineering, and cognitive science. Why it is good: It is an academic edition, meaning that it theory-oriented and provides deeper understanding of major concepts that their functioning. Where to find: Target readers: Practitioners at least slightly familiar with R. Why it is good: It is quite new; therefore it has a practical and modern feel to the demonstrations and provides examples of real text mining problems. Where to find: Target readers: Software developers and industry practitioners who are already familiar with neural networks. Why it is good: The book offers a thorough overview of state-of-the-art neural network models that may be useful for NLP. Where to find: Target readers: Software developers who want to familiarize themselves with enterprise-grade NLP tools for work projects. Why it is good: This book offers first-hand insights into Apache-based NLP a cofounder of the Apache Mahout project. Besides, it is a rare book having Java code examples. Where to find: Target readers: Advanced undergraduate and graduate students in computational linguistics and computer science, as well as academic and industrial researchers. Why it is good: First of all, it is a 2018 edition, so it reviews the real state of the art. Besides, it provides deep and fundamental knowledge of deep learning far beyond practical applications. Where to find: Target readers: Software developers in Python who are interested in applying natural language processing and machine learning to their software development toolkit. Why it is good: This practical book presents a data scientist’s perspective on building language-aware products with applied machine learning techniques. Where to find: Target readers: Software developers with at least minor previous experience in machine learning. Why it is good: The book gives a comprehensive overview of the most recent developments in machine learning starting from simple linear regression and progressing to deep neural networks — and it all on two most popular libraries: Scikit-Learn and TensorFlow. Where to find:

nullWord Vectors in Natural Language Processing: Global Vectors (GloVe)

Another well-known model that learns vectors or words from their co-occurrence information, i.e. how frequently they appear together in large text corpora, is GlobalVectors (GloVe). While word2vec is a predictive model — a feed-forward neural network that learns vectors to improve the predictive ability, GloVe is a count-based model. Generally speaking, count-based models learn vectors by doing dimensionality reduction on a co-occurrence counts matrix. First they construct a large matrix of co-occurrence information, which contains the information on how frequently each “word” (stored in rows), is seen in some “context” (the columns). The number of “contexts” needs be large, since it is essentially combinatorial in size. Afterwards they factorize this matrix to yield a lower-dimensional matrix of words and features, where each row yields a vector representation for each word. It is achieved by minimizing a “reconstruction loss” which looks for lower-dimensional representations that can explain the variance in the high-dimensional data. In the case of GloVe, the counts matrix is preprocessed by normalizing the counts and log-smoothing them. Compared to word2vec, GloVe allows for parallel implementation, which means that it’s easier to train over more data. It is believed (GloVe) to combine the benefits of the word2vec skip-gram model in the word analogy tasks, with those of matrix factorization methods exploiting global statistical information. On the project page it is stated that GloVe is essentially a log-bilinear model with a weighted least-squares objective. The model rests on a rather simple idea that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning which can be encoded as vector differences. Therefore, the training objective is to learn word vectors such that their dot product equals the logarithm of the words’ probability of co-occurrence. As the logarithm of a ratio equals the difference of logarithms, this objective associates the ratios of co-occurrence probabilities with vector differences in the word vector space. It creates the word vectors that perform well on both word analogy tasks and on similarity tasks and named entity recognition. Almost all unsupervised methods for learning word representations use the statistics of word occurrences in a corpus as the primary source of information, yet the question remains as to how we can generate meaning from these statistics, and how the resulting word vectors might represent that meaning. Pennington et al. (2014) present a simple example based on the words ice and steam to illustrate it. The relationship of these words can be revealed by studying the ratio of their co-occurrence probabilities with various probe words, k. Let P(k|w) be the probability that the word k appears in the context of word w: ice co-occurs more frequently with solid than it does with gas, whereas steam co-occurs more frequently with gas than it does with solid. Both words co-occur frequently with water (as it is their shared property ) and infrequently — with the unrelated word fashion. In other words, P(solid | ice) will be relatively high, and P(solid | steam) will be relatively low. Therefore, the ratio of P(solid | ice) / P(solid | steam) will be large. If we take a word such as gas that is related to steam but not to ice, the ratio of P(gas | ice) / P(gas | steam) will instead be small. For a word related to both ice and steam, such as water we expect the ratio to be close to one: We can see that the appropriate starting point for word vector learning might be indeed with ratios of co-occurrence probabilities rather than the probabilities themselves. The way GloVe predicts surrounding words is by maximizing the probability of a context word occurring given a center word by performing a dynamic logistic regression. Before training the actual model, a _co-occurrence matrix_ _X_ is constructed, where a cell _Xij_ is a “strength” which represents how often the word _i_ appears in the context of the word _j_. Once _X_ is ready, it is necessary to decide vector values in continuous space for each word in the corpus, in other words, to build word vectors that show how every pair of words _i_ and _j_ co-occur. We will produce vectors with a soft constraint that for each word pair of word _i_ and word _j_ where _bi_ and _bj_ are scalar bias terms associated with words _i_ and _j_, respectively. We’ll do this by minimizing an objective function _J_, which evaluates the sum of all squared errors based on the above equation, weighted with a function _f_: Where V is the size of the vocabulary. Yet, some co-occurrences that happen rarely or never are noisy and carry less information than the more frequent ones. To deal with them, a weighted least squares regression model is used. One class of weighting functions found to work well can be parameterized as The model generates two sets of word vectors,W and W̃. When X is symmetric,W and W̃ are equivalent and differ only as a result of their random initializations; the two sets of vectors should perform equivalently. As for certain types of neural networks, training multiple instances of the network and then combining the results can help reduce overfitting and noise (Ciresan et al., 2012), W and W̃ are summed up as word vectors. Doing so gives a small boost in performance, with the biggest increase in the semantic analogy task. The model utilizes the main benefit of count data — the ability to capture global statistics — while simultaneously capturing the meaningful linear substructures prevalent in recent log-bilinear prediction-based methods like word2vec. As a result, GloVe becomes a global log-bilinear regression model for the unsupervised learning of word representations that outperforms other models on word analogy, word similarity, and named entity recognition tasks. Advantages Drawbacks

astronaut