SciForce Blog

Read our blog and carry on

Learn who we are and why we stand out among the others.

iconData Analytics
iconBig Data
iconPredictive analytics
iconData Science
iconMachine Learning
iconComputer Vision
iconNatural Language Processing
iconFollow Sciforce on Medium
OHDSI Global Symposium 2022
WHAT IS GPT-3, HOW DOES IT WORK, AND WHAT DOES IT ACTUALLY DO?What is GPT-3, How Does It Work, and What Does It Actually Do?

GitHub and OpenAI presented a new code-generating tool, Copilot, that is now a part of Visual Studio Code that is autocompleting code snippets. Copilot is based on Codex that is a product of GPT-3, presented a year ago. It seems like the hype around GPT-3 still is not going to evaporate, and we decided to delve into details step-by-step. Check it out. GPT-3 stands for Generative Pre-trained Transformer 3, and it is the third version of the language model that Open AI released in May 2020. It is generative, as GPT-3 can generate long sentences of the unique text as the output. Notice that most neural networks are capable only of spitting out yes or no answers or simple sentences. Pre-trained means that the language model has not been built with any special domain knowledge, but it can complete domain-specific tasks like translation. Thus, GPT-3 is the most innovative language model that has ever existed. Ok, but what is Transformer, then? Simply put, it is the neural network’s architecture developed by Google’s scientists in 2017, and it uses a self-attention mechanism that is a good fit for language understanding. Given that the attention mechanism enabled a breakthrough in the NLP domain in 2015, Transformer became a ground for GPT-1 and Google’s BERT, another great language model. In essence, attention is a function that calculates the probability of the next word appearing, surrounded by the other ones. By the way, we have developed an explainer for BERT. Wait, but what makes GPT-3 so unique? GPT-3 language model has 175 billion parameters, i.e., values that a neural network is optimizing during the training (compare with 1,5 billion parameters of GPT-2). Thus, this language model has excellent potential for automatization across various industries — from customer service to documentation generation. You could play around with the beta of GPT-3 Playground by yourself. How can I use GPT-3 for my applications? As of July 2021, you can join the waitlist since the company can offer a private beta version of its API under the LmaS basis (language-model-as-a-service). Here are the examples that you might have already heard of — GPT-3 is writing stunning fiction. Gwern, author of the who is experimenting both with GPT-2 and GPT-3, states that “GPT-3, however, is not merely a quantitative tweak yielding “GPT-2 but better” — it is qualitatively different.” The beauty of GPT-3 for text generation is that you need to train anything in a usual way. Instead, it would be best to write the prompts for GPT-3 to teach it anything you want. Sharif Shameem used GPT-3 for debuild, a platform that generates code as per request. You could type the request like “create a watermelon-style button” and grab your code to use for an app. You could even use GPT-3 to generate substantial business guidelines, as @zebulgar did. Let us look under the hood and define the nuts and bolts of GPT-3. Larger models are learning efficiently from in-context information To put it bluntly, GPT-3 calculates how likely some word can appear in the text given the other one in this text. It is known as the conditional probability of words. For example, the word chair in the sentences: “Margaret is arranging a garage sale... Maybe we could buy that old ___ “ is much more likely to appear than, let us say, an elephant. That means the probability of a word chair occurring in the prompted text is higher than the probability of an elephant. GPT-3 uses some form of data compression while consuming millions of sample texts to convert the words into vectors, i.e., numeric representations. Later, the language model is unpacking the compressed text in human-friendly sentences. Thus, compressing and decompressing text develops the model’s accuracy while calculating the conditional probability of words. Dataset used to train GPT-3 Since GPT-3 is high-performing in the “few-shot” settings, it can respond in a way consistent with a given example piece of text that has never been exposed before. Thus, it only needs a few examples to produce a relevant response, as it has already been trained on lots of text samples. Check out the research paper for more technical details: Language Models are Few-Shot Learners. The few-shot model needs only a few examples to produce a relevant response, as it has already been trained on lots of text samples. The scheme illustrates the mechanics of English to French translation. After the training, when the language model’s conditional probability as accurate as possible, it can predict the next word while given an input word, sentence, or a fragment as a prompt. Speaking formally, prediction of the next word relates to the natural language inference. In essence, GPT-3 is a text predictor — its output is a statistically plausible response to the given input, grounded on the data it was trained before. However, some critiques arguing that GPT-3 is not the best AI system for question answering and text summarizing. GPT-3 is mediocre compared to the SOTA (state-of-the-art) methods per each NLP task separately, but it is much more general than any previous system, and the upcoming ones will be resembling GPT-3. In general, GPT-3 can perform NLP tasks after a few prompts are given. It demonstrated high performance under the few-shot settings in the following tasks: GPT-3 demonstrated a perplexity of 20,5 (defines how well a probability language model predicts a sample) under the zero-shot circumstances on the Penn Tree Bank (PTB). The closest rival, BERT-Large-CAS, boasts of 31,3. GPT-3 is a leader in Language Modelling on Penn Tree Bank with a perplexity of 20.5 GPT-3 also demonstrates 86,4% accuracy (an 18% increase from previous SOTA models) in the few-shot settings while performing the LAMBADA dataset test. For this test, the model predicts the last word in the sentence, requiring “reading” of the whole paragraph. Important notice: GPT-3 demonstrated these results thanks to the fill-in-the-blank examples like: George bought some baseball equipment, a ball, a glove, and a_____. →” Moreover, researchers report about 79,3% accuracy while picking the best ending of a story while on the HellaSwag dataset in the few-shot settings. And it demonstrated 87,7% accuracy on the StoryCloze 2016 dataset (which is still “4.1% lower than the fine-tuned SOTA using a BERT based model”). … or testing broad factual knowledge with GPT-3. As per the GPT-3 research paper, it was tested on Natural Questions, WebQuestions, and TriviaQA datasets, and the results are the following: GPT-3 in the few-shot settings outperforms fine-tuned SOTA models only on the TriviaQA dataset As for translation, supervised SOTA neural machine translation (NMT) models are the clear leaders in this domain. However, GPT-3 reflects its strength as an English LM, mainly when translating into English. Researchers also state that “GPT-3 significantly outperforms prior unsupervised NMT work when translating into English but underperforms when translating in the other direction.” In general, across all three language models tested (English in combinations with French, German, and Romanian), there is a smooth upward trend with model capacity: Winograd-style tasks are classical NLP tasks, determining word pronoun referring in the sentence when it is grammatically ambiguous but semantically unambiguous for a human. Fine-tuned methods have recently reached human-like performance on the Winograd dataset but still lag behind the more complex Winogrande dataset. GPT-3 results are the following: “On Winograd GPT-3 achieves 88.3%, 89.7%, and 88.6% in the zero-shot, one-shot, and few-shot settings, showing no clear in-context learning but in all cases achieving strong results just a few points below state-of-the-art and estimated human performance. ” As for physical or scientific reasoning, GPT-3 is not outperforming fine-tuned SOTA methods: GPT-3 is not that good at arithmetic still, since the results are the following: However, when it comes to the news article generation, human detection of GPT-3 written news (few-shot settings) is close to chance — 52% of mean accuracy. Well, even the Open AI CEO Sam Altman tweeted that GPT-3 is overhyped, and here is what the researchers themselves state: GPT-3 is not good at text synthesis — while the overall quality of the generated text is high, it starts repeating itself at the document level or when it goes to the long passages. It is also lagging at the domain of the discrete language tasks, having difficulty within “common sense physics”. Thus, it is hard for GPT-3 to answer the question: “If I put cheese into the fridge, will it melt?” GPT-3 has some notable gaps in reading comprehension and comparison tasks. Tasks that empirically benefit from bidirectionally are also areas of improvement for GPT-3. It may include the following: “fill-in-the-blank tasks, tasks that involve looking back and comparing two pieces of content, or tasks that require re-reading or carefully considering a long passage and then generating a very short answer,” as researchers state. Models like GPT-3 have a lot of skills and become “overqualified” for some specific tasks. Moreover, it is the computing-power hungry model: “training the GPT-3v175B consumed several thousand petaflop/s-days of compute during pre-training, compared to tens of petaflop/s-days for a 1.5B parameter GPT-2 model”, as researchers state. Since the model was trained on the content that humans generated on the internet, there are still troubles referring to bias, fairness, and representation. Thus, GPT-3 can generate prejudiced or stereotyped content. But you may already read a lot about it online, or you can check it out in the research paper. The authors are dwelling on it pretty well. GPT-3 is a glimpse of the bright future in NLP, helping to generate code, meaningful pieces of texts, translation, and doing well with different tasks. Also, it has its limitations and ethical issues like generating biased fragments of text. All in all, we are witnessing something interesting, as it always used to be in NLP. Clap for this blog and give some more inspiration to us. Check out more of our posts on NLP: Text Preprocessing for NLP and Machine Learning Tasks Biggest Open Problems in Natural Language Processing A Comprehensive Guide to Natural Language Generation NLP vs. NLU: from Understanding a Language to Its Processing

HOW TO TELL A FANTASTIC DATA STORYHow to tell a fantastic data story

What are the central parts of any data-driven story, and how to apply the Gestalt laws to your data visualization reports? What role does context play in your data story? We are dwelling on these questions and providing you with a list of the best books and tools for stunning data visualization. Check it out! First, let us start from the central part of any good data story — the data-ink ratio. You’ve probably heard of or read something from Edward Tufte, the father of data visualization, who has coined the term. Thus, by saying data-ink ratio, we mean the amount of data-ink divided by the total ink needed for your visualization. Every element of your infographics requires a reason. Simply put, Tufte says that you should use only the required details and remove the visual noise. How far could you go? *Until your visualization communicates the overall idea. Daniel Haight calls it the visualization spectrum, the constant trade-off between **clarity **and *engagement. Daniel proposes to measure the clarity by the time needed to understand the visualization, dependent on information density. Then, you can measure the engagement of your data story by the emotional connection it involves (besides the shares and mentions on social media). Take the data story by Fivethirthyeight about women of color at the US Congress as an example. Authors are using simple elements and not overloading the viewer with needles details (but still communicating the overall story crystal clear). At the same time, looking at timelines of different colors, you can see the drastic changes behind them. That is a pretty good compromise between clarity and engagement. Edward Tufte also coined the term chartjunk, which stands for all the ugly visualization you may see endless online. 11 Reasons Infographics Are Poison And Should Never Be Used On The Internet Again by Walt Hickey, dated back by 2013, is still topical. Thus, it is better to follow the principles that never make infographics appear on the list. Gestalt principles are heuristics standing as the mental shortcuts for the brain and explain how we group the small objects to form the larger ones. Thus, we tend to perceive the things located close to each other as a group, and it is a principle of proximity. Check out also post by Becca Selah, rich in quick and easy tips on clear data visualization. In essence, color conveys information in the most effective way for a human brain. But choosing it according to the rules might be frustrating. As a rule of thumb, remember that choosing a color of the same saturation helps a viewer to perceive such colors as a group. Also, check out the excellent guide from Lisa Charlotte Rost (Datawrapper) since it is the best thing that we have seen for beginners looking for color picking tools. Pro-tip: gray color is not fighting for human attention and should become your best friend. Andy Kirk tells more about it here. Firstly, let us make it clear — a data visualization specialist is a data analyst. Thus, your primary task is to differentiate the signals from the noise, i.e., finding the hidden gems in tons of data. But presenting your findings with good design in mind is not enough while no context is introduced to your audience. Here is when storytelling comes in handy. What is your audience, and how will they see it? This question is relevant for both cases. At the outset of your data-driven story, define your focus — is it a broad or narrow one? In the first case, you will spend lots of time digging in data, so it is crucial to ask your central question. Working with a narrow focus is different. When you have some specific prerequisites at the very beginning and harness several datasets to find the answer, it is the case. Thus, it might be easier to cope with one specific inquiry than to look for some insights in datasets. Consider placing simple insights at the beginning of your story. Thus, you draw in the reader immediately and add some relevant points to illustrate the primary idea better. But when it comes to “well-it-depends-what-are-we-talking-about” answers, try to be mother-like, careful with your audience. Guide them step-by-step into your story. You can use comparing within different elements of periods or apply analogies. Also, it is helpful to use individual data points on a small scale before delving into the large-scaled story. Besides the links to the datasets you have been working with while crafting your story, it is worth sharing your methodology. Thus, your savvy audience may be relaxed while looking at the results. You may read tons of blogs on data visualization, but we believe in the old-fashioned style — build your stable foundation first with classics. Thus, here is the list of ones that we recommend depending on the tasks you are solving. Edward Tufte is the father of data visualization, so we recommend starting with his books to master the main ideas. The Visual Display of Quantitative Information, Envisioning Information, Beautiful Evidence, and Visual Explanation — You are a rock star of data visualization! Naked Statistics: Stripping the Dread from the Data by Charles Wheelan. Applying statistics to your analysis is crucial, and you will delve into the principal concepts like inference, correlation, and regression analysis. How Not to Be Wrong: The Power of Mathematical Thinking by Jordan Ellenberg. We recommend it for those who are coming to DataViz with a non-tech background. Statistics Unplugged by Sally Caldwell. Again, statistics explained, since you will love it. Interactive Data Visualization for the Web: An Introduction to Designing with D3 by Scott Murray comes in handy to create online visualization even if you have no experience with web development. D3.js in Action: Data visualization with JavaScript by Elijah Meeks — a guide on creating interactive graphics with D3. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham to brush up your coding skills with R. Data Visualisation: A Handbook for Data Driven Design by Andy Kirk. This one can help you choose the best visualization for your data, as your insights should not be as clear as engaging, ideally. Visualization Analysis and Design by Tamara Munzner. This one represents a comprehensive, systematic approach to design for DataViz. Information Visualization: Perception for Design by Colin Ware is a cherry on the cake of your designing skills. However, if you’ve got some tasks regarding data visualization right now and have no time for upskilling, we recommend online tools like Florish or Datawrapper. Mala Deep dwells on five free data visualization tools pretty clear, check it out! Also, we would appreciate your suggestions in the comments section!

High-Quality Software: to Pay or Not to Pay

Software development companies are always under pressure to launch their software onto the market faster, as releasing ahead of the competition gives advantage which can be vital. Fast release times and more frequent releases can, at the same time, corrupt the quality of the product, increasing the chances of defects and bugs. It is quite a common debate in software development projects to choose between spending time on improving software quality versus releasing more valuable features faster. The pressure to deliver functionality often cuts off time that can be dedicated to working on architecture and code quality. However, reality shows that high-performing IT companies can release fast (Amazon, for example, unfolds new software for production through its Apollo deployment service every 11.7 seconds) with 60 times fewer failures. So, do we actually need to choose between quality, time, and price? Software quality refers to many things. It measures whether the software satisfies its functional and non-functional requirements: Functional requirements specify what the software should do, including technical details, data manipulation, and processing, or any other specific function. Non-functional requirements, or quality attributes, include things like disaster recovery, portability, privacy, security, supportability, and usability. To understand the software quality, we can explore the CISQ software quality model that outlines all quality aspects and relevant factors to get a holistic view of software quality. It rests on four important indicators of software quality: Reliability – the risk of software failure and the stability of a program when exposed to unexpected conditions. Quality software should have minimal downtime, good data integrity, and no errors that directly affect users. Performance efficiency – an application’s use of resources and how it affects the scalability, customer satisfaction, and response time. It rests on the software architecture, source code design, and individual architectural components. Security – protection of information against the risk of software breaches that relies on coding and architectural strength. Maintainability – the amount of effort needed to adjust software, adapt it for other goals or hand it over from one development team to another. The key principles here are compliance with software architectural rules and consistent coding across the application. Of course, there are other factors that ensure software quality and provide a more holistic view of quality and the development process. Rate of Delivery – how often new versions of the software are shipped to customers. Testability – finding faults in software with high testability is easier, making such systems less likely to contain errors when shipped to end-users. Usability – the user interface is the single part of the software visible to users, so it’s crucial to have a great UI. Simplicity and task execution speed are two factors that facilitate better UI. User sentiment – measuring how end-users feel when interacting with an application or system helps companies get to know them better and incorporate their needs into upcoming sprints and ultimately broaden your impact and market presence. Continuous improvement – implementing the practice of constant improvement processes is central to quality management. It can help your team develop its own best practices and share them further, justify investments, and increase self-organization. There are obviously a lot of aspects that describe quality software, however, not all of them are evident to the end-user: a user can tell if the user-interface is good, an executive can assess if the software is making the staff more efficient. Most probably, users will notice defects or certain bugs and inconsistencies. What they do not see is the architecture of the software. Software quality can thus fall into two major categories: external* (such as the UI and defects) and *internal (architecture): a user can see what makes up the high external quality of a software product, but cannot tell the difference between higher or lower internal quality. Therefore, a user can judge whether to pay more to get a better user interface, since they can assess what they get. But users do not see the internal modular structure of the software, let alone judge that it's better, so they might be reluctant to pay for something that they neither see, nor understand. And why should any software-developing company put time and effort into improving the internal quality of their product if it has no direct effect? When users do not see or appreciate extra efforts spent on the product architecture, and the demand for software delivery speed continues to increase along with the demand for reduction in costs, companies are tempted to release more new features that would show progress to their customers. However, it is a trap that reduces the initial time spent and the cost of the software but makes it more expensive to modify and upgrade in the long run. One of the principal features of internal quality is making it easier to figure out how the application works so developers can add things easily. For example, if the software is divided into separate modules, you can read not the whole bunch of code, but look through a few hundred lines in a couple of modules to find the necessary information. More robust architecture – and therefore, better internal quality, will make adding new features easier, which means faster and cheaper. Besides, software's customers have only a rough idea of what features they need in a product and learn gradually as the software is built - particularly after the early versions are released to their users. It entails constant changing of the software, including languages, libraries, and even platforms. With poor internal quality, even small changes require developers to understand large areas of code, which in turn is quite tough to understand. When they perform changes, unexpected breakages happen, leading to long test times and defects that need to be fixed.Therefore, concentrating only on external quality will yield fast initial progress, but as time goes on, it gets harder to add new features. High internal quality means reducing that drop off in productivity. But how can you achieve high external and internal quality when you don't have endless time and resources? Following the build life cycle from story to code on a developer desktop could be an answer.While performing testing, use automation through the process, including automated, functional, security, and other modes of testing. This provides teams with quality metrics and automated pass/fail rates. When your most frequent tests are fully automated and only manual tests on the highest quality releases left, it leads to the automated build-life quality metrics that cover the full life cycle. It is enabling developers to deliver high-quality software quickly and reduce costs through higher efficiency. Neglecting internal quality leads to rapid build-up of work that eventually slows down new feature development. It is important to keep internal quality high in the light of having control, which will pay off when adding features and changing the product. Therefore, to answer the question in the title, it is actually a myth that high-quality software is more expensive, so no such trade-off exists. And you definitely should spend more time and effort on building robust architecture to have a good basis for further development – unless you are just working on a college assignment that you’ll forget in a month.

Top-5 NLP news of December by a CTO of Sciforce — Max Ved

We are excited to announce that we are launching a new section today at Sciforce – “Top AI news of the month.” We decided to create this column to keep our valued clients aware of the latest news and technologies in the AI world. So, here, we will share some interesting information with you about SciTech! Well, today we will discuss Top-5 NLP news in December:


Generative models under a microscope: Comparing VAEs, GANs, and Flow-Based Models

Two years after Generative Adversarial Networks were introduced in a paper by Ian Goodfellow and other researchers, including in 2014, Facebook’s AI research director and one of the most influential AI scientists, Yann LeCun called adversarial training “the most interesting idea in the last ten years in ML.” Interesting and promising as they are, GANs are only a part of the family of generative models that can offer a completely different angle of solving traditional AI problems. Generative algorithms When we think of Machine Learning, the first algorithms that will probably come to mind will be discriminative. Discriminative models that predict a label or a category of some input data depending on its features are the core of all classification and prediction solutions. In contrast to such models, generative algorithms help us tell a story about the data, providing a possible explanation of how the data has been generated. Instead of mapping features to labels, like discriminative algorithms do, generative models attempt to predict features given a label. While discriminative models define the relation between a label y and a feature x, generative models answer “how you get x.” Generative Models model P(Observation/Cause) and then use Bayes theorem to compute P(Cause/Observation). In this way, they can capture p(x|y), the probability of x given y, or the probability of features given a label or category. So, actually, generative algorithms can be used as classifiers, but much more, as they model the distribution of individual classes. There are many generative algorithms, yet the most popular models that belong to the Deep Generative Models category are Variational Autoencoders (VAE), GANs, and Flow-based Models. Variational Autoencoders A Variational Autoencoder (VAE) is a generative model that “provides probabilistic descriptions of observations in latent spaces.” Simply put, this means VAEs store latent attributes as probability distributions. The idea of Variational Autoencoder (Kingma & Welling, 2014), or VAE, is deeply rooted in the variational bayesian and graphical model methods. A standard autoencoder comprises a pair of two connected networks, an encoder, and a decoder. The encoder takes in an input and converts it into a smaller representation, which the decoder can use to convert it back to the original input. However, the latent space they convert their inputs to and where their encoded vectors lie may not be continuous or allow easy interpolation. For a generative model, it becomes a problem, since you want to randomly sample from the latent space or generate variations on an input image from a continuous latent space. Variational Autoencoders have their latent spaces continuous by design, allowing easy random sampling and interpolation. To achieve this, the hidden nodes of the encoder do not output an encoding vector but, rather, two vectors of the same size: a vector of means and a vector of standard deviations. Each of these hidden nodes will act as its own Gaussian distribution. The new vectors form the parameters of a so-called latent vector of random variables. The ith element of both mean and standard deviation vectors corresponds to the ith random variable's mean and standard deviation values. We sample from this vector to obtain the sampled encoding that is passed to the decoder. Decoders can then sample randomly from the probability distributions for input vectors. This process is stochastic generation. It implies that even for the same input, while the mean and standard deviation remain the same, the actual encoding will somewhat vary on every pass simply due to sampling. The loss of the autoencoder is to minimize both the reconstruction loss (how similar the autoencoder’s output to its input) and its latent loss (how close its hidden nodes were to a normal distribution). The smaller the latent loss, the less information can be encoded that boosts the reconstruction loss. As a result, the VAE is locked in a trade-off between the latent loss and the reconstruction loss. When the latent loss is small, the generated images will resemble the images at train time too much, but they will look bad. If the reconstruction loss is small, the reconstructed images at train time will look good, but novel generated images will be far from the reconstructed images. Obviously, we want both, so it’s important to find a nice equilibrium. VAEs work with remarkably diverse types of data, sequential or nonsequential, continuous or discrete, even labeled or completely unlabelled, making them highly powerful generative tools. A major drawback of VAEs is the blurry outputs that they generate. As suggested by Dosovitskiy & Brox, VAE models tend to produce unrealistic, blurry samples. This is caused by the way data distributions are recovered, and loss functions are calculated. A 2017 paper by Zhao et al. has suggested modifications to VAEs not to use variational Bayes method to improve output quality. Generative Adversarial Networks Generative Adversarial Networks, or GANs, are a deep-learning-based generative model that is able to generate new content. The GAN architecture was first described in the 2014 paper by Ian Goodfellow, et al. titled “Generative Adversarial Networks.” GANs adopt the supervised learning approach using two sub-models: the generator model that generates new examples and the discriminator model that tries to classify examples as real or fake (generated). GAN sub-models Generator. The model that is used to generate new plausible examples from the problem domain. Discriminator. The model that is used to classify examples as real (from the domain) or fake (generated). The two models compete against an adversary in a zero-sum game. The generator directly produces samples. Its adversary, the discriminator, attempts to distinguish between samples drawn from the training data and samples drawn from the generator. The game continues until the discriminator model is fooled about half the time, meaning the generator model is generating plausible examples. Image credit: Thalles Silva “Zero-sum” means that when the discriminator successfully identifies real and fake samples, it is rewarded, or its parameters remain the same. In contrast, the generator is penalized with updates to model parameters and vice versa. Ideally, the generator can generate perfect replicas from the input domain every time The discriminator cannot tell the difference and predicts “unsure” (e.g. 50% for real and fake) in every case. This is essentially an actor-critic model. It is important to remember that each model can overpower the other. If the discriminator is too good, it will return values so close to 0 or 1 that the generator will struggle to read the gradient. If the generator is too good, it will exploit weaknesses in the discriminator, leading to false negatives. Both neural networks must have a similar “skill level” achieved by respective learning rates. Generator model The generator takes a fixed-length random vector as input and generates a sample in the domain. The vector is drawn randomly from a Gaussian distribution. After training, points in this multidimensional vector space will correspond to points in the problem domain, forming a compressed representation of the data distribution. Similar to VAEs, this vector space is called a latent space, or a vector space comprised of latent variables. In the case of GANs, the generator applies meaning to points in a chosen latent spaceNew points drawn from the latent space can be provided to the generator model as input and used to generate new and different output examples. After training, the generator model is kept and used to generate new samples. The Discriminator Model The discriminator model takes an example as input (real from the training dataset or generated by the generator model) and predicts a binary class label of real or fake (generated). The discriminator is a normal (and well understood) classification model. After the training process, the discriminator is discarded, since we are interested in a robust generator. GANs can produce viable samples and have stimulated a lot of interesting research and writing. However, there are downsides to using a GAN in its plain version: Images are generated off some arbitrary noise. When generating a picture with specific features, you cannot determine what initial noise values would produce that picture but search over the entire distribution. A GAN only differs between "real" and "fake" images. But there are no constraints that a picture of a cat has to look like a cat. Thus, it might result in no actual object in a generated image, but the style looks like the picture. GANs take a long time to train. A GAN might take hours on a single GPU and a single CPU more than a day. Flow-based models Flow-based generative models are exact log-likelihood models with tractable sampling and latent-variable inference. In general terms, Flow-Based Models apply a stack of invertible transformations to a sample from a prior so that exact log-likelihood of observations can be computed. Unlike the previous two algorithms, the model learns the data distribution explicitly, and therefore the loss function is the negative log-likelihood. As a rule, a flow model f is constructed as an invertible transformation that maps the high-dimensional random variable x to a standard Gaussian latent variable z=f(x), as in nonlinear independent component analysis. The key idea in the design of a flow model is that it can be any bijective function and can be formed by stacking individual simple invertible transformations. Explicitly, the flow model f is constructed by composing a series of invertible flows as f(x) =f1◦···◦fL(x), with each fi having a tractable inverse and a tractable Jacobian determinant. Flow-based models have two large categories: models with normalizing flows and models with autoregressive flows that try to enhance the performance of the basic model. Normalizing Flows Being able to do good density estimation is essential for many machine learning problems. Still, it is intrinsically complex: when we need to run backward propagation in deep learning models, the embedded probability distribution needs to be simple enough to calculate the derivative efficiently. The conventional solution is to use Gaussian distribution in latent variable generative models, even though most real-world distributions are much more complicated. Normalizing Flow (NF) models, such as RealNVP or Glow, provide a robust distribution approximation. They transform a simple distribution into a complex one by applying a sequence of invertible transformation functions. Flowing through a series of transformations, we repeatedly substitute the variable for the new one according to the change of variables theorem. Then we eventually obtain a probability distribution of the final target variable. Models with Autoregressive Flows When flow transformation in a normalizing flow is framed as an autoregressive model where every dimension in a vector variable is under the condition of the preceding dimensions, this variation of a flow-model is referred to as an autoregressive flow. It takes a step forward compared to models with the normalizing flow. The popular autoregressive flow models are PixelCNN for image generation and WaveNet for 1-D audio signals. They both consist of a stack of causal convolution – a convolution operation made considering the ordering: the prediction at a specific timestamp consumes only data observed in the past. In PixelCNN, the causal convolution is performed by a masked convolution kernel. WaveNet shifts the output by several timestamps to the future. Thus, that output is aligned with the last input element. Flow-based models are conceptually attractive for modeling complex distributions but are limited by density estimation performance issues compared to state-of-the-art autoregressive models. Besides, though Flow Models might initially substitute GANs producing decent output, there is a significant gap between the computational cost of training between them, with Flow-Based Models taking several times as long as GANs to generate images of the same resolution. Conclusion Therefore, each of the algorithms has its benefits and limitations in terms of accuracy and efficiency. While GANs and Flow-Based Models usually generate better or closer to life images than VAEs, the latter type is more time and parameter-efficient than Flow-Based Models. Thus, GANs are parallel and efficient, not reversible. Flow Models, instead, are reversible and parallel but not efficient, and VAEs are reversible and efficient, but not parallel. In practice, it implies a constant trade-off between the output, the training process, and the efficiency.

HOW TO BUILD ETHICAL AI: GUIDING PRINCIPLES AND ACTIONABLE STEPSHow to Build Ethical AI: Guiding Principles and Actionable Steps

What springs up in your mind when hearing or reading something about AI ethics? Timnit Gebru, data biases, automated inequality, discrimination at scale? But have you ever thought of what do we mean by using the term “ethical AI?” What are the ideas of ethical AI shared and agreed upon to be perceived as so? And how to develop actionable steps toward ethical AI practices at your company? Let’s dive right into it. We don’t have any universal and conventional shared practice worldwide, but the need for an ethical AI framework is rising. Meanwhile, some principles are mentioned more often than others when drafting ethical AI policies, as per the research, The global landscape of AI ethics guidelines by Anna Jobin et al. Given their findings, AI ethics is mainly concentrating on the following five ethical principles: “transparency, justice and fairness, non-maleficence, responsibility and privacy.” Skip to the next block if you are eager to learn the actionable steps to implement ethical AI at your company. Here, we’d like to provide you with some foundational principles and useful links for further research. Based on the OECD recommendations, you can think of transparency in AI as facilitating the general understanding of AI systems and providing awareness of possible outcomes and interaction with AI systems for all stakeholders. It’s part of a bigger story related to the black-box and white-box AI principles. When interested in the practices of transparency in AI, check out our blog Introduction to the White-Box AI: the Concept of Interpretability for more details. As per the guidelines of the European Commission, this principle relates to “ensuring equal and just distribution of both benefits and costs, and ensuring that individuals and groups are free from unfair bias, discrimination and stigmatisation.” To make this principle more feasible, take the automated gender recognition of commercial solutions as an example. There is still a lot to change when talking about commercial automated: gender recognition: “AI systems and the environments in which they operate must be safe and secure” and defended from malicious use, as the European Commission defines. One of the interpretations proposed by Anna Jobin et al. implies “avoidance of specific risks or potential harms — for example, intentional misuse via cyberwarfare and malicious hacking — and suggest risk-management strategies.” Privacy is closely related to data governance, at least per the European Commission’s guidelines, relating to the principle of non-maleficence. AI Actors should ensure that data collected or provided as an output by the AI system would not harm their users. Moreover, the regulation relates to “responsibility and accountability for AI systems and their outcomes, both before and after their development, deployment and use.” Check out also this great list on AI ethics, curated by Eirini Malliaraki for further research. However, developing ethical AI practices is easier said than done. Shipping the biased-free AI-powered product could be tricky, and tech giants demonstrated this multiple times. So, how to transform nebulous theory into actionable steps at your company? First, AI ethics should not exist in a vacuum but be adjusted to the values and principles of your company. You will ensure a sustainable and scalable ethical AI program when developing ethical AI practices on top of your mission and values. Consider the following steps: AI ethics is a responsibility of the whole company but relates to the C-level first. Thus, executives can set the tempo of how employees would take these guidelines. Usually, the data governance department deals with compliance and privacy issues, so they could also deal with AI ethics challenges. Consider inviting other relevant experts and ethicists to empower your ethical AI team. An internal document including your company's ethical standards explained in clear terms comes in handy in case of emergencies. Thus, product owners, data collectors, and managers would know their scenarios when the crisis is knocking on your doors. Do not forget to define the precise KPIs and quality assurance practices, keeping your framework updated. Consider the specificity of your domain. For instance, it would be crucial to measure the quality of recommendations that should be free of biased associations related to some particular groups of society in retail. Someone has already faced this challenge and probably found out how to deal with it. Take the healthcare industry as an example — patients could not be treated until they express their consent. Thus, it should be the same when talking about data of users collected, used, and shared further. All privacy details should be communicated clearly to ensure that users understand the possible outcomes. Check out our blog HIPPA vs. GDPR: major acts regulating health data protection for more details. Ensure that all know about your code of conduct regarding ethical AI. For example, just a decade ago, companies paid little attention to cybersecurity. But now, every organization faces 22 security breaches every year, as per the State of Cybersecurity Report 2020 by Accenture. Now, this fact makes cybersecurity a point of primary concern in every organization. Developing ethical AI practices is going the same way. Thus, it is better to define crucial ethical infrastructure at the outset. Start building a culture that would nurture an appropriate attitude toward ethical AI. Making everyone related informed and motivated to respect the principles is an investment into the reputation of your AI-driven product. It is crucial to constantly monitor the changes in the AI world since it is not possible to oversee all the outcomes. But developing an ethical AI framework, bearing the best practices in mind, fostering the AI-bias-aware culture at your company, and monitoring the changes around ethical AI would make your future safer. Ethical AI is not an agreed-upon and conventionalized practice globally, but researchers suggest the following principles. When working with AI, every organization should consider “transparency, justice and fairness, non-maleficence, responsibility and privacy” when working with AI. In practice, we propose to use existing organizational infrastructure, take the best methods available, facilitate AI-bias-aware culture, and constant monitoring of changes in this domain.

WHAT’S NEXT FOR GENERATIVE ADVERSARIAL NETWORKS (GANS): LATEST TECHNIQUES AND APPLICATIONSWhat’s Next for Generative Adversarial Networks (GANs): Latest Techniques and Applications

Generative Adversarial Networks (GANs) are constantly improving year over the year. In October 2021, NVIDIA presented a new model, StyleGAN3, that outperforms StyleGAN2 with its hierarchical refinement. The new model resolves “sticking issues” of StyleGAN2 and learns to mimic camera motion. Moreover, StyleGAN3 promises to improve the models for video and animation generation. That’s impressive progress, compared to 2014 when GANs entered the picture with low-resolution images. We are also witnessing applications beyond simple images generation. They include but are not limited to: medical products, training data, scientific simulation development, improvements for augmented reality (AR) experience, and speech enhancement and generation. Let’s delve into the most impressive applications we’ve got so far! Lots of articles illustrate the GANs abilities when it comes to image generation and editing. You’ve probably read about functions like face aging, text-to-image translation, frontal face view or pose generation, and so on. As a starter, we recommend 18 Impressive Applications of Generative Adversarial Networks (GANs) and Best Resources for Getting Started With GANs by Jason Brownlee. In this article, we’d like to delve into the latest applications of GANs in real life to go deeper. Labeling medical datasets is expensive and time-consuming, but it seems like GANs have something to offer. Since GANs predominantly belong to the data augmentation techniques, we'd like to dwell on the latest updates in healthcare. Data augmentation is about GANs helping computer vision professionals struggling with class imbalance, leading to biased models while training datasets. Data augmentation, ensured by GANs, helps fight overfitting and the inability to generalize novel examples. So this is how GANs are increasing performance for underrepresented classes of chest X-ray classification, as per the research of Sundaram et al. in 2021. They proved that GANs-based data augmentation is more efficient than standard data augmentation. Meanwhile, researchers point out that GAN data augmentation was most effective when applied to small, significantly imbalanced datasets. Also, it has a limited impact on large datasets. Also, researchers from the University of New Hampshire, in the US, demonstrated that GANs-based data augmentation is beneficial for neuroimaging. Functional near-infrared spectroscopy (fNIRS) belongs to the neuroimaging techniques for mapping the functioning human cortex. By the way, fNIRS applies to brain-computer interfaces, so a large amount of new data for deep learning classification training is crucial. Conditional Generative Adversarial Networks (CGAN), combined with a CNN classifier, led to the 96.67% task classification accuracy, as per the research of Sajila D. Wickramaratne and Shaad Mahmud in 2021. Researchers from the University of California, Berkeley and Glidewell Dental Labs presented one of the first real applications for medical product development. With the help of generative models, dental crowns can be designed to reach the same morphology quality as dental experts do. It takes years of training to develop synthetic crowns for a professional in the dental industry. Thus, it paves the way for the mass customization of products in the healthcare industry. At the same time, GANs come as a good fit for super-resolution medical imaging like low dose Computer Tomography (CT), low field magnetic resonance imaging (MRI). GANs-based method, proposed in 2020, Medical Images SR using Generative Adversarial Networks (MedSRGAN) increase radiologists' efficiency. Thus, it helps to increase the quality of scans and avoid harmful effects this procedure may bring. Automatic speech recognition (ASR) is one of the areas of our expertise. Speech enhancement GANs (SEGAN) apply to the noisy inputs to refine them and make a qualitative output. This function is crucial for people with speech impairments, for example. Thus, GANs could enhance their quality of life. Recently, Huy Phan et al. proposed using “multiple generators that are chained to perform a multi-stage enhancement.” As researchers state, new models, ISEGAN and DSEGAN, are performing better than SEGAN. GANs also come in handy for augmented reality (AR) scenes with creative generation capabilities. For example, recent use cases include completing environmental maps with lightning, reflections, and shadows. Thus, ARShadowGAN, presented in 2020 by Daquan Liu et al., generates shadows of the virtual objects in single light scenes. This technology bridges the real-world environment and the virtual object’s shadow without 3D geometric details or any explicit estimation of illumination: When it comes to advertising, the phrase “time is money” means a lot. One of our use cases implied automated advertising images generation at scale. For example, it costs time and money for a designer to resize images for marketing campaigns from socials to Amazon platforms. However, Super-Resolution Using a Generative Adversarial Network (SRGAN) for single image super-resolution can deal with it. As a result, using SRGAN, it’s possible to resize qualitative images without any human interaction with design. “What could be better than data for a data scientist? More data!” This joke became a real application thanks to GANs. As any neural network-based model is hungry for training data, generative models that could create labeled training data on demand could become a game-changer. For instance, Zhenghao Fei et al. (University of California, Davis) demonstrated how semantically constrained GAN (CycleGAN plus task constrained network) can eliminate the labor-, cost-, and time-consuming process of data labeling. Thus, it ensures more data-efficient and generalizable fruit detection. Simply put, semantically constrained GAN can generate realistic day and night images of grapevine from 3D rendering images and retain grape position and size simultaneously. Labeled data generation could be beneficial in the NLP domain — supporting the research of low resource languages. For instance, Sangramsing Kayte used GANs for text-to-speech translation of low-resource languages of the Indian subcontinent. Recent research shows that ML models can leak sensitive information provided by the training samples. For example, the paper This Person (Probably) Exists. Identity Membership Attacks Against GAN Generated (2021) illustrates that many images of faces produced by GANs strongly resemble the real faces taken from the training data. Researchers propose differential privacy that could help networks learn the data distribution while securing the training data’s privacy. GANs demonstrated impressive progress, compared to 2014, when introduced first by Ian Goodfellow. Despite still being in its infancy, we are already witnessing how GANs improve the design of medical products, automate image editing for advertising, and merge with AR technology. At the same time, the privacy of data generated remains topical, and differential privacy is one of the techniques to consider.

SCIFORCE IS NOW THE EUROPEAN HEALTH DATA AND EVIDENCE NETWORK (EHDEN) CERTIFIED PARTNERSciForce is now the European Health Data and Evidence Network (EHDEN) certified partner

We are pleased to announce that SciForce is now the first and only enterprise in Ukraine that became the certified partner of the European Health Data and Evidence Network (EHDEN). Thus, we contribute to the wider Observational Health Data Sciences and Informatics (OHDSI) initiative to take advantage of large-scale health data analytics. This step brings us into the community of 47 European SMEs delivering their expertise in health data, data standardization, and interoperability. Read on for details. EHDEN, with support from the EU’s Horizon 2020 program and the European Federation of Pharmaceutical Industries and Associations (EFPIA), aims to take advantage of so-called Big Health Data. In the case of the EHDEN, this data may include any information related to patients in Europe, facilitating research, supporting the conduct of healthcare researches, or managing any payment or HR data. As the official website states, EHDEN was launched “to support patients, clinicians, payers, regulators, governments, and the industry in understanding wellbeing, disease, treatments, outcomes, and new therapeutics and devices.” Starting from August 2020, EHDEN certifies the European SMEs on standardizing health data to the Observational Medical Outcomes Partnership (OMOP) common data model (CDM) and setting needed technical ecosystem. At the moment, EHDEN built a community of 47 enterprises taking their part in this initiative. Before joining the community of certified SMEs to cooperate with the European data partners, our professionals took pieces of training within the EHDEN Academy, grasping the technical ecosystem and work standards. See the scheme of the training process, aimed to become the certified Data Partner: Thus, within the EHDEN, we now can provide services related to the ecosystem OHDSI Software and Tools, Technical infrastructure services, OHDSI Training, OMOP CDM ETL, OMOP Standardized Vocabularies. Speaking about healthcare expertise, we are not limiting our capabilities to observational research, data mapping, and ETL only. Our healthcare professionals also focus on prediction, digitizing, telehealth services, and impaired speech recognition. Finally, we are excited to share our expertise with our European partners, moving medical science forward.

Our contacts