logo
SciForce Blog

Read our blog and carry on - Data Science

Stay informed and inspired in the world of AI with us.

SciForce medical team attended  OHDSI Symposium 20232023 OHDSI GLOBAL SYMPOSIUMEvolution of Forecasting from the Stone Age to Artificial Intelligence

Each nation has its ancient monument or a pagan holiday — the relic of the days when our ancestors tried to persuade their gods to give them more rain, no rain, better harvest, fewer wars and many other things considered essential for survival. However, neither Stonehenge nor jumping over fire could predict the gods’ reaction. It was a totally reactive world with no forecast. However, as time passed, people started to look into the future more inquisitively trying to understand what would be waiting for them. The science of prediction has emerged. In this article, we’ll see how prediction evolved over time shaping our technologies, expectations and the worldview. Naïve forecasting is an estimating technique in which the last period’s values are used as this period’s forecast, without adjusting them or attempting to establish causal factors. In other words, a naive forecast is just the most recently observed value. It is calculated by the formula Ft+k=yt where at the time _t_, the _k_\-step-ahead naive forecast (Ft+k) equals the observed value at time _t_ (yt). In Ancient times, before such formulas, communities typically relied on observed patterns and recognized sequences of events for weather forecasting. The remnants of these techniques we can see in our everyday lives: we can foresee the next Monday routine based on the previous Monday, or expect spring to come in March (even though such expectations rely more on our imagination than the recorded seasonal changes). In industry and commerce, it is used mainly for comparison with the forecasts generated by the better (sophisticated) techniques. However, sometimes this is the best that can be done for many time series including most stock price data. It also helps to baseline the forecast by tracking naïve forecast over time and estimating the forecast value added to the planning process. It reveals how difficult products are to forecast, whether it is worthwhile to spend time and effort on forecasting with more sophisticated methods and how much that method adds to the forecast. Even if it is not the most accurate forecasting method, it provides a useful benchmark for other approaches. Statistical forecasting is a method based on a systematic statistical examination of data representing past observed behavior of the system to be forecast, including observations of useful predictors outside the system. In simple terms, it uses statistics based on historical data to project what could happen out in the future. As the late 19th and early 20th centuries were stricken by a series of crises that lead to severe panics — in 1873, 1893, 1907, and 1920 — and also substantial demographic change, as countries moved from being predominantly agricultural to being industrial and urban, people were struggling to find stability in the volatile world. Statistics-based forecasting invented at the beginning of the 20th century showed that economic activity was not random, but followed discernable patterns that could be predicted. The two major statistical forecasting approaches are time series forecasting and model-based forecasting. _Time Series forecasting_ is a short-term purely statistical forecasting method that predicts short-term changes based on historical data. It is working on time (years, days, hours, and minutes) based data, to find hidden insights. The simplest technique of Time Series Forecasting is a simple moving average (SMA). It is calculated by adding up the last ’n’ period’s values and then dividing that number by ’n’. So the moving average value is then used as the forecast for next period. _Model-Based Forecasting_ is more strategic and long-term, and it accounts for changes in the business environment and events with little data. It requires management. Model-based forecasting techniques are similar to conventional predictive models which have independent and dependent variables, but the independent variable is now time. The simplest of such methods is the linear regression. Given a training set, we estimate the values of regression coefficients to forecast future values of the target variable. With time the basic statistical methods of forecasting have seen significant improvements in approaches, forming the spectra of data-driven forecasting methods and modeling techniques. Data-driven forecasting refers to a number of time-series forecasting methods where there is no difference between a predictor and a target. The most commonly employed data-driven time series forecasting methods are Exponential Smoothing and ARIMA Holt-Winters methods. Exponential smoothing was first suggested in the statistical literature without citation to previous work by Robert Goodell Brown in 1956. Exponential smoothing is a way of “smoothing” out data by removing much of the “noise” from the data by giving a better forecast. It assigns exponentially decreasing weights as the observation gets older: y^x=α⋅yx+(1−α)⋅y^x−1 where we’ve got a weighted moving average with two weights: α and 1−α. This simplest form of exponential smoothing can be used for short-term forecast with a time series that can be described using an additive model with constant level and no seasonality. Charles C. Holt proposed a variation of exponential smoothing in 1957 for a time series that can be described using an additive model with increasing or decreasing trend and no seasonality. For a time series that can be described using an additive model with increasing or decreasing trend and seasonality, Holt-Winters exponential smoothing, or Triple Exponential Smoothing, would be more accurate. It is an improvement of Holt’s algorithms that Peter R. Winters offered in 1960. The idea behind this algorithm is to apply exponential smoothing to the seasonal components in addition to level and trend. The smoothing is applied across seasons, e.g. the seasonal component of the 3rd point into the season would be exponentially smoothed with the one from the 3rd point of last season, 3rd point two seasons ago, etc. Here we can see evident seasonal trends that are supposed to continue in the proposed forecast. ARIMA is a statistical technique that uses time series data to predict future. It is are similar to exponential smoothing in that it is adaptive, can model trends and seasonal patterns, and can be automated. However, ARIMA models are based on autocorrelations (patterns in time) rather than a structural view of level, trend and seasonality. All in all, ARIMA models take trends, seasonality, cycles, errors and non-stationary aspects of a data set into account when making forecasts. ARIMA checks stationarity in the data, and whether the data shows a constant variance in its fluctuations over time. The idea behind ARIMA is that the final residual should look like white noise; otherwise there is information available in the data to extract. ARIMA models tend to perform better than exponential smoothing models for longer, more stable data sets and not as well for noisier, more volatile data. While many of time-series models can be built in spreadsheets, the fact that they are based on historical data makes them easily automated. Therefore, software packages can produce large amounts of these models automatically across large data sets. In particular, data can vary widely, and the implementation of these models varies as well, so automated statistical software can assist in determining the best fit on a case by case basis. A step forward compared to pure time series models, dynamic regression models allow incorporating causal factors such as prices, promotions and economic indicators into forecasts. The models combine standard OLS (“Ordinary Least Squares”) regression (as offered in Excel) with the ability to use dynamic terms to capture trend, seasonality and time-phased relationships between variables. A dynamic regression model lends insight into relationships between variables and allows for “what if” scenarios. For example, if we study the relationship between sales and price, the model allows us to create forecasts under varying price scenarios, such as “What if we raise the price?” “What if we lower it?” Generating these alternative forecasts can help you to determine an effective pricing strategy. A well-specified dynamic regression model captures the relationship between the dependent variable (the one you wish to forecast) and one or more (in cases of linear or multiple regressions, respectively) independent variables. To generate a forecast, you must supply forecasts for your independent variables. However, some independent variables are not under your control — think of weather, interest rates, price of materials, competitive offerings, etc. — you need to keep in mind that poor forecasts for the independent variables will lead to poor forecasts for the dependent variable. _Forecasting demand for electricity using data on the weather (e.g. when people are likely to run their heat or AC)._ In contrast to time series forecasting, regression models require knowledge of the technique and experience in data science. Building a dynamic regression model is generally an iterative procedure, whereby you begin with an initial model and experiment with adding or removing independent variables and dynamic terms until you arrive upon an acceptable model. Everyone who ever had a look at data or computer science knows that linear regression is in fact the basic prediction model in machine learning, which brings us to the final destination of our journey — Artificial Intelligence. Artificial intelligence and machine learning are considered the tools that can revolutionize forecasting. An AI that can take into account all possible factors that might influence the forecast gives business strategists and planners breakthrough capabilities to extract knowledge from massive datasets assembled from any number of internal and external sources. The application of machine learning algorithms in the so called predictive modeling unearths insights and identifies trends missed by traditional human-configured forecasts. Besides, AI can simultaneously test and learn, constantly refining hundreds of advanced models. The optimal model can then be applied at a highly granular SKU-location level to generate a forecast that improves accuracy. Among multiple models and techniques for prediction in ML and AI inventory, we have chosen one that is closest to our notion of a truly independent artificial intelligence. Artificial neural network (ANN) is a machine learning approach that models the human brain and consists of a number of artificial neurons. Neural networks can derive meaning from complicated or imprecise data and are used to detect the patterns and trends in the data, which are not easily detectable either by humans or by machines. We can make use of NNs in any type of industry, as they are very flexible and also don’t require any algorithms. They are regularly used to model parts of living organisms and to investigate the internal mechanisms of the brain. The simplest neural network is a fully Connected Model which consists of a series of fully connected layers. In a fully connected layer each neuron is connected to every neuron in the previous layer, and each connection has its own weight. Such model resembles a simple regression model that takes one input and will spit out one output. It basically takes the price from the previous day and forecasts the price of the next day. Such models repeat the previous values with a slight shift. However, fully connected models are not able to predict the future from the single previous value. With the latest emergence of Deep Learning techniques, neural networks have seen significant improvements in terms of accuracy and ability to tackle the most sophisticated and complex tasks. Recently introduced recurrent neural networks deal with sequence problems. They can retain a state from one iteration to the next by using their own output as input for the next step. In programming terms, this is like running a fixed program with certain inputs and some internal variables. Such models can learn to reproduce the yearly shape of the data and don’t have the lag associated with a simple fully connected feed-forward neural network. With the development of Artificial Intelligence, forecasting as we knew it has transformed itself into a new phenomenon. Traditional forecasting is a technique that takes data and predicts the future value for the data looking at its unique trends. Artificial Intelligence and Big Data introduced predictive analysis that factors in a variety of inputs and predicts the future behavior — not just a number. In forecasting, there is no separate input or output variable but in the predictive analysis you can use several input variables to arrive at an output variable. While forecasting is insightful and certainly helpful, predictive analytics can provide you with some pretty helpful people analytics insights. People analytics leaders have definitely caught on.

Big Data is not so big: Data Science for small- and medium-sized enterprises

Current advances in technology are in many ways fueled by the growing flow of data coming from multiple sources and analyzed to create competitive advantage. Both individual users and businesses are switching to a digital system¹, which in turn generates pools of information. In their turn, organizations share data with other companies, giving rise to digital ecosystems that begin to blur traditional industry borders. As the amount of data available grows, the size, diversity, and applications of it are accelerating at a near-exponential rate, and businesses are discovering that traditional data management systems and strategies do not have the means to support the demands of the new data-driven world. If several years ago data analytics was used mostly in finance, sales and marketing (such as customer targeting) and risk analysis, today analytics are everywhere²: HR, manufacturing, customer service, security, crime prevention and much more. As Ashish Thusoo, co-founder and CEO, Qubole, pointed out, “A new generation of cloud-native, self-service platforms have become essential to the success of data programs, especially as companies look to expand their operations with new AI, machine learning and analytics initiatives.” While, according to a report by Qubole³, only 9% of businesses already support self-service analytics, 61 percent express plans for moving to a self-service analytics model. With different forms of data collected and connected to aid businesses in drawing analogies between datasets, coming up with actionable insights and improving decision-making, Big Data and Data Science have moved to the foreground of the industrial and commercial sector. However, the volume of data may not be the decisive factor for optimizing business operations. Small- and medium-sized businesses need to understand the benefits that intelligent data analytics can bring and the opportunities for data collection and management. Big Data is a term covering large collections of heterogeneous data whose size or type is beyond the ability of traditional databases to capture, manage, and process. Big Data encompasses all types of data, namely: _Volume_: with data coming from sensors, business transactions, social media and machines, there is a problem of the amount of data required for analytics is considered to be solved. _Velocity_, or the pace and regularity at which data flows in. It is critical, that the flow of data is massive and continuous, and the data could be obtained in real time or with milliseconds to seconds delay. _Variety_: for the data to be representative, it should come from various sources and in many types and formats. The initial concept has evolved to capture other factors that impact the effectiveness of manipulations with data, such as: _Variability_: in addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic daily, seasonal or event-triggered peaks that need to be taken into account in analytics. _Veracity_: in most general terms, data veracity is the degree of accuracy or truthfulness of a data set in terms of the source, the type, and processing techniques. As technology evolves, more aspects of data come into the foreground giving rise to new big Vs. Even though the amount of data collected is sufficient for analytics, it cannot guarantee that the analytical findings will be useful for the company. The problems that companies face in their quest for effective analytics can be triangulated into the problems related to the overabundance of versatile data, the lack of tools and the talent shortage. On the data processing and machine learning side, analyzing extremely large data sets (40%), ensuring adequate staffing and resources (38%) and integrating new data into existing pipelines (38%) were called the primary obstacles to implementing projects. The research firm Gartner forecasts that in 2019 we will see 14.2 billion connected things in use⁵ resulting in a never-ending stream of information that can become a challenge for drawing meaningful insights. To successfully compete in today’s marketplace, small businesses need the tools larger companies use. In its 2018 Big Data Trends and Challenges report⁶ Oubole, the data activation company, stated that 75 percent of respondents also reported that a sizeable gap exists between the potential value of the data available to them, and dedicated tools and talent dedicated to delivering it. The spreading of new technologies will shift the core skills required to perform a job. The Future of Jobs Report⁷ estimates that by 2022, no less than 54% of employees will require re- and upskilling. According to Qubole, 83 percent of companies say it is difficult to find data professionals with the right skills and experience. For business these challenges mean that they need to choose between retraining their existing personnel, hiring new talent with required skills and invest into developing their own tools for data collection and processing, purchasing third-party analytical products or finding subcontractors for doing Big Data Analytics. Big Data affects organizations across practically every industry and of any size ranging from governments and bank institutions to retailers. Armed with the power of Big Data, industries can turn to predictive manufacturing that can improve quality and output and minimize waste and downtime. Data Science and Big Data Analytics can track process and product defects, plan supply chains, forecast output, increase energy consumption as well as support mass-customization of manufacturing. The retail industry largely depends on the customer relationship building. Retailers need their customers, the most effective way to handle transactions, and the most strategic way to bring back lapsed business — and Big Data provides the best solution for this. Originated from the financial sector, the use of large amounts of data for customer profiling, expenditures prediction and risk management become the essential Data Science tasks in the retail industry. The digital marketing spectrum is probably the biggest application of Data Science and machine learning. Ranging from the display banners on websites to the digital bill boards at the airports — almost all digital advertisement is decided by Data Science algorithms. Based on the user’s past behavior, digital advertisement ensures a higher CTR than traditional advertisement targeting the audience in a timely and more demand-based manner. Another facet of digital marketing is recommender systems, or suggestions about similar products used by businesses to promote their products and services in accordance with the user’s interest and relevance of information. Remaining a new application for Data Science, logistics benefits from its insights to improve the operational efficiency. Data science is used to determine the best routes to ship, the best suited time to deliver, the best mode of transport ensuring cost efficiency. Furthermore, the data that logistic companies generate using the GPS installed on their vehicles, in its turn creates new possibilities to explore using Data Science. The current consumers’ search patterns and the requirement of accessing content anywhere, any time, on any device lead to emerging new business models in media and entertainment. Big Data provides actionable points of information about millions of individuals predicting what the audience wants, scheduling optimization, increasing acquisition and retention as well as content monetization and new product development. In education, data-driven insight can impact school systems, students and curriculums by identifying at-risk students, implementing a better system for evaluation and supporting of teachers and principals. Big Data Analytics is known as a critical factor to improve healthcare by providing personalized medicine and prescriptive analytics. Researchers mine data to see what treatments are effective for particular conditions, identify patterns related to drug side effects, strategize diagnostics and plan for stocking serums and vaccines. The first step for processing data is discovering the sources that might be useful for your business. The sources for Big Data generally fall into one of three categories: _Streaming data_ — the data that reaches your IT systems from a web of connected devices, often part of the IoT. Social media data — the data on social interactions that might be used for marketing, sales and support functions. Publicly available sources — massive amounts of data are available through open data sources like the US government’s data.gov, the CIA World Factbook or the European Union Open Data Portal. Harnessing information is the next step that requires choosing strategies for storing and managing the data. Data storage and management: at present, there are low-cost options for storing data in clouds that can be used by small businesses. Amount of data to analyze: while some organizations don’t exclude any data from their analyses, relying on grid computing or in-memory analytics, others try to determine upfront which data is relevant to spare machine resources. Potential of insights: Generally, the more knowledge you have, the more confident you are in making business decisions. However, not to be overwhelmed, it is critical to select only the insights relevant to the specific business or market. The final step in making Big Data work for your business is to research the technologies that help you make the most of Big Data Analytics. Nowadays there is a variety of ready-made solutions for small-businesses, such as SAS, ClearStory Data, or Kissmetrics, to name a few. Another option to tackle your specific needs is to develop — or subcontract — your own solution. In the choice it is useful to consider:

Inside recommendations: how a recommender system recommends

If we think of the most successful and widespread applications of machine learning in business, one of the examples would be recommender systems. Each time you visit Amazon or Netflix, you see recommended items or movies that you might like — the product of recommender systems incorporated by these companies. Though a recommender system is a rather simple algorithm that discovers patterns in a dataset, rates items and shows the user the items that they might rate highly, they have the power to boost sales of many e-commerce and retail companies. In simple words, these systems predict users’ interests and recommend relevant items. Recommender systems rely on a combination of data stemming from explicit and implicit information on users and items, including: _Characteristic information_, including information about items, such as categories, keywords, etc., and users with their preferences and profiles, and _User-item interactions_ _—_ the information about ratings, number of purchases, likes, and so on. Based on this, recommender systems fall into two categories: _content-based_* systems that use characteristic information, and **_collaborative filtering_** systems based on user-item interactions. Besides, there is a complementary method called knowledge-based system that relies on explicit knowledge about the item, the user and recommendation criteria, as well as the class of *_hybrid systems_ that combine different types of information. Such systems make recommendations based on the user’s item and profile features. The idea underlying them is that if a user was interested in an item in the past, they will be interested in similar items later. User profiles are constructed using historical interactions or by explicitly asking users about interests. Of course, pure content-based systems tend to make too obvious recommendations — because of excessive specialization — and to offer too many similar items in a row. Well suited for movies, if you want to want all films starring the same actor, these systems fall short in e-commerce, spamming you with hundreds of watches or shoes. Cosine similarity: the algorithm finds the cosine of the angle between the profile vector and item vector: Based on the cosine value, which ranges between -1 to 1, items are arranged in descending order and one of the two below approaches is used for recommendations: Euclidean Distance: since similar items lie in close proximity to each other if plotted in n-dimensional space, we can calculate the distance between items and use it to recommend items to the user: However, Euclidean Distance performance falls in large-dimensional spaces, which limits the scope of its application. Pearson’s Correlation: the algorithm shows how much two items are correlated, or similar: A major drawback of this algorithm is that it is limited to recommending items that are of the same type. Unlike content-based systems, they utilize user interactions and the preference of other users to filter for items of interest. The baseline approach to collaborative filtering is matrix factorization. The goal is to complete the unknowns in the matrix of user-items interactions (let’s call it R_R_). Suppose we have two matrices U_U_ and I_I_, such that U \\times I_U_×_I_ is equal to R_R_ _in the known entries_. Using the U \\times I_U_×_I_ product we will also have values for the _unknown_ _entries_ of R_R_, which can then be used to generate the recommendations. A smart way to find matrices U_U_ and I_I_ is by using a neural network. An interesting way of looking at this method is to think of it as a generalization of classification* and *regression. Though more intricate and smarter, they should have enough information to work, meaning cold start for new e-commerce websites and new users. There are two types of collaborative models: memory-based and model-based: _Memory-based methods_ offer two approaches: to identify clusters of users and utilize the interactions of one specific user to predict the interactions of the cluster. The second approach identifies clusters of items that have been rated by a certain user and utilizes them to predict the interaction of the user with a similar item. Memory-based techniques are simple to implement and transparent, but they encounter major problems with large sparse matrices, since the number of user-item interactions can be too low for generating high-quality clusters. In the algorithms that measure the similarity between users, the prediction of an item for a user _u_ is calculated by computing the weighted sum of the user ratings given by other users to an item _i_. The prediction _Pu,i_ is given by: where _Model-based_* *_methods_ are based on machine learning and data mining techniques to predict users’ ratings of unrated items. These methods are able to recommend a larger number of items to a larger number of users, compared to other methods like memory-based. Examples of such model-based methods include decision trees, rule-based models, Bayesian methods and latent factor models. Knowledge-based recommender systems use explicit information about the item assortment and the client’s preference. Based on this knowledge, the system generates corresponding recommendations. If no item satisfies all the requirements, products satisfying a maximal set of constraints are ranked and displayed. Unlike other approaches, it does not depend on large bodies of statistical data about items or user ratings, which makes them especially useful for rarely sold items, such as houses, or when the user wants to specify requirements manually. Such an approach allows avoiding a ramp-up or cold start problem since recommendations do not depend on a base of user ratings. Knowledge-based recommender systems have a conversational style offering a dialog that effectively walks the user down a discrimination tree of product features. Knowledge-based systems work on two approaches: constraint-based, relying on an explicitly defined set of recommendation rules, and case-based, taking intelligence from different types of similarity measures and retrieving items similar to the specified requirements. Constraint-based recommender systems try to mediate between the weighted hard and soft user requirements (constraints) and item features. The system asks the user which requirements should be relaxed/modified so that some items exist that do not violate any constraint and finds a subset of items that satisfy the maximum set of weighted constraints. The items are afterwards ranked according to the weights of the constraints they satisfy and are shown to the user with an explanation of their placement in the ranking. The case-based approach relies on the similarity distance, where sim (p, r) expresses for each item attribute value p its distance to the customer requirement r ∈ REQʷᵣ is the importance weight for requirement r. However, in real life, some users may want to maximize certain requirements, minimize others, or simply not be sure what they want, submitting queries that might look like “similar to Item A, but better”. In contrast to content-based systems, the conversational approach of knowledge-based recommenders allows for such scenarios by eliciting users’ feedback, called critiques. Recent research shows that to improve the effectiveness of recommender systems, it is worth combining collaborative and content-based recommendation. Hybrid approaches can be implemented by making content-based and collaborative-based predictions separately and then combining them by adding content-based capabilities to a collaborative-based approach and vice versa; or by unifying the approaches into one model. Netflix, for instance, makes recommendations by comparing the watching and searching habits of similar users (i.e., collaborative filtering) as well as by offering movies that share characteristics with films that a user has rated highly (content-based filtering). Since a product recommendation engine mainly runs on data, data mining and storage are of primary concern. The data can be collected explicitly and implicitly. Explicit data is information that is provided intentionally, i.e. input from the users such as movie ratings. Implicit data is information that is not provided intentionally but gathered from available data streams like search history, clicks, order history, etc. Data scraping is one of the most useful techniques to mine these types of data from the website. The type of data plays an important role in deciding the type of storage that has to be used. This type of storage could include a standard SQL database, a NoSQL database or some kind of object storage. To store big amounts of data, you can use online frameworks like Hadoop or Spark which allow you to store data in multiple devices to reduce dependence on one machine. Hadoop uses HDFS to split files into large blocks and distributes them across nodes in a cluster, which means the dataset will be processed faster and more efficiently. There are multiple ready-made systems and libraries for different languages, from Python to C++. To make a simple recommender system from scratch, the easiest way may be to try your hand on Python’s pandas, NumPy, or SciPy. Of course, recommender systems are the heart of e-commerce. However, the most straightforward way may not be the best and showing long lines of similar products will not win customers’ loyalty. The only way to truly engage with customers is to communicate with each as an individual, but advanced and non-traditional techniques, such as deep learning, social learning, and tensor factorization based on machine learning and neural networks can also be a step forward.

A Layman’s Guide to Data Science. Part 3: Data Science Workflow

By now you have already gained enough knowledge and skills about Data Science and have built your first (or even your second and third) project. At this point, it is time to improve your workflow to facilitate further development process. There is no specific template for solving any data science problem (otherwise you’d see it in the first textbook you come across). Each new dataset and each new problem will lead to a different roadmap. However, there are similar high-level steps in many different projects. In this post, we offer a clean workflow that can be used as a basis for data science projects. Every stage and step in it, of course, can be addressed on its own and can even be implemented by different specialists in larger-scale projects. Data science workflow As you already know, at the starting point, you’re asking questions and trying to get a handle on what data you need. Therefore, think of the problem you are trying to solve. What do you want to learn more about? For now, forget about modeling, evaluation metrics, and data science-related things. Clearly stating your problem and defining goals are the first step to providing a good solution. Without it, you could lose the track in the data-science forest. In any Data Science project, getting the right kind of data is critical. Before any analysis can be done, you must acquire the relevant data, reformat it into a form that is amenable to computation and clean it. Acquire data The first step in any data science workflow is to acquire the data to analyze. Data can come from a variety of sources: _Data provenance_: It is important to accurately track provenance, i.e. where each piece of data comes from and whether it is still up-to-date, since data often needs to be re-acquired later to run new experiments. Re-acquisition can be helpful if the original data sources get updated or if researchers want to test alternate hypotheses. Besides, we can use provenance to trace back downstream analysis errors to the original data sources. _Data management_: To avoid data duplication and confusion between different versions, it is critical to assign proper names to data files that they create or download and then organize those files into directories. When new versions of those files are created, corresponding names should be assigned to all versions to be able to keep track of their differences. For instance, scientific lab equipment can generate hundreds or thousands of data files that scientists must name and organize before running computational analyses on them. _Data storage_: With modern almost limitless access to data, it often happens that there is so much data that it cannot fit on a hard drive, so it must be stored on remote servers. While cloud services are gaining popularity, a significant amount of data analysis is still done on desktop machines with data sets that fit on modern hard drives (i.e., less than a terabyte). Reformat and clean data Raw data is usually not in a convenient format to run an analysis, since it was formatted by somebody else without that analysis in mind. Moreover, raw data often contains semantic errors, missing entries, or inconsistent formatting, so it needs to be “cleaned” prior to analysis. _Data wrangling (munging)_* is the process of cleaning data, putting everything together into one workspace, and making sure your data has no faults in it. It is possible to reformat and clean the data either manually or by writing scripts. Getting all of the values in the correct format can involve stripping characters from strings, converting integers to floats, or many other things. Afterwards, it is necessary to deal with missing values and null values that are common in sparse matrices. The process of handling them is called *_missing data imputation_ where the missing data are replaced with substituted data. _Data integration_ is a related challenge, since data from all sources needs to be integrated into a central MySQL relational database, which serves as the master data source for his analyses. Usually it consumes a lot of time and cannot be fully automated, but at the same time, it can provide insights into the data structure and quality as well as the models and analyses that might be optimal to apply. Explore the data Here’s where you’ll start getting summary-level insights of what you’re looking at, and extracting the large trends. At this step, there are three dimensions to explore: whether the data imply supervised learning or unsupervised learning? Is this a classification problem or is it a regression problem? Is this a prediction problem or an inference problem? These three sets of questions can offer a lot of guidance when solving your data science problem. There are many tools that help you understand your data quickly. You can start with checking out the first few rows of the data frame to get the initial impression of the data organization. Automatic tools incorporated in multiple libraries, such as Pandas’ .describe(), can quickly give you the mean, count, standard deviation and you might already see things worth diving deeper into. With this information you’ll be able to determine which variable is our target and which features we think are important. Analysis is the core phase of data science that includes writing, executing, and refining computer programs to analyze and obtain insights from the data prepared at the previous phase. Though there are many programming languages for data science projects ranging from interpreted “scripting” languages such as Python, Perl, R, and MATLAB to compiled ones such as Java, C, C++, or even Fortran, the workflow for writing analysis software is similar across the languages. As you can see, analysis is a repeated _iteration cycle_ of editing scripts or programs, executing to produce output files, inspecting the output files to gain insights and discover mistakes, debugging, and re-editing. Baseline Modeling As a data scientist, you will build a lot of models with a variety of algorithms to perform different tasks. At the first approach to the task, it is worthwhile to avoid advanced complicated models, but to stick to simpler and more traditional _linear regression_* for regression problems and *_logistic regression_ for classification problems as a baseline upon which you can improve. At the model preprocessing stage you can separate out features from dependent variables, scale the data, and use a train-test-split or cross validation to prevent overfitting of the model — the problem when a model too closely tracks the training data and doesn’t perform well with new data. With the model ready, it can be fitted on the training data and tested by having it predict _y_ values for the _X\_test_ data. Finally, the model is evaluated with the help of metrics that are appropriate for the task, such as R-squared for regression problems and accuracy or ROC-AUC scores for classification tasks. Secondary Modeling Now it is time to go into a deeper analysis and, if necessary, use more advanced models, such as, for example neural networks*, *XGBoost, or Random Forests. It is important to remember that such models can initially render worse results than simple and easy-to-understand models due to a small dataset that cannot provide enough data or to the collinearity problem with features providing similar information. Therefore, the key task of the secondary modeling step is parameter tuning. Each algorithm has a set of parameters you can optimize. Parameters are the variables that a machine learning technique uses to adjust to the data. Hyperparameters that are the variables that govern the training process itself, such as the number of nodes or hidden layers in a neural network, are tuned by running the whole training job, looking at the aggregate accuracy, and adjusting. Data scientists frequently alternate between the _analysis_ and _reflection_ phases: whereas the analysis phase focuses on programming, the reflection phase involves thinking and communicating about the outputs of analyses. After inspecting a set of output files, a data scientist, or a group of data scientists can make comparisons between output variants and explore alternative paths by adjusting script code and/or execution parameters. Much of the data analysis process is trial-and-error: a scientist runs tests, graphs the output, reruns them, graphs the output and so on. Therefore, graphs are the central comparison tool that can be displayed side-by-side on monitors to visually compare and contrast their characteristics. A supplementary tool is taking notes, both physical and digital to keep track of the line of thought and experiments. The final phase of data science is disseminating results either in the form of a data science product or as written reports such as internal memos, slideshow presentations, business/policy white papers, or academic research publications. A _data science product_ implies getting your model into production. In most companies, data scientists will be working with the software engineering team to write the production code. The software can be used both to reproduce the experiments or play with the prototype systems and as an independent solution to tackle a known issue on the market, like, for example, assessing the risk of financial fraud. Alternatively to the data product, you can create a data science report. You can showcase your results with a presentation and offer a technical overview on the process. Remember to keep your audience in mind: go into more detail if presenting to fellow data scientists or focus on the findings if you address the sales team or executives. If your company allows publishing the results, it is also a good opportunity to have feedback from other specialists. Additionally, you can write a blog post and push your code to GitHub so the data science community can learn from your success. Communicating your results is an important part of the scientific process, so this phase should not be overlooked.

A Layman’s Guide to Data Science. Part 2: How to Build a Data Project

It is quite often that in our blog we explore intricate connections between state-of-the-art technologies, or explore the mesmerizing depth of a new technique. However, AI or data science is not only bragging about new exciting methods that boost accuracy by 2% (which is a big gain) but about making data and technology work for you. It will help you increase sales, understand your customers, predict future faults in process lines, make an insightful presentation, submit a term project, or have a good time with your friends working on a new idea that will change the world. And in this sense, everyone can — and to some extent should — become a data scientist. We already discussed what makes a good data scientist and what you should learn before you set to a real project. In this post, we’ll walk you through the process of building a backbone data project in simple steps. Find a story behind an idea You have an excellent idea in your head — the one you have cherished since you were a child about having a toys-cleaning robot or the one that just came into your mind about accessing the customers in your shop by sending them fortune cookies with predictions based on their purchase preferences. However, to make your idea work you need the attention of others. Find a compelling narrative for it; make sure that it has a hook or a captivating purpose, and that it is up-to-date and relevant. Finding the narrative structure will help you decide whether you actually have a story to tell. Such a narrative will be the basis for your business model. Ask yourself: What is it that you develop, what resources do you need, and what value do you provide to the customer? For what values are customers going to pay? A nice way to do this is the business model canvas. It’s simple and cheap, you can create it on a sheet of paper. Prepare the data The first practical step is collecting data to fuel your project. Depending on your field and goals, you can search for ready datasets available on the Internet, such as, for example, this collection. You can choose to scrape data from websites or access data from social networks through public APIs. For the latter option, you need to write a small program that can download data from social networks in a programming language you feel the most comfortable with. For the cloud option, you can spin up a simple AWS EC2 Linux instance (nano or micro), and run your software on in. The best way to store the data is to use a simple .csv format with each line including the text and meta information, such as the person, timestamp, replies and likes. As to the amount of data needed, the rule of thumb is to get as much data as possible in a reasonable time, for example, a few days of running your program. Another important consideration is to collect as much data as the machine you are using for analytics can handle. How much data to get is not an exact science, but it rather depends on the technical limitations and the question you have. Finally, in collecting and managing data it is crucial to be devoid of bias and not be selective about inclusion or exclusion of data. This selectivity includes using discrete values when the data is continuous; how you deal with missing, outlier, and out-of-range values; arbitrary temporal ranges; capped values, volumes, ranges, and intervals. Even if it is arguing to influence, it should be based upon what the data says–not what you want it to say. Choose the right tools To perform a valid analysis, you need to find the proper tools. After getting the data you need to select the proper tool to explore it. To make a choice, you can write down a list of analytics features you think you need and compare available tools. Sometimes you can use user-friendly graphical tools like Orange, Rapid Miner, or Knime. In other cases, you’ll have to write the analysis on your own with such languages as Python or R. Prove your theory With the data and tools available, you can prove your theory. In Data Science, theories are statements of how the world should be or is and are derived from axioms that are assumptions about the world, or precedent theories (Das, 2013). Models are implementations of the theory; in data science, they are often algorithms based on theories that are run on data. The results of running a model lead to a deeper understanding of the world based on theory, model, and data. To assess your theory at an initial step, in line with the more general and conventional content analysis, you can pinpoint trends present in the data. One way we use quite a lot is to select significant events that have been reported. Then you can try to create an analytics process that finds these trends. If analytics can find the trends you specified, then you are on the right track. Look for instances where analytics finds new trends. Confirm these trends, for instance by searching the internet. The results are not going to be reliable 100% of the time, so you’ll need to decide how many falsely reported trends (the error rate) you want to tolerate. Build a minimum viable product When you have your business model and a proven theory, it is time to build the first version of your product, the so-called minimum viable product (MVP). Basically this can be the first version that you offer to customers. As a minimum viable product (MVP) is a product with just enough features to satisfy early customers, and to provide feedback for future development, it should focus only on the core functionality without any fancy solutions. You should stick to simple functions that will work in the beginning and expand your system later. At this stage, the system could look something like this: Automate your system In principle, your focus should be on the future development of your product, not on a system operation. For this, you need to automate as much as possible: uploading to S3, starting the analysis, or data storing. In this article we discussed automation in more detail. The other face of automation is logging. When everything is automated you can feel that you are losing control over your system and do not know how it performs. Besides, you need to know what to develop next, both in terms of new features and fixing problems. For this, you need to set up a system for logging, monitoring and measuring all meaningful data. For instance, you should log statistics for the download of your data or upload to S3, the time of the analytics process and the users’ behavior. There are multiple tools to help you log server statistics like CPU, RAM, network, code level performance, and error monitoring, many of them having a user-friendly interface. Reiterate and expand You probably know that AI, Machine Learning, Data Science and other new developments are all about reiteration and fine-tuning. So, when you have your MVP running, automation and monitoring in place, you can start enhancing your system. It is time to get rid of weaknesses, optimize the overall performance and stability, and add new functions. Implementing new features will also allow you to offer new services or products. Present your product Finally, when your product is ready, you need to present it to the customers. This is where your story behind the data and business model come to help. First of all, think about your target audience. Who are your customers and how are you going to sell your product to them? What does the audience you are going to present your product to know about the topic? The story needs to be framed around the level of information the audience already has, correct and incorrect:

Soft skills detection

The new world we live in gives us more help — and more doubts. Machines stand behind everything — and the scope of this everything is only growing. To what extent can we trust such machines? We are used to relying on them in market trends, traffic management, and maybe even in healthcare. Machines are now analysts, medical assistants, secretaries, and teachers. Are they reliable enough to work as HRs? Psychologists? What can they tell about us? Let’s see how text analysis can analyze your soft skills and tell a potential employer whether you can join the team smoothly. In this project, we used text analysis techniques to analyze the soft skills of young men (aged 15–24) looking for career opportunities. What we had in mind was to perform a number of tests, or to choose the most effective one, to determine ground truth values. The tests we were experimenting with included:

A Layman’s Guide to Data Science: How to Become a (Good) Data Scientist

Sometimes when you hear data scientists shoot a dozen of algorithms while discussing their experiments or go into details of Tensorflow usage you might think that there is no way a layman can master Data Science. Big Data looks like another mystery of the Universe that will be shut up in an ivory tower with a handful of present-day alchemists and magicians. At the same time, you hear about the urgent necessity to become data-driven from everywhere. The trick is, we used to have only limited and well-structured data. Now, with the global Internet, we are swimming in the never-ending flows of structured, unstructured and semi-structured data. It gives us more power to understand industrial, commercial or social processes, but at the same time, it requires new tools and technologies. Data Science is merely a 21st century extension of mathematics that people have been doing for centuries. In its essence, it is the same skill of using information available to gain insight and improve processes. Whether it’s a small Excel spreadsheet or a 100 million records in a database, the goal is always the same: to find value. What makes Data Science different from traditional statistics is that it tries not only to explain values, but to predict future trends. In other words, we use Data Science for: Data Science is a newly developed blend of machine learning algorithms, statistics, business intelligence, and programming. This blend helps us reveal hidden patterns from the raw data which in turn provides insights in business and manufacturing processes. To go into Data Science, you need the skills of a business analyst, a statistician, a programmer, and a Machine Learning developer. Luckily, for the first dive into the world of data, you do not need to be an expert in any of these fields. Let’s see what you need and how you can teach yourself the necessary minimum. When we first look at Data Science and Business Intelligence we see the similarity: they both focus on “data” to provide favorable outcomes and they both offer reliable decision-support systems. The difference is that while BI works with static and structured data, Data Science can handle high-speed and complex, multi-structured data from a wide variety of data sources. From the practical perspective, BI helps interpret past data for reporting or Descriptive Analytics and Data Science analyzes the past data to make future predictions in Predictive Analytics or Prescriptive Analytics. Theories aside, to start a simple Data Science project, you do not need to be an expert Business Analyst. What you need is to have clear ideas of the following points: Analytical Mindset: it is a general requirement for any person working with data. However, if common sense might suffice at the entry level, your analytical thinking should be further backed up by statistical background and knowledge of data structures and machine learning algorithms. Focus on Problem Solving*: when you master a new technology, it is tempting to use it everywhere, However, while it is important to know recent trends and tools, the goal of Data Science is to solve specific problems by extracting knowledge from data. A good data scientist first understands the problem, then defines the requirements for the solution to the problem, and *only then decides which tools and techniques are best fit for the task. Don’t forget that stakeholders will never be captivated by the impressive tools you use, only by the effectiveness of your solution. Domain Knowledge: data scientists need to understand the business problem and choose the appropriate model for the problem. They should be able to interpret the results of their models and iterate quickly to arrive at the final model. They need to have an eye for detail. Communication Skills: there’s a lot of communication involved in understanding the problem and delivering constant feedback in simple language to the stakeholders. But this is just the surface of the importance of communication — a much more important element of this is asking the right questions. Besides, data scientists should be able to clearly document their approach so that it is easy for someone else to build on that work and, vice versa, understand research work published in their area. As you can see, it is the combination of various technical and soft skills that make up a good data scientist.

Best Libraries and Platforms for Data Visualization

In one of our previous posts we discussed data visualization and the techniques used both in regular projects and in Big Data analysis. However, knowing the plot does not let you go beyond theoretical understanding of what tool to apply for certain data. With the abundance of techniques, the data visualization world can overwhelm the newcomer. Here we have collected some best data visualization libraries and platforms. Though all of the most popular languages in Data Science have built-in functions to create standard plots, building a custom plot usually requires more efforts. To address the necessity to plot versatile formats and types of data. Some of the most effective libraries for popular Data Science languages include the following: Some of the most effective libraries for popular Data Science languages The R language provides numerous opportunities for data visualization — and around 12,500 packages in the CRAN repository of R packages. This means there are packages for practically any data visualization task regardless the discipline. However, if we choose several that suit most of the task, we’d select the following: ggplot2 is based on _The Grammar of Graphics_, a system for understanding graphics as composed of various layers that together create a complete plot. Its powerful model of graphics simplifies building complex multi-layered graphics. Besides, the flexibility it offers allows you, for example, to start building your plot with axes, then add points, then a line, a confidence interval, and so on. ggplot2 is slower than base R and rather difficult to master, it pays huge dividends for any data scientist working in R. Lattice is a system of plotting inspired by Trellis graphics. It helps visualize multi-variate data, creating tiled panels of plots to compare different values or subgroups of a given variable. Lattice is built using the grid package for its underlying implementation and it inherits many grid’s features. Therefore, the logic of Lattice should feel familiar to many R users making it easier to work with. rgl package is used to create interactive 3D plots. Like Lattice, it’s inspired by the grid package, though it’s not compatible with it. RGL features a variety of 3D shapes to choose from, lighting effects, various “materials” for the objects, as well as the ability to make an animation. The Python Package Index has libraries for practically every data visualization need, however, the most popular ones offering the broadest range of functionalities are the following: Matplotlib is the first Python data visualization and the most widely-used library for generating simple and powerful visualizations in the Python community. The library allows building a wide range of graphs from histograms to heat plots to line plots. Matplotlib is the basis for many other libraries that are designed to work in conjunction with analysis. For instance, libraries like pandas and matplotlib are “wrappers” over Matplotlib allowing access to a number of Matplotlib’s methods with less code. An example of a popular library, built on top of Matplotlib, is Seaborn. Seaborn’s default styles and color palettes are much more sophisticated than Matplotlib. Beyond that, Seaborn is a higher-level library, so it is easier to generate certain kinds of plots, including heat maps, time series, and violin plots. Similar to the ggplot library for R, Bokeh is based on The Grammar of Graphics. It supports streaming, and real-time data. Unlike the majority of other data visualization libraries, Bokeh can create interactive, web-ready plots, which can easily output as JSON objects, HTML documents, or interactive web applications. Bokeh has three interfaces with varying degrees of control to accommodate different types of users from users wishing to create simple charts quickly to developers and engineers who wish to define every element of the chart. Python and R remain the leading languages for rapid data analysis, however, Scala is becoming the key language in the development of functional products that work with big data, as the latter need stability, flexibility, high speed, scalability, etc Probably, the most functional Scala library for data visualization, Vegas allows plotting specifications such as filtering, transformations, and aggregations. It is similar in structure to Python’s Bokeh and Plotly. Vegas provides declarative visualization, so that the user can focus on specifying what needs to be done with the data, without having to worry about the code implementation. Breeze-viz is based on the prominent Java charting library JFreeChart and has a MATLAB-like syntax. Although Breeze-viz has much fewer opportunities than MATLAB, matplotlib in Python, or R, it is still quite helpful in the process of developing and establishing new models. Javascript may not be among languages adopted for Data Science, but it offers vast opportunities for data visualization, and many libraries for other languages are actually wrappers for JS packages. D3 is called the mother of all visualization libraries, since it is the basis for many libraries. Being the oldest library, it remains the most popular and extensive Javascript data visualization library. It uses web standards and is framework agnostic, working smoothly with any Javascript framework. D3 is built for manipulating documents based on data and bring data to life using HTML, SVG, and CSS. D3’s emphasis on web standards gives you the capabilities of modern browsers without coupling to a proprietary framework, combining visualization components and a data-driven approach to DOM manipulation. Chart.js is a lightweight library that has fully responsive charts including Bar, Bubble, Doughnut, Line, PolarArea, Radar, Scatter. This is an open-source library based on HTML5. V.2 provides mixed chart-types, new chart axis types, and beautiful animations. Designs are simple and elegant with 8 basic chart types, and you can combine the library with moment.js for time axis. At a certain time, especially at the beginning of a project, it is important to generate a lot of charts quickly, in order to explore the scope, depth and the texture of the data and find interesting stories to develop further. There are quite a lot of online platforms to generate data visualization. Plotly as an online platform for data visualization, that is, among all can be accessed from an R or Python Notebook. Plotly is an advanced, online data visualization program with a colorful design. Its forte is making interactive plots, but it offers some charts you won’t find in most packages, like contour plots, candlestick charts, and 3D charts. You can use the chart studio to create web-based reporting templates. You can also modify your own dashboards and interactive graphs for your collaborators to comprehend better. Tableau is a business intelligence system that has a new approach to data analysis. Tableau lets you create charts, graphs, maps and many other graphics. A big advantage of Tableau is availability of several versions: desktop, server, and a cloud one. You can create and publish dashboards, share them with colleagues, and analyze using different methods. We recommend it because of its simplified drag-and-drop system, all day technical support, and flexible package fees. Of course, it is just a small fraction of all platforms, tools and libraries available for you to visualize your data in the most effective and transparent way. The data itself, as well as the project goals — be it scientific analysis, business intelligence or creating a website that should incorporate some charts — will prompt you the approach, or, most usually a combination of approaches from quick online plotting to base functions and specialized packages.

Food, Health, and Data Science

When I was first approached by our CTO with an idea to write an article about food as medicine, for the share of a second my reaction was “Are you stark raving mad?” However, what I can be confident in is that he would never abandon the scientific approach to seemingly everything. So is there a scientific approach to eat health? Or, rather, what does Data Science say about food? Amid today’s obsession with mindfulness and body practices, it’s hard not to notice numerous books, documentaries, and podcasts that explore the association between diet and chronic disease or illness, both somatic and mental. The present-day application of the famous Hippocrates’ phrase about letting food be your medicine advises us to make better decisions about what to eat in order to prevent disease, maintain health and function better. Image Credit: pixabay.com It sounds fair, and a plethora of principles follow, ranging from Ayurveda to genetic programming. Science probably cannot measure the prana content in your breakfast, but let’s think of more measurable aspects of your diet. Modern digital personalizes nutritions platforms stake on analyzing anthropometric measurements, blood biomarkers, Nutrigenomics and gut health testing, and here data science can really provide valuable insights. Anthropometric measurements are used to assess the size, shape and composition of the human body. Height and weight are used for tracking health conditions, drug dosages, as well as nutritional diseases and energy expenditure. As it is impossible to measure every person, their measurements have to be estimated approximately. In this case, different Machine Learning models, starting from simple linear regressions and going to support vector regression, Gaussian process, and artificial neural networks are used to predict height and weight more accurately. Biomarker testing is the core of personalized medicine. In its broadest sense, the term “biomarker” refers to any of the body’s molecules that can be measured to assess health, including molecules from blood, body fluids, and tissue. Biomarker testing looks for molecular signs of health so that doctors can plan the best care. Applying machine learning techniques is critical to assess the importance of different biomarkers and to collect the best marker combinations that have the biggest impact on the person’s response to treatment or — in case of nutrition — to personalize the diet according to the metabolism. Image Credit: unsplash.com Nutrigenomics refers to the use of biochemistry, physiology, nutrition, genomics, proteomics, metabolomics, transcriptomics, and epigenomics to seek and explain the reciprocal interactions between genes and nutrients. These interactions are believed to aid the prescription of personalized diets according to one’s genotype mitigating the symptoms of existing diseases or preventing future illnesses, especially in the area of Nontransmissible Chronic Diseases. Data Science is playing a major role in nutrigenomics analysis, as it helps to find patterns predicting future outputs, to study the present status of nutraceuticals in the market, and to control the customers’ responses to them. Using cluster detection, memory-based reasoning, genetic algorithms, link analysis, decision trees, and neural net, huge amounts of data can be handled in a short period of time. Gut health tests help to understand what nutrients and toxins are being produced by gut microbiome. Gut contain bacteria, viruses, fungi, phages, yeast, parasites, etc. that may be active to a varying degree and produce nutrients or toxins from the digested food influencing our mood, digestion, immune system, skin, fitness, and many more. Artificial intelligence helps process all information about the gut microbiome to accurately assess risks of development of diseases and to recommend the exact foods you should be eating and foods you should be minimizing with the goal to keep your gut in balance. Intelligence (AI) and Machine Learning (ML) techniques to increasingly applied to identify patterns and correlations in between the gathered user data and specific health-related conditions. These correlations, already known at some forums as Digital Biomarkers, will empower preventive health approaches, including personalize diet development. In this sense, we have enabled AI with our world’s most complex data set: the data set of biology. It is true that what we have learned about genomics on the one hand and nutrients on the other is incredible, but what we don’t know still outweighs what we have learned. Biology is extremely complex, and we are only on our way to understanding the impact of the food we eat on our health — but now at least we can make decisions based on proven data. The science behind our diet may be sophisticated, but the basic principles are simple: understand your body, its microbiome and genes, and use food rather than supplements that suit your organism the best. And the main principle of any health lifestyle: “Everything in moderation, including moderation.”