What the Tech?

Primer: Demystifying Data Science

May 06, 2020  | by Levon Paradzhanyan

This article was originally published in The New Stack on March 26, 2020

Artificial Intelligence emerged in our lives many years ago. First, as science fiction and today embedded in real products. It has since been followed by newer buzzwords such as data science, machine learning, and deep learning. Yet there are many misconceptions related to these terms. Most people think they mean the same thing. Even developers who are starting their journey in this field struggle to understand the difference between them.

In part one of this primer series, we will focus on data science, the discipline all these other methodologies belong to. Over the course of these articles, we’ll explore each technology and the fundamental asset they have in common — data.

Humanity’s First Attempts to Create Human-Like Abilities

For centuries, bright minds have invented new technologies, bringing automation to activities previously performed manually and enabling new possibilities. While these innovations lacked human-like intelligence, there are many examples in ancient Greek, Chinese, Jewish and other nation’s histories of the “automaton,” a self-operating “intelligent” machine designed to automatically perform a predetermined sequence of operations. Today, we call them “robots” and their capabilities are infinite.

But robots aren’t the only intelligent technology. Everything is getting smarter, for instance, social, audio or video streaming services, such as Facebook, Spotify and Netflix, recommend content that will resonate with the user’s preferences. Similarly, companies are increasingly processing data gathered about consumer behavior in order to improve the services they offer, detect, and prevent fraud or provide new services. And, finally, Tesla has developed self-driving, intelligent cars which are constantly improving autonomously.

Data is fueling all this.

Your Data, Your Goldmine

Think about all the data that companies and organizations are generating or collecting these days. The amount, and most importantly, the structure of it is sometimes unimaginably big and complex. Hence the term “Big Data.”

Big data claims to deliver business-changing analytics and insights. Unfortunately, much of that is mere hype. The bottom line is that big data is just another word for a large amount of data. In fact, at  EastBanc Technologies, we’ve been advocating for years to start with small  sets of data with what we’ve coined a Minimal Viable Prediction. The accuracy, quality and completeness of the data are what really matters, along with a good approach to analysis.

A few notable industries where this is relevant include healthcare, education, media and entertainment, government, transportation, banking, etc.

Using the power of data, these industries can uncover trends, make predictions, and improve decision-making. Data can help public health organizations understand what preventative measures they should take to minimize the risk of epidemic outbreaks. It can help guide a student towards the most suitable career based on his or her interests, strengths, and academic results. Data can help predict where traffic congestion will occur due to public events, holidays, accidents, or news alerts.

But data may also hide patterns and insights that have yet to be thought about, ones that could be potentially valuable for a company’s business intelligence/operations. For instance, data may reveal new patterns that fraudsters use to steal money from an online banking system. It can also show that if one person switches their cell phone provider, their friends are likely to follow suit or that men who buy diapers are more likely to buy beer. Or, perhaps, that individuals seeking a business loan who complete their application form in the correct care are more dependable debtors.

These are all real-life examples. So, what do we have here? Data is the foundation of any business. That’s why understanding it and finding hidden correlations, new patterns, and even discovering new insights can positively impact business success.

The challenge is that today most of an organization’s data is unstructured and segmented in different sources, such as analytical software, log files, legacy systems, cloud-based services, third-party enterprise solutions, and much more. There are no universal tools that can intelligently determine the relationships between these systems and process them in huge volumes.

This is where data science comes into play.

What Is Data Science?

A data scientist is a person who can easily juggle many different fields and disciplines, a symbiosis of Sherlock Holmes, business analyst, and software developer. Indeed, a huge variety of skills and knowledge; plus the ability to dive into very different industries, analyze them and create predictive models are an essential prerequisite for any data scientist.

A data scientist is responsible for navigating the data science lifecycle, which looks like this:

data science lifecycle

Let’s break down each of these phases:

  • Discovery. This involves understanding and analyzing business problems or objectives.
  • Data Preparation and Mining. Here the data scientist collects data from different sources, cleans it, and transforms it to a format suitable for machine learning algorithms. He or she then creates a new feature (individual measurable property) by making assumptions or examining the hidden patterns in the data, and much more. This is probably the most important and time-consuming part of the work.
  • Modeling. Now the data scientist trains machine learning models and evaluates and validates performance (accuracy).
  • Visualize and Communicate the Results. Results are demonstrated using diagrams and other methods to clearly communicate and explain findings in a digestible way.

Data scientists can be involved in any manner of business use cases. For example, helping a marketing department improve the marketing techniques for retailers based on customers’ shopping and wish lists. A data scientist can also help optimize the process of price formation by analyzing consumer purchasing power, competitor price offers, sales history, popularity of the product, and more. They can also reveal hidden patterns in public trials by analyzing thousands of different factors and properties.

Data Scientist Versus Machine Learning Engineer

In some cases, the same person performs each step in the data science lifecycle, but it doesn’t have to be this way. In many cases, a Machine Learning Engineer (ML Engineer) will take over the steps of engineering the model. This typically occurs when the data scientist lacks the programming skills required for a full-stack development of the solution.

A data scientist and ML engineer have two distinct roles:

The data scientist must fully understand the business needs, the data, find the right approach to the problem, and help build and verify the results.

The ML engineer isn’t required to understand the algorithms and the overall science behind the solution, but may participate in building the Machine Learning Model (ML Model), using special technologies, programming frameworks and ensuring the data is gathered from data pipelines, cleansed and suited for the ML Model.

In this way, machine learning has enabled the function of traditional data analysis to grow into data science.

In summary, data has the huge potential to uncover unknown correlations and patterns which can improve decision-making and ultimately business outcomes. Most companies today have accumulated massive amounts of data, but it’s mostly unstructured and siloed. With no universal tool to determine relationships and effectively process the data, uncovering those trends is no easy task. That’s where data science comes in.

Data science can unify and process data to uncover hidden patterns and build predictive and prescriptive analytics tools for better decision-making. In our next article, we’ll examine machine learning and how it compares with data science.