This article was originally published in KMWorld
on March 18, 2020
Unstructured text is found in many, if not all business functions, and can become a source of valuable insight. Product reviews will guide
your customers’ preferences, customer support chats can identify weak points in your service, and leveraging the information in an unstructured
news articles can enhance the work of your analyst team. However, all these examples gain real power when they are used on data at scale, creating
a critical need for unstructured text analysis and exploration tools.
What is unstructured text?
We define “unstructured text data” as human written text in all forms, for example, news or scientific articles, messages, blog posts, books,
product reviews, chats etc. While this data has highly valuable information for decision making, it cannot be directly leveraged by analytical
methods or computers because they require a numerical representation. Obtaining a numerical representation of text data is the first step of
every text analysis method. Referred to as word embedding, this step is part of natural language processing (NLP) and it starts with creating a
multidimensional space where each word is represented by its own dimension.
The objective of the embedding process is to shrink the dimension and have smaller, let's say a 300-dimensional space. But now, each word is
represented by a value in each dimension. All 300 values compose a vector for each word. The process is done in such a way that similar,
or in some way related words, are represented by distantly close vectors, while unrelated words are represented by vectors that are far
away from each other. Each text is now a set of 300-dimensional vectors and a whole variety of methods can be applied: text classification
by content or style, extracting key words, generating topics and summaries, analysis of changes in time, or generation of new content based on
a starting piece.
More recently, spoken language analysis has become a very popular area of text analysis, for example, audio-based virtual assistant systems
or analysis of customer service phone calls. The process starts with translating audio recordings to text with speech-to-text recognition
algorithms and then applies text embedding and text analysis methods afterwards. However, spoken language tends to have less structured
sentences, may cover many topics, or contains errors arising from the process of speech-to-text recognition. It may also contain dialogs
where each part has a specific and important role for an analysis, for example in a case of a service center agent and a client. In addition,
meaningful gestures may not be reflected in words and therefore are lost for text analysis. While spoken language analysis presents challenges,
it is a rapidly developing field, perhaps, thanks to industry leaders like Google and Amazon competing for audio based virtual assistance solutions.
Advancements in text analytics
Many of the advancements in text analysis over the last decade are due to improvements and the adoption of artificial neural networks (NN).
Artificial NN, or connectionist systems, are computing systems (a set of algorithms) modeled loosely after the human brain that are designed to
recognize patterns. Such systems “learn” to perform tasks by considering examples and have proven to be a very effective tool for text classification,
emotion detection, enrichment with additional information, discovering relations between various linguistic entities, and even generating novel content.
By combining several NLP, NN, and other models we take a step closer to an AI system. For example, adding the ability to make suggestions based on
the history or empirical experience. In one of our recent efforts we created an AI system that used a combination of NLP and NN models to analyze
customer interactions and condensed the data into an easy to read summary. This summary helps the company to work more efficiently with its customers,
ultimately gaining a better understanding of each customer’s needs. To enhance the AI system even further, we examined a set of text classification
models that increase the company’s comprehension of comparable situations. Such a component will ensure successful and efficient adaptation of the
most effective solutions.
In another example, we had access to huge amounts of business texts and news articles. We used these text datasets to extract possible relations
between companies. The solution includes several NLP methods, statistical and NN models and uses data from various sources. In the end we had built
a comprehensive and effective AI solution that reduces time spent on reading business and news articles.
It has become clear that unstructured text data analysis has a very wide application and has been successfully adopted by many companies.
But often, these applications and required solutions have a costly development/adoption price and therefore are not easily embraced by small
and medium businesses. This is in part due to the fact that the required technologies do not exist today as a ready-to-use out-of-the-box solution,
but rather as components. However, the industry might see some rapid changes on this front from some leading vendors including AWS and Microsoft who
have created machine learning development platforms, like AWS SageMaker and Microsft Azure ML studio that will aid companies in harnessing the power
of text analysis and AI.
MVP: Minimal valuable prediction guidelines
Despite these technical obstacles, the most difficult part is often not technical, but rather finding a way to formulate business problems as
mathematical problems. This is a difficult and very creative task. Contrary to widespread belief, it does not require the magical power of the
most experienced data scientist. It requires a tedious, step-by-step, trial and error process executed with patience and a bit of meticulousness.
There is one approach that can help in this process—MVP.
MVP stands for minimal valuable prediction, a well-known minimal valuable product concept applied to Data Science. It aims to simplify the difficult
task of breaking a novel text analysis idea to actionable steps with quality control and progress metrics. The idea is to focus on delivering a
minimum viable prediction as fast as you can and iterate from there. For a text analysis project that means taking the following steps:
Focus only on the most-pressing problem you want to solve.
Find the most useful text data for your problem and pay attention to its quality. You do not want to start with difficult to obtain or to clean data.
You’ll work “backwards” as you identify other related dataset, essentially following the data breadcrumbs that lead to actionable outcomes.
By following the MVP guidelines and combining NLP methods, NN and statistical models, we can leverage our unstructured text data in the best
way while minimizing financial risks.
In the coming years we expect that the growing experience of data science community combined with the growing number of tools from leading
solution providers, will allow more and more small and medium businesses to include an Artificial Intelligence into their toolset.