One of the hardest things for me to grasp when embarking on a career in NLP (Natural Language Processing) and ML (Machine Learning) was defining what exactly NLP is and how does it differ from Machine Learning? Or, more precisely, where does NLP fall in the ML Landscape? I have read lots about the history of NLP, its humble beginnings in the 1950’s, all the way up until the recent groundbreaking innovations from Google (Text Summarization, Language Translation, etc.) but it never became quite clear exactly where the line was between what aspects of these innovations are being achieved through “NLP” (and its associated techniques) vs. what is being achieved through Machine Learning.. This article hopes to shed some light on this mystery.
NLP and its associated techniques have been around and used extensively for nearly 50 years. Even before Google in the 90’s, NLP was being used in industry in a very real way. Among many others, the predominant use cases for NLP were:
- Search: Search engines within Industry (e.g. Patent searches, Medical research, Lawyers using Lexis Nexis, Westlaw, etc.)
- Document Similarity: (e.g. Patent Duplicate Matching, Law Case Precedence, Plagiarism Detection, etc.)
In a nutshell, NLP includes:
- A set of techniques to format and represent raw text (say, documents) into useful data structures so that these documents can be used for these text processing use cases.
- The associated algorithms using the structured data as part of an implemented solution to these text processing use cases.
So, up until the 1990’s, the two bullets above were responsible for the predominance of solutions to Search and Document Similarity problems. It is in the second bullet above, the “algorithms” portion of the solution, for which Machine Learning (ML) has started to encroach significantly into NLP’s share of the overall solution to these use cases. To this day, however, the formatting/structuring/quantizing of data falls squarely in the NLP domain. Often times, at this point in history, NLP is often thought of as only that first bullet (and has been rebranded “Feature Engineering”), and Machine Learning has taken over the algorithm portion of the equation.
Below is a diagram to depict this relationship:
The above diagram is an oversimplification of how NLP interacts with ML. In reality, the Text Processing Solutions that were being attempted and solved w/ pure NLP-based techniques since the 1950’s also include most of the items shown on the Machine Learning side of the fence (e.g. Duplicate Detection, Classification, etc..), but due to many factors such as time / expense (e.g. slow computers take a long time to solve) , accuracy, etc. industry had never fully embraced these other solutions to even a fraction of the extent of the ML-based solutions today.
The primary innovation these ML approaches introduced was the usage of statistical modelling and probability into the solution of text processing. Prior to ML being used to solve NLP problems, probability models played very little role in the solution. However, not dissimilar to the Heisenberg’s Uncertainly Principle in Physics (the fact that we don’t really understand how things work on the quantum scale, but nevertheless, we can use probability and statistics to accurately model it and therefore produce real life innovations), although not fully explainable, and acting only as a “proxy” or “approximation” to the fundamental understanding of language that is attempted in pure NLP approaches, it turns out that the probabilistic solutions coming from ML provide far better and more accurate results to these problems than pure NLP ones (at least at this point in time).
In fact, similar to Einstein’s reluctance to believe in the probabilistic nature of quantum mechanics, some feel as though ML is a short-sighted solution. Whether or not you agree with it, however, there is no question that ML produces real life innovations for text processing problems that work, regardless of the methods employed to get us there. The famous quote in data science is “All models are wrong. But some are useful.” More recently, with such notions as confirmation bias and other controversial byproducts following ML’s success, the following has been added: “All models are wrong. Some are useful. And some are even ethical”.
Although there is much to be learned from the many NLP exclusive solutions employed (and I plan on writing a blog on some of these techniques), and indeed, many of the underlying principles inside the ML algorithms themselves in fact contain some NLP underpinnings (e.g. VSM Distance/Similarity measurements, etc.), this article is primarily focused on the topic of NLP with respect to its usage in Machine Learning. However, be mindful of the fact that NLP encroaches on many other disciplines such as Graph’s, Search Engines, and others.
NLP is a broad topic that has extensive roots in academics, and a large set of “first principles” that underlie it. Courses and degrees can be obtained in this topic. This document created by Stanford University is an excellent, and very thorough primer on many of these “first principles” that will aid in understand just how broad and extensive NLP is.
So in a broad sense, NLP is simply a set of techniques that plugs into the Machine learning framework and becomes a collaborative technology in the overall solution. In fact, Machine Learning, in general, is structured this way. It is a general problem solver. It solves many “classes” of problems, but cannot solve any of them without these use-case-specific “plugins” to help. The specific class of problems that NLP helps ML in solving is popularly referred to as “Text Processing”. As you all know, there are literally hundreds of applications that can rely on Machine Learning to provide solutions, but below is a subset that has a large share of the pie:
- Text Processing: Text summarization, Language translation, Fraud/Spam detection, etc.
- Image/Video Processing: Tumor detection and many medical uses, facial/image recognizers, etc.
- Audio Processing: Speech recognition, noise filtering, speech isolation, etc.
- Predictive Modelling: Should we pay this claim? Should we give this loan? , etc.
The ML Process (or even AI as a whole) is a use-case-agnostic implementer of an algorithmic way to solution any probability-based modelling problem. As long as you can turn whatever set of data you are starting with into numbers, you can solve the prediction problem with ML. However, the “hole in the donut” of any ML solution always needs to fill with use-case-specific components to take care of these Feature Engineering aspects of the system depicted above.
Feature Engineering: What is it?
At this point you might be wondering, what is Feature Engineering? And how does it play into Machine Learning. Let’s start with the fundamentals.
What is a feature?
Let’s start with some basics about how machine learning works…
The first step of all machine learning problems is converting the Raw Data into ML Input data, aka numbers. Computers only speak numbers, and machine learning algorithms are no different. There are many reasons for this that are beyond the scope of this article but suffice it to say that the “magic” of these algorithms predictive abilities firmly perches on top of centuries of innovation in mathematics (mostly Linear algebra, probability & statistics and calculus). Many of these underlying mathematical approaches have been around for centuries and the various model implementations make use of these methods. Thus there has been no reason thus far to invent a new way to communicate or express these algorithms other than through the native input used in the mathematics domain, aka numbers. This makes it the job of the data scientist to figure out a way to turn the raw data into a set of numbers so that we can use these algorithms for predictions.
This numeric representation of the raw data thus becomes a sort of “proxy” to the raw data itself, e.g. the “image” (in Image Processing) or the “document” (in Text Processing) converted into numbers. Each atomic unit of informative data about that “document” or “image” included in the mapping for each record is called a Feature (database developers: think a column cell of data inside a row). The encoding schemes and transformations that occur to turn the raw data into features is called “Feature Engineering”. Besides just simple encoding, Feature Engineering also encompasses the related activities of transforming and sharpening the predictive quality of these features in order to yield the most highly accurate models. Note, in Data Science circles, the term “Feature” is sometimes also referred to as a “Signal” or a “Dimension” (e.g. Dimensionality Reduction, etc.)
Before diving in to NLP, let’s take a look at a Computer Vision example to help comprehend this Feature Engineering process:
Image Feature Engineering
We will borrow from the ‘Hello World’ of the Computer Vision problem domain within ML called MNIST. The task is to train a ML model to recognize hand-written digits from 0 to 9.
Let’s start with determining how to turn these individual images into numbers. Say we want to encode the number 1 into numeric format. The world is our oyster as far as choosing how we might decide to encode this image pixel data into numbers. As long as we can go back and forth between the numeric form and the image form, we are good. So, simplistically, we might decide then to represent each pixel in this 28 pixel by 28 pixel representation as a Matrix (a Vector of Vector’s, which can also be used as an input to a ML algorithm) like follows:
You can see that the slots in the array that are zero represent the white pixels. Each non-zero value in the matrix is a fraction from 0 to 1, 0=white, 1=black, and the fractional values indicating the intensity of darkness displayed on the image. The above image might be one of, say, 20,000 examples of images we might use to train our algorithm (including other numbers, like the “5”, ”4”, and “0” like you see above). Each of these images will also include a human’s manual “label” or judgment of what number the image actually represents. This important “judgment” variable is often called the “outcome”, “class”, “label”, or “predictor” variable (Boy, I wish they would just come up with a naming consensus!) , and it helps the ML algorithms group the data to make predictions.
The algorithm then, essentially, will be looking for similarities in all the images in the training set that have been grouped and labelled with a “1”, then, upon completion of that analysis, will move on to the images labelled with the number “2”, etc. Upon analyzing the grouped sets of data, it is comparing and contrasting the similarities in the matrix values between each “like”-classified set of images… for instance, perhaps (over-simplified again) all of the “1” images tend to have zero’s in the first 6 columns and last 4 columns, whereas “0”’s almost never do, etc.. There are very robust and effective linear algebra techniques for comparing similarities in vectors. Along with this, the benefit of a large set of “training” data is that it gives the algorithm an opportunity to see a large amount of variations of the way someone might write the number “1” (more accurately: how variable the number “1” might get encoded into a matrix). Once the ML algorithm has trained on this data and produces a model, you might expose this model to , say, a web page, where a user could upload a hand-written image of a number from 0 to 9, and we can use our model to attempt to predict what that number is. Although this is called “Computer Vision”, as you can see, the computer is not actually seeing anything. It is simply comparing vectors to each other for similarity and trying to predict that next inbound vector (or uploaded image that is encoded into a vector).
Text Feature Engineering
NLP contains the techniques used to perform most feature engineering tasks related to text processing. As mentioned, NLP has lots of advanced techniques to improve the quality of features significantly, but we will start with an extremely simple example to gain a foundation.
Just like the Image example above, again, the world is our oyster as far as how we decide to map or encode our text documents into numbers. Many of the advanced techniques expand on many different ways to do this, but perhaps the most simplistic NLP technique might be as follows:
Given a set of documents (A full set of documents we plan on using in our training is often called a “Corpus” of documents):
We might write a program to “tokenize” each document into a set of words, and then create a Map, or “Vocabulary” of words where we assign an ordinal position to the term:
Then, for each document, we can encode each word contained in the text of the document as follows (we will call this “binary encoding”: place a 1 in the array if the term exists, 0 if not):
This binary encoding scheme provides us a starter’s understanding of how we might represent a document as numbers. However, if used in a real model, it would not yield us very good results because it does not account for some important characteristics of the terms within the document that will help us.
Advanced Feature Engineering
For all the various problem-domains solved through ML algorithms, simply encoding the raw data into numbers is usually not good enough. This is because there is often a lot of noise or other problems with the data that make prediction quality quite low. So, within each problem domain, for these various Features or “Signals”, there have been various techniques invented over the years to strengthen the signal and dampen the noise represented in each feature vector (To use an analogy from the domain of signal processing: Always strive to optimize the signal-to-noise ratio. Same applies here). The more “pure” the signal is (“pure” conceptually meaning: Devoid of all noise. Highlighting the strengths of truly differentiating features of a given document or image vs. all others, and dampening the extremely shared features of documents or images which add little or no predictive power to the model [and often times actually worsen the models ability to predict]) At least some of these advanced techniques are almost always necessary in order to obtain any level of quality modelling in almost all problem-domains. This is because, most often, in real-life, data is messy. Below is some of the real-life feature engineer techniques used across the problem-domains to attain higher levels of accuracy in modelling:
As you can see from above, there are numerous methods across all of the problem domains to provide features with higher predictive quality.
In the next blog post we will discuss the set of NLP techniques listed below that are used to produce high quality ML models:
- TF-IDF Weighting
- Stemming / Lemmatization
- n-grams / shingling
- Word Sense Disambiguation
- POS Tagging