Novelty and Outlier Detection

In my last few articles, I've looked at a number of ways machine learning can help make predictions. The basic idea is that you create a model using existing data and then ask that model to predict an outcome based on new data.

So, it's not surprising that one of the most amazing ways machine learning is being applied is in predicting the future. Just a few days before writing this piece, it was announced that machine learning models actually might be able to predict earthquakes—a goal that has eluded scientists for many years and that has the potential to save thousands, and maybe even millions, of lives.

But as you've also seen, machine learning can be used to "cluster" data—that is, to find patterns that humans either can't or won't see, and to try to put the data into various "clusters", or machine-driven categories. By asking the computer to divide data into distinct groups, you gain the opportunity to find and make use of previously undetected patterns.

Just as clustering can be used to divide data into a number of coherent groups, it also can be used to decide which data points belong inside a group and which don't. In "novelty detection", you have a data set that contains only good data, and you're trying to determine whether new observations fit within the existing data set. In "outlier detection", the data may contain outliers, which you want to identify.

Where could such detection be useful? Consider just a few questions you could answer with such a system:

  • Are there an unusual amount of login attempts from a particular IP address?

  • Are any customers buying more than the typical number of products at a given hour?

  • Which homes are consuming above-average amounts of water during a drought?

  • Which judges convict an unusual number of defendants?

  • Should a patient's blood tests be considered normal, or are there outliers that require further checks and examinations?

In all of those cases, you could set thresholds for minimum and maximum values and then tell the computer to use those thresholds in determining what's suspicious. But machine learning changes that around, letting the computer figure out what is considered "normal" and then identify the anomalies, which humans then can investigate. This allows people to concentrate their energies on understanding whether the outliers are indeed problematic, rather than on identifying them in the first place.

So in this article, I look at a number of ways you can try to identify outliers using the tools and libraries that Python provides for working with data: NumPy, Pandas and scikit-learn. Just which technique and tools will be appropriate for your data depend on what you're doing, but the basic theory and practice presented here should at least provide you with some food for thought.

Finding Anomalies

Humans are excellent at finding patterns, and they're also quite good at finding things that don't fit a pattern. But, what sort of algorithm can look at a group of data sets and figure out which is unlike the others?

One simple way to do this is to set a cutoff, often done at one or two standard deviations. For those of you without a background in statistics (or who have forgotten what a "standard deviation" is), it's a measurement of how spread out the data is. For example:

>>> a = np.array([10,10,10,10,10,10,10])
>>> print("std = {}, mean = {}".format(a.std(), a.mean()))

std = 0.0, mean = 10.0

In the above example, I have a NumPy array containing seven instances of the number ten. People often think of the mean as describing the data, and it does, but it's only when combined with the standard deviation that you can know how much the numbers differ from one another. In this case, they're all identical, so the standard deviation is 0.

In this example, the mean remains the same, but the standard deviation is quite different:

>>> a = np.array([5,15,0,20,-5,25,10])
>>> print("std = {}, mean = {}".format(a.std(), a.mean()))

std = 10.0, mean = 10.0

Here, the mean has not changed, but the standard deviation has. You can see, from just those two numbers, that although the numbers remain centered around 10, they also are spread out quite a bit.

One simple way to detect unusual data is to look for all of the values that lie outside of two standard deviations from the mean, which accounts for about 95% of the data. (You can go further out if you want; 99.73% of data points are within three standard deviations, and 99.994% are within four.) If you're looking for outliers in an existing data set, you can do something like this:

>>> a = np.array([-5,15,0,20,-5,25,1000])
>>> print(a.std())


>>> min_cutoff = a.mean() - a.std()*2
>>> max_cutoff = a.mean() + a.std()*2

>>> print(a[(a<min_cutoff) | (a>max_cutoff)])


Sure enough, that found an outlier in the data.