Novelty and Outlier Detection

It's even easier if you have a bunch of new data and want to determine whether those values would fit inside or outside your existing data set:

>>> new_data = np.array([-5000, -3000, -1000, -500, 20, 60, 500, 800,
>>> 900])
>>> print(new_data[(new_data<min_cutoff) | (new_data>max_cutoff)])

array([-5000, -3000, -1000,   900])

The good news is that this is simple—simple to understand, simple to implement and simple to automate.

However, it's also too simple for most data. You're unlikely to be looking at a single-dimensional vector. The baseline (mean) is likely to shift over time. And besides, there must be other, better ways to measure whether something is "inside" or "outside", right?

Getting More Sophisticated

For real-world anomaly detection, you're going to need to improve on a few fronts. You'll need to consider the data and determine what's "in" and what's "out". You'll also need to figure out ways to evaluate your model.

Let's consider novelty detection: there is initial data, and you want to know if a new piece of data would fit inside the existing data or if it would be considered an outlier. For example, consider a patient who comes in with values from a blood test. Do those tests indicate that the patient is normal, because the data's values are similar to the ones you've already seen? Or are those new values statistical outliers, indicating that the patient needs additional attention?

In order to experiment with novelty and outlier detection, I downloaded historic precipitation data for an area of Pennsylvania (Wyncote), just outside Philadelphia, for every day in 2016. Because I'm a scientific kind of guy, I downloaded the data in metric units. The data came from the US government.

That site contains clear instructions for downloading data from here.

It's quite amazing what government data is freely available, and the sorts of analysis you can do with it once you've retrieved it.

I downloaded the data as a CSV file and then used Pandas to read it into a data frame:

>>> df = pd.read_csv('/Users/reuven/downloads/914914.csv',
    usecols=['PRCP', 'DATE'])

Notice that I was interested only in PRCP (precipitation) and DATE (the date, in YYYYMMDD format). I then manipulated the data to break apart the DATE column and then to remove it:

>>> df['DATE'] = df['DATE'].astype(np.str)
>>> df['MONTH'] = df['DATE'].str[4:6].astype(np.int8)
>>> df['DAY'] = df['DATE'].str[6:8].astype(np.int8)
>>> df.drop('DATE', inplace=True, axis=1)

Why would I break the date apart? Because it'll likely be easier for models to work with three separate numeric columns, rather than a single date-time column. Besides, having these columns as part of my model will make it easier to understand whether snow in July is abnormal. I ignore the year, since it's the same for every record, which means that it can't help me as a predictor in this model.

My data frame now contains 353 rows—I'm not sure why it's not 365—of data from 2016, with columns indicating the amount of rain (in mm), the date and the month.

Based on this, how can you build a model to indicate whether rainfall on a given day is normal or an outlier?

In scikit-learn, you always use the same method: you import the estimator class, create an instance of that class and then fit the model. In the case of supervised learning, "fitting" means teaching the model which inputs go with which outputs. In the case of unsupervised learning, which I'm doing here, you use "fit" with just a set of inputs, allowing the model to distinguish between inliers and outliers.

Creating a Model

In the case of this data, there are several types of models that I can build. I experimented a bit and found that the IsolationForest estimator gave me the best results. Here's how I create and train the model:

>>> from sklearn.ensemble import IsolationForest
>>> model = IsolationForest()

The model now has been trained, so I can find out whether a given amount of rain, on a certain month and day, is considered normal.

To try things out, I check the model against its own inputs:

>>> Series(model.predict(df)).value_counts()

In the above code, I run model.predict(df). This gives the inputs to the model and asks it to predict whether these are normal, expected values (indicated by 1) or outlier values (indicated by –1). By turning the result into a Pandas series and then calling value_counts, I see:

 1    317
-1     36

Although it falsely marked 36 days as outliers, maybe those days were unusual. The model certainly would be improved if it had multiple years' worth of data, rather than just one year's worth.