Novelty and Outlier Detection

Now what? I can ask the system to make some predictions:


for i in range(1, 13):
    print(model.predict([[15, i, 16]]))

This will tell whether it's normal to get 15 mm rain on the 16th of each month. The conclusion of the model: yes, it's perfectly normal in February–July, but not so in August–January. What about if there's zero precipitation:


for i in range(1, 13):
    print(model.predict([[0, i, 16]]))

It turns out that no matter what month, it's never an outlier to have zero rain on the 16th of the month.

Of course, those are just crude tests. The real thing to do is use our old friend train_test_split:


>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test = train_test_split(df)
>>> model.fit(X_train)
>>> Series(model.predict(X_test)).value_counts()

The model did pretty well, given that I didn't even try to tune it:


 1    77
-1    12
dtype: int64

In other words, given data that should all be classified as inliers, you can see here that the overwhelming majority is indeed classified correctly.

There are other types of estimators you can use as well. In particular, the One-Class SVM estimator has had a good track record of working with input data. That, combined with a larger data set, might well improve the results shown above—although in trying One-Class SVM for this article, I didn't see any such results. It's possible that if I were to add several more years' worth of data, other estimators would work better.

Conclusion

Novelty and outlier detection is (yet another) large, exciting and growing use for machine learning. As usual with machine learning, the problem is not one of coding, but rather of massaging the data into a format that you can use, and then tinkering with model definitions until you find one that predicts or identifies outliers with a high degree of confidence.

Resources

I used Python and many parts of the SciPy stack (NumPy, SciPy, Pandas, Matplotlib and scikit-learn) in this article. All are available from PyPI or from SciPy.org.

The documentation for scikit-learn has some (but not a great deal of) documentation on novelty/outlier detection.

A simple Python package for detecting anomalies, lsanomaly, is available on PyPI and GitHub. It might be worth consideration for simple data sets.

As I mentioned previously, the US government's NOAA (National Oceanic and Atmospheric Administration) site contains a treasure trove of weather and climate data, which you can download for free.

______________________