Testing Models

In my last few articles, I've been dipping into the waters of "machine learning"—a powerful idea that has been moving steadily into the mainstream of computing, and that has the potential to change lives in numerous ways. The goal of machine learning is to produce a "model"—a piece of software that can make predictions with new data based on what it has learned from old data.

One common type of problem that machine learning can help solve is classification. Given some new data, how can you categorize it? For example, if you're a credit-card company, and you have data about a new purchase, does the purchase appear to be legitimate or fraudulent? The degree to which you can categorize a purchase accurately depends on the quality of your model. And, the quality of your model will generally depend on not only the algorithm you choose, but also the quantity and quality of data you use to "train" that model.

Implied in the above statement is that given the same input data, different algorithms can produce different results. For this reason, it's not enough to choose a machine-learning algorithm. You also must test the resulting model and compare its quality against other models as well.

So in this article, I explore the notion of testing models. I show how Python's scikit-learn package, which you can use to build and train models, also provides the ability to test them. I also describe how scikit-learn provides tools to compare model effectiveness.

Testing Models

What does it even mean to "test" a model? After all, if you have built a model based on available data, doesn't it make sense that the model will work with future data?

Perhaps, but you need to check, just to be sure. Perhaps the algorithm isn't quite appropriate for the type of data you're examining, or perhaps there wasn't enough data to train the model well. Or, perhaps the data was flawed and, thus, didn't train the model effectively.

But, one of the biggest problems with modeling is that of "overfitting". Overfitting means that the model does a great job of describing the training data, but that it is tied to the training data so closely and specifically, it cannot be generalized further.

For example, let's assume that a credit-card company wants to model fraud. You know that in a large number of cases, people use credit cards to buy expensive electronics. An overfit model wouldn't just give extra weight to someone buying expensive electronics in its determination of fraud; it might look at the exact price, location and type of electronics being bought. In other words, the model will precisely describe what has happened in the past, limiting its ability to generalize and predict the future.

Imagine if you could read letters that were only from a font you had previously learned, and you can further understand the limitations of overfitting.

How do you avoid overfit models? You check them with a variety of input data. If the model performs well with a number of different inputs, it should work well with a number of outputs.

In my last article, I continued to look at data from a semi-humorous study in which evaluations were made of burritos at a variety of restaurants in Southern California. Examining this data allowed one to identify which elements of a burrito were important (or not) in the overall burrito's quality assessment. Here, in summary, are the steps I took inside a Jupyter notebook window in order to create and assess the data:

%pylab inline
import pandas as pd                     # load pandas with an alias
from pandas import Series, DataFrame    # load useful Pandas classes
df = pd.read_csv('burrito.csv')         # read into a data frame

burrito_data = df[range(11,24)]
burrito_data.drop(['Circum', 'Volume', 'Length'], axis=1, inplace=True)
burrito_data.dropna(inplace=True, axis=0)

y = burrito_data['overall']
X = burrito_data.drop(['overall'], axis=1)

from sklearn.neighbors import KNeighborsRegressor  # import
                                                   # classifier
KNR = KNeighborsRegressor()                        # create a model
KNR.fit(X, y)                                      # train the model

So, is the model good or not? You can know only if you try to make some predictions for which you know the answers, and see whether the model predicts things correctly.

Where can you find data about which you already know the answers? In the input data, of course! You can ask the model (KNR) to make predictions about X and compare those with y. If the model were performing categorization, you even could examine it by hand to get a basic assessment. But using regression, or even a large-scale categorization model, you're going to need a more serious set of metrics.

Fortunately, scikit-learn comes with a number of metrics you can use. If you say:

from sklearn import metrics

then you have access to methods that can be used to compare your predicted values (that is, from the original "y" vector) to the values that were computed by the model. You can apply several scores to the model; one of them would be the "explained variance score". You can get that as follows:

y_test = KNR.predict(X)

from sklearn import metrics
metrics.mean_squared_error(y_test, y)