Testing Models

In my last few articles, I've been dipping into the waters of "machine learning"—a powerful idea that has been moving steadily into the mainstream of computing, and that has the potential to change lives in numerous ways. The goal of machine learning is to produce a "model"—a piece of software that can make predictions with new data based on what it has learned from old data.

One common type of problem that machine learning can help solve is classification. Given some new data, how can you categorize it? For example, if you're a credit-card company, and you have data about a new purchase, does the purchase appear to be legitimate or fraudulent? The degree to which you can categorize a purchase accurately depends on the quality of your model. And, the quality of your model will generally depend on not only the algorithm you choose, but also the quantity and quality of data you use to "train" that model.

Implied in the above statement is that given the same input data, different algorithms can produce different results. For this reason, it's not enough to choose a machine-learning algorithm. You also must test the resulting model and compare its quality against other models as well.

So in this article, I explore the notion of testing models. I show how Python's scikit-learn package, which you can use to build and train models, also provides the ability to test them. I also describe how scikit-learn provides tools to compare model effectiveness.

Testing Models

What does it even mean to "test" a model? After all, if you have built a model based on available data, doesn't it make sense that the model will work with future data?

Perhaps, but you need to check, just to be sure. Perhaps the algorithm isn't quite appropriate for the type of data you're examining, or perhaps there wasn't enough data to train the model well. Or, perhaps the data was flawed and, thus, didn't train the model effectively.

But, one of the biggest problems with modeling is that of "overfitting". Overfitting means that the model does a great job of describing the training data, but that it is tied to the training data so closely and specifically, it cannot be generalized further.

For example, let's assume that a credit-card company wants to model fraud. You know that in a large number of cases, people use credit cards to buy expensive electronics. An overfit model wouldn't just give extra weight to someone buying expensive electronics in its determination of fraud; it might look at the exact price, location and type of electronics being bought. In other words, the model will precisely describe what has happened in the past, limiting its ability to generalize and predict the future.

Imagine if you could read letters that were only from a font you had previously learned, and you can further understand the limitations of overfitting.

How do you avoid overfit models? You check them with a variety of input data. If the model performs well with a number of different inputs, it should work well with a number of outputs.

In my last article, I continued to look at data from a semi-humorous study in which evaluations were made of burritos at a variety of restaurants in Southern California. Examining this data allowed one to identify which elements of a burrito were important (or not) in the overall burrito's quality assessment. Here, in summary, are the steps I took inside a Jupyter notebook window in order to create and assess the data:

%pylab inline
import pandas as pd                     # load pandas with an alias
from pandas import Series, DataFrame    # load useful Pandas classes
df = pd.read_csv('burrito.csv')         # read into a data frame

burrito_data = df[range(11,24)]
burrito_data.drop(['Circum', 'Volume', 'Length'], axis=1, inplace=True)
burrito_data.dropna(inplace=True, axis=0)

y = burrito_data['overall']
X = burrito_data.drop(['overall'], axis=1)

from sklearn.neighbors import KNeighborsRegressor  # import
                                                   # classifier
KNR = KNeighborsRegressor()                        # create a model
KNR.fit(X, y)                                      # train the model

So, is the model good or not? You can know only if you try to make some predictions for which you know the answers, and see whether the model predicts things correctly.

Where can you find data about which you already know the answers? In the input data, of course! You can ask the model (KNR) to make predictions about X and compare those with y. If the model were performing categorization, you even could examine it by hand to get a basic assessment. But using regression, or even a large-scale categorization model, you're going to need a more serious set of metrics.

Fortunately, scikit-learn comes with a number of metrics you can use. If you say:

from sklearn import metrics

then you have access to methods that can be used to compare your predicted values (that is, from the original "y" vector) to the values that were computed by the model. You can apply several scores to the model; one of them would be the "explained variance score". You can get that as follows:

y_test = KNR.predict(X)

from sklearn import metrics
metrics.mean_squared_error(y_test, y)

Notice what's happening here. You're reusing the input matrix X, asking the model to predict its outputs. But, you already know those outputs; those are in y. So now you see how closely the model comes to predicting outputs that already were fed into it.

On my system, I get 0.64538408898281541. Ideally, with a perfect model, you would get a 1, which means that the model is okay, but not amazing.

However, at least you now have a way of evaluating the model and comparing it against other models that might be better or worse. You even can run KNR for different numbers of neighbors and see how well (or poorly) each model does:

for k in range(1,10):
    KNR = KNeighborsRegressor(n_neighbors=k)
    KNR.fit(X, y)
    y_test = KNR.predict(X)
    print "\t", metrics.mean_squared_error(y_test, y)
    print "\t", metrics.explained_variance_score(y_test, y)

The good news is that you have now looked at how the KNR model changes when configured with different values of n_neighbors. Moreover, you see that when n_neighbors = 1, you get no error and 100% explained variance. The model is a success!

Split Testing

But wait. The above test is a bit silly. If you test the model using data that was part of the training, you would be surprised if the model didn't get it at least partly right. The real test of a model is how well it works when it encounters new data.

It's a bit of a dilemma. You want to test the model with real-world data, but if you do that, you don't necessarily know what answer should appear. And, that means you can't really test it after all.

The modeling world has a simple solution to this problem. Use only a subset of the training data to train the model, and use the rest for testing it.

scikit-learn has functionality that supports this "train-test-split" functionality. You invoke the train_test_split function on your original X and y values, getting two X values (for training and testing) and two y values (for training and testing) back. As you might expect, you then can train the model with the X_train and y_train values and test it with X_test and y_test:

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,

KNR = KNeighborsRegressor(n_neighbors=1)
KNR.fit(X_train, y_train)
y_pred = KNR.predict(X_test)
print "\t", metrics.mean_squared_error(y_test, y_pred)
print "\t", metrics.explained_variance_score(y_test, y_pred)

Suddenly, this amazing model no longer seems so amazing. By checking it against values it hadn't seen before, it's giving a mean squared error of 0.4 and an explained variance of 0.2.

This doesn't mean the model is terrible, but it does mean you might want to check it a bit further. Perhaps you should (again) check additional values of n_neighbors. Or, perhaps you should try something other than KNeighborsRegressor. Again though, the key takeaway is that you are now using a real, reasonable way to evaluate that model, rather than just eyeballing the numbers and assuming (hoping) that all is well.

Multiple Splits

The split-test that you do might somehow tickle the model in such a way that it gives particularly good (or bad) results. What you really need to do is try different splits, so you can be sure that no matter what training data you use, the model performs optimally. Then, you can average the results over a bunch of different splits.

In the world of scikit-learn, this is done using KFold. You indicate how many different instances of the model you'll want to create and the number of "folds" (that is, split tests) you'll want to run:

from sklearn.cross_validation import KFold, cross_val_score
kfold = KFold(n=len(X), n_folds=10)

With the kfold object in place, you then can pass it to the cross_val_score method in the cross_validation module. You pass it the model (KNR, in this case), X, y and the kfold object you created:

v_results = cross_val_score(KNR, X, y, cv=kfold)

The cv_results object you get back describes the cross validation and typically is analyzed by looking at its mean (that is, what was the average score across those runs) and the standard deviation (that is, how much variance was there across runs):

print cv_results.mean()
print cv_results.std()

In this particular case, the results aren't that promising:


In other words, although the n_neighbors=1 seemed to be so terrific when first analyzed, using all of the training data for testing, that no longer appears to be the case.

Even if you stick with KNR as your classifier, you still can incorporate KFold, checking to see when (if) a different value of n_neighbors might be better than the value of 1 you gave here:

from sklearn.cross_validation import KFold, cross_val_score

for k in range(1,10):
    KNR = KNeighborsRegressor(n_neighbors=k)
    kfold = KFold(n=len(X), n_folds=10)
    cv_results = cross_val_score(KNR, X, y, cv=kfold)
    print "\t", cv_results.mean()
    print "\t", cv_results.std()

Sure enough, when k=9, you get results that are significantly better than when k=1:


That said, I do believe it's likely you can create a better model. Perhaps a better classifier for regression would improve things. Perhaps using categorization, rather than regression, in which you round the values in y to the nearest integer and treat scores as 5 distinct categories, would work. Perhaps, as mentioned before, I should have paid more attention to which columns were most (and least) important and done some better feature selection.

Regardless, with a proper test system in place, you're now able to start tackling these questions intelligently with a way to evaluate your progress.


It's not enough to create a machine-learning model; testing it is also important. As you saw here, scikit-learn makes it relatively easy to create, split-test and then evaluate one model or even a whole bunch of them.

Supervised learning isn't the only type of machine learning out there. In many cases, you can ask the computer to divide your data into multiple groups based on heuristics it develops, rather than categories that you have trained. In my next article, I plan to look at how (and when) to build "unsupervised learning" models.

Reuven Lerner teaches Python, data science and Git to companies around the world. You can subscribe to his free, weekly "better developers" e-mail list, and learn from his books and courses at http://lerner.co.il. Reuven lives with his wife and children in Modi'in, Israel.

Load Disqus comments