Teaching Your Computer

A bit of Python code can establish this for you:


for k in range(1,10):
    print(k)
    KNR = KNeighborsRegressor(n_neighbors=k)
    KNR.fit(X, y)
    print("\tTerrible: {0}".format(KNR.predict([terrible_ingredients])))
    print("\tBest: {0}".format(KNR.predict([great_ingredients])))

After running the above code, it seems like the model that has the highest high and the lowest low is the one in which n_neighbors is equal to 1. It's not quite what I would have expected, but that's why it's important to try different models.

And yet, this way of checking to see which value of n_neighbors is the best is rather primitive and has lots of issues. In my next article, I plan to look into checking the models, using more sophisticated techniques than I used here.

Using Another Classifier

So far, I've described how you can create multiple models from a single classifier, but scikit-learn comes with numerous classifiers, and it's usually a good idea to try several.

So in this case, let's also try a simple regression model. Whereas KNN uses existing, known data points in order to decide what outputs to predict based on new inputs, regression uses good old statistical techniques. Thus, you can use it as follows:


from sklearn.linear_model import LinearRegression
LR = LinearRegression()
LR.fit(X, y)
print("\tTerrible: {0}".format(KNR.predict([terrible_ingredients])))
print("\tBest: {0}".format(KNR.predict([great_ingredients])))

Once again, I want to stress that just because you don't cover the entire spread of output values, from best to worst, you can't discount this model. And, a model that works with some data sets often will not work with other data sets.

But as you can see, scikit-learn makes it easy—almost trivially easy, in fact—to create and experiment with different models. You can, thus, try different classifiers, and types of classifiers, in order to create a model that describes your data.

Now that you've created several models, the big question is which one is the best? Which one not only describes the data, but also does so well? Which one will give the most predictive power moving forward, as you encounter an ever-growing number of burritos? What ingredients should a burrito-maker stress in order to maximize eater satisfaction, while minimizing costs?

In order to answer these questions, you'll need to have a way of testing your models. In my next article, I'll look at how to test your models, using a variety of techniques to check the validity of a model and even compare numerous classifier types against one another.

Resources

I used Python and the many parts of the SciPy stack (NumPy, SciPy, Pandas, matplotlib and scikit-learn) in this article. All are available from PyPI or from SciPy.org.

I recommend a number of resources for people interested in data science and machine learning. One long-standing weekly e-mail list is "KDNuggets" You also should consider the "Data Science Weekly" newsletter and "This Week in Data", describing the latest data sets available to the public.

I am a big fan of podcasts and particularly love "Partially Derivative". Other good ones are "Data Stories" and "Linear Digressions". I listen to all three on a regular basis and learn from them all.

If you're looking to get into data science and machine learning, I recommend Kevin Markham's "Data School" and Jason Brownlie's "Machine Learning Mastery", where he sells a number of short, dense, but high-quality ebooks on these subjects.

______________________