Teaching Your Computer

Let's assume you're going to keep the data as it is. You cannot use a purely categorical model, but rather will need to use one that incorporates the statistical concept of "regression", in which you attempt to determine which of your input factors cause the output to correlate linearly with the outputs—that is, assume that the ideal is something like the "y = qX" that you saw above; given that this isn't the case, how much influence did meat quality have vs. uniformity vs. temperature? Each of those factors affected the overall quality in some way, but some of them had more influence than others.

One of the easiest to understand, and most popular, types of models uses the K Network Neighbors (KNN) algorithm. KNN basically says that you'll take a new piece of data and compare its features with those of existing, known, categorized data. The new data is then classified into the same category as its K closest neighbors, where K is a number that you must determine, often via trial and error.

However, KNN works only for categories; this example is dealing with a regression problem, which can't use KNN. Except, Python's scikit-learn happens to come with a version of KNN that is designed to work with regression problems—the KNeighborsRegressor classifier.

So, how do you use it? Here's the basic way in which all supervised learning happens in scikit-learn:

  1. Import the Python class that implements the classifier.

  2. Create a model—that is, an instance of the classifier.

  3. Train the model using the "fit" method.

  4. Feed data to the model and get a prediction.

Let's try this with the data. You already have an X and a y, which you can plug in to the standard sklearn pattern:

from sklearn.neighbors import KNeighborsRegressor   # import classifier
KNR = KNeighborsRegressor()                         # create a model
KNR.fit(X, y)                                       # train the model

Without the dropna above (in which I removed any rows containing one or more NaN values), you still would have "dirty" data, and sklearn would be unable to proceed. Some classifiers can handle NaN data, but as a general rule, you'll need to get rid of NaN values—either to satisfy the classifier's rules, or to ensure that your results are of high quality, or even (in some cases) valid.

With the trained model in place, you now can ask it: "If you have a burrito with really great ingredients, how highly will it rank?"

All you have to do is create a new, fake sample burrito with all high-quality ingredients:

great_ingredients = np.ones(X.iloc[0].count()) * 5

In the above line of code, I took the first sample from X (that is, X.iloc[0]), and then counted how many items it contained. I then multiplied the resulting NumPy array by 5, so that it contained all 5s. I now can ask the model to predict the overall quality of such a burrito:


I get back a result of:

array([ 4.86])

meaning that the burrito would indeed score high—not a 5, but high nonetheless. What if you create a burrito with absolutely awful ingredients? Let's find the predicted quality:

terrible_ingredients = np.zeros(X.iloc[0].count())

In the above line of code, I created a NumPy array containing zeros, the same length as the X's list of features. If you now ask the model to predict the score of this burrito, you get:

array([ 1.96])

The good news is that you have now trained the computer to predict the quality of a burrito from a set of rated ingredients. The other good news is that you can determine which ingredients are more influential and which are less influential.

At the same time, there is a problem: how do you know that KNN regression is the best model you could use? And when I say "best", I ask whether it's the most accurate at predicting burrito quality. For example, maybe a different classifier will have a higher spread or will describe the burritos more accurately.

It's also possible that the classifier is a good one, but that one of its parameters—parameters that you can use to "tune" the model—wasn't set correctly. And I suspect that you indeed could do better, since the best burrito actually sampled got a score of 5, and the worst burrito had a score of 1.5. This means that the model is not a bad start, but that it doesn't quite handle the entire range that one would have expected.

One possible solution to this problem is to adjust the parameters that you hand the classifier when creating the model. In the case of any KNN-related model, one of the first parameters you can try to tune is n_neighbors. By default, it's set to 5, but what if you set it to higher or to lower?