Teaching Your Computer

The goal here, then, will be to combine the burrito data and an algorithm to create a model for burrito tastiness. The next step will be to see if the model can predict the tastiness of a burrito based on its inputs.

But, how do you create such a model?

In theory, you could create it from scratch, reading the appropriate statistical literature and implementing it all in code. But because I'm using Python, and because Python's scikit-learn has been tuned and improved over several years, there are a variety of model types to choose from that others already have created.

Before starting with the model building, however, let's get the data into the necessary format. As I mentioned in my last article and alluded to above, Python's machine-learning package (scikit-learn) expects that when training a supervised-learning model, you'll need a set of sample inputs, traditionally placed in a two-dimensional matrix called X (yes, uppercase X), and a set of sample outputs, traditionally placed in a vector called y (lowercase). You can get there as follows, inside the Jupyter notebook:


%pylab inline
import pandas as pd                     # load pandas with an alias
from pandas import Series, DataFrame    # load useful Pandas classes
df = pd.read_csv('burrito.csv')         # read into a data frame

Once you have loaded the CSV file containing burrito data, you'll keep only those columns that contain the features of interest, as well as the output score:


burrito_data = df[range(11,24)]

You'll then remove the columns that are highly correlated to one another and/or for which a great deal of data is missing. In this case, it means removing all of the features having to do with burrito size:


burrito_data.drop(['Circum', 'Volume', 'Length'], axis=1,
 ↪inplace=True)

Let's also drop any of the samples (that is, rows) in which one or more values is NaN ("not a number"), which will throw off the values:


burrito_data.dropna(inplace=True, axis=0)

Once you've done this, the data frame is ready to be used in a model. Separate out the X and y values:


y = burrito_data['overall']
X = burrito_data.drop(['overall'], axis=1)

The goal is now to create a model that describes, as best as possible, the way the values in X lead to a value in y. In other words, if you look at X.iloc[0] (that is, the input values for the first burrito sample) and at y.iloc[0] (that is, the output value for the first burrito sample), it should be possible to understand how those inputs map to those outputs. Moreover, after training the computer with the data, the computer should be able to predict the overall score of a burrito, given those same inputs.

Creating a Model

Now that the data is in order, you can build a model. But which algorithm (sometimes known as a "classifier") should you use for the model? This is, in many ways, the big question in machine learning, and is often answerable only via a combination of experience and trial and error. The more machine-learning problems you work to solve, the more of a feel you'll get for the types of models you can try. However, there's always the chance that you'll be wrong, which is why it's often worth creating several different types of models, comparing them against one another for validity. I plan to talk more about validity testing in my next article; for now, it's important to understand how to build a model.

Different algorithms are meant for different kinds of machine-learning problems. In this case, the input data already has been ranked, meaning that you can use a supervised learning model. The output from the model is a numeric score that ranges from 0 to 5, which means that you'll have to use a numeric model, rather than a categorical one.

The difference is that a categorical model's outputs will (as the name implies) indicate into which of several categories, identified by integers, the input should be placed. For example, modern political parties hire data scientists who try to determine which way someone will vote based on input data. The result, namely a political party, is categorical.

In this case, however, you have numeric data. In this kind of model, you expect the output to vary along a numeric range. A pricing model, determining how much someone might be willing to pay for a particular item or how much to charge for an advertisement, will use this sort of model.

I should note that if you want, you can turn the numeric data into categorical data simply by rounding or truncating the floating-point y values, such that you get integer values. It is this sort of transformation that you'll likely need to consider—and try, and test—in a machine-learning project. And, it's this myriad of choices and options that can lead to a data-science project being involved, and to incorporate your experience and insights, as well as brute-force tests of a variety of possible models.

______________________