Preparing Data for Machine Learning

When I go to Amazon.com, the online store often recommends products I should buy. I know I'm not alone in thinking that these recommendations can be rather spooky—often they're for products I've already bought elsewhere or that I was thinking of buying. How does Amazon do it? For that matter, how do Facebook and LinkedIn know to suggest that I connect with people whom I already know, but with whom I haven't yet connected online?

The answer, in short, is "data science", a relatively new field that marries programming and statistics in order to make sense of the huge quantity of data we're creating in the modern world. Within the world of data science, machine learning uses software to create statistical models to find correlations in our data. Such correlations can help recommend products, predict highway traffic, personalize pricing, display appropriate advertising or identify images.

So in this article, I take a look at machine learning and some of the amazing things it can do. I increasingly feel that machine learning is sort of like the universe—already vast and expanding all of the time. By this, I mean that even if you think you've missed the boat on machine learning, it's never too late to start. Moreover, everyone else is struggling to keep up with all of the technologies, algorithms and applications of machine learning as well.

For this article, I'm looking at a simple application of categorization and "supervised learning", solving a problem that has vexed scientists and researchers for many years: just what makes the perfect burrito? Along the way, you'll hopefully start to understand some of the techniques and ideas in the world of machine learning.

The Problem

The problem, as stated above, is a relatively simple one to understand: burritos are a popular food, particularly in southern California. You can get burritos in many locations, typically with a combination of meat, cheese and vegetables. Burritos' prices vary widely, as do their sizes and quality. Scott Cole, a PhD student in neuroscience, argued with his friends not only over where they could get the best burritos, but which factors led to a burrito being better or worse. Clearly, the best way to solve this problem was by gathering data.

Now, you can imagine a simple burrito-quality rating system, as used by such services as Amazon: ask people to rate the burrito on a scale of 1–5. Given enough ratings, that would indicate which burritos were best and which were worst.

But Cole, being a good researcher, understood that a simple, one-dimensional rating was probably not sufficient. A multi-dimensional rating system would keep ratings closer together (since they would be more focused), but it also would allow him to understand which aspects of a burrito were most essential to its high quality.

The result is documented on Cole's GitHub page, in which he describes the meticulous and impressive work that he and his fellow researchers did, bringing tape measures and scales to lunch (in order to measure and weigh the burritos) and sacrificing themselves for the betterment of science.

Beyond the amusement factor—and I have to admit, it's hard for me to stop giggling whenever I read about this project—this can be seen as a serious project in data science. By creating a machine-learning model, you can not only describe burrito quality, but you also can determine, without any cooking or eating, the quality of a potential or theoretical burrito.

The Data

Once Cole established that he and his fellow researchers would rate burritos along more than one dimension, the next obvious question was: which dimensions should be measured?

This is a crucial question to ask in data science. If you measure the wrong questions, then even with the best analysis methods, your output and conclusions will be wrong. Indeed, a fantastic new book, Weapons of Math Destruction by Cathy O'Neil, shows how the collection and usage of the wrong inputs can lead to catastrophic results for people's jobs, health care and safety.

So, you want to measure the right things. But just as important is to measure distinct things. In order for statistical analysis to work, you have to ensure that each of your measures is independent. For example, let's assume that the size of the burrito will be factored in to the quality measurement. You don't want to measure both the volume and the length, because those two factors are related. It's often difficult or impossible to separate two related factors completely, but you can and should try to do so.

At the same time, consider how this research is being done. Researchers are going into the field (which is researcher-speak for "going out to lunch") and eating their burritos. They might have only one chance to collect data. This means it'll likely make sense to collect more data than necessary, and then use only some of it in creating the model. This is known as "feature selection" and is an important aspect of building a machine-learning model.

Cole and his colleagues decided to measure ten different aspects of burrito quality, ranging from volume to temperature to salsa quality. They recorded the price as well to see whether price was a factor in quality. They also had two general measurements: an overall rating and a recommendation. All of these measurements were taken on a 0–5 scale, with 0 indicating that it was very bad and 5 indicating that it was very good.

______________________