Unsupervised Learning

In my last few articles, I've looked into machine learning and how you can build a model that describes the world in some way. All of the examples I looked at were of "supervised learning", meaning that you loaded data that already had been categorized or classified in some way, and then created a model that "learned" the ways the inputs mapped to the outputs. With a good model, you then were able to predict the output for a new set of inputs.

Supervised learning is a very useful technique and is quite widespread. But, there is another set of techniques in machine learning known as unsupervised learning. These techniques, broadly speaking, ask the computer to find the hidden structure in the data—in other words, to "learn" what the meaning of the data is, what relationships it contains, which features are of importance, and which data records should be considered to be outliers or anomalies.

Unsupervised learning also can be used for what's known as "dimensionality reduction", in which the model functions as a preprocessing step, reducing the number of features in order to simplify the inputs that you'll hand to another model.

In other words, in supervised learning, you teach the computer about your data and hope that it understands the relationships and categorization well enough to categorize data it hasn't seen before successfully.

In unsupervised learning, by contrast, you're asking the computer to tell you something interesting about the data.

This month, I take an initial look at the world of unsupervised learning. Can a computer categorize data as well as a human? How can you use Python's scikit-learn to create such models?

Unsupervised Learning

There's a children's card game called Set that is a useful way to think about machine learning. Each card in the game contains a picture. The picture contains one, two or three shapes. There are several different shapes, and each shape has a color and a fill pattern. In the game, players are supposed to identify three-card groups of cards using any one of those properties. Thus, you could create a group based on the color green, in which all cards are green in color (but contain different numbers of shapes, shapes and fill patterns). You could create a group based on the number of shapes, in which every card has two shapes, but those shapes can be of any color, any shape and any fill pattern.

The idea behind the game is that players can create a variety of different groups and should take advantage of this in order to win the game.

I often think of unsupervised learning as asking the computer to play a game of Set. You give the computer a data set and ask it to divide that large bunch of data into separate categories. The model may choose any feature, or set of features, and that might (or might not) be a feature that humans would consider to be important. But, it will find those connections, or at least try to do so.

One of the most common machine-learning models for beginners is the "iris" dataset, containing 150 flowers, 50 from each of three types of irises. Several months ago, I showed how you could create a supervised model to identify irises. In other words, you could create and train a model that would categorize irises accurately based on their petal and sepal sizes.

Can unsupervised learning achieve the same goal? That is, can you create a model that will divide the flowers into three different groups, doing the same job (or close to it) that humans have done?

Another way of asking this question is whether the way in which biologists distinguish between varieties of flowers is supported by the underlying measurement data.