Unsupervised Learning

Let's load the iris data and then start to create an unsupervised model. Assuming that I'm working within the Jupyter notebook, I can execute the following:


%pylab inline
import pandas as pd
from pandas import DataFrame, Series

from sklearn.datasets import load_iris
iris = load_iris()

df = DataFrame(iris.data, columns=iris.feature_names)
df['response'] = iris.target

In other words, I've created a Pandas data frame containing five columns—the four features and also the response (that is, the classification). You won't be passing the classification to the model (although that might improve the model's ability to classify the flowers), but it's convenient to keep everything together in this way.

Creating a Model

Once you've loaded the data, it's time to create a model. You're looking to do what's known as "clustering", which means that the computer will divide the data set into categories or clusters.

So, now what? In supervised learning, you would create a new model from a classifier and then train it using scikit-learn's "fit" method. You then could give the trained model one or more data points and ask it to categorize those based on the model.

In unsupervised learning, it's a bit trickier—after all, you're asking the computer to do the categorization. If you don't have any pre-labeled categories, it's going to be hard to know whether your model is useful, accurate or both.

But before getting into the evaluation, let's build a model. Sklearn comes with a number of classifiers that handle clustering. One popular classifier is known as "K-means". In K-means clustering, the idea is that the model puts each data point inside the cluster whose mean is the closest. Thus, if there are three clusters, each cluster will contain points that are calculated to be closest. The "inertia" is a measurement of how coherent the groups are—that is, how closely associated with one another the elements that have been grouped together fit.

I should note that because K-means uses distances to calculate how to compose a group, you likely will want all of your features to be on the same scale. In the case of the flowers, all are within the same order of magnitude. But, you can imagine that if three measurements are on a scale of 1–10 and a fourth is on a scale of 1–1 million, the calculations might not work out as well. For this reason, it can be a good idea to use a scaler—several of which come with sklearn—to put all of your data onto the same scale. Such scaling is often important when creating models; it helps the calculations to identify two or more items as being close by.

So, using Python's scikit-learn, you can say:


from sklearn.cluster import KMeans
k = KMeans(n_clusters=3)

The above code indicates that you're going to use the K-means algorithm. You create a new model, indicating when you do so that you want three groups.

Now, right away you might be asking yourself how to know that there will be three categories—and the cop-out answer is that you guess. You can try different values for n_clusters and evaluate the model to see how well it does. But in many cases, you'll have to experiment a bit.

Let's now run K-means on the data. The X (that is, input matrix) is going to be the data frame, minus the "response" column. You can create that as follows:


X = df.drop('response', axis=1)

With supervised learning, the "fit" method is the process in which you teach the model to make associations between the input matrix X and the output vector y. In unsupervised learning, you're asking the model itself to make such divisions and to create an output vector. You do this with "fit":


k.fit(X)

______________________