Unsupervised Learning

Evaluating the Model

The first question you'll ask the model is: "How did it divide up the flowers?" You know that the irises should be divided into three different groups, each with 50 flowers. How did K-means do?

You can ask the model itself using a variety of attributes. These attributes often end with an underscore (_), indicating that they may continue to change over time, as the model is trained more.

And indeed, this is an important point to make. When you invoke the "fit" method, you are teaching the model from scratch. However, there are times when you have so much data, you cannot reasonably teach the model all at once. For such cases, you might want to try an algorithm that supports the "partial_fit" method, which allows you to grab inputs a little bit at a time, teaching the model iteratively. However, not all algorithms support partial_fit; a large number of data points might force your hand and reduce the number of algorithms from which you can choose.

For this example, and in the case of K-means, you cannot teach the model incrementally. Let's ask the model for its measure of inertia:


(Again, notice the trailing underscore.) The value that I get is 0.78.9. The inertia value isn't on a scale; the general sense is that the lower the inertia score, the better, with zero being the best.

What if I were to divide the flowers into only two groups, or four groups? Using scikit-learn, I can do that pretty quickly and determine whether the computer thinks the manual classification (into three groups) was a good choice:

output = [ ]
for i in range(2,20):
    model = KMeans(n_clusters=i)
    output.append((i, model.inertia_))
kmeans = DataFrame(output, columns=['i', 'inertia'])

Now, it might seem ridiculous to group 150 flowers into up to 19 different groups! And indeed, the lowest inertia value that I get is when I set n_clusters=19, with the inertia rising as the number of groups goes down.

Perhaps this means that every flower is unique and cannot be categorized? Perhaps. But it seems more likely that our data isn't appropriate for K-means. Maybe it's the wrong shape. Maybe its values aren't varied enough. And indeed, when you look at the way in which the flowers were clustered for n_clusters=3, you see that the clustering was quite different from what people came up with. I can turn the automatically labeled flowers into a Pandas Series, and then count how many of each flower was found:


I get:

2    62
1    50
0    38

Well, it could be worse—but it also could be much better. Perhaps you can and should try another algorithm and see if it's better able to group the flowers together.

I should note that this now falls under the category of "semi-supervised learning"—that is, trying to see whether an unsupervised technique can achieve the same results, or at least similar results, to a previously used supervised technique.

In such a case, you can evaluate your model using not just statistical tests, but also one of the techniques I described in my previous articles on supervised learning, namely train-test-split. You use unsupervised learning on a portion of the input data and then predict on the remaining part. Comparing the model's outputs with the expected outputs for that subset can help you evaluate and tune your model.

A Different Algorithm

But in this case, let's try using a different model to achieve a different result, simply to see how easily sklearn allows you to try different models. One common choice in unsupervised learning is Gaussian Mixture, known in previous versions of scikit-learn as GMM. Let's use it:

from sklearn.mixture import GaussianMixture
model = GaussianMixture(n_components=3)

Now, let's have the model predict with the data used to train it, which will return a NumPy array with the categories:


How did that do? Let's pop this data into a Pandas Series object and then count the values:


And sure enough, the results:

2    55
1    50
0    45