Classifying Text

How, then, can you deal with textual data? It's true that bytes are numbers, but that won't really help here; you want to deal with words and sentences, not with individual characters.

The answer is to turn documents into a DTM—a "document term matrix" in which the columns are the words that were used across the documents, and the rows indicate whether (and how many times) that word existed in the document.

For example, take the following three sentences:

  • I'm hungry, and need to eat lunch.

  • Call me, and we'll go eat.

  • Do you need to eat?

Let's turn the above into a DTM:

i'm hungry and need to eat lunch call me we'll go do you
1     1     1    1  1    1   1   0    0     0  0   0  0
0     0     1    0  0    1   0   1    1     1  1   0  0
0     0     0    1  1    1   0   0    0     0  0   1  1

Now, this DTM certainly does a good job of summarizing which words appeared in which documents. But with just three short sentences that I constructed to have overlapping vocabulary, it's already starting to get fairly wide. Imagine what would happen if you were to categorize a large number of documents; the DTM would be massive! Moreover, the DTM would mostly consist of zeros.

For this reason, a DTM usually is implemented as a "sparse matrix", listing the coordinates of where the value is non-zero. That tends to crunch down its size and, thus, processing time, quite a lot.

It's this DTM that you'll feed into your model. Because it's numeric, the model can handle it—and, thus, can make predictions. Note that you'll actually need to make two different DTMs: one for training the model and another for handing it the text you want to categorize.

Creating a DTM

I decided to do a short experiment to see if I could create a machine-learning model that knows how to differentiate between Python and Ruby code. Not only do I have a fair amount of such code on my computer, but the languages have similar vocabularies, and I was wondering how accurately a model could actually do some categorization.

So, the first task was to create a Python list of text, with a parallel list of numeric categories. I did this using some list comprehensions, as follows:

from glob import glob

# read Ruby files
ruby_files = [open(filename).read()
              for filename in glob("Programs/*.rb")]

# read Python files
python_files = [open(filename).read()
                for filename in glob("Programs/*.py")]

# all input files
input_text = ruby_files + python_files

# set up categories
input_text_categories = [0] * len(ruby_files) + [1]
 ↪* len(python_files)

After this code is run, I have a list (input_text) of strings and another list (input_text_categories) of integers representing the two categories into which these strings should be classified.