Examining Data Using Pandas
You don't need to be a data scientist to use Pandas for some basic analysis.
Traditionally, people who program in Python use the data types that come with the language, such as integers, strings, lists, tuples and dictionaries. Sure, you can create objects in Python, but those objects typically are built out of those fundamental data structures.
If you're a data scientist working with Pandas though, most of your time is spent with NumPy. NumPy might feel like a Python data structure, but it acts differently in many ways. That's not just because all of its operations work via vectors, but also because the underlying data is actually a C-style array. This makes NumPy extremely fast and efficient, consuming far less memory for a given array of numbers than traditional Python objects would do.
The thing is, NumPy is designed to be fast, but it's also a bit low level for some people. To get more functionality and a more flexible interface, many people use Pandas, a Python package that provides two basic wrappers around NumPy arrays: one-dimensional Series objects and two-dimensional Data Frame objects.
I often describe Pandas as "Excel within Python", in that you can perform all sorts of calculations as well as sort data, search through it and plot it.
For all of these reasons, it's no surprise that Pandas is a darling of the data science community. But here's the thing: you don't need to be a data scientist to enjoy Pandas. It has a lot of excellent functionality that's good for Python developers who otherwise would spend their time wrestling with lists, tuples and dictionaries.
So in this article, I describe some basic analysis that everyone can do with Pandas, regardless of whether you're a data scientist. If you ever work with CSV files (and you probably do), I definitely recommend thinking about using Pandas to open, read, analyze and even write to them. And although I don't cover it in this article, Pandas handles JSON and Excel very well too.Creating Data Frames
Although it's possible to create a data frame from scratch using Python data structures or NumPy arrays, it's more common in my experience to do so from a file. Fortunately, Pandas can load data from a variety of file formats.
Before you can do anything with Pandas, you have to load it. In a Jupyter notebook, do:
%pylab inline import pandas as pd
For example, Python comes with a
csv module that knows how to handle
files in CSV (comma-separated value) format. But, then you need to
iterate over the file and do something with each of those
lines/rows. I often find it easier to use Pandas to work with
such files. For example, here's a CSV file:
a,b,c,d e,f,g,h "i,j",k,l,m n,o.p,q
You can turn this into a data frame with:
df = pd.read_csv('mycsv.csv')
Now you can view the contents of the data frame in Jupyter with:
The thing is, there are all sorts of CSV files you probably don't
even think of as CSV files. For example, consider the Linux
/etc/passwd file. It's not really a CSV file, but if you think about
it for a moment, you'll realize that the format is the same. Each
record is on a single line, and the fields are separated with ":"
characters. So in this case, you'll need to use the
sep parameter to indicate
filename = '/etc/passwd' df = pd.read_csv(filename, sep=':') df.head()
df.head(), much like the UNIX
head utility, shows the
first few rows of the data frame. I often use this method to inspect
the data and make sure it came through okay.
In this particular case, the data came through just fine, but the first line (the root user's record!) was interpreted as header rows. Fortunately, you can fix that with another parameter:
df = pd.read_csv(filename, sep=':', header=None) df.head()
If you pass the
header parameter, you're telling Pandas which
row of the file should be considered the headers. But if you pass
as a value, you're telling Pandas that none of the rows is the
header. That's fine, but then you're going to get integers (starting
with 0) as your column names. I'd rather use real names, so
to specify them, do:
df = pd.read_csv(filename, sep=':', header=None, names=['username', 'password', 'uid', 'gid', 'name', 'homedir', 'shell']) df.head()
The thing is, do you really need the
password column? Given that on
modern computers, it'll always contain "x", I think that you can leave
it out. So, you can say:
df = pd.read_csv(filename, sep=':', header=None, usecols=[0,2,3,4,5,6], names=['username', 'uid', 'gid', 'name', 'homedir', 'shell']) df.head()
usecols parameter indicates which columns you want to read from
Now, why would you want to use Pandas on your /etc/passwd file? You probably don't, but as you can imagine, if it works on that file, it'll work on other files too.
For example, many cities now are making data available to the general public. Chicago publishes much of its data online, so let's take a look and see, for example, what you can find out about what cars were towed in the last 90 days in Chicago.
You can go to the Chicago data portal at https://data.cityofchicago.org/browse. I clicked "towed vehicles"→"export"→"CSV" to get a CSV file. Then I took the downloaded file and imported it into Pandas with:
df = pd.read_csv('towed.csv')
Nah, I'm just kidding. You think that I have the time and energy to download the file and then load it? One of my favorite features of Pandas is the fact that most methods that work with files also can work with URLs. So, I can say:
url = 'https://data.cityofchicago.org/api/views/ygr5-vcbg/ ↪rows.csv?accessType=DOWNLOAD' df = pd.read_csv(url)
(Note that the URL might well have changed by the time you read this, or it might be keyed to my session information. If not, all the better, from my perspective!)
I then take a look at the data frame and discover that it has headers, that the separator is a comma and that it worked just fine. The only adjustment I'm going to make is to parse the "tow date" column as a date, so I can play with that a bit. I also indicate that because the date is in US format, the day comes first:
df = pd.read_csv(url, parse_dates=['Tow Date'], dayfirst=False) df.head()
You'll notice that the first column now appears a bit differently, in
year-month-day format. That shows there is a datestamp. You also
can see that if you run the
df.info method, which tells you about the
data frame itself:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5560 entries, 0 to 5559 Data columns (total 10 columns): Tow Date 5560 non-null datetime64[ns] Make 5537 non-null object Style 5538 non-null object Model 509 non-null object Color 5536 non-null object Plate 4811 non-null object State 5392 non-null object Towed to Address 5560 non-null object Tow Facility Phone 5559 non-null object Inventory Number 5560 non-null int64 dtypes: datetime64[ns](1), int64(1), object(8) memory usage: 434.5+ KB
As you can see, the "Tow Date" column is now being stored as a
datetime64, a 64-bit field containing a date and time.
So, now that you have this data, what can you do with it? One of my
favorite methods is
value_counts, which tells how many times a
particular value appears in a column. So, you can say the following to find
out how many vehicles were towed from each state:
that's going to be a long list, so it's probably better to limit it by
head on the series you get back from
Now you'll see which ten states have the most towed vehicles in Chicago in the last 90 days:
IL 4948 IN 148 TX 48 WI 48 MN 28 IA 27 MI 19 GA 14 FL 14 TN 13
Not surprisingly, most of the vehicles towed in Chicago were from Illinois, with the second-highest number from Indiana, which is nearby. Less expected (to me, at least) was the large number of towed vehicles from Texas, which is not exactly nearby.
To make use of the fact that the timestamp is now a true
datetime object, you can find out the most
common dates on which vehicles were towed:
2018-03-03 215 2018-03-04 195 2018-02-24 165 2018-03-25 148 2018-03-26 140 2018-03-24 135 2018-03-15 126 2018-03-21 122 2018-02-03 120 2018-03-22 117
As you can see, there's a great deal of variation. Maybe that
variation is due to the day of the week? In order to find out,
you can use the
dt proxy object Pandas provides for
columns. You can use that to extract information from the
column and then analyze it.
For example, you can say:
This retrieves the day of the week for each of the tow dates. Then you can
value_counts to summarize how many vehicles were towed on
each day of the week:
The results are as follows:
5 1143 4 831 3 820 6 798 2 794 0 640 1 534
The above indicates that Chicago towed far more cars on Fridays than on other days. This might always be true, or it might just be true for this 90-day sample.
You also can check to see what brand of cars are towed most often or whether there's a correlation between the color and the chances of being towed.
What if you want to know, week by week, how many cars were towed? For that, you can take advantage of a great Pandas feature, in which you can set the data frame's index to be a timestamp column:
df.set_index('Tow Date', inplace=True)
Now, instead of accessing rows by an integer index, you'll use a
timestamp. But the point is not to access things that way. Rather,
resample, similar to a
GROUP BY query in SQL.
Resampling can be done in a variety of ways; here's asking for it
to be on a one-week basis:
By itself, this does nothing. But if you then count the number of rows
for each week, you get much more interesting output. For every column, you
get the number of entries per week. The numbers differ, because many
columns have missing (
NaN) information, which isn't included in the
count. That's fine; you can just use the
Make column, which seems to
be filled in for every towed vehicle:
Now you should get a very nice report, showing how many cars were towed each week:
Tow Date 2017-12-31 103 2018-01-07 321 2018-01-14 209 2018-01-21 241 2018-01-28 250 2018-02-04 399 2018-02-11 248 2018-02-18 328 2018-02-25 587 2018-03-04 862 2018-03-11 495 2018-03-18 601 2018-03-25 754 2018-04-01 139 Freq: W-SUN, Name: Make, dtype: int64
Humans see patterns more easily with graphics than with tables of numbers, so you might want to plot this, also using Pandas:
That makes a line plot showing that the number of towed vehicles shot up at the start of March. Why? You could look at temperatures and a calendar to start to guess, but the fact that you can download, inspect and graph this information so easily is a good starting point.Conclusion
Pandas is more than merely a tool for data scientists to do numeric analysis. The fact that it can read (and write) formats such as CSV, Excel and JSON make it my primary tool for working with those formats. Add to that the fact that it can summarize data, including based on timestamps, and you probably can see why Pandas is so popular.Resources
The home page for Pandas is here, and the Jupyter notebook, which I mentioned in this article, is here.
Finally, the list of "datetime components", which you can access via
dt object on timestamp columns in Pandas, is