Pandas

Serious practitioners of data science use the full scientific method, starting with a question and a hypothesis, followed by an exploration of the data to determine whether the hypothesis holds up. But in many cases, such as when you aren't quite sure what your data contains, it helps to perform some exploratory data analysis—just looking around, trying to see if you can find something.

And, that's what I'm going to cover here, using tools provided by the amazing Python ecosystem for data science, sometimes known as the SciPy stack. It's hard to overstate the number of people I've met in the past year or two who are learning Python specifically for data science needs. Back when I was analyzing data for my PhD dissertation, just two years ago, I was told that Python wasn't yet mature enough to do the sorts of things I needed, and that I should use the R language instead. I do have to wonder whether the tables have turned by now; the number of contributors and contributions to the SciPy stack is phenomenal, making it a more compelling platform for data analysis.

In my article "Analyzing Data", I described how to filter through logfiles, turning them into CSV files containing the information that was of interest. Here, I explain how to import that data into Pandas, which provides an additional layer of flexibility and will let you explore the data in all sorts of ways—including graphically. Although I won't necessarily reach any amazing conclusions, you'll at least see how you can import data into Pandas, slice and dice it in various ways, and then produce some basic plots.

Pandas

NumPy is a Python package, downloadable from the Python Package Index

(PyPI), which provides a data structure known an a NumPy array. These arrays, although accessible from Python, are mainly implemented in C for maximum speed and efficiency. They also operate on a vector basis, so if you add 1 to a NumPy array, you're adding 1 to every single element in that array. It takes a while to get used to this way of thinking, and to the fact that the array should have a uniform data type.

Now, what can you do with your NumPy array? You could apply any number of functions to it. Fortunately, SciPy has an enormous number of functions defined and available, suitable for nearly every kind of scientific and mathematical investigation you might want to perform.

But in this case, and in many cases in the data science world, what I really want to do is read data from a variety of formats and then explore that data. The perfect tool for that is Pandas, an extensive library designed for data analysis within Python.

The most basic data structure in Pandas is a "series", which is basically a wrapper around a NumPy array. A series can contain any number of elements, all of which should be of the same type for maximum efficiency (and reasonableness). The big deal with a series is that you can set whatever indexes you want, giving you more expressive power than would be possible in a NumPy array. Pandas also provides some additional functionality for series objects in the form of a large number of methods.

But the real powerhouse of Pandas is the "data frame", which is something like an Excel spreadsheet implemented inside of Python. Once you get a table of information inside a data frame, you can perform a wide variety of manipulations and calculations, often working in similar ways to a relational database. Indeed, many of the methods you can invoke on a data frame are similar or identical in name to the operations you can invoke in SQL.

______________________