Learning Data Science

In my last few articles, I've written about data science and machine learning. In case my enthusiasm wasn't obvious from my writing, let me say it plainly: it has been a long time since I last encountered a technology that was so poised to revolutionize the world in which we live.

Think about it: you can download, install and use open-source data science libraries, for free. You can download rich data sets on nearly every possible topic you can imagine, for free. You can analyze that data, publish it on a blog, and get reactions from governments and companies.

I remember learning in high school that the difference between freedom of speech and freedom of the press is that not everyone has a printing press. Not only has the internet provided everyone with the equivalent of a printing press, but it has given us the power to perform the sort of analysis that until recently was exclusively available to governments and wealthy corporations.

During the past year, I have increasingly heard that data science is the sexiest profession of the 21st century and the one that will be in greatest demand. Needless to say, those two things make for a very appealing combination! It's no surprise that I've seen a major uptick in the number of companies inviting me to teach on this subject.

The upshot is that you—yes, you, dear reader—should spend time in the coming months, weeks and years learning whatever you can about data science. This isn't because you will change jobs and become a data scientist. Rather, it's because everyone is going to become a data scientist. No matter what work you do, you'll be better at it, because you will be able to use the tools of data science to analyze past performance and make predictions based on it.

Back when I started to develop web applications, it was the norm to have a database team that created the tables and queries. Nowadays, although there certainly are places that have a full-time database staff, the assumption is that every developer has at least a passing familiarity with relationship (or even NoSQL) databases and how to work with them. In the same way that developers who understand databases are more powerful than those who don't, people in the computer field who understand data science are more powerful than those who don't.

There is a bit of bad news on this front, though. If you thought that the pace of technological change in programming and the web moved at a breakneck pace, you haven't seen anything yet! The world of data science—the tools, the algorithms, the applications—are moving at an overwhelming speed. The good news is that everyone is struggling to keep up, which means if you find yourself overwhelmed, you're probably in very good company. Just be sure to keep moving ahead, aiming to increase your understanding of the theory, algorithms, techniques and software that data scientists use.

Where should you start? In this article, I describe some of the resources I've found to be the most helpful as I've been diving deeper and deeper into data science.


There's no way around it. If you're going to do data science, you're going to need to learn some statistics. I took a year of it in graduate school, and then I did some analysis as part of my dissertation, but there's a lot I don't know, so I've been trying to improve my understanding. Every little bit helps! Whether you're simply learning Bayes' Theorem, figuring out how linear regression works or learning how to modify your data to minimize errors, statistics is a crucial part of this.

So, where do you start? There are a number of courses, often for free or at very low cost, at edX, Udemy and Coursera. A particularly popular introduction to machine learning, which includes the basic statistical knowledge you'll need, is taught by Stanford professor Andrew Ng via Coursera. If you're looking for something more hard-core, I definitely recommend the Udemy courses by LazyProgrammer.

Two good and standard textbooks on the subject are An Introduction to Statistical Learning (by James, Witten, Hastie and Tibshirani) and Elements of Statistical Learning (by Hastie, Tibshirani and Friedman). Both books are published by Springer, and both are available in PDF form, as free downloads. You probably should download and read those books; over time, the ideas and methods they describe will help you to reason about what you're doing.

I also want to recommend the various books and courses offered by Jason Brownlee at his site. His writing is clear, and he tries to be very practical about what he shows you. Especially if you're using Python for machine learning, his books are a great way to get started and improve your understanding.

Note that you definitely should not wait until you have read through books, watched lectures and taken courses to start playing with machine learning. That would be akin to saying you should try to learn a language only after you have mastered its grammar. As with language, you should be trying to use it at the same time that you're learning how it works.

Along with understanding the math, it's also important to have a good skeptical, statistical look at the world. Jake VanderPlas has a talk called "Statistics for Hackers" that not only translates the mathematical ideas into code, but it also concentrates on the aspects that are most likely to be of interest in data science.

Two other books worth mentioning are The Cartoon Guide to Statistics (by Larry Gotnick and Woollcott Smith) and Statistics Done Wrong (by Alex Reinhart). Both books are good for getting you to think in this way—by which I mean, when someone presents you with data, or if you are about to present others with data, you'll at least find some of the holes in the argument or alternative explanations to yours.