Learning Data Science

Data Science Theory

Although statistics certainly is an important part of data science, it's not the only part. Indeed, there are a number of model types that aren't statistical, such as K Nearest Neighbors.

Knowing the different types of algorithms that are available, when each is appropriate and how to tweak them will be invaluable. In many cases, you'll just want to throw a bunch of algorithms at the problem—and if your data set is small and/or easy to understand, that'll be just fine. But if it takes a long time to train your model, trying a dozen different algorithms is neither smart nor effective. Just as an expert cook knows which knife to use, and a good programmer should know which language is appropriate for a given task, someone building machine learning models should know which algorithms are more likely to be useful. (It's not always 100% obvious, but you do want to narrow down your starting set.)

In addition to the books I mentioned above, some others are well worth reading and reviewing. Doing Data Science by Cathy O'Neil and Rachel Schutt, as well as the Python Data Science Handbook by Jake VanderPlas, introduce the ideas behind data science, but they also include working code and examples that you can and should play with.

A phenomenal resource is the Analysis Vidhya site that summarizes, describes and instructs in a truly staggering number of technologies, algorithms and theories. Daily email messages from this site always are interesting and useful—and, quite frankly, overwhelming in their number and scope.

Data Science Hacking

Although statisticians have been using software for many years, one of the key differences between statistics and data science is that the latter requires programming knowledge. It's no surprise, given its shallow learning curve and huge, friendly community, that Python has become the leading language for data science. If you choose to use Python (which I definitely recommend), you'll need to learn a number of libraries that don't always adhere to the standard Python way of doing things: NumPy and Pandas provide data structures, and then there's also scikit-learn, which provides the algorithms and supports for machine learning.

The websites for each of these packages, but especially scikit-learn, are huge, and they likely will make you think you never can learn it all. And indeed, no one is expecting you to know everything that those packages can do by heart. But over time, you will be expected to understand more and more algorithms and ideas, and also how to implement them.

If you're using Python, the the Jupyter notebook is likely to be your day-to-day tool of choice. Jupyter continues to expand in impressive functionality, with new versions released every few weeks. If you're new to Python or to dynamic languages in general, Jupyter can feel a bit odd, but it quickly grows on you and will become a fluid part of your day-to-day work.

As you can see, it's important to practice. I often say that programming languages are like human (natural) languages, in that you need to practice using them to gain true fluency. Data science is the same, but it's also different, in that you need fluency in several related disciplines in order to succeed.

Fortunately, the world of data science is large and growing, providing a lot of interesting data sets for people to analyze, both for fun and practice, and also for serious use. "I Quant NY" is a blog that not only provides interesting information about New York City from city-supplied data sets, but it also shows how data scientists can ask questions and provide answers that affect many people. If you're looking for data sets, it's hard to know just where to start or what sort of analysis might be most appropriate. The weekly newsletter "Data is Plural" by Jeremy Singer-Vine, the "data sets" subreddit (Data.World all offer a staggering number of data sets on a variety of topics. Choose something that's of interest to you, and see what questions you can ask and answer.

I would be remiss if I didn't mention a few of the podcasts to which I listen. Not only do they provide me with the latest news, information, anecdotes and updates from the world of data science, they also allow me to understand the trends better—for example, in favor of neural networks and deep learning. "Partially Derivative" and "Linear Digressions" are my two favorites, but there are some others, such as "Data Science at Home" and "Data Skeptic". Podcasts aren't going to help you to code better; only more coding can really do that. But they will give you perspective and understanding that make the code more obvious.

Finally, although I believe that data science is changing our world for the better, we do need to be on the lookout for potential issues. Cathy O'Neil's book, "Weapons of Math Destruction", is a must-read for anyone entering this world. Even if you aren't writing algorithms that will affect millions of people, awareness of our biases as humans, and of our need to be transparent when implementing policy via machine, is an important one. This easily is one of the best books I've read in the last few years.

I'll definitely return to data science topics in the future, given its importance to developers. But for my next article, I plan to return to the world of web applications and databases, looking at the languages, libraries and packages we use to create modern applications.

______________________