Data science


My goal here is to provide a structured and (moderately) scaffolded path for an engaged person to acquire skills, vocabulary, and concepts necessary to build and communicate data science projects. As a framework, we’ll step through specific statistical learning techniques.

Stretch goals for the highly-motivated learner would be to work towards
a) designing or leading such projects,
b) working at the cutting edge of data science applications, or
c) building a foundation for doing research in the field (for example, pursue a PhD).

I assume familiarity with linear algebra, calculus, and programming. You’ll need to know how to write and run scripts in R or Python. You should know what the following terms mean (and know the notation that describe them): dot products, eigenvalues and eigenvectors, partial derivatives (okay, that’s kind of of optional, but you’ll want to know what a derivative is), integrals, and computational complexity. These concepts aren’t always key to doing the work–with modern tools, a decent programmer or business analyst is able to do some pretty cool data science–but these concepts are key to understanding what you are doing.

weekly handouts

The journal entry guidelines¬†describe generally how I’m assessing your write-ups. And try to follow these principles for developing, documenting, and sharing computational work (items 1-4 particularly relevant to the work for this course.)


  • DataCamp. We will make extensive use of DataCamp for getting up-to-speed on the tools. There are more than a hundred mini-courses at variety of levels. It looks like they’ve done a good job organizing material into focused chunks, and they’ve got some instructors whose work I really respect. I’m pretty excited about the opportunity to try their courses out, and I would love to hear your feedback. Contact me directly if you’re formally enrolled in the class and either
    1. need to be re-invited to the group or
    2. want help developing a study-plan that meets your goals.

good data sources

for the course

I’m still considering a few others, such as

  • the US Dept of Transportation flight delay data,
  • a home ownership dataset (Trulia? Zillow? Federal data?), and
  • an image dataset (perhaps one at

If you have a suggestion or an opinion, please let me know. Also, I said that I didn’t want to use Kaggle data because it was cleaned to the point of not being a lot of the meaning stripped from it. After checking out a few of the publicly contributed ones, they still look fairly clean, but seem to have more substance than the often-sterile competition data sets.


blogs, news

I follow about 50 data science blogs and news sources. When one jumps out at me as particularly interesting to the class, I’ll try to add to the list below. I don’t know that these are the best, but I do pay attention to their posts.

  • Kaggle’s blog: Mostly news, but also some decent, quick tutorials.
  • Revolutions: Highlights cool projects and tutorials, as well as data science news.
  • FlowingData: I think Nathan Yau does a great job for doing some original analyses, but also keeping in touch with the cool and innovative dataviz going on.
  • R-bloggers: A blog aggregator and, overall, a mixed bag of fluff, highly technical posts, announcements of R packages, news, cool projects, and more… Most contributors are high-quality and/or interesting most of the time.
  • DataScience+: This is an exception to the “I pay attention to the posts” statement, because I just learned about it (thanks, Jason!). Pretty cool examples and tutorials.

Leave a Reply

Your email address will not be published. Required fields are marked *