My goal here is to provide a structured and (moderately) scaffolded path for an engaged person to acquire skills, vocabulary, and concepts necessary to build and communicate data science projects. As a framework, we’ll step through specific statistical learning techniques.
Stretch goals for the highly-motivated learner would be to work towards
a) designing or leading such projects,
b) working at the cutting edge of data science applications, or
c) building a foundation for doing research in the field (for example, pursue a PhD).
I assume familiarity with linear algebra, calculus, and programming. You’ll need to know how to write and run scripts in R or Python. You should know what the following terms mean (and know the notation that describe them): dot products, eigenvalues and eigenvectors, partial derivatives (okay, that’s kind of of optional, but you’ll want to know what a derivative is), integrals, and computational complexity. These concepts aren’t always key to doing the work–with modern tools, a decent programmer or business analyst is able to do some pretty cool data science–but these concepts are key to understanding what you are doing.
The journal entry guidelines describe generally how I’m assessing your write-ups. And try to follow these principles for developing, documenting, and sharing computational work (items 1-4 particularly relevant to the work for this course.)
- week 1: setting up your environment
- week 2: data visualization, data, and distributions
- week 3: model evaluation and regression
- week 4: hypothesis testing
- week 5: Bayesian estimation
- week 6: Bayesian prediction
- week 7-8: project planning
- week 8-9: clustering
- week 10: Big Data symposium
- week 11: ensemble learning
- week 12: natural language processing
- week 13: TBD
- week 14: project presentations
- DataCamp. We will make extensive use of DataCamp for getting up-to-speed on the tools. There are more than a hundred mini-courses at variety of levels. It looks like they’ve done a good job organizing material into focused chunks, and they’ve got some instructors whose work I really respect. I’m pretty excited about the opportunity to try their courses out, and I would love to hear your feedback. Contact me directly if you’re formally enrolled in the class and either
- need to be re-invited to the group or
- want help developing a study-plan that meets your goals.
good data sources
for the course
- Quandl. Open and free time-series financial datasets.
- DrivenData. Data science competitions for social good.
- Kaggle Datasets. They’ve put together a clean portal for open datasets, with community vetting and good search functionality.
- General Social Survey from NORC. American Community Survey from the US census.
- Pothole images. With paper here.
- Million Song Dataset. Wasn’t someone interested in music analysis?
- Nature Scientific Data. An open-access, peer-reviewed journal publishing descriptions of datasets.
I’m still considering a few others, such as
- the US Dept of Transportation flight delay data,
- a home ownership dataset (Trulia? Zillow? Federal data?), and
- an image dataset (perhaps one at deeplearning.net?).
If you have a suggestion or an opinion, please let me know. Also, I said that I didn’t want to use Kaggle data because it was cleaned to the point of not being a lot of the meaning stripped from it. After checking out a few of the publicly contributed ones, they still look fairly clean, but seem to have more substance than the often-sterile competition data sets.
- Detroit Open Data: You can find such portals for a bunch of medium-to-large cities in the US. For example, see this Forbes article.
- StatLine. Netherlands Central Bureau of Statistics.
- UCI Machine Learning Repository. Datasets to test machine learning algorithms.
- The Harvard Dataverse. Scientific data for reproducible research.
- R datasets package. A bunch of built-in datasets for building examples in R.
- The Collection of Really Great, Interesting, Situated (CORGIS) Datasets. Just what it sounds like.
- data.gov: “The home of the U.S. Government’s open data.” Includes many, many datasets from both federal and non-federal sources.
I follow about 50 data science blogs and news sources. When one jumps out at me as particularly interesting to the class, I’ll try to add to the list below. I don’t know that these are the best, but I do pay attention to their posts.
- Kaggle’s blog: Mostly news, but also some decent, quick tutorials.
- Revolutions: Highlights cool projects and tutorials, as well as data science news.
- FlowingData: I think Nathan Yau does a great job for doing some original analyses, but also keeping in touch with the cool and innovative dataviz going on.
- R-bloggers: A blog aggregator and, overall, a mixed bag of fluff, highly technical posts, announcements of R packages, news, cool projects, and more… Most contributors are high-quality and/or interesting most of the time.
- DataScience+: This is an exception to the “I pay attention to the posts” statement, because I just learned about it (thanks, Jason!). Pretty cool examples and tutorials.