Winter 2017

Last updated: March 07, 2017

Note: This is a draft of the syllabus. Details are subject to change, though overall topics covered and flavor of the course will not.

Grading information

Several homeworks/projects (a goal of 3-4). Open ended where possible, though there will be some normal practice questions. The assignments should be done in RMarkdown, where any code is documented and choices justified. (If you want to use something besides RMarkdown, talk to me ASAP to see if we can work it out.)

Programming is not a solo activity - I encourage you to seek out resources online or to discuss with classmates. However

  1. Each student must submit their own assignment.
  2. If you borrow code, attribute it (whether to a website or another student).

Pre-requisites

Stats 506 (Computational methods and tools in statistics) or equivalent experience. The course will be taught using R. We will use basic statistical techniques.

Course description

This course will use real-world data to explore the issues surrounding the handling of raw data. Often in courses, the data are provided by the instructor and have undergone some cleaning. We will be using real world data from sources such as data.gov, as well as obtaining our own data by web scraping. Time allowing, we’ll also discuss approaches to dealing with big data.

Students are encouraged (not required) to bring laptops to class to aid with hands-on learning.

Tentative Topics

This list of topics is ambitious. We will most likely not have time to get to everything. If any of these topics are of special interest, please let me know as soon as possible and I will try and ensure we cover them.

  1. Basics
    • Developing a programming style
    • Rmarkdown
  2. Obtaining Data
    • String manipulation
    • Web scraping
    • SQL
  3. Visualization & Data Cleaning
    • ggplot
    • High dimensional data visualization
    • Best practices in data cleaning
    • Multiple Imputation
  4. Simulations
    • Randomization studies
    • Simulations (especially for power)
  5. Software Development
    • Debugging
    • Profiling
    • Documenting with Roxygen
    • Testing with testthat
  6. Parallel programming
    • Working on Flux