Datasets and case studies

A major part of this course involves analysis of datasets drawn from various fields. We have created a repository of datasets, and developed case studies relating to these datasets. These case studies are all in the form of Jupyter notebooks that you will run on the Greatlakes system. The data are stored on Greatlakes and are ready to be accessed.

Below is a list of the datasets, and the case studies for each dataset. We also provide a brief description of the learning goals for each case study.

  • Austin emobility: This is an extract from a log of all e-scooter rentals in Austin, Texas for a period from April, 2018 to July, 2020. The ‘scooters_basic.ipynb’ notebook illustrates how to load and manipulate data, work with dates and times, pivot dataframes, and produce basic statistical summaries including proportions and quantiles.

  • American Community Survey: This survey, known as the ACS, is the premier survey of the US population. The ‘acs_basic.ipynb’ notebook looks at summary statistics of ACS variables to illustrate statistical concepts such as location, dispersion and skew, quantiles, moments, grouping/aggregation, and some basic plotting.

  • Global Historical Climatology Network: This is a highly curated collection of weather data from stations around the world. The ‘ghcn_scatterplot.ipynb’ notebook focuses on using plotting and scatterplot smoothing to explore relationships among quantitative variables in the dataset.

  • National Health and Nutrition Examination Survey: This is a high-quality survey of US adults focusing on aspects of health and nutrition. The ‘nhanes_sampling.ipynb’ notebook focuses on using the NHANES data to understand sampling properties of the mean and other statistics, and the Central Limit Theorem.

  • World Health Organization mortality: This is a dataset compiled by the World Health Organization (WHO) that reports sex and cause-specific mortality by country and by year. The ‘who_mortality.ipynb’ notebook uses this data to illustrate methods for statistically comparing proportions (mortality rates) between groups using standard errors and Z-scores.