Goals and expectations

The goal of this course is to introduce methods and theoretical principles for data analysis. In addition, you will learn some computing skills that will allow you to analyze data.

To do well in this course, you should master all of the conceptual material covered in the course ebook. You will also need to study the Jupyter notebooks illustrating various techniques of data management and analysis. Lectures and lab discussions will also cover important topics that you are expected to master.

Mathematics

Statistics and data science have important origins in the field of mathematics. In that field, precise definitions and appropriate use of terminology are very important. A data scientist must be able to communicate effectively about their findings, and using terminology appropriately is an important aspect of being a proficient communicator. Therefore, you will be expected to master all of the terminology introduced in this course – you should be able to use the terms that we cover precisely, and explain them correctly in your own words.

This course relies on high-school mathematics. We will not use any calculus, but you should be comfortable with algebra, pre-calculus (e.g. functions), and mathematical tools such as logarithms. You may be asked to do some very basic calculations by hand to show that you understand a concept, but for the most part we will be using computers to do all of our calculations. It is much more important that you understand the relevant mathematical concepts, rather than being good at arithmetic and other calculations.

Application domains

Data science techniques are used in a “domain” from which the motivating question arises. This could be a scientific domain, such as chemistry or psychology, or a domain from the setting of business, industry, commerce, or government. It is rarely possible to completely separate the data science from the context in which the data and motivating questions arise. Therefore, successful data scientists must have broad intellectual curiosity, and be unafraid to learn some basic background information about a variety of different fields. Do not be surprised if the discussions in this course involve basic concepts from physics, economics, medicine, sociology, or other areas. We will not expect you to know much about these subjects, but you must be willing to engage in critical thinking about the subjects for which our datasets are relevant.

Computing

This course will involve a fair amount of computer programming. You are not expected to have previous experience with any programming language. However you should be willing to engage seriously with this aspect of the course. Although Python is sometimes said to be an easy programming language to learn, any programming language has a steep learning curve at the beginning. The purpose of programming in this course is to allow you to analyze data. At the beginning of the course, you will mainly be taking code that is provided to you and making small modifications to it. But by the end of the course, you should be able to do moderately sophisticated data analyses on your own.

The exams in this course will focus primarily on statistical principles. You will need to understand the purpose and basic theoretical properties of the statistical tools that we cover. You will also need to demonstrate that you understand how to formulate questions in statistical terms, how to interpret the results that different statistical procedures produce, and how to communicate these results effectively. Coding is a means to an end for us, rather than an end in itself, but being able to write clear and organized code is very important because it enables other people to read your code and verify what it does. In an exam setting, you will not be asked to write code unaided, but you will be asked to read fragments of code that are provided to you, explain what it does, and identify any problems in the code.