Course Content

This course will introduce participants to a fascinating field of statistics. We will see how we can rely on statistical models to gain a deep understanding from data. This often involves finding optimal predictions and classifications. Machine Learning (also known as Statistical Learning) is quickly developing and is being applied in various fields such as business analytics, political science, sociology, and elsewhere.

Course Objectives

This course aims to provide an introduction to the data science approach to the quantitative analysis of data using the methods of statistical learning, an approach blending classical statistical methods with recent advances in computational and machine learning. The course will cover the main analytical methods from this field with hands-on applications using example datasets. This will allow students to gain experience with and confidence in using the methods we cover.

Students will know how to successfully apply a number of tools and models for supervised and unsupervised learning. After a short probability refresher, students will learn how to evaluate various methods based on cross-validation. We will then see how we can create optimal prediction models. Creating a good prediction model requires choosing an optimal set of explanatory variables. To this end, we will rely on subset selection, shrinkage methods, lasso, and ridge regression. Classification is another prominent topic and we will use decision trees and random forests to solve such problems. Finally, in terms of data reduction, we will rely on principle component analysis. All these tools provide the foundation for students to then solve real-world problems, potentially by combining these various approaches. The focus of this class is on giving the students sufficient practical training such that they can fruitfully apply these methods in their own work.

Course Prerequisites

Students are expected to have a solid understanding of linear regression models and preferably know binary models. Some prior exposure to statistical software is beneficial but not required. The course will also provide a short introduction to RStudio at the beginning. More important than prior training will be a willingness to engage with the topics of the class.