This lesson is being piloted (Beta version)
If you teach this lesson, please tell the authors and provide feedback by opening an issue in the source repository

Introduction to Machine Learning in Python

This lesson provides an introduction to some of the common methods and terminologies used in machine learning research. We cover areas such as data preparation and resampling, model building, and model evaluation.

It is a prerequisite for the other lessons in the machine learning curriculum. In later lessons we explore tree-based models for prediction, neural networks for image classification, and responsible machine learning.

Predicting the outcome of critical care patients

Critical care units are home to sophisticated monitoring systems, helping carers to support the lives of the sickest patients within a hospital. These monitoring systems produce large volumes of data that could be used to improve patient care.

Patient in the ICU

Our goal will be to predict the outcome of critical care patients using physiological data available on the first day of admission to the intensive care unit. These predictions could be used for resource planning or to assist with family discussions.

The dataset used in this lesson was extracted from the eICU Collaborative Research Database, a publicly available dataset comprising deidentified physiological data collected from critically ill patients.

Prerequisites

You need to understand the basics of Python before tackling this lesson. The lesson sometimes references Jupyter Notebook although you can use any Python interpreter mentioned in the Setup.

Getting Started

To get started, follow the directions on the “Setup” page to download data and install a Python interpreter.

Schedule

Setup Download files required for the lesson
00:00 1. Introduction What is machine learning?
What is the relationship between machine learning, AI, and statistics?
What is meant by supervised learning?
00:30 2. Data preparation Why are some common steps in data preparation?
What is SQL and why is it often needed?
What do we partition data at the start of a project?
What is the purpose of setting a random state when partitioning?
Should we impute missing values before or after partitioning?
01:00 3. Learning How do machines learn?
How can machine learning help us to make predictions?
Why is it important to be able to quantify the error in our models?
What is an example of a loss function?
01:30 4. Modelling Broadly speaking, when talking about regression and classification, how does the prediction target differ?
Would linear regression be most useful for a regression or classification task? How about logistic regression?
02:10 5. Validation What is meant by model accuracy?
What is the purpose of a validation set?
What are two types of cross validation?
What is overfitting?
02:40 6. Evaluation What kind of values go into a confusion matrix?
What do the letters AUROC stand for?
Does an AUROC of 0.5 indicate our predictions were good, bad, or average?
In the context of evaluating performance of a classifier, what is TP?
03:10 7. Bootstrapping Why do we ‘boot up’ computers?
How is bootstrapping commonly used in machine learning?
03:40 8. Data leakage What are common types of data leakage?
How does data leakage occur?
What are the implications of data leakage?
04:10 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.