This lesson is being piloted (Beta version)

Machine Learning for Tabular Data in R

This lesson introduces a selection of machine learning techniques for analyzing tabular data, including random forests and gradient boosted trees. No experience in machine learning is necessary, but learners should be familiar with data analysis and visualization in R.

Prerequisites

This lesson assumes some familiarity with R, including dplyr and ggplot. Learners who have completed an introductory Data Carpentry lesson in R should be able to follow the presentation. For a good refresher on prerequisite material, consider the lessons Data Analysis and Visualization in R for Ecologists or R for Social Scientists, for example.

For Instructors

If you are teaching this lesson in a workshop, please see the Instructor notes.

Schedule

Setup Download files required for the lesson
00:00 1. A Brief Introduction to Machine Learning What is machine learning?
What specific tools will this lesson cover?
00:20 2. Linear and Logistic Regression How can a model make predictions?
How do we measure the performance of predictions?
01:10 3. Decision Trees What are decision trees?
How can we use a decision tree model to make a prediction?
What are some shortcomings of decision tree models?
02:05 4. Random Forests What are random forests?
How do random forests improve decision tree models?
03:35 5. Gradient Boosted Trees What is gradient boosting?
How can we train an XGBoost model?
What is the learning rate?
04:45 6. Cross Validation and Tuning How can the fit of an XGBoost model be improved?
What is cross validation?
What are some guidelines for tuning parameters in a machine learning algorithm?
06:00 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.