This lesson is still being designed and assembled (Pre-Alpha version)

Exploring and Modeling High-Dimensional Data

High dimensional datasets, or tabular data with many features describing each observation of a dataset, are increasingly commonplace in many research domains. How can researchers find patterns and extract insights from such complex and information-rich data? In this workshop, we will explore several tried and true methods that can help data analysts better understand their high dimensional data including: principal component analysis, data visualization, and regularized multivariate regression. As a result of participating in this workshop learners should be able to…

Prerequisites

Learners are expected to have the follow prerequisite knowledge:

  • Introductory Python programming skills (variable assignments, how to create a function, for loops, etc.) and familiarity with the Pandas package. If you need a refresher on Python before taking this workshop, please review the lesson materials from this Introductory Python Carpentries workshop.
  • Familiarity with basic machine learning concepts including train/test splits and overfitting. For a refresher on machine learning basics, please review the lesson materials from the Intro to Machine Learning with Sklearn workshop

Schedule

Setup Download files required for the lesson
00:00 1. Exploring high dimensional data What is a high dimensional dataset?
00:22 2. The Ames housing dataset Here we introduce the data we’ll be analyzing
01:09 3. Predictive vs. explanatory regression What are the two different goals to keep in mind when fitting machine learning models?
What kinds of questions can be answered using linear regresion?
How can we evaluate a model’s ability to capture a true signal/relationship in the data versus spurious noise?
01:56 4. Model validity - relevant predictors What are the benfits/costs of including additional predictors in a regression model?
02:43 5. Model validity - regression assumptions How can we rigorously evaluate the validity and accuracy of a multivariate regression model?
What are the assumptions of linear regression models and why are they important?
How should we preprocess categorical predictors and sparse binary predictors?
03:30 6. Model interpretation and hypothesis testing How can multivariate models be used to detect interesting relationships found in nature?
How can we interpret statistical results with minor violations of model assumptions?
04:17 7. Feature selection with PCA How can PCA be used as a feature selection method?
05:04 8. Unpacking PCA What is the intuition behind how PCA works?
05:51 9. Regularization methods - lasso, ridge, and elastic net How can LASSO regularization be used as a feature selection method?
06:38 10. Exploring additional datasets How can we use everything we have learned to analyze a new dataset?
07:25 11. Introduction to High-Dimensional Clustering What is clustering?
Why is clustering important?
What are the challenges of clustering in high-dimensional spaces?
How do we implement K-means clustering in Python?
How do we evaluate the results of clustering?
07:55 12. Addressing challenges in high-dimensional clustering How can dimensionality reduction help in high-dimensional clustering?
What are specialized clustering algorithms for high-dimensional data?
How can we visualize high-dimensional data?
What insights can we gain from visualizing clusters?
08:25 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.