High dimensional datasets, or tabular data with many features describing each observation of a dataset, are increasingly commonplace in many research domains. How can researchers find patterns and extract insights from such complex and information-rich data? In this workshop, we will explore several tried and true methods that can help data analysts better understand their high dimensional data including: principal component analysis, data visualization, and regularized multivariate regression. As a result of participating in this workshop learners should be able to…
- Define, identify, and give examples of high dimensional datasets
- Visualize and explore high-dimensional data to reveal a research story
- Use dimensionality reduction techniques such as PCA to yield useful abstractions/summaries of complex, high dimensional data
- Understand the challenges associated with fitting both predictive (e.g., overfitting) and explanatory (e.g., avoiding multicolinearity) regression models to high dimensional datasets
- Optimize high dimensional multivariate regression models for either predictive or explanatory purposes via a combination of techniques including: PCA, feature selection, and regularization techniques (lasso, ridge, and elastic net)
- TODO: (1) How to navigate the common pitfalls of clustering in high-dimensions, (2) High-dim visualization tools including PacMAP and t-SNE
Prerequisites
Learners are expected to have the follow prerequisite knowledge:
- Introductory Python programming skills (variable assignments, how to create a function, for loops, etc.) and familiarity with the Pandas package. If you need a refresher on Python before taking this workshop, please review the lesson materials from this Introductory Python Carpentries workshop.
- Familiarity with basic machine learning concepts including train/test splits and overfitting. For a refresher on machine learning basics, please review the lesson materials from the Intro to Machine Learning with Sklearn workshop