This lesson is in the early stages of development (Alpha version)

High dimensional statistics with R

Key Points

Introduction to high-dimensional data
  • High-dimensional data are data in which the number of features, $p$, are close to or larger than the number of observations, $n$.

  • These data are becoming more common in the biological sciences due to increases in data storage capabilities and computing power.

  • Standard statistical methods, such as linear regression, run into difficulties when analysing high-dimensional data.

  • In this workshop, we will explore statistical methods used for analysing high-dimensional data using datasets available on Bioconductor.

Regression with many outcomes
  • Performing linear regression in a high-dimensional setting requires us to perform hypothesis testing in a way that low-dimensional regression may not.

  • Sharing information between features can increase power and reduce false positives.

  • When running a lot of null hypothesis tests for high-dimensional data, multiple testing correction allows retain power and avoid making costly false discoveries.

  • Multiple testing methods can be more conservative or more liberal, depending on our goals.

Regularised regression
  • Regularisation is a way to fit a model, get better estimates of effect sizes, and perform variable selection simultaneously.

  • Test and training splits, or cross-validation, are a useful way to select models or hyperparameters.

  • Regularisation can give us a more predictive set of variables, and by restricting the magnitude of coefficients, can give us a better (and more stable) estimate of our outcome.

  • Regularisation is often very fast! Compared to other methods for variable selection, it is very efficient. This makes it easier to practice rigorous variable selection.

Principal component analysis
  • A principal component analysis is a statistical approach used to reduce dimensionality in high-dimensional datasets (i.e. where $p$ is equal or greater than $n$).

  • PCA may be used to create a low-dimensional set of features from a larger set of variables. Examples of when a PCA may be useful include reducing high-dimensional datasets to fewer variables for use in a linear regression and for identifying groups with similar features.

  • PCA is a useful dimensionality reduction technique used in the analysis of complex biological datasets (e.g. high throughput data or genetics data).

  • The first principal component represents the dimension along which there is maximum variation in the data. Subsequent principal components represent dimensions with progressively less variation.

  • Scree plots and biplots may be used to show: 1. how much variation in the data is explained by each principal component and 2. how data points cluster according to principal component scores and which variables are associated with these scores.

Factor analysis
  • Factor analysis is a method used for reducing dimensionality in a dataset by reducing variation contained in multiple variables into a smaller number of uncorrelated factors.

  • PCA can be used to identify the number of factors to initially use in factor analysis.

  • The factanal() function in R can be used to fit a factor analysis, where the number of factors is specified by the user.

  • Factor analysis can take into account expert knowledge when deciding on the number of factors to use, but a disadvantage is that the output requires careful interpretation.

K-means
  • K-means is an intuitive algorithm for clustering data.

  • K-means has various advantages but can be computationally intensive.

  • Apparent clusters in high-dimensional data should always be treated with some scepticism.

  • Silhouette width and bootstrapping can be used to assess how well our clustering algorithm has worked.

Hierarchical clustering
  • Hierarchical clustering uses an algorithm to group similar data points into clusters. A dendrogram is used to plot relationships between clusters (using the hclust() function in R).

  • Hierarchical clustering differs from k-means clustering as it does not require the user to specify expected number of clusters

  • The distance (dissimilarity) matrix can be calculated in various ways, and different clustering algorithms (linkage methods) can affect the resulting dendrogram.

  • The Dunn index can be used to validate clusters using the original dataset.