This lesson is in the early stages of development (Alpha version)

Dimensionality Reduction

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can we perform unsupervised learning with dimensionality reduction techniques such as Principle Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE)?

Objectives
  • Recall that most data is inherently multidimensional

  • Understand that reducing the number of dimensions can simplify modelling and allow classifications to be performed.

  • Recall that PCA is a popular technique for dimensionality reduction.

  • Recall that t-SNE is another technique for dimensionality reduction.

  • Apply PCA and t-SNE with Scikit Learn to an example dataset.

  • Evaluate the relative peformance of PCA and t-SNE.

Dimensionality Reduction

Dimensionality reduction is the process of using a subset of the coordinates, which may be transformed, of the dataset to capture the variation in features of the data set. It can be a helpful pre-processing step before doing other operations on the data, such as classification, regression or visualization.

Dimensionality Reduction with Scikit-learn

First setup our environment and load the MNIST digits dataset which will be used as our initial example.

import numpy as np
import matplotlib.pyplot as plt

from sklearn import decomposition
from sklearn import datasets
from sklearn import manifold

digits = datasets.load_digits()

# Examine the dataset
print(digits.data)
print(digits.target)

X = digits.data
y = digits.target

Principle Component Analysis (PCA)

PCA is a technique that does rotations of data in a two dimensional array to decompose the array into combinations vectors that are orthogonal and can be ordered according to the amount of information they carry.

# PCA
pca = decomposition.PCA(n_components=2)
pca.fit(X)
X_pca = pca.transform(X)

fig = plt.figure(1, figsize=(4, 4))
plt.clf()
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap=plt.cm.nipy_spectral, 
        edgecolor='k',label=y)
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.savefig("pca.svg")

Reduction using PCA

t-distributed Stochastic Neighbor Embedding (t-SNE)

# t-SNE embedding
tsne = manifold.TSNE(n_components=2, init='pca',
        random_state = 0)
X_tsne = tsne.fit_transform(X)
fig = plt.figure(1, figsize=(4, 4))
plt.clf()
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap=plt.cm.nipy_spectral,
        edgecolor='k',label=y)
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.savefig("tsne.svg")

Reduction using t-SNE

Exercise: Working in three dimensions

The above example has considered only two dimensions since humans can visualize two dimensions very well. However, there can be cases where a dataset requires more than two dimensions to be appropriately decomposed. Modify the above programs to use three dimensions and create appropriate plots. Do three dimensions allow one to better distinguish between the digits?

Solution

from mpl_toolkits.mplot3d import Axes3D
# PCA
pca = decomposition.PCA(n_components=3)
pca.fit(X)
X_pca = pca.transform(X)
fig = plt.figure(1, figsize=(4, 4))
plt.clf()
ax = fig.add_subplot(projection='3d')
ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=y,
          cmap=plt.cm.nipy_spectral, s=9, lw=0)
plt.savefig("pca_3d.svg")

Reduction to 3 components using pca

# t-SNE embedding
tsne = manifold.TSNE(n_components=3, init='pca',
        random_state = 0)
X_tsne = tsne.fit_transform(X)
fig = plt.figure(1, figsize=(4, 4))
plt.clf()
ax = fig.add_subplot(projection='3d')
ax.scatter(X_tsne[:, 0], X_tsne[:, 1], X_tsne[:, 2], c=y,
          cmap=plt.cm.nipy_spectral, s=9, lw=0)
plt.savefig("tsne_3d.svg")

Reduction to 3 components using tsne

Exercise: Parameters

Look up parameters that can be changed in PCA and t-SNE, and experiment with these. How do they change your resulting plots? Might the choice of parameters lead you to make different conclusions about your data?

Exercise: Other Algorithms

There are other algorithms that can be used for doing dimensionality reduction, for example the Higher Order Singular Value Decomposition (HOSVD) Do an internet search for some of these and examine the example data that they are used on. Are there cases where they do poorly? What level of care might you need to use before applying such methods for automation in critical scenarios? What about for interactive data exploration?

Key Points

  • PCA is a linear dimensionality reduction technique for tabular data

  • t-SNE is another dimensionality reduction technique for tabular data that is more general than PCA