Unpacking PCA
Overview
Teaching: 45 min
Exercises: 2 minQuestions
What is the intuition behind how PCA works?
Objectives
explain the overall process of PCA
explain the result of a PCA operation
define a primary component
define dimensionality reduction
explain how PCA can be used as a dimensionality reduction technique
PCA part 2
What Just Happened !?
Intuition:
PCA is a way to rotate the axes of your dataset around the data so that the axes line up with the directions of the greatest variation through the data.
2d PCA example - get orientation/intuition.
- show two variables
- show PCA result
- intuition is rotated axes
- each new axis is a % of other axes now.
- you can scale that to n dimensions.
keep detail of PCA minimized below - keep them in here. those 5 steps won’t build extra intuition though.
relationship between two variables x and y
# Here is a random data feature, re-run until you like it:
from helper_functions import create_feature_scatter_plot
feature_ax, features, axlims = create_feature_scatter_plot(random_state=13)
Reminder: correlated features lead to multicollinearity issues
A VIF score above 10 means that you can’t meaningfully interpret the model’s estimated coefficients/p-values due to confounding effects.
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
def calc_print_VIF(X):
# Calculate VIF for each predictor in X
vif = pd.DataFrame()
vif["Variable"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
# Display the VIF values
print(vif)
calc_print_VIF(pd.DataFrame(features))
Variable VIF
0 0 45.174497
1 1 45.174497
PCA of those two variables
from sklearn.decomposition import PCA
p = PCA(n_components=2) # instantiate PCA transform
features_pca = p.fit_transform(features) # perform PCA and re-project data
calc_print_VIF(pd.DataFrame(features_pca))
Variable VIF
0 0 1.0
1 1 1.0
plot PCA result
import matplotlib.pyplot as plt
from helper_functions import plot_pca_features
fig, ax, (ax_min, ax_max) = plot_pca_features(features_pca)
plt.show()
original and PCA result comparison
from helper_functions import plot_pca_feature_comparison
point_to_highlight=10
plot_pca_feature_comparison(features, features_pca, ax_max, ax_min, p, point_to_highlight)
plt.show()
The process of PCA is analagous to walking around your data and looking at it from a new angle
from IPython.display import YouTubeVideo
YouTubeVideo("BorcaCtjmog")
3d data example
https://setosa.io/ev/principal-component-analysis/
this rotation of the axes, mean that new pca axes are made up of proportions of the old axes
what are those proportions?
The pca “components_” property, or the eigenvectors of each primary component
from helper_functions import show_pcs_on_unit_axes
show_pcs_on_unit_axes(p)
plt.show()
for i, pc in enumerate(p.components_):
print(f"PC{i}: {pc}")
PC0: [ 0.70710678 -0.70710678]
PC1: [0.70710678 0.70710678]
demonstrate transform of one point from original feature space to PC-space
fmt_str = "{:.2f}, {:.2f}"
print("START in feature space:")
print(fmt_str.format(features[point_to_highlight,0],features[point_to_highlight,1]))
print()
print("END: in pca space:")
print(fmt_str.format(features_pca[point_to_highlight,0], features_pca[point_to_highlight,1]))
START in feature space:
1.20, -1.29
END: in pca space:
1.76, -0.07
step 1 scale feature space data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features_scaled = scaler.fit(features).transform(features)
print("scaled feature space:")
print("{:.2f}, {:.2f}".format(
features_scaled[point_to_highlight, 0],
features_scaled[point_to_highlight, 1])
)
scaled feature space:
1.20, -1.29
step 2 get dot product of feature space values and principle component eigenvectors
# use both x and y coords of original point here as new pc0-coord is combination of both axes!
print("{:.2f}".format(
# first dimension of example point * first dimension of PC0 eigenvector
features_scaled[point_to_highlight, 0] * p.components_[0,0]
+
# second dimension of example point * second dimension of PC0 eigenvector
features_scaled[point_to_highlight, 1] * p.components_[0,1]
)
)
1.76
# Again: use both x and y coords of original point here as new pc1-coord is combination of both axes!
print("{:.2f}".format(
# first dimension of example point * first dimension of PC1 eigenvector
features_scaled[point_to_highlight, 0] * p.components_[1,0]
+
# first dimension of example point * first dimension of PC1 eigenvector
features_scaled[point_to_highlight, 1] * p.components_[1,1]
)
)
-0.07
this is called a dimensionality REDUCTION technique, because one dimension now explains more of the variability of your data
fig, ax = plt.subplots()
x = ["pc0", "pc1"]
ax.bar(x, p.explained_variance_ratio_)
for i in range(len(x)):
plt.text(x=i, y=p.explained_variance_ratio_[i],
s="{:.3f}".format(p.explained_variance_ratio_[i]),
ha="center")
ax.set_title("explained variance ratio")
plt.show()
Key Points
PCA transforms your original data by projecting it into new axes
primary components are orthogonal vectors through your data that explain the most variability