This lesson is still being designed and assembled (Pre-Alpha version)

Exploring and Modeling High-Dimensional Data

Exploring high dimensional data

Overview

Teaching: 20 min
Exercises: 2 min
Questions
  • What is a high dimensional dataset?

Objectives
  • Define a dimension, index, and observation

  • Define, identify, and give examples of high dimensional datasets

  • Summarize the dimensionality of a given dataset

Introduction - what is high dimensional data?

What is data?

da·ta

/ˈdadə, ˈdādə/ noun “the quantities, characters, or symbols on which operations are performed by a computer”

—Oxford Languages

(how is data formatted? structured, semi-structured, unstructured: flat file, json, raw text)

There is a conversion to numerical representation happening here

A rectangular dataset: Original data set not rectangular, might require conversion that produces high dimensional rectangular data set.

We’re discussing structured, rectangular data only today.

What is a dimension?

di·men·sion

/dəˈmen(t)SH(ə)n, dīˈmen(t)SH(ə)n/

noun noun: dimension; plural noun: dimensions

  1. a measurable extent of some kind, such as length, breadth, depth, or height.
  2. an aspect or feature of a situation, problem, or thing.

—Oxford Languages

A Tabular/Rectangular Data Context

A Schematic of the arrangement of Tabular Data with columns/features rows/observations

Each row is an observation, is a sample.

Each column is a feature, is a dimension.

The index is not a dimension.

A Dataset

  1. Some number of observations > 1
  2. every feature of an observation is a dimension
  3. the number of observations i.e. the index, is not a dimension

Examples of datasets with increasing dimensionality

1 D

  1. likert scale question (index: respondent_id, question value (-3 to 3)

2 D

  1. scatter plot (x, y)
  2. two question survey (index: respondent_id, q1 answer, q2 answer)
  3. data from temperature logger: (index: logged_value_id, time, value)

3 D

  1. surface (x, y, z)
  2. scatter plot with variable as size per point (x, y, size)
  3. 2d black and white image (x, y, pixel_value)
  4. moves log from a game of ‘battleship’ (index: move number, x-coord, y-coord, hit/miss)
  5. consecutive pulses of CP 1919 (time, x, y)

4 D

  1. surface plus coloration, (x, y, z, color_label)
  2. surface change over time (x, y, z, time)

30 D

  1. Brain connectivity analysis of 30 regions

20, 000 D

human gene expression e.g.

Exercise - Battleship moves:

discussion point is this 3d or 4d?

is the move number a dimension or an index?

move_id column (A-J) row (1-10) hit
0 A 1 False
1 J 10 True
2 C 7 False
n  

Solution

3d: move_id is an index!

  1. order sequence matters but not the specific value of the move number

4d: move_id is a dimension!

  1. odd or even tells you which player is making which move
  2. order sequence is important, but when a specific moves get made might matter - what if you wanted to analyze moves as a function of game length?

There is always an index

  1. move_id is an index
  2. that doesn’t mean there is no information there
  3. you can perform some feature engineering with move_id
  4. this would up the dimensionality of the inital 3d dataset perhaps adding two more dimensions:
    1. player
    2. player’s move number

Exercise - Film:

consider a short, black and white, silent film, in 4K. It has the following properties:

  1. 1 minute long
  2. 25 frames per second
  3. 4K resolution i.e. 4096 × 2160.
  4. standard color depth 24 bits/pixel

Think of this film as a dataset, How many observations might there be?

Solution:

60 seconds x 25 frames per second = 1500 frames or ‘observations’. Is there another way to think about this?

Exercise: How many dimensions are there per observation?

Solution:

There are three dimensions per observation:

  1. pixel row (0-2159)
  2. pixel col (0-4095)
  3. pixel grey value (0-255)

Exercise: How many dimensions would there be if the film was longer, or shorter?

Solution:

  1. The number of dimensions would NOT change.
  2. There would simply be a greater or fewer number of ‘observations’

Exercise: How many dimensions would there be if the film was in color?

Solution:

4 dimensions.

There is an extra dimension per observation now.

  1. channel value (red, green, blue)
  2. pixel row (0-2159)
  3. pixel col (0-4095)
  4. pixel intensity (0-255)

Exercise: Titanic dataset

Look at the kaggle Titantic Dataset.

passenger_id pclass name sex age sibsp parch ticket fare cabin embarked boat body home.dest survived
1216 3 Smyth, Miss. Julia female   0 0 335432 7.7333   Q 13     1
699 3 Cacic, Mr. Luka male 38.0 0 0 315089 8.6625   S     Croatia 0
1267 3 Van Impe, Mrs. Jean Baptiste (Rosalie Paula Govaert) female 30.0 1 1 345773 24.15   S       0
449 2 Hocking, Mrs. Elizabeth (Eliza Needs) female 54.0 1 3 29105 23.0   S 4   Cornwall / Akron, OH 1
576 2 Veal, Mr. James male 40.0 0 0 28221 13.0   S     Barre, Co Washington, VT 0

What column is the index?

Solution:

PassengerId

Exercise: What columns are the dimensions?

Solution:

  1. pclass
  2. name
  3. sex
  4. age
  5. sibsp
  6. parch
  7. ticket
  8. fare
  9. cabin
  10. embarked
  11. survived

Exercise: how many dimensions are there?

Solution:

11

Exercise: Imagine building a model to predict survival on the titantic

  1. would you use every dimension?
  2. what makes a dimension useful?
  3. could you remove some dimensions?
  4. could you combine some dimensions?
  5. how would you combine those dimensions?
  6. do you have fewer dimensions after combining?
  7. do you have less information after combining?

Solution:

  1. No, some variables are poor predictors and can be ignored
  2. If it is (anti-)correlated with survival (in some context) i.e. has information.
  3. Yes any mostly null columns are not useful (add no information), any highly correlated columns also (no additional information)
  4. Yes
  5. Maybe add SibSp and Parch into one ‘family count’.
  6. Yes.
  7. Yes, but more data than if columns had been excluded.

High-Dimensional Data

What is high-dimensional data? Unfortunately, there isn’t a precise definition. Often, when people use the term, they are referring to specific problems and headaches that arise when working with data that has many (typically dozens or more) features (a.k.a. dimensions). These problems are generally referred to as the “curse of dimensionality”.

Curse of Dimensionality

The “curse of dimensionality” refers to the challenges that arise when dealing with data in high-dimensional spaces. These challenges include:

Throughout this workshop, we’ll see how these challenges, or “curses,” apply to our research goals and explore strategies to address them.

Key Points

  • data can be anything - as long as you can represent it in a computer

  • A dimension is a feature in a dataset - i.e. a column, but NOT an index.

  • an index is not a dimension


The Ames housing dataset

Overview

Teaching: 45 min
Exercises: 2 min
Questions
  • Here we introduce the data we’ll be analyzing

Objectives

Intro to Ames Housing Dataset

Throughout this workshop, we will explore how to efficiently detect patterns and extract insights from high-dimensional data. We will focus on a widely accessible dataset known as the Ames housing data. This dataset will serve as a foundational example, allowing us to grasp the challenges and opportunities presented by high-dimensional data analysis.

Load the dataset

Here we load the dataset from the sklearn library, and see a preview

from sklearn.datasets import fetch_openml

# load the dataset
housing = fetch_openml(name="house_prices", as_frame=True, parser='auto')

df = housing.data.copy(deep=True) # create new DataFrame copy of original dataset
df = df.astype({'Id': int})       # set data type of Id to int
df = df.set_index('Id')           # set Id column to be the index of the DataFrame
df                                # evaluate result
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
Id
1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside ... 0 0 NaN NaN NaN 0 2 2008 WD Normal
2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 ... 0 0 NaN NaN NaN 0 5 2007 WD Normal
3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside ... 0 0 NaN NaN NaN 0 9 2008 WD Normal
4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner ... 0 0 NaN NaN NaN 0 2 2006 WD Abnorml
5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 ... 0 0 NaN NaN NaN 0 12 2008 WD Normal
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1456 60 RL 62.0 7917 Pave NaN Reg Lvl AllPub Inside ... 0 0 NaN NaN NaN 0 8 2007 WD Normal
1457 20 RL 85.0 13175 Pave NaN Reg Lvl AllPub Inside ... 0 0 NaN MnPrv NaN 0 2 2010 WD Normal
1458 70 RL 66.0 9042 Pave NaN Reg Lvl AllPub Inside ... 0 0 NaN GdPrv Shed 2500 5 2010 WD Normal
1459 20 RL 68.0 9717 Pave NaN Reg Lvl AllPub Inside ... 0 0 NaN NaN NaN 0 4 2010 WD Normal
1460 20 RL 75.0 9937 Pave NaN Reg Lvl AllPub Inside ... 0 0 NaN NaN NaN 0 6 2008 WD Normal

1460 rows × 79 columns

print(df.columns.tolist())
['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition']

Dataset Preview Exercises

How many features are there?

Solution

79

EXERCISE: How many observations are there?

Solution

1460

EXERCISE: What are all the feature names?

Solution

print(df.columns.tolist())
['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition']
EXERCISE_END

How do I find out what these features mean?

Read the Data Dictionary

from IPython.display import display, Pretty

# the housing object we created in step one above contains a Data Dictionary for the Dataset
display(Pretty(housing.DESCR))

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

MSSubClass: Identifies the type of dwelling involved in the sale.

20 1-STORY 1946 & NEWER ALL STYLES 30 1-STORY 1945 & OLDER 40 1-STORY W/FINISHED ATTIC ALL AGES 45 1-1/2 STORY - UNFINISHED ALL AGES 50 1-1/2 STORY FINISHED ALL AGES 60 2-STORY 1946 & NEWER 70 2-STORY 1945 & OLDER 75 2-1/2 STORY ALL AGES 80 SPLIT OR MULTI-LEVEL 85 SPLIT FOYER 90 DUPLEX - ALL STYLES AND AGES 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 150 1-1/2 STORY PUD - ALL AGES 160 2-STORY PUD - 1946 & NEWER 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER 190 2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.

A Agriculture C Commercial FV Floating Village Residential I Industrial RH Residential High Density RL Residential Low Density RP Residential Low Density Park RM Residential Medium Density

LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

Street: Type of road access to property

Grvl Gravel Pave Paved

Alley: Type of alley access to property

Grvl Gravel Pave Paved NA No alley access

LotShape: General shape of property

Reg Regular IR1 Slightly irregular IR2 Moderately Irregular IR3 Irregular

LandContour: Flatness of the property

Lvl Near Flat/Level Bnk Banked - Quick and significant rise from street grade to building HLS Hillside - Significant slope from side to side Low Depression

Utilities: Type of utilities available

AllPub All public Utilities (E,G,W,& S) NoSewr Electricity, Gas, and Water (Septic Tank) NoSeWa Electricity and Gas Only ELO Electricity only

LotConfig: Lot configuration

Inside Inside lot Corner Corner lot CulDSac Cul-de-sac FR2 Frontage on 2 sides of property FR3 Frontage on 3 sides of property

LandSlope: Slope of property

Gtl Gentle slope Mod Moderate Slope Sev Severe Slope

Neighborhood: Physical locations within Ames city limits

Blmngtn Bloomington Heights Blueste Bluestem BrDale Briardale BrkSide Brookside ClearCr Clear Creek CollgCr College Creek Crawfor Crawford Edwards Edwards Gilbert Gilbert IDOTRR Iowa DOT and Rail Road MeadowV Meadow Village Mitchel Mitchell Names North Ames NoRidge Northridge NPkVill Northpark Villa NridgHt Northridge Heights NWAmes Northwest Ames OldTown Old Town SWISU South & West of Iowa State University Sawyer Sawyer SawyerW Sawyer West Somerst Somerset StoneBr Stone Brook Timber Timberland Veenker Veenker

Condition1: Proximity to various conditions

Artery Adjacent to arterial street Feedr Adjacent to feeder street Norm Normal RRNn Within 200’ of North-South Railroad RRAn Adjacent to North-South Railroad PosN Near positive off-site feature–park, greenbelt, etc. PosA Adjacent to postive off-site feature RRNe Within 200’ of East-West Railroad RRAe Adjacent to East-West Railroad

Condition2: Proximity to various conditions (if more than one is present)

Artery Adjacent to arterial street Feedr Adjacent to feeder street Norm Normal RRNn Within 200’ of North-South Railroad RRAn Adjacent to North-South Railroad PosN Near positive off-site feature–park, greenbelt, etc. PosA Adjacent to postive off-site feature RRNe Within 200’ of East-West Railroad RRAe Adjacent to East-West Railroad

BldgType: Type of dwelling

1Fam Single-family Detached 2FmCon Two-family Conversion; originally built as one-family dwelling Duplx Duplex TwnhsE Townhouse End Unit TwnhsI Townhouse Inside Unit

HouseStyle: Style of dwelling

1Story One story 1.5Fin One and one-half story: 2nd level finished 1.5Unf One and one-half story: 2nd level unfinished 2Story Two story 2.5Fin Two and one-half story: 2nd level finished 2.5Unf Two and one-half story: 2nd level unfinished SFoyer Split Foyer SLvl Split Level

OverallQual: Rates the overall material and finish of the house

10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor

OverallCond: Rates the overall condition of the house

10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor

YearBuilt: Original construction date

YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)

RoofStyle: Type of roof

Flat Flat Gable Gable Gambrel Gabrel (Barn) Hip Hip Mansard Mansard Shed Shed

RoofMatl: Roof material

ClyTile Clay or Tile CompShg Standard (Composite) Shingle Membran Membrane Metal Metal Roll Roll Tar&Grv Gravel & Tar WdShake Wood Shakes WdShngl Wood Shingles

Exterior1st: Exterior covering on house

AsbShng Asbestos Shingles AsphShn Asphalt Shingles BrkComm Brick Common BrkFace Brick Face CBlock Cinder Block CemntBd Cement Board HdBoard Hard Board ImStucc Imitation Stucco MetalSd Metal Siding Other Other Plywood Plywood PreCast PreCast Stone Stone Stucco Stucco VinylSd Vinyl Siding Wd Sdng Wood Siding WdShing Wood Shingles

Exterior2nd: Exterior covering on house (if more than one material)

AsbShng Asbestos Shingles AsphShn Asphalt Shingles BrkComm Brick Common BrkFace Brick Face CBlock Cinder Block CemntBd Cement Board HdBoard Hard Board ImStucc Imitation Stucco MetalSd Metal Siding Other Other Plywood Plywood PreCast PreCast Stone Stone Stucco Stucco VinylSd Vinyl Siding Wd Sdng Wood Siding WdShing Wood Shingles

MasVnrType: Masonry veneer type

BrkCmn Brick Common BrkFace Brick Face CBlock Cinder Block None None Stone Stone

MasVnrArea: Masonry veneer area in square feet

ExterQual: Evaluates the quality of the material on the exterior

Ex Excellent Gd Good TA Average/Typical Fa Fair Po Poor

ExterCond: Evaluates the present condition of the material on the exterior

Ex Excellent Gd Good TA Average/Typical Fa Fair Po Poor

Foundation: Type of foundation

BrkTil Brick & Tile CBlock Cinder Block PConc Poured Contrete Slab Slab Stone Stone Wood Wood

BsmtQual: Evaluates the height of the basement

Ex Excellent (100+ inches) Gd Good (90-99 inches) TA Typical (80-89 inches) Fa Fair (70-79 inches) Po Poor (<70 inches NA No Basement

BsmtCond: Evaluates the general condition of the basement

Ex Excellent Gd Good TA Typical - slight dampness allowed Fa Fair - dampness or some cracking or settling Po Poor - Severe cracking, settling, or wetness NA No Basement

BsmtExposure: Refers to walkout or garden level walls

Gd Good Exposure Av Average Exposure (split levels or foyers typically score average or above) Mn Mimimum Exposure No No Exposure NA No Basement

BsmtFinType1: Rating of basement finished area

GLQ Good Living Quarters ALQ Average Living Quarters BLQ Below Average Living Quarters Rec Average Rec Room LwQ Low Quality Unf Unfinshed NA No Basement

BsmtFinSF1: Type 1 finished square feet

BsmtFinType2: Rating of basement finished area (if multiple types)

GLQ Good Living Quarters ALQ Average Living Quarters BLQ Below Average Living Quarters Rec Average Rec Room LwQ Low Quality Unf Unfinshed NA No Basement

BsmtFinSF2: Type 2 finished square feet

BsmtUnfSF: Unfinished square feet of basement area

TotalBsmtSF: Total square feet of basement area

Heating: Type of heating

Floor Floor Furnace GasA Gas forced warm air furnace GasW Gas hot water or steam heat Grav Gravity furnace OthW Hot water or steam heat other than gas Wall Wall furnace

HeatingQC: Heating quality and condition

Ex Excellent Gd Good TA Average/Typical Fa Fair Po Poor

CentralAir: Central air conditioning

N No Y Yes

Electrical: Electrical system

SBrkr Standard Circuit Breakers & Romex FuseA Fuse Box over 60 AMP and all Romex wiring (Average) FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair) FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor) Mix Mixed

1stFlrSF: First Floor square feet

2ndFlrSF: Second floor square feet

LowQualFinSF: Low quality finished square feet (all floors)

GrLivArea: Above grade (ground) living area square feet

BsmtFullBath: Basement full bathrooms

BsmtHalfBath: Basement half bathrooms

FullBath: Full bathrooms above grade

HalfBath: Half baths above grade

Bedroom: Bedrooms above grade (does NOT include basement bedrooms)

Kitchen: Kitchens above grade

KitchenQual: Kitchen quality

Ex Excellent Gd Good TA Typical/Average Fa Fair Po Poor

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

Functional: Home functionality (Assume typical unless deductions are warranted)

Typ Typical Functionality Min1 Minor Deductions 1 Min2 Minor Deductions 2 Mod Moderate Deductions Maj1 Major Deductions 1 Maj2 Major Deductions 2 Sev Severely Damaged Sal Salvage only

Fireplaces: Number of fireplaces

FireplaceQu: Fireplace quality

Ex Excellent - Exceptional Masonry Fireplace Gd Good - Masonry Fireplace in main level TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement Fa Fair - Prefabricated Fireplace in basement Po Poor - Ben Franklin Stove NA No Fireplace

GarageType: Garage location

2Types More than one type of garage Attchd Attached to home Basment Basement Garage BuiltIn Built-In (Garage part of house - typically has room above garage) CarPort Car Port Detchd Detached from home NA No Garage

GarageYrBlt: Year garage was built

GarageFinish: Interior finish of the garage

Fin Finished RFn Rough Finished Unf Unfinished NA No Garage

GarageCars: Size of garage in car capacity

GarageArea: Size of garage in square feet

GarageQual: Garage quality

Ex Excellent Gd Good TA Typical/Average Fa Fair Po Poor NA No Garage

GarageCond: Garage condition

Ex Excellent Gd Good TA Typical/Average Fa Fair Po Poor NA No Garage

PavedDrive: Paved driveway

Y Paved P Partial Pavement N Dirt/Gravel

WoodDeckSF: Wood deck area in square feet

OpenPorchSF: Open porch area in square feet

EnclosedPorch: Enclosed porch area in square feet

3SsnPorch: Three season porch area in square feet

ScreenPorch: Screen porch area in square feet

PoolArea: Pool area in square feet

PoolQC: Pool quality

Ex Excellent Gd Good TA Average/Typical Fa Fair NA No Pool

Fence: Fence quality

GdPrv Good Privacy MnPrv Minimum Privacy GdWo Good Wood MnWw Minimum Wood/Wire NA No Fence

MiscFeature: Miscellaneous feature not covered in other categories

Elev Elevator Gar2 2nd Garage (if not described in garage section) Othr Other Shed Shed (over 100 SF) TenC Tennis Court NA None

MiscVal: $Value of miscellaneous feature

MoSold: Month Sold (MM)

YrSold: Year Sold (YYYY)

SaleType: Type of sale

WD Warranty Deed - Conventional CWD Warranty Deed - Cash VWD Warranty Deed - VA Loan New Home just constructed and sold COD Court Officer Deed/Estate Con Contract 15% Down payment regular terms ConLw Contract Low Down payment and low interest ConLI Contract Low Interest ConLD Contract Low Down Oth Other

SaleCondition: Condition of sale

Normal Normal Sale Abnorml Abnormal Sale - trade, foreclosure, short sale AdjLand Adjoining Land Purchase Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit Family Sale between family members Partial Home was not completed when last assessed (associated with New Homes)

Downloaded from openml.org.

Feature Exercises

EXERCISE_START

What information is represented by BsmtFinType2?

Solution

Rating of basement finished area (if multiple types)

EXERCISE_START

What type of variable is BsmtFinType2 (categorical or numeric, then nominal/ordinal or discrete/continuous)?

Solution

categorical, ordinate

EXERCISE_START

What information is represented by GrLivArea?

Solution

Above grade (ground) living area square feet

EXERCISE_START

What type of variable is GrLivArea? (categorical or numeric, then nominal/ordinal or discrete/continuous)?

Solution

numeric, discrete

Load the Target (Response) Variable - Housing Price

import pandas as pd

housing_price_df = pd.DataFrame(housing.target) # load data into dataframe
housing_price_df.describe()                     # create numeric summary of that data
SalePrice
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000

Price Exercises

EXERCISE_START

What is the range of housing prices in the dataset?

Solution

min: $34,900, max: $755,000

EXERCISE_START

Are the price data skewed? What distribution might you expect?

Solution

yes, positive/right handed skew. Expect positive/right handed skew from a long tail to outlier high values

Plot housing price values

import helper_functions
helper_functions.plot_salesprice(
    housing_price_df,
    # ylog=True
)

Summary

In this session we:

  1. Introduced the Ames Housing Dataset
  2. Determined the number of features and observations
  3. Understood some variables
  4. Viewed the ‘target variable’: SalePrice

Key Points


Predictive vs. explanatory regression

Overview

Teaching: 45 min
Exercises: 2 min
Questions
  • What are the two different goals to keep in mind when fitting machine learning models?

  • What kinds of questions can be answered using linear regresion?

  • How can we evaluate a model’s ability to capture a true signal/relationship in the data versus spurious noise?

Objectives
  • Review structure and goals of linear regression

  • Know when to use different model evaluation metrics for different modeling goals

  • Learn how to train and evaluate a predictive machine learning model

  • Understand how to detect underfitting and overfitting in a machine learning model

Linear Regression

Linear regression is a powerful technique that is often used to understand whether and how certain predictor variables (e.g., garage size, year built, etc.) in a dataset linearly relate to some target variable (e.g., house sale prices). Starting with linear models when working with high-dimensional data can offer several advantages including:

While linear models have their merits, it’s important to recognize that they might not capture complex (nonlinear) relationships present in the data. However, they are often the best option available when working in a high-dimensional context unless data is extremely limited.

Goals of Linear Regression

By fitting linear models to the Ames housing dataset, we can…

  1. Predict: Use predictive modeling to predict hypothetical/future sale prices based on observed values of the predictor variables in our dataset (e.g., garage size, year built, etc.).
  2. Explain: Use statistics to make scientific claims concerning which predictor variables have a significant impact on sale price — the target variable (a.k.a. response / dependent variable)

Terminology note: “target” and “predictor” synonyms

In this workshop, we will explore how we can exploit well-established machine learning methods, including feature selection, and regularization techniques (more on these terms later), to achieve both of the above goals on high-dimensional datasets.

To predict or explain. That is the question.

When trying to model data you use in your work, which goal is typically more prevalent? Do you typically care more about (1) accurately predicting some target variable or (2) making scientific claims concerning the existence of certain relationships between variables?

Solution

In a research setting, explaining relationships typically takes higher priority over predicting since explainations hold high value in science, but both goals are sometimes relevant. In industry, the reverse is typically true as many industry applications place predictive accuracy above explainability. We will explore how these goals align and sometimes diverge from one another throughout the remaining lessons.

Predicting housing prices with a single predictor

We’ll start with the first goal: prediction. How can we use regression models to predict housing sale prices? For clarity, we will begin this question through the lens of simple univariate regression models.

General procedure for fitting and evaluating predictive models

We’ll follow this general procedure to fit and evaluate predictive models:

  1. Extract predictor(s), X, and target, y, variables
  2. Preprocess the data: check for NaNs and extreme sparsity
  3. Visualize the relationship between X and y
  4. Transform target variable, if necessary, to get a linear relationship between predictors
  5. Train/test split the data
  6. Fit the model to the training data
  7. Evaluate model

    a. Plot the data vs predictions - qualitative assessment

    b. Measure train/test set errors and check for signs of underfitting or overfitting

We’ll start by loading in the Ames housing data as we have done previously in this workshop.

from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True, parser='auto') #

1) Extract predictor variable and target variable from dataframe

Next, we’ll extract the two variables we’ll use for our model — the target variable that we’ll attempt to predict (SalePrice), and a single predictor variable that will be used to predict the target variable. For this example, we’ll explore how well the “OverallQual” variable (i.e., the predictor variable) can predict sale prices.

OverallQual: Rates the overall material and finish of the house

   10	Very Excellent
   1	Very Poor
# Extract x (predictor) and y (target)
y = housing['target']
predictor = 'OverallQual'
x = housing['data'][predictor]

2) Preprocess the data

# remove columns with nans or containing > 97% constant values (typically 0's)
from preprocessing import remove_bad_cols
x_good = remove_bad_cols(x, 95)
0 columns removed, 1 remaining.

3) Visualize the relationship between x and y

Before fitting any models in a univariate context, we should first explore the data to get a sense for the relationship between the predictor variable, “OverallQual”, and the response variable, “SalePrice”. If this relationship does not look linear, we won’t be able to fit a good linear model (i.e., a model with low average prediction error in a predictive modeling context) to the data.

import matplotlib.pyplot as plt
plt.scatter(x,y, alpha=.1)
plt.xlabel(predictor)
plt.ylabel('Sale Price');
# plt.savefig('..//fig//regression//intro//scatterplot_x_vs_salePrice.png', bbox_inches='tight', dpi=300, facecolor='white');

4) Transform target variable, if necessary

Unfortunately, sale price appears to grow almost exponentially—not linearly—with the predictor variable. Any line we draw through this data cloud is going to fail in capturing the true trend we see here.

Log scaling

How can we remedy this situation? One common approach is to log transform the target variable. We’ll convert the “SalePrice” variable to its logarithmic form by using the math.log() function. Pandas has a special function called apply which can apply an operation to every item in a series by using the statement y.apply(math.log), where y is a pandas series.

import numpy as np
y_log = y.apply(np.log)
plt.scatter(x,y_log, alpha=.1)
plt.xlabel(predictor)
plt.ylabel('Sale Price');
# plt.savefig('..//fig//regression//intro//scatterplot_x_vs_logSalePrice.png', bbox_inches='tight', dpi=300, facecolor='white')

This plot now shows a more linear appearing relationship between the target and predictor variables. Whether or not it is sufficiently linear can be addressed when we evaluate the model’s performance later.

5) Train/test split

Next, we will prepare two subsets of our data to be used for model-fitting and model evaluation. This process is standard for any predictive modeling task that involves a model “learning” from observed data (e.g., fitting a line to the observed data).

During the model-fitting step, we use a subset of the data referred to as training data to estimate the model’s coefficients (the slope of the model). The univariate model will find a line of best fit through this data.

Next, we can assess the model’s ability to generalize to new datasets by measuring its performance on the remaining, unseen data. This subset of data is referred to as the test data or holdout set. By evaluating the model on the test set, which was not used during training, we can obtain an unbiased estimate of the model’s performance.

If we were to evaluate the model solely on the training data, it could lead to overfitting. Overfitting occurs when the model learns the noise and specific patterns of the training data too well, resulting in poor performance on new data. By using a separate test set, we can identify if the model has overfit the training data and assess its ability to generalize to unseen samples. While overfitting is typically not likely to occur when using only a single predictor variable, it is still a good idea to use a train/test split when fitting univariate models. This can help in detecting unanticipated issues with the data, such as missing values, outliers, or other anomalies that affect the model’s behavior.

The above image is from Badillo et al., 2020. An Introduction to Machine Learning. Clinical Pharmacology & Therapeutics. 107. 10.1002/cpt.1796.

The below code will split our dataset into a training dataset containing 2/3 of the samples and a test set containing the remaining 1/3 of the data. We’ll discuss these different subsets in more detail in just a bit.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y_log,
                                                    test_size=0.33,
                                                    random_state=0)

print(x_train.shape)
print(x_test.shape)
(978,)
(482,)

Reshape single-var predictor matrix in preparation for model-fitting step (requires a 2-D representation)

x_train = x_train.values.reshape(-1,1)
x_test = x_test.values.reshape(-1,1)
print(x_train.shape)
print(x_test.shape)
(978, 1)
(482, 1)

6) Fit the model to the training dataset

During the model fitting step, we use a subset of the data referred to as training data to estimate the model’s coefficients. The univariate model will find a line of best fit through this data.

The sklearn library

When fitting linear models solely for predictive purposes, the scikit-learn or “sklearn” library is typically used. Sklearn offers a broad spectrum of machine learning algorithms beyond linear regression. Having multiple algorithms available in the same library allows you to switch between different models easily and experiment with various techniques without switching libraries. Sklearn is also optimized for performance and efficiency, which is beneficial when working with large datasets. It can efficiently handle large-scale linear regression tasks, and if needed, you can leverage tools like NumPy and SciPy, which are well-integrated with scikit-learn for faster numerical computations.

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(x_train,y_train)

7) Evaluate model

a) Plot the data vs predictions - qualitative assessment

y_pred_train=reg.predict(x_train)
y_pred_test=reg.predict(x_test)
from regression_predict_sklearn import plot_train_test_predictions

help(plot_train_test_predictions)
Help on function plot_train_test_predictions in module regression_predict_sklearn:

plot_train_test_predictions(predictors: List[str], X_train: Union[numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame], X_test: Union[numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame], y_train: Union[numpy.ndarray, pandas.core.series.Series], y_test: Union[numpy.ndarray, pandas.core.series.Series], y_pred_train: Union[numpy.ndarray, pandas.core.series.Series], y_pred_test: Union[numpy.ndarray, pandas.core.series.Series], y_log_scaled: bool, plot_raw: bool, err_type: Optional[str] = None, train_err: Optional[float] = None, test_err: Optional[float] = None) -> Tuple[Optional[matplotlib.figure.Figure], Optional[matplotlib.figure.Figure]]
    Plot true vs. predicted values for train and test sets and line of best fit.

    Args:
        predictors (List[str]): List of predictor names.
        X_train (Union[np.ndarray, pd.Series, pd.DataFrame]): Training feature data.
        X_test (Union[np.ndarray, pd.Series, pd.DataFrame]): Test feature data.
        y_train (Union[np.ndarray, pd.Series]): Actual target values for the training set.
        y_test (Union[np.ndarray, pd.Series]): Actual target values for the test set.
        y_pred_train (Union[np.ndarray, pd.Series]): Predicted target values for the training set.
        y_pred_test (Union[np.ndarray, pd.Series]): Predicted target values for the test set.
        y_log_scaled (bool): Whether the target values are log-scaled or not.
        plot_raw (bool): Whether to plot raw or log-scaled values.
        err_type (Optional[str]): Type of error metric.
        train_err (Optional[float]): Training set error value.
        test_err (Optional[float]): Test set error value.

    Returns:
        Tuple[Optional[plt.Figure], Optional[plt.Figure]]: Figures for true vs. predicted values and line of best fit.
(fig1, fig2) = plot_train_test_predictions(predictors=[predictor],
                                           X_train=x_train, X_test=x_test,
                                           y_train=y_train, y_test=y_test,
                                           y_pred_train=y_pred_train, y_pred_test=y_pred_test,
                                           y_log_scaled=True, plot_raw=True);

# print(type(fig1))
# import matplotlib.pyplot as plt
# import pylab as pl
# pl.figure(fig1.number)
# plt.savefig('..//fig//regression//intro//univariate_truePrice_vs_predPrice.png',bbox_inches='tight', dpi=300)
# pl.figure(fig2.number)
# fig2.savefig('..//fig//regression//intro//univariate_x_vs_predPrice.png',bbox_inches='tight', dpi=300)

Inspect the plots

  1. Does the model capture the variability in sale prices well? Would you use this model to predict the sale price of a house? Why or why not?

  2. Does the model seem to exhibit any signs of overfitting? What about underfitting?

  3. How might you improve the model?

Solution

  1. Based on visual inspection, this linear model does a fairly good job in capturing the relationship between “OverallQual” and sale price. However, there is a tendency for the model to underpredict more expensive homes and overpredict less expensive homes.

  2. Since the train and test set plots look very similar, overfitting is not a concern. Generally speaking, overfitting is not encountered with univariate models unless you have an incredily small number of samples to train the model on. Since the model follows the trajectory of sale price reasonably well, it also does not appear to underfit the data (at least not to an extreme extent).

  3. In order to improve this model, we can ask ourselves — is “OverallQual” likely the only variable that contributes to final sale price, or should we consider additional predictor variables? Most outcome variables can be influenced by more than one predictor variable. By accounting for all predictors that have an impact on sales price, we can improve the model.

b) Measure train/test set errors and check for signs of underfitting or overfitting

While qualitative examinations of model performance are extremely helpful, it is always a good idea to pair such evaluations with a quantitative analysis of the model’s performance.

Convert back to original data scale There are several error measurements that can be used to measure a regression model’s performance. Before we implement any of them, we’ll first convert the log(salePrice) back to original sale price for ease of interpretation.

expY_train = np.exp(y_train)
pred_expY_train = np.exp(y_pred_train)

expY_test = np.exp(y_test)
pred_expY_test = np.exp(y_pred_test)

Measure baseline performance

from math import sqrt
import pandas as pd

baseline_predict = y.mean()
print('mean sale price =', baseline_predict)
# convert to series same length as y sets for ease of comparison
baseline_predict = pd.Series(baseline_predict)
baseline_predict = baseline_predict.repeat(len(y))
baseline_predict
mean sale price = 180921.19589041095
0    180921.19589
0    180921.19589
0    180921.19589
0    180921.19589
0    180921.19589
         ...
0    180921.19589
0    180921.19589
0    180921.19589
0    180921.19589
0    180921.19589
Length: 1460, dtype: float64

Root Mean Squared Error (RMSE): The RMSE provides an easy-to-interpret number that represents error in terms of the units of the target variable. With our univariate model, the “YearBuilt” predictor variable (a.k.a. model feature) predicts sale prices within +/- $68,106 from the true sale price. We always use the RMSE of the test set to assess the model’s ability to generalize on unseen data. An extremely low prediction error in the train set is also a good indicator of overfitting.

from sklearn import metrics

RMSE_baseline = metrics.mean_squared_error(y, baseline_predict, squared=False)
RMSE_train = metrics.mean_squared_error(expY_train, pred_expY_train, squared=False)
RMSE_test = metrics.mean_squared_error(expY_test, pred_expY_test, squared=False)

print(f"Baseline RMSE = {RMSE_baseline}")
print(f"Train RMSE = {RMSE_train}")
print(f"Test RMSE = {RMSE_test}")
Baseline RMSE = 79415.29188606751
Train RMSE = 45534.34940950763
Test RMSE = 44762.77229823455

Here, both train and test RMSE are very similar to one another. As expected with most univariate models, we do not see any evidence of overfitting. This model performs substantially better than the baseline. However, an average error of +/- $44,726 is likely too high for this model to be useful in practice. That is, the model is underfitting the data given its poor ability to predict the true housing prices.

Mean Absolute Percentage Error: What if we wanted to know the percent difference between the true sale price and the predicted sale price? For this, we can use the mean absolute percentage error (MAPE)

Practice using helper function, measure_model_err

This code will be identical to the code above except for changing metrics.mean_squared_error to metrics.mean_absolute_percentage_error.

Rather than copying and pasting the code above, let’s try using one of the helper functions provided for this workshop.

from regression_predict_sklearn import measure_model_err
help(measure_model_err)
Help on function measure_model_err in module regression_predict_sklearn:

measure_model_err(y: Union[numpy.ndarray, pandas.core.series.Series], baseline_pred: Union[float, numpy.float64, numpy.float32, int, numpy.ndarray, pandas.core.series.Series], y_train: Union[numpy.ndarray, pandas.core.series.Series], y_pred_train: Union[numpy.ndarray, pandas.core.series.Series], y_test: Union[numpy.ndarray, pandas.core.series.Series], y_pred_test: Union[numpy.ndarray, pandas.core.series.Series], metric: str, y_log_scaled: bool) -> pandas.core.frame.DataFrame
    Measures the error of a regression model's predictions on train and test sets.

    Args:
        y (Union[np.ndarray, pd.Series]): Actual target values for full dataset (not transformed)
        baseline_pred (Union[float, np.float64, np.float32, int, np.ndarray, pd.Series]): Single constant or array of predictions equal to the length of y. Baseline is also not transformed.
        y_train (Union[np.ndarray, pd.Series]): Actual target values for the training set.
        y_pred_train (Union[np.ndarray, pd.Series]): Predicted target values for the training set.
        y_test (Union[np.ndarray, pd.Series]): Actual target values for the test set.
        y_pred_test (Union[np.ndarray, pd.Series]): Predicted target values for the test set.
        metric (str): The error metric to calculate ('RMSE', 'R-squared', or 'MAPE').
        y_log_scaled (bool): Whether the target values are log-scaled or not.

    Returns:
        pd.DataFrame: A DataFrame containing the error values for the baseline, training set, and test set.
error_df = measure_model_err(y=y, baseline_pred=baseline_predict,
                             y_train=expY_train, y_pred_train=pred_expY_train,
                             y_test=expY_test, y_pred_test=pred_expY_test,
                             metric='MAPE', y_log_scaled=False)

error_df.head()
Baseline Error Train Error Test Error
0 0.363222 0.187585 0.16754

With the MAPE measurement (max value of 1 which corresponds to 100%), we can state that our model over/under estimates sale prices by an average of 23.41% (25.28%) across all houses included in the test set (train set). Certainly seems there is room for improvement based on this measure.

R-Squared: Another useful error measurement to use with regression models is the coefficient of determination $R^2$. Oftentimes pronounced simply “R-squared”, this measure assesses the proportion of the variation in the target variable that is predictable from the predictor variable(s). Using sklearn’s metrics, we can calculate this as follows:

error_df = measure_model_err(y=y, baseline_pred=baseline_predict,
                                                   y_train=expY_train, y_pred_train=pred_expY_train,
                                                   y_test=expY_test, y_pred_test=pred_expY_test,
                                                   metric='R-squared', y_log_scaled=False)

error_df.head()
Baseline Error Train Error Test Error
0 0.0 0.666875 0.690463

Our model predicts 70.1% (65.2%) of the variance across sale prices in the test set (train set). The R-squared for the baseline model is 0 because the numerator and denominator in the equation for R-squared are equivalent:

R-squared equation: R-squared = 1 - (Sum of squared residuals) / (Total sum of squares)

Sum of Squared Residuals (SSR): $SSR = \sum\left(Actual Value - Predicted Value\right)^2$ for all data points. The SSR is equivalent to the variance of the residuals in a regression model. Residuals are the differences between the actual observed values and the predicted values produced by the model. Squaring these differences and summing them up yields the SSR.

Total Sum of Squares (TSS): $TSS = \sum\left(Actual Value - Mean of Actual Values\right)^2$ for all data points. The TSS represents the total variability or dispersion in the observed values of the target variable. It measures the total squared differences between each data point’s value and the mean of the observed values.

To read more about additional error/loss measurements, visit sklearn’s metrics documentation.

More on R-squared

Our above example model is able to explain roughly 70.1% of the variance in the test dataset. Is this a “good” value for R-squared?

Solution

The answer to this question depends on your objective for the regression model. This relates back to the two modeling goals of explaining vs predicting. Depending on the objective, the answer to “What is a good value for R-squared?” will be different.

Predicting the response variable: If your main objective is to predict the value of the response variable accurately using the predictor variable, then R-squared is important. The value for R-squared can range from 0 to 1. A value of 0 indicates that the response variable cannot be explained by the predictor variable at all. A value of 1 indicates that the response variable can be perfectly explained without error by the predictor variable. In general, the larger the R-squared value, the more precisely the predictor variables are able to predict the value of the response variable. How high an R-squared value needs to be depends on how precise you need to be for your specific model’s application. To find out what is considered a “good” R-squared value, you will need to explore what R-squared values are generally accepted in your particular field of study.

Explaining the relationship between the predictor(s) and the response variable: If your main objective for your regression model is to explain the relationship(s) between the predictor(s) and the response variable, the R-squared is mostly irrelevant. A predictor variable that consistently relates to a change in the response variable (i.e., has a statistically significant effect) is typically always interesting — regardless of the the effect size. The exception to this rule is if you have a near-zero R-squared, which suggests that the model does not explain any of the variance in the data.

Comparing univariate predictive models

Let’s see how well the other predictors in our dataset can predict sale prices. For simplicity, we’ll compare just continous predictors for now.

General procedure for comparing predictive models

We’ll follow this general procedure to compare models:

  1. Use get_feat_types() to get a list of continuous predictors
  2. Create an X variable containing only continuous predictors from housing['data']
  3. Extract sale prices from housing['target'] and log scale it
  4. Use the remove_bad_cols helper function to remove predictors with nans or containing > 97% constant values (typically 0’s)
  5. Perform a train/validation/test split using 60% of the data to train, 20% for validation (model selection), and 20% for final testing of the data
  6. Use the compare_models helper function to quickly calculate train/validation errors for all possible single predictors. Returns a df_model_err df that contains the following data stored for each predictor: ‘Predictor Variable’, ‘Train Error’, ‘Validation Error’.
  7. Once selecting the best model, get the final assessment of the model’s generalizeability using the test set data
# preprocess
from preprocessing import get_feat_types
predictor_type_dict = get_feat_types()
continuous_fields = predictor_type_dict['continuous_fields']
X = housing['data'][continuous_fields]
y = housing['target']
y_log = np.log(y)

# remove columns with nans or containing > 97% constant values (typically 0's)
from preprocessing import remove_bad_cols
X_good = remove_bad_cols(X, 99)
4 columns removed, 30 remaining.
Columns removed: ['LotFrontage', 'MasVnrArea', 'GarageYrBlt', 'PoolArea']
# train/holdout split
X_train, X_holdout, y_train, y_holdout = train_test_split(X_good, y_log,
                                                          test_size=0.4,
                                                          random_state=0)

# validation/test split
X_val, X_test, y_val, y_test = train_test_split(X_holdout, y_holdout,
                                                test_size=0.5,
                                                random_state=0)
from regression_predict_sklearn import compare_models
help(compare_models)
Help on function compare_models in module regression_predict_sklearn:

compare_models(y: Union[numpy.ndarray, pandas.core.series.Series], baseline_pred: Union[numpy.ndarray, pandas.core.series.Series], X_train: pandas.core.frame.DataFrame, y_train: Union[numpy.ndarray, pandas.core.series.Series], X_val: pandas.core.frame.DataFrame, y_val: Union[numpy.ndarray, pandas.core.series.Series], predictors_list: List[List[str]], metric: str, y_log_scaled: bool, model_type: str, include_plots: bool, plot_raw: bool, verbose: bool) -> pandas.core.frame.DataFrame
    Compare different models based on predictor variables and evaluate their errors.

    Args:
        y (Union[np.ndarray, pd.Series]): Target variable in its original scale (raw/untransformed).
        baseline_pred (Union[np.ndarray, pd.Series]): Baseline predictions (in same scale as original target, y).
        X_train (pd.DataFrame): Training feature data.
        y_train (Union[np.ndarray, pd.Series]): Actual target values for the training set.
        X_val (pd.DataFrame): Validation feature data.
        y_val (Union[np.ndarray, pd.Series]): Actual target values for the validation set.
        predictors_list (List[List[str]]): List of predictor variables for different models.
        metric (str): The error metric to calculate.
        y_log_scaled (bool): Whether the model was trained on log-scaled target values or not.
        model_type (str): Type of the model being used.
        include_plots (bool): Whether to include plots.

    Returns:
        pd.DataFrame: A DataFrame containing model errors for different predictor variables.
df_model_err = compare_models(y=y, baseline_pred=baseline_predict,
                              X_train=X_train, y_train=y_train,
                              X_val=X_val, y_val=y_val,
                              predictors_list=X_train.columns,
                              metric='RMSE', y_log_scaled=True,
                              model_type='unregularized',
                              include_plots=False, plot_raw=False, verbose=False)

df_model_err.head()
Baseline Error Train Error Validation Error Predictors Trained Model
0 79415.291886 82875.380855 84323.189234 LotArea LinearRegression()
1 79415.291886 67679.790920 69727.341057 YearBuilt LinearRegression()
2 79415.291886 69055.741014 70634.285653 YearRemodAdd LinearRegression()
3 79415.291886 45516.185542 46993.501006 OverallQual LinearRegression()
4 79415.291886 81016.566207 84915.452252 OverallCond LinearRegression()
from regression_predict_sklearn import compare_models_plot
sorted_predictors, train_errs, val_errs = compare_models_plot(df_model_err, 'RMSE');
Best model train error = 45516.18554163278
Best model validation error = 46993.501005708364
Worst model train error = 63479.544551733954
Worst model validation error = 220453.4404000341

Assess best model’s generalizeability

Get the final assessment of the model’s generalizeability using the test set data.

best_model = df_model_err.loc[df_model_err['Predictors']=='OverallQual', 'Trained Model'].item()
y_test_pred = best_model.predict(np.array(X_test['OverallQual']).reshape(-1,1))
test_err = metrics.mean_squared_error(np.exp(y_test), np.exp(y_test_pred), squared=False)
test_err
42298.77889761536

Examing the worst performers

df_model_err = compare_models(y=y, baseline_pred=baseline_predict,
                              X_train=X_train, y_train=y_train,
                              X_val=X_val, y_val=y_val,
                              predictors_list=sorted_predictors[-3:],
                              metric='RMSE', y_log_scaled=True,
                              model_type='unregularized',
                              include_plots=True, plot_raw=True, verbose=False)

df_model_err.head()
Baseline Error Train Error Validation Error Predictors Trained Model
0 79415.291886 65085.562455 105753.386038 1stFlrSF LinearRegression()
1 79415.291886 60495.941297 106314.048186 GrLivArea LinearRegression()
2 79415.291886 63479.544552 220453.440400 TotalBsmtSF LinearRegression()

Outliers and interactions

It appears the worst performing predictors do not have much of a linear relationship with log(salePrice) and have some extreme outliers in the test set data. If we were only to focus on univariate models, we would want to remove these outliers after carefully considering their meaning and cause. However, outliers in a univariate context may not remain outliers in a multivariate context.

This point is further illustrated by the distributions / data clouds we see with the TotalBsmtSF predictor. The type of basement finish may change the relationship between TotalBsmtSF and SalePrice. If we fit a regression model that accounts for this interaction, the model will follow a linear pattern for each distribtuion separately. Similarly, certain outliers may stem from other predictors having interactions/relationships with one another. When searching for outliers, it is important to consider such multivariate interactions.

Fitting all predictors

Let’s assume all predictors in the Ames housing dataset are related to sale price to some extent and fit a multivariate regression model using all continuous predictors.

df_model_err = compare_models(y=y, baseline_pred=baseline_predict,
                              X_train=X_train, y_train=y_train,
                              X_val=X_val, y_val=y_val,
                              predictors_list=[X_train.columns],
                              metric='RMSE', y_log_scaled=True,
                              model_type='unregularized',
                              include_plots=True, plot_raw=True, verbose=True)

# of predictor vars = 30
# of train observations = 876
# of test observations = 292
Baseline RMSE = 79415.29188606751
Train RMSE = 34139.3449119712
Holdout RMSE = 134604.1997549234
(Holdout-Train)/Train: 294%

Compare permutations of models with different numbers of predictors

Let’s see how well we can predict house prices with different numbers of predictors.

from preprocessing import get_predictor_combos
sampled_combinations = get_predictor_combos(X_train=X_train, K=2, n=30)
print(sampled_combinations[0:2])
[['TotalBsmtSF', '3SsnPorch'], ['YearBuilt', 'OverallQual']]

Compare efficacy of different numbers of predictors

To quickly assess how well we can predict sale price with varying numbers of predictors, use the code we just prepared in conjunction with a for loop to determine the best train/validation errors possible when testing 30 permutations containing K=1, 2, 5, 10, and 25 predictors. Plot the results (K vs train/test errors). Is there any trend?

best_train_errs = []
best_val_errs = []
n_predictors  = [1, 2, 5, 10, 20, 25]

for K in n_predictors:
    print('K =', K)
    sampled_combinations = get_predictor_combos(X_train=X_train, K=K, n=25)
    df_model_err = compare_models(y=y, baseline_pred=baseline_predict,
                                  X_train=X_train, y_train=y_train,
                                  X_val=X_val, y_val=y_val,
                                  predictors_list=sampled_combinations,
                                  metric='RMSE', y_log_scaled=True,
                                  model_type='unregularized',
                                  include_plots=False, plot_raw=True, verbose=False)

    sorted_predictors, train_errs, val_errs = compare_models_plot(df_model_err, 'RMSE')
    best_train_errs.append(np.median(train_errs))
    best_val_errs.append(np.median(val_errs))
K = 1
Best model train error = 45516.18554163278
Best model validation error = 46993.501005708364
Worst model train error = 63479.544551733954
Worst model validation error = 220453.4404000341








K = 2
Best model train error = 44052.332399325314
Best model validation error = 47836.516342128445
Worst model train error = 58691.19123135773
Worst model validation error = 215272.44855611687








K = 5
Best model train error = 43122.17607559138
Best model validation error = 45562.43051930849
Worst model train error = 62279.367627167645
Worst model validation error = 333222.31164330646








K = 10
Best model train error = 38240.57808718848
Best model validation error = 50883.52958766488
Worst model train error = 52066.92688665911
Worst model validation error = 357384.2570907179








K = 20
Best model train error = 33729.291727505
Best model validation error = 84522.32196428241
Worst model train error = 39417.4765728046
Worst model validation error = 189314.02234304545








K = 25
Best model train error = 33143.6127150748
Best model validation error = 96832.07020531746
Worst model train error = 38689.73868462601
Worst model validation error = 192248.540723152
plt.plot(n_predictors, best_train_errs, '-o', label='Training Error')
plt.plot(n_predictors, best_val_errs, '-o', label='Validation Error')
plt.xlabel('Number of Predictors')
plt.ylabel('Error')
plt.title('Training and Validation Errors')
plt.legend()  # This adds the legend based on the labels provided above
plt.grid(True)

# Save the plot to a file
# plt.savefig('..//fig//regression//intro//Npredictor_v_error.png', bbox_inches='tight', dpi=300, facecolor='white')

How much data is needed per new predictor? 10X rule of thumb

As the number of observations begins to approach the number of model parameters (i.e., coefficients being estimated), the model will simply memorize the training data rather than learn anything useful. As a general rule of thumb, obtaining reliable estimates from linear regression models requires that you have at least 10X as many observations than model coefficients/predictors. The exact ratio may change depending on the variability of your data and whether or not each observation is truly independent (time-series models, for instance, often require much more data since observations are rarely independent).

Let’s see what the ratio is when we start to hit overfitting effects with our data. We need to determine the number of observations used to train the model as well as the number of estimated coefficients from the model (equal to number of predictors in this simple regression equation).

[X_train.shape[0] / n for n in n_predictors]
[876.0, 438.0, 175.2, 87.6, 43.8, 35.04]

With our data, we start to see overfitting effects even when we have as much as 87.6 times as many observations as estimated model coefficients. if you find that your regression model is more prone to overfitting that the “10X rule”, it could suggest that the training data might not be strictly independently and identically distributed (i.i.d.). Overfitting occurs when a model learns the noise and random fluctuations in the training data instead of the true underlying patterns, leading to poor generalization to new data.

The reasons for overfitting can vary including:

Univariate results as a feature selection method

Given the harsh reality of overfitting concerns, working in high-dimensional settings typical requires researchers to explore a number of “feature selection” methods. When working in strictly a predictive modeling context, you can use brute force methods to test different permutations of predictors. However, such permutation methods can become computationally expensive as the number of predictors begins to increase. One shortcut to brute force permutation testing is to select features based on their performance in a univariate modeling context.

X_train.shape
(876, 30)
from feature_selection import get_best_uni_predictors

top_features = get_best_uni_predictors(N_keep=5, y=y, baseline_pred=y.mean(),
                                       X_train=X_train, y_train=y_train,
                                       X_val=X_val, y_val=y_val,
                                       metric='RMSE', y_log_scaled=True)

top_features
['OverallQual', 'GarageCars', 'YearBuilt', 'YearRemodAdd', 'FullBath']
from regression_predict_sklearn import fit_eval_model
help(fit_eval_model)
Help on function fit_eval_model in module regression_predict_sklearn:

fit_eval_model(y: Union[numpy.ndarray, pandas.core.series.Series], baseline_pred: Union[numpy.ndarray, pandas.core.series.Series], X_train: Union[numpy.ndarray, pandas.core.frame.DataFrame], y_train: Union[numpy.ndarray, pandas.core.series.Series], X_test: Union[numpy.ndarray, pandas.core.frame.DataFrame], y_test: Union[numpy.ndarray, pandas.core.series.Series], predictors: Union[str, List[str]], metric: str, y_log_scaled: bool, model_type: str, include_plots: bool, plot_raw: bool, verbose: bool, cv: int = None, alphas: Union[numpy.ndarray, List[float]] = None, max_iter: int = None) -> Tuple[float, float, float]
    Fits a linear regression model using specified predictor variables and evaluates its performance.

    Args:
        ... (existing arguments)

        cv (int, optional): Number of cross-validation folds. Applicable when model_type is LassoCV.
        alphas (Union[np.ndarray, List[float]], optional): List of alphas to tune the LassoCV model.
                                                          Applicable when model_type is LassoCV.
        max_iter (int, optional): Maximum number of iterations for the LassoCV solver.

    Returns:
        Tuple[float, float, float]: Baseline error, training error, and test error.
fit_eval_model(y=y,baseline_pred=y.mean(), X_train=X_train, y_train=y_train,
                   X_test=X_test, y_test=y_test,
                   predictors=top_features,
                   metric='RMSE',
                   y_log_scaled=True,
                   model_type='unregularized',
                   include_plots=True,
                   plot_raw=True,
                   verbose=True);
# of predictor vars = 5
# of train observations = 876
# of test observations = 292
Baseline RMSE = 79415.29188606751
Train RMSE = 41084.668354744696
Holdout RMSE = 36522.31626858727
(Holdout-Train)/Train: -11%

7) Explaining models

At this point, we have assessed the predictive accuracy of our model. However, what if we want to interpret our model to understand which predictor(s) have a consistent or above chance (i.e., statistically significant) impact sales price? For this kind of question and other questions related to model interpretability, we need to first carefully validate our model. The next two episodes will explore some of the necessary checks you must perform before reading too far into your model’s estimations.

Key Points

  • Linear regression models can be used to predict a target variable and/or to reveal relationships between variables

  • Linear models are most effective when applied to linear relationships. Data transformation techniques can be used to help ensure that only linear relationships are modelled.

  • Train/test splits are used to assess under/overfitting in a model

  • Different model evaluation metrics provide different perspectives of model error. Some error measurements, such as R-squared, are not as relevant for explanatory models.


Model validity - relevant predictors

Overview

Teaching: 45 min
Exercises: 2 min
Questions
  • What are the benfits/costs of including additional predictors in a regression model?

Objectives
  • Understand the importance of including relevant predictors in a model.

Model Validity And Interpretation

While using models strictly for predictive purposes is a completely valid approach for some domains and problems, researchers typically care more about being able to interpret their models such that interesting relationships between predictor(s) and target can be discovered and measured. When interpretting a linear regression model, we can look at the model’s estimated coefficients and p-values associated with each predictor to better understand the model. The coefficient’s magnitude can inform us of the effect size associated with a predictor, and the p-value tells us whether or not a predictor has a consistent (statistically significant) effect on the target.

Before we can blindly accept the model’s estimated coefficients and p-values, we must answer three questions that will help us determine whether or not our model is valid.

Model Validity Assessments

  1. Accounting for relevant predictors: Have we included all relevant predictors in the model?
  2. Bias/variance or under/overfitting: Does the model capture the variability of the target variable well? Does the model generalize well?
  3. Model assumptions: Does the fitted model follow the 5 assumptions of linear regression?

We will discuss the first two assessments in detail throughout this episode.

1. Relevant predictors

Benefits and drawbacks of including all relevant predcitors

What do you think might be some benefits of including all relevant predictors in a model that you intend to use to explain relationships? Are there any drawbacks you can think of?

Solution

Including all relevant predictor variables in a model is important for several reasons:

  1. Improving model interpretability: Leaving out relevant predictors can result in model misspecification. Misspecification refers to a situation where the model structure or functional form does not accurately reflect the underlying relationship between the predictors and the outcome. If a relevant predictor is omitted from the model, the coefficients of the remaining predictors may be biased. This occurs because the omitted predictor may have a direct or indirect relationship with the outcome variable, and its effect is not accounted for in the model. Consequently, the estimated coefficients of other predictors may capture some of the omitted predictor’s effect, leading to biased estimates.

  2. Improving predicitive accuracy and reducing residual variance: Omitting relevant predictors can increase the residual variance in the model. Residual variance represents the unexplained variation in the outcome variable after accounting for the predictors in the model. If a relevant predictor is left out, the model may not fully capture the systematic variation in the data, resulting in larger residuals and reduced model fit. While it is true that, in a research setting, we typically care more about being able to interpret our model than being able to perfectly predict the target variable, a model that severely underfits is still a cause for concern since the model won’t be capturing the variability of the data well enough to form any conclusions.

  3. Robustness to future changes: This benefit only applies to predictive modeling tasks where models are often being fit to new data. By including all relevant predictors, the model becomes more robust to changes in the data or the underlying system. If a predictor becomes important in the future due to changes in the environment or additional data, including it from the start ensures that the model is already equipped to capture its influence.

Drawbacks to including all relevant predictors: While one should always aim to include as many relevant predictors as possible, this goal needs to be balanced with overfitting concerns. If we include too many predictors in the model and train on a limited number of observations, the model may simply memorize the nuances/noise in the data rather than capturing the underlying trend in the data.

Example

Let’s consider a regression model where we want to evaluate the relationship between FullBath (number of bathrooms) and SalePrice.

from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True, parser='auto') #
y=housing['target']
X=housing['data']['FullBath']
print(X.shape)
X.head()
(1460,)





0    2
1    2
2    2
3    1
4    2
Name: FullBath, dtype: int64

It’s always a good idea to start by plotting the predictor vs the target variable to get a sense of the underlying relationship.

import matplotlib.pyplot as plt
plt.scatter(X,y,alpha=.3);
# plt.savefig('..//fig//regression//relevant_predictors//scatterplot_fullBath_vs_salePrice.png', bbox_inches='tight', dpi=300, facecolor='white');

Since the relationship doesn’t appear to be quite as linear as we were hoping, we will try a log transformation as we did in the previous episode.

import numpy as np
y_log = y.apply(np.log)
plt.scatter(X,y_log, alpha=.3);
# plt.savefig('..//fig//regression//relevant_predictors//scatterplot_fullBath_vs_logSalePrice.png', bbox_inches='tight', dpi=300, facecolor='white');

The log transform improves the linear relationship substantially!

Standardizing scale of predictors

We’ll compare the coefficients estimated from this model to an additional univariate model. To make this comparison more straightforward, we will z-score the predictor. If you don’t standardize the scale of all predictors being compared, the coefficient size will be a function of the scale of each specific predictor rather than a measure of each predictor’s overall influence on the target.

X = (X - X.mean())/X.std()
X.head()
0    0.789470
1    0.789470
2    0.789470
3   -1.025689
4    0.789470
Name: FullBath, dtype: float64

Statsmodels for model fitting and interpretation

Next, we will import the statsmodels package which is an R-style modeling package that has some convenient functions for rigorously testing and running stats on linear models.

For efficiency, we will skip train/test splits for now. Recall that train/test splits aren’t as essential when working with only a handful or predictors (i.e., when the ratio between number of training observations and model parameters/coefficients is at least 10).

Fit the model. Since we are now turning our attention towards explanatory models, we will use the statsmodels library isntead of sklearn. Statsmodels comes with a variety of functions which make it easier to interpret the model and ultimately run hypothesis tests. It closely mirrors the way R builds linear models.

import statsmodels.api as sm

# Add a constant column to the predictor variables dataframe - this acts as the y-intercept for the model
X = sm.add_constant(X)
X.head()
const FullBath
0 1.0 0.789470
1 1.0 0.789470
2 1.0 0.789470
3 1.0 -1.025689
4 1.0 0.789470

Note: statsmodels is smart enough to not re-add the constant if it has already been added

X = sm.add_constant(X)
X.head()
const FullBath
0 1.0 0.789470
1 1.0 0.789470
2 1.0 0.789470
3 1.0 -1.025689
4 1.0 0.789470
# Fit the multivariate regression model
model = sm.OLS(y_log, X)
results = model.fit()

Let’s print the coefs from this model. In addition, we can quickly extract R-squared from the statsmodel model object using…

print(results.params,'\n')
print(results.pvalues,'\n')
print('R-squared:', results.rsquared)
const       12.024051
FullBath     0.237582
dtype: float64

const        0.000000e+00
FullBath    2.118958e-140
dtype: float64

R-squared: 0.3537519976399338

You can also call results.summary() for a detailed overview of the model’s estimates and resulting statistics.

results.summary()
OLS Regression Results
Dep. Variable: SalePrice R-squared: 0.354
Model: OLS Adj. R-squared: 0.353
Method: Least Squares F-statistic: 798.1
Date: Mon, 14 Aug 2023 Prob (F-statistic): 2.12e-140
Time: 21:14:09 Log-Likelihood: -412.67
No. Observations: 1460 AIC: 829.3
Df Residuals: 1458 BIC: 839.9
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 12.0241 0.008 1430.258 0.000 12.008 12.041
FullBath 0.2376 0.008 28.251 0.000 0.221 0.254
Omnibus: 51.781 Durbin-Watson: 1.975
Prob(Omnibus): 0.000 Jarque-Bera (JB): 141.501
Skew: 0.016 Prob(JB): 1.88e-31
Kurtosis: 4.525 Cond. No. 1.00



Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Based on the R-squared, this model explains 35.4% of the variance in the SalePrice target variable.

The model coefficient estimated for the “FullBath” predictor is 0.24. Recall that we fit this model to a log scaled version of the SalePrice. In other words, increasing the FullBath predictor by 1 standard deviation increases the log(SalePrice) by 0.24. While this explanation is completely valid, it is often useful to interpret the coefficient in terms of the original scale of the target variable.

Transforming the coefficient to the original scale of the data.

Exponentiate the coefficient to reverse the log transformation. This gives the multiplicative factor for every one-unit increase in the independent variable. In our model (run code below), for every standard devation increase in the predictor, our target variable increases by a factor of about 1.27, or 27%. Recall that multiplying a number by 1.27 is the same as increasing the number by 27%. Likewise, multiplying a number by, say 0.3, is the same as decreasing the number by 1 – 0.3 = 0.7, or 70%.

Bonus: Unpacking the coefficient transformation

When we have a linear model fit to the logarithm of the target variable, the general form of the model can be written as:

log(y) = b0 + b1 * x1 + b2 * x2 + ... + bn * xn

Where:

Now, if you want to express the effect of a one-unit change in one of the predictor variables on the original target variable y, you can take the exponential of both sides of the equation:

exp(log(y)) = exp(b0 + b1 * x1 + b2 * x2 + ... + bn * xn)
y = exp(b0) * exp(b1 * x1) * exp(b2 * x2) * ... * exp(bn * xn)

When you’re specifically interested in the multiplicative change associated with a one-unit change in a particular predictor variable (let’s call it x1), you don’t actually need to include the x1 term in the equation. This is because when x1 changes by 1 unit, the multiplicative effect is solely captured by exp(b1).

y = exp(b0) * exp(b1) * exp(b2) * ... * exp(bn)

Notice in the above equation, that equation coefficient is not multiplied. This is where we get the “multiplicative change” from.

Log review

Remember that exp() is the inverse of the natural logarithm, log().

e = 2.71828#... (irrational number) Euler's number, base of the natural log
x = 2
e_squared = e**x # raise e to the power of x
e_squared_exact = np.exp(x)

# take the natural log (ln)
print(np.log(e_squared))
print(np.log(e_squared_exact)) # take the natural log (ln)
np.exp(results.params[1]) # First param is the estimated coef for the y-intercept / "const". The second param is the estimated coef for FullBath.
1.2681792421553808

When transformed to the original data scale, this coefficient tells us that increasing bathroom count by 1 standard deviation increases the sale price, on average, by 27%. While bathrooms are a very hot commodity to find in a house, they likely don’t deserve this much credit. Let’s do some further digging by comparing another predictor which likely has a large impact on SalePrice — the total square footage of the house (excluding the basement).

X=housing['data']['GrLivArea']
plt.scatter(X, y_log);
# plt.savefig('..//fig//regression//relevant_predictors//scatterplot_GrLivArea_vs_logSalePrice.png', bbox_inches='tight', dpi=300, facecolor='white');

As before, we will z-score the predictor. This is a critical step when comparing coefficient estimates since the estimates are a function of the scale of the predictor.

X = (X - X.mean())/X.std()
X.head()
0    0.370207
1   -0.482347
2    0.514836
3    0.383528
4    1.298881
Name: GrLivArea, dtype: float64

Fit the model and print coefs/R-squared.

# Add a constant column to the predictor variables dataframe
X = sm.add_constant(X)
print(X.head())
# Fit the multivariate regression model
model = sm.OLS(y_log, X)
results = model.fit()
print(results.params)
print('R-squared:', results.rsquared)
   const  GrLivArea
0    1.0   0.370207
1    1.0  -0.482347
2    1.0   0.514836
3    1.0   0.383528
4    1.0   1.298881
const        12.024051
GrLivArea     0.279986
dtype: float64
R-squared: 0.49129817224671934

Based on the R-squared, this model explains 49.1% of the variance in the target variable (higher than FullBath which is to be expected). Let’s convert the coef to the original scale of the target data before reading into it.

np.exp(results.params[1]) # First param is the estimated coef for the y-intercept / "const". The second param is the estimated coef for FullBath.
1.3231118984358705

For every one standard devation increase in the predictor (GrLivArea), our target variable (SalePrice) increases by a factor of about 1.32, or 32%.

Let’s compare our findings with a multivariate regression model that includes both predictors.

predictors = ['GrLivArea', 'FullBath']
X=housing['data'][predictors]
X.head()
GrLivArea FullBath
0 1710 2
1 1262 2
2 1786 2
3 1717 1
4 2198 2
Standardization
X = (X - X.mean())/X.std()
X.head()
GrLivArea FullBath
0 0.370207 0.789470
1 -0.482347 0.789470
2 0.514836 0.789470
3 0.383528 -1.025689
4 1.298881 0.789470

Add constant for modeling y-intercept

# Fit the multivariate regression model
X = sm.add_constant(X)
X.head()
const GrLivArea FullBath
0 1.0 0.370207 0.789470
1 1.0 -0.482347 0.789470
2 1.0 0.514836 0.789470
3 1.0 0.383528 -1.025689
4 1.0 1.298881 0.789470
model = sm.OLS(y_log, X)
results = model.fit()
print(results.params)
print('R-squared:', results.rsquared)
const        12.024051
GrLivArea     0.216067
FullBath      0.101457
dtype: float64
R-squared: 0.530204241994317

Comparing results

  1. How does the R-squared of this model compare to the univariate models? Is the variance explained by the multivariate model equal to the sum of R-squared of each univariate model? Why or why not?
  2. Convert the coefficients to the original scale of the target variable as we did earlier in this episode. How much does SalePrice increase with a 1 standard deviation increase in each predictor?
  3. How do the coefficient estimates compare to the univariate model estimates? Is there any difference? If so, what might be the cause?

Solution

How does the R-squared of this model compare to the univariate models? Is the variance explained by the multivariate model equal to the sum of R-squared of each univariate model? Why or why not?

The R-squared value in the multivariate model (53.0%) is somewhat larger than each of the univariate models (GrLivArea=49.1%, FullBath=35.4%) which illustrates one of the benefits of includign multiple predictors. When we add the R-squared values of the univariate models, we get 49.1 + 35.4 = 84.5%. This value is much larger than what we observe in the multivariate model. The reason we can’t simply add the R-squared values together is because each univariate model fails to account for at least one relevant predictor. When we omit one of the predictors, the model assumes the observed relationship is only due to the remaining predictor. This causes the impact of each individual predictor to appear inflated (R-squared and coef magnitude) in the univariate models.

Convert the coefficients to the original scale of the target variable as we did earlier in this episode. How much does SalePrice increase with a 1 standard deviation increase in each predictor?

First we’ll convert the coefficients to the original scale of the target variable using the exp() function (the inverse of log).

print('GrLivArea:', np.exp(.216))
print('FullBath:', np.exp(.101))
GrLivArea: 1.2411023790006717
FullBath: 1.1062766417634236

Based on these results, increasing the GrLivArea by 1 standard deviation increases SalePrice by 24.1% (univariate = 32.3%), while increasing FullBath by 1 standard deviation increases SalePrice by only 10.6% (univariate = 26.8%).

How do the coefficient estimates compare to the univariate model estimates? Is there any difference? If so, what might be the cause?

When using a multivariate model, the coefficients were reduced to a considerable degree compared to the univariate models. Why does this happen? Both SalePrice and FullBath linearly relate to SalePrice. If we model SalePrice while considering only one of these effects, the model will think that only one predictor is doing the work of multiple predictors. We call this effect omitted-variable bias or omitted-predictor bias. Omitted-variable bias leads to model misspecification, where the model structure or functional form does not accurately reflect the underlying relationship between the predictors and the outcome. If you want a more truthful model, it is critical that you include as many relevant predictors as possible.

Including ALL predictors - overfitting concerns

While researchers should always strive to include as many many relevant predictors as possible, this must also be balanced with overfitting concerns. That is, it is often the case that some of the relevant predictors must be left out in order to ensure that overfitting does not occur. If we include too many predictors in the model and train on a limited number of observations, the model may simply memorize the nuances/noise in the data rather than capturing the underlying trend in the data.

Let’s see how this plays out with the Ames housing dataset.

We’ll first load and prep the full high-dimensional dataset. The following helper function…

  1. loads the full Ames housing dataset
  2. Encodes all categorical predictors appropriately (we’ll discuss this step in detail in the next episode)
  3. Removes sparse predictors with little to no variability (we’ll discuss this step in detail in the next episode)
  4. log scales the target variable, SalePrice
  5. train/test splits the data
from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True, parser='auto') #
y=housing['target']
X=housing['data']
print('X.shape[1]', X.shape[1])
import numpy as np
y_log = np.log(y)

from preprocessing import encode_predictors_housing_data
X_encoded = encode_predictors_housing_data(X) # use one-hot encoding to encode categorical predictors
print('X_encoded.shape[1]', X_encoded.shape[1])

from preprocessing import remove_bad_cols
X_encoded_good = remove_bad_cols(X_encoded, 98) # remove cols with nans and cols that are 98% constant
X_encoded_good.head()

print(X_encoded_good.shape)
print(y.shape)

X.shape[1] 80
X_encoded.shape[1] 215
99 columns removed, 116 remaining.
Columns removed: ['LotFrontage', 'PoolArea', '3SsnPorch', 'LowQualFinSF', 'GarageYrBlt', 'MasVnrArea', 'RoofMatl_ClyTile', 'RoofMatl_CompShg', 'RoofMatl_Membran', 'RoofMatl_Metal', 'RoofMatl_Roll', 'RoofMatl_Tar&Grv', 'RoofMatl_WdShake', 'RoofMatl_WdShngl', 'SaleCondition_AdjLand', 'SaleCondition_Alloca', 'SaleCondition_Family', "Exterior2nd_'Brk Cmn'", 'Exterior2nd_AsbShng', 'Exterior2nd_AsphShn', 'Exterior2nd_BrkFace', 'Exterior2nd_CBlock', 'Exterior2nd_ImStucc', 'Exterior2nd_Other', 'Exterior2nd_Stone', 'Exterior2nd_Stucco', 'Utilities_AllPub', 'Utilities_NoSeWa', 'LotConfig_FR3', 'Foundation_Slab', 'Foundation_Stone', 'Foundation_Wood', 'Condition2_Artery', 'Condition2_Feedr', 'Condition2_Norm', 'Condition2_PosA', 'Condition2_PosN', 'Condition2_RRAe', 'Condition2_RRAn', 'Condition2_RRNn', 'GarageType_2Types', 'GarageType_Basment', 'GarageType_CarPort', 'Heating_Floor', 'Heating_GasW', 'Heating_Grav', 'Heating_OthW', 'Heating_Wall', 'MSSubClass_40', 'MSSubClass_45', 'MSSubClass_75', 'MSSubClass_85', 'MSSubClass_180', 'Neighborhood_Blmngtn', 'Neighborhood_Blueste', 'Neighborhood_BrDale', 'Neighborhood_ClearCr', 'Neighborhood_MeadowV', 'Neighborhood_NPkVill', 'Neighborhood_SWISU', 'Neighborhood_StoneBr', 'Neighborhood_Veenker', 'RoofStyle_Flat', 'RoofStyle_Gambrel', 'RoofStyle_Mansard', 'RoofStyle_Shed', 'SaleType_CWD', 'SaleType_Con', 'SaleType_ConLD', 'SaleType_ConLI', 'SaleType_ConLw', 'SaleType_Oth', 'Exterior1st_AsbShng', 'Exterior1st_AsphShn', 'Exterior1st_BrkComm', 'Exterior1st_CBlock', 'Exterior1st_ImStucc', 'Exterior1st_Stone', 'Exterior1st_Stucco', 'Exterior1st_WdShing', "MSZoning_'C (all)'", 'MSZoning_RH', 'MasVnrType_BrkCmn', 'Condition1_PosA', 'Condition1_PosN', 'Condition1_RRAe', 'Condition1_RRAn', 'Condition1_RRNe', 'Condition1_RRNn', 'Electrical_FuseF', 'Electrical_FuseP', 'Electrical_Mix', 'MiscFeature_Gar2', 'MiscFeature_Othr', 'MiscFeature_TenC', 'HouseStyle_1.5Unf', 'HouseStyle_2.5Fin', 'HouseStyle_2.5Unf', 'Street']
(1460, 116)
(1460,)

Train/test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_encoded_good, y_log, test_size=0.33, random_state=0)
print(X_train.shape)
print(X_test.shape)
(978, 116)
(482, 116)

Zscoring all predictors

Since we’re now working with multiple predictors, we will zscore our data such that we can compare coefficient estimates across predictors.

There is some additional nuance to this step when working with train/test splits. For instance, you might wonder which of the following procedures is most appropriate…

  1. Zscore the full dataset prior to train/test splittng
  2. Zscore the train and test sets separately, using each subset’s mean and standard deviation

As it turns out, both are incorrect. Instead, it is best to use only the training set to derive the means and standard deviations used to zscore both the training and test sets. The reason for this is to prevent data leakage, which can occur if you calculate the mean and standard deviation for the entire dataset (both training and test sets) together. This would give the test set information about the distribution of the training set, leading to biased and inaccurate performance evaluation. The test set should be treated as unseen data during the preprocessing steps.

To standardize your data correctly:

  1. Calculate the mean and standard deviation of each feature on the training set.
  2. Use these calculated means and standard deviations to standardize both the training and test sets.
from preprocessing import zscore
help(zscore)
Help on function zscore in module preprocessing:

zscore(df: pandas.core.frame.DataFrame, train_means: pandas.core.series.Series, train_stds: pandas.core.series.Series) -> pandas.core.frame.DataFrame
    return z-scored dataframe using training data mean and standard deviation
# get means and stds
train_means = X_train.mean()
train_stds = X_train.std()
X_train_z = zscore(df=X_train, train_means=train_means, train_stds=train_stds)
X_test_z = zscore(df=X_test, train_means=train_means, train_stds=train_stds)
X_train_z.head()
YearRemodAdd MoSold GarageArea OverallCond 2ndFlrSF TotRmsAbvGrd BsmtFinSF1 1stFlrSF TotalBsmtSF GrLivArea ... BldgType_Twnhs BldgType_TwnhsE HouseStyle_1.5Fin HouseStyle_1Story HouseStyle_2Story HouseStyle_SFoyer HouseStyle_SLvl Alley_Grvl Alley_Pave CentralAir
1127 0.940448 1.757876 0.758930 -0.491196 -0.801562 0.276161 1.972614 0.976157 1.140398 0.011263 ... -0.168411 -0.294241 -0.341056 0.997447 -0.674456 -0.148058 -0.198191 -0.171591 -0.180836 0.266685
1424 -0.081668 -0.104345 0.060128 -0.491196 -0.801562 -0.341883 0.027414 0.480164 -0.087155 -0.347499 ... -0.168411 -0.294241 -0.341056 0.997447 -0.674456 -0.148058 -0.198191 -0.171591 -0.180836 0.266685
587 -0.130340 0.268099 0.270726 0.400065 -0.801562 -1.577972 0.523521 -0.810483 -0.533538 -1.281052 ... -0.168411 -0.294241 -0.341056 -1.001535 -0.674456 6.747209 -0.198191 -0.171591 -0.180836 0.266685
1157 1.135137 0.268099 0.739785 -0.491196 -0.801562 -0.341883 1.058855 0.400166 0.616384 -0.405364 ... 5.931796 -0.294241 -0.341056 0.997447 -0.674456 -0.148058 -0.198191 -0.171591 -0.180836 0.266685
938 1.037792 0.640543 1.898074 -0.491196 0.490338 0.276161 0.043566 0.605496 0.803185 0.844517 ... -0.168411 -0.294241 -0.341056 -1.001535 1.481159 -0.148058 -0.198191 -0.171591 -0.180836 0.266685

5 rows × 116 columns

Note: we have a helper function that will do the above steps for us later in this workshop

from preprocessing import prep_fulldim_zdata
X_train_z, X_test_z, y_train, y_test, y = prep_fulldim_zdata(const_thresh= 98, test_size=.33, y_log_scaled=True)
print(X_train_z.shape)
X_train_z.head()
99 columns removed, 116 remaining.
Columns removed: ['LotFrontage', 'PoolArea', '3SsnPorch', 'LowQualFinSF', 'GarageYrBlt', 'MasVnrArea', 'RoofMatl_ClyTile', 'RoofMatl_CompShg', 'RoofMatl_Membran', 'RoofMatl_Metal', 'RoofMatl_Roll', 'RoofMatl_Tar&Grv', 'RoofMatl_WdShake', 'RoofMatl_WdShngl', 'SaleCondition_AdjLand', 'SaleCondition_Alloca', 'SaleCondition_Family', "Exterior2nd_'Brk Cmn'", 'Exterior2nd_AsbShng', 'Exterior2nd_AsphShn', 'Exterior2nd_BrkFace', 'Exterior2nd_CBlock', 'Exterior2nd_ImStucc', 'Exterior2nd_Other', 'Exterior2nd_Stone', 'Exterior2nd_Stucco', 'Utilities_AllPub', 'Utilities_NoSeWa', 'LotConfig_FR3', 'Foundation_Slab', 'Foundation_Stone', 'Foundation_Wood', 'Condition2_Artery', 'Condition2_Feedr', 'Condition2_Norm', 'Condition2_PosA', 'Condition2_PosN', 'Condition2_RRAe', 'Condition2_RRAn', 'Condition2_RRNn', 'GarageType_2Types', 'GarageType_Basment', 'GarageType_CarPort', 'Heating_Floor', 'Heating_GasW', 'Heating_Grav', 'Heating_OthW', 'Heating_Wall', 'MSSubClass_40', 'MSSubClass_45', 'MSSubClass_75', 'MSSubClass_85', 'MSSubClass_180', 'Neighborhood_Blmngtn', 'Neighborhood_Blueste', 'Neighborhood_BrDale', 'Neighborhood_ClearCr', 'Neighborhood_MeadowV', 'Neighborhood_NPkVill', 'Neighborhood_SWISU', 'Neighborhood_StoneBr', 'Neighborhood_Veenker', 'RoofStyle_Flat', 'RoofStyle_Gambrel', 'RoofStyle_Mansard', 'RoofStyle_Shed', 'SaleType_CWD', 'SaleType_Con', 'SaleType_ConLD', 'SaleType_ConLI', 'SaleType_ConLw', 'SaleType_Oth', 'Exterior1st_AsbShng', 'Exterior1st_AsphShn', 'Exterior1st_BrkComm', 'Exterior1st_CBlock', 'Exterior1st_ImStucc', 'Exterior1st_Stone', 'Exterior1st_Stucco', 'Exterior1st_WdShing', "MSZoning_'C (all)'", 'MSZoning_RH', 'MasVnrType_BrkCmn', 'Condition1_PosA', 'Condition1_PosN', 'Condition1_RRAe', 'Condition1_RRAn', 'Condition1_RRNe', 'Condition1_RRNn', 'Electrical_FuseF', 'Electrical_FuseP', 'Electrical_Mix', 'MiscFeature_Gar2', 'MiscFeature_Othr', 'MiscFeature_TenC', 'HouseStyle_1.5Unf', 'HouseStyle_2.5Fin', 'HouseStyle_2.5Unf', 'Street']
(978, 116)
YearRemodAdd MoSold GarageArea OverallCond 2ndFlrSF TotRmsAbvGrd BsmtFinSF1 1stFlrSF TotalBsmtSF GrLivArea ... BldgType_Twnhs BldgType_TwnhsE HouseStyle_1.5Fin HouseStyle_1Story HouseStyle_2Story HouseStyle_SFoyer HouseStyle_SLvl Alley_Grvl Alley_Pave CentralAir
1127 0.940448 1.757876 0.758930 -0.491196 -0.801562 0.276161 1.972614 0.976157 1.140398 0.011263 ... -0.168411 -0.294241 -0.341056 0.997447 -0.674456 -0.148058 -0.198191 -0.171591 -0.180836 0.266685
1424 -0.081668 -0.104345 0.060128 -0.491196 -0.801562 -0.341883 0.027414 0.480164 -0.087155 -0.347499 ... -0.168411 -0.294241 -0.341056 0.997447 -0.674456 -0.148058 -0.198191 -0.171591 -0.180836 0.266685
587 -0.130340 0.268099 0.270726 0.400065 -0.801562 -1.577972 0.523521 -0.810483 -0.533538 -1.281052 ... -0.168411 -0.294241 -0.341056 -1.001535 -0.674456 6.747209 -0.198191 -0.171591 -0.180836 0.266685
1157 1.135137 0.268099 0.739785 -0.491196 -0.801562 -0.341883 1.058855 0.400166 0.616384 -0.405364 ... 5.931796 -0.294241 -0.341056 0.997447 -0.674456 -0.148058 -0.198191 -0.171591 -0.180836 0.266685
938 1.037792 0.640543 1.898074 -0.491196 0.490338 0.276161 0.043566 0.605496 0.803185 0.844517 ... -0.168411 -0.294241 -0.341056 -1.001535 1.481159 -0.148058 -0.198191 -0.171591 -0.180836 0.266685

5 rows × 116 columns

Fit the model and measure train/test performance

# Fit the multivariate regression model
X_train_z = sm.add_constant(X_train_z)
X_train_z.head()
const YearRemodAdd MoSold GarageArea OverallCond 2ndFlrSF TotRmsAbvGrd BsmtFinSF1 1stFlrSF TotalBsmtSF ... BldgType_Twnhs BldgType_TwnhsE HouseStyle_1.5Fin HouseStyle_1Story HouseStyle_2Story HouseStyle_SFoyer HouseStyle_SLvl Alley_Grvl Alley_Pave CentralAir
1127 1.0 0.940448 1.757876 0.758930 -0.491196 -0.801562 0.276161 1.972614 0.976157 1.140398 ... -0.168411 -0.294241 -0.341056 0.997447 -0.674456 -0.148058 -0.198191 -0.171591 -0.180836 0.266685
1424 1.0 -0.081668 -0.104345 0.060128 -0.491196 -0.801562 -0.341883 0.027414 0.480164 -0.087155 ... -0.168411 -0.294241 -0.341056 0.997447 -0.674456 -0.148058 -0.198191 -0.171591 -0.180836 0.266685
587 1.0 -0.130340 0.268099 0.270726 0.400065 -0.801562 -1.577972 0.523521 -0.810483 -0.533538 ... -0.168411 -0.294241 -0.341056 -1.001535 -0.674456 6.747209 -0.198191 -0.171591 -0.180836 0.266685
1157 1.0 1.135137 0.268099 0.739785 -0.491196 -0.801562 -0.341883 1.058855 0.400166 0.616384 ... 5.931796 -0.294241 -0.341056 0.997447 -0.674456 -0.148058 -0.198191 -0.171591 -0.180836 0.266685
938 1.0 1.037792 0.640543 1.898074 -0.491196 0.490338 0.276161 0.043566 0.605496 0.803185 ... -0.168411 -0.294241 -0.341056 -1.001535 1.481159 -0.148058 -0.198191 -0.171591 -0.180836 0.266685

5 rows × 117 columns

We’ll add the constant to the test set as well so that we can feed the test data to the model for prediction.

X_test_z = sm.add_constant(X_test_z)
X_test_z.head()
const YearRemodAdd MoSold GarageArea OverallCond 2ndFlrSF TotRmsAbvGrd BsmtFinSF1 1stFlrSF TotalBsmtSF ... BldgType_Twnhs BldgType_TwnhsE HouseStyle_1.5Fin HouseStyle_1Story HouseStyle_2Story HouseStyle_SFoyer HouseStyle_SLvl Alley_Grvl Alley_Pave CentralAir
529 1.0 -0.471045 -1.221678 0.060128 -2.273717 -0.801562 1.512249 1.785709 3.602784 2.365525 ... -0.168411 -0.294241 -0.341056 0.997447 -0.674456 -0.148058 -0.198191 -0.171591 -0.180836 0.266685
491 1.0 -1.687850 0.640543 -1.107734 1.291325 0.601202 -0.959927 -0.097190 -0.549153 -0.616021 ... -0.168411 -0.294241 2.929070 -1.001535 -0.674456 -0.148058 -0.198191 -0.171591 -0.180836 0.266685
459 1.0 -1.687850 0.268099 -0.571667 -1.382456 -0.294757 -0.959927 -0.600219 -0.493154 -0.851343 ... -0.168411 -0.294241 2.929070 -1.001535 -0.674456 -0.148058 -0.198191 -0.171591 -0.180836 0.266685
279 1.0 -0.373701 -1.221678 0.160640 -0.491196 1.157782 0.894205 -0.122572 -0.021161 0.242780 ... -0.168411 -0.294241 -0.341056 -1.001535 1.481159 -0.148058 -0.198191 -0.171591 -0.180836 0.266685
655 1.0 -0.665734 -1.221678 -0.992863 -0.491196 0.481288 -0.341883 -1.027102 -1.703802 -1.297726 ... 5.931796 -0.294241 -0.341056 -1.001535 1.481159 -0.148058 -0.198191 -0.171591 -0.180836 0.266685

5 rows × 117 columns

Train the model.

model = sm.OLS(y_train, X_train_z)
trained_model = model.fit()

Get model predictions on train and test sets.

y_pred_train=trained_model.predict(X_train_z)
y_pred_test=trained_model.predict(X_test_z)

Compare train/test R-squared.

from regression_predict_sklearn import measure_model_err
errors_df = measure_model_err(y, np.mean(y),
                      y_train,
                      y_pred_train,
                      y_test,
                      y_pred_test,
                      'RMSE', y_log_scaled=True)
errors_df.head()
Baseline Error Train Error Test Error
0 79415.291886 24796.925348 69935.683796

We can see that this model exhibits signs of overfitting. That is, the test set performance is substantially lower than train set performance. Since the model does not generalize well to other datasets — we shouldn’t read too much into the model’s estimated coefficients and p-values. An overfit model is a model that learns nuances/noise in the training data, and the coefficients/p-values may be biased towards uninteresting patterns in the data (i.e., patterns that don’t generalize).

Why do we see overfitting here? Let’s quickly calculate the ratio between number of observations used to train the model and number of coefficients that need to be esitmated.

X_train_z.shape[0]/X_train_z.shape[1]
8.35897435897436

As the number of observations begins to approach the number of model parameters (i.e., coefficients being estimated), the model will simply memorize the training data rather than learn anything useful. As a general rule of thumb, obtaining reliable estimates from linear regression models requires that you have at least 10X as many observations than model coefficients/predictors. The exact ratio may change depending on the variability of your data and whether or not each observation is truly independent (time-series models, for instance, often require much more data since observations are rarely independent).

“All models are wrong, but some are useful” - George Box

Because of these opposing forces, it’s important to remember the following sage wisdom: All models are wrong, but some are useful.. This famous quote by the statistician George E. P. Box conveys an essential concept in the field of statistics and modeling.

In essence, the phrase means that no model can perfectly capture the complexities and nuances of real-world data and phenomena. Models are simplifications of reality and are, by their nature, imperfect representations. Therefore, all models are considered “wrong” to some extent because they do not fully encompass the entire reality they attempt to describe.

However, despite their imperfections, some models can still be valuable and “useful” for specific purposes. A useful model is one that provides valuable insights, makes accurate predictions, or aids in decision-making, even if it does not perfectly match the underlying reality. The key is to understand the limitations of the model and interpret its results accordingly. Users of the model should be aware of its assumptions, potential biases, and areas where it might not perform well. Skilled data analysts and scientists know how to leverage models effectively, acknowledging their limitations and using them to gain insights or solve problems within the scope of their applicability.

Feature selection methods

Throughout this workshop, we will explore a couple of feature (predictor) selection methods that can help you simplify your high-dimensional data — making it possible to avoid overfitting concerns. These methods can involve:

  1. Mathematically combining features to reduce dimensionality (e.g., PCA)
  2. Data-driven/empirical methods such as comparing univariate model performance
  3. Hypothesis-driven feature selection
from feature_selection import hypothesis_driven_predictors
hypothesis_driven_predictors()

Summary

In summary, leaving out relevant predictors can lead to biased coefficient estimates and model misspecification. Without including the most essential predictors, the model will place too much focus on the predictors included and over/underestimate their contributions to the target variable.

In addition, while researchers should strive to include as many relevant predictors in their models as possible, this must be balanced with overfitting concerns. Obtaining good coefficient estimates can become difficult as the number of predictors increases. As a general rule of thumb, obtaining reliable estimates from linear regression models requires that you have at least 10X as many observations than model coefficients/predictors. The exact ratio may change depending on the variability of your data and whether or not each observation is truly independent (time-series models, for instance, often require much more data since observations are rarely independent).

Other considerations

So far, we’ve explored the importance of including relevant predictors and checking for overfitting before we attempt to read too far into the model’s estimates. However, recall that there are three critical questions we must ask before we can read too far into our model’s estimations

  1. Accounting for relevant predictors: Have we included all relevant predictors in the model?
  2. Bias/variance or under/overfitting: Does the model capture the variability of the target variable well? Does the model generalize well?
  3. Model assumptions: Does the fitted model follow the 5 assumptions of linear regression?

In the next episode, we’ll review a handful of assumptions that must be evaluated prior to running any hypothesis tests on a regression model.

Key Points

  • All models are wrong, but some are useful.

  • Before reading into a model’s estimated coefficients, modelers must take care to account for essential predictor variables

  • Models that do not account for essential predictor variables can produce distorted pictures of reality due to omitted variable bias and confounding effects.


Model validity - regression assumptions

Overview

Teaching: 45 min
Exercises: 2 min
Questions
  • How can we rigorously evaluate the validity and accuracy of a multivariate regression model?

  • What are the assumptions of linear regression models and why are they important?

  • How should we preprocess categorical predictors and sparse binary predictors?

Objectives
  • Understand how to assess the validity of a multivariate regression model.

  • Understand how to use statistics to evaluate the likelihood of existing relationships recovered by a multivariate model.

Model Validity And Interpretation

While using models strictly for predictive purposes is a completely valid approach for some domains and problems, researchers typically care more about being able to interpret their models such that interesting relationships between predictor(s) and target can be discovered and measured. When interpretting a linear regression model, we can look at the model’s estimated coefficients and p-values associated with each predictor to better understand the model. The coefficient’s magnitude can inform us of the effect size associated with a predictor, and the p-value tells us whether or not a predictor has a consistent (statistically significant) effect on the target.

Before we can blindly accept the model’s estimated coefficients and p-values, we must answer three questions that will help us determine whether or not our model is valid.

Model Validity Assessments

  1. Accounting for relevant predictors: Have we included as many relevant predictors in the model as possible?
  2. Bias/variance or under/overfitting: Does the model capture the variability of the target variable well? Does the model generalize well?
  3. Model assumptions: Does the fitted model follow the 5 assumptions of linear regression?

We will discuss the third assessment — regression assumptions — in detail throughout this episode.

Linear regression assumptions

When discussing machine learning models, it’s important to recognize that each model is built upon certain assumptions that define the relationship between the input data (features, predictors) and the output (target variable). The main goal of linear regression is to model the relationship between the target variable and the predictor variables using a linear equation. The assumptions that are typically discussed in the context of linear regression are related to how well this linear model represents the underlying data-generating process. These assumptions ensure that the linear regression estimates are accurate, interpretable, and can be used for valid statistical inference.

When interpretting multivariate models (coefficient size and p-values), the following assumpitons should be met to assure validty of results.

  1. Linearity: There is a linear relation between Y and X
  2. Normality: The error terms (residuals) are normally distributed
  3. Homoscedasticity: The variance of the error terms is constant over all X values (homoscedasticity)
  4. Independence: The error terms are independent
  5. Limited multicollinearity among predictors: This assumption applies to multivariate regression models but is not relevant in univariate regression since there is only one predictor variable. Multicollinearity refers to the presence of high correlation or linear dependence among the predictor variables in a regression model. It indicates that there is a strong linear relationship between two or more predictor variables. Multicollinearity can make it challenging to isolate the individual effects of predictors and can lead to unstable and unreliable coefficient estimates. It primarily focuses on the relationships among the predictors themselves.

We will explore assess all five assumptions using a multivariate model built from the Ames housing data.

0. Load and prep data

from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True, parser='auto') #
y=housing['target']

# Log scale sale prices: In our previous univariate models, we observed that several predictors tend to linearly relate more with the log version of SalePrices. Target variables that exhibit an exponential trend, as we observe with house prices, typically need to be log scaled in order to observe a linear relationship with predictors that scale linearly. For this reason, we'll log scale our target variable here as well.
import numpy as np
y_log = y.apply(np.log)

Set predictors

Let’s assume we have two predictors recorded in this dataset — sale condition and OverallQual (don’t worry, we’ll use more predictors soon!). What values can the sale condition variable take?

cat_predictor = 'SaleCondition'
predictors = ['OverallQual', cat_predictor]
X=housing['data'][predictors]
print(X.head())
print(X[cat_predictor].unique())
   OverallQual SaleCondition
0            7        Normal
1            6        Normal
2            7        Normal
3            7       Abnorml
4            8        Normal
['Normal' 'Abnorml' 'Partial' 'AdjLand' 'Alloca' 'Family']

SaleCondition: Condition of sale

   Normal	Normal Sale
   Abnorml	Abnormal Sale -  trade, foreclosure, short sale
   AdjLand	Adjoining Land Purchase
   Alloca	Allocation - two linked properties with separate deeds, typically condo with a garage unit
   Family	Sale between family members
   Partial	Home was not completed when last assessed (associated with New Homes)

Encode categorical data as multiple binary predictors

import pandas as pd
nominal_fields =['SaleCondition']
X_enc = X.copy()
X_enc[nominal_fields]=X[nominal_fields].astype("category")
one_hot = pd.get_dummies(X_enc[nominal_fields])
one_hot.head()
SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
0 0 0 0 0 1 0
1 0 0 0 0 1 0
2 0 0 0 0 1 0
3 1 0 0 0 0 0
4 0 0 0 0 1 0
# Join the encoded df
X_enc = X.join(one_hot)

# Drop column SaleCondition as it is now encoded
X_enc = X_enc.drop(cat_predictor,axis = 1)
X_enc.head()
OverallQual SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
0 7 0 0 0 0 1 0
1 6 0 0 0 0 1 0
2 7 0 0 0 0 1 0
3 7 1 0 0 0 0 0
4 8 0 0 0 0 1 0

Alternatively, we could use one of the preprocessing helpers here.

from preprocessing import encode_predictors_housing_data
X_enc = encode_predictors_housing_data(X)
X_enc.head()
OverallQual SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
0 7 0 0 0 0 1 0
1 6 0 0 0 0 1 0
2 7 0 0 0 0 1 0
3 7 1 0 0 0 0 0
4 8 0 0 0 0 1 0

Handling sparse or near-constant predictors

After we’ve split our category variable into 6 binary predictors, it is important to assess the quantity of information present in each predictor. If a predictor shows very little variability (e.g., nearly all 0’s or 1’s in a binary variable), then it will be challenging to detect a meaningful relationship between that predictor and the target. The model needs to observe examples from both classes of a binary variable in order to reveal a measurable effect between a binary variable and the target.

To assess not only sparsity (presence of nearly all 0’s) but for the presence of any one constant value in a predictor, we will use the following procedure:

import pandas as pd

# Calculate value counts
value_counts_df = X_enc['OverallQual'].value_counts().reset_index()
value_counts_df.columns = ['Value', 'Count']

# Calculate percentages separately
value_counts_df['Percentage'] = value_counts_df['Count']/len(X_enc) * 100

# Sort the DataFrame
value_counts_df = value_counts_df.sort_values(by='Count', ascending=False)

# Display the DataFrame
value_counts_df.head()
Value Count Percentage
0 5 397 27.191781
1 6 374 25.616438
2 7 319 21.849315
3 8 168 11.506849
4 4 116 7.945205

A few of the predictors (AdjLand, Alloca, Family) contain very little information since they are filled almost entirely with 0’s. With few observations to rely on, this makes it difficult to assess how house price changes when these predictors become active (1 instead of 0). If you encounter extremely sparse predictors, it’s best to remove them from the start to avoid wasting computational resources. While it’s important to pick a threshold that encourages different values of each predictor to land in both the training and test sets, the exact threshold chosen is somewhat arbitrary. It’s more important that you follow-up this choice with a thorough investigation of the resulting model and adjust the model downstream, if necessary.

Preprocessing helper function does the above calcuation at looks at the top percentage

We have a pre-made helper function that will run the above code for all predictors and remove any that contain that are nearly constant. Here, we’ll remove predictors that don’t have at are 99% constant.

from preprocessing import remove_bad_cols
X_good = remove_bad_cols(X_enc, 99)
X_good.head()
2 columns removed, 5 remaining.
Columns removed: ['SaleCondition_AdjLand', 'SaleCondition_Alloca']
OverallQual SaleCondition_Abnorml SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
0 7 0 0 1 0
1 6 0 0 1 0
2 7 0 0 1 0
3 7 1 0 0 0
4 8 0 0 1 0

Why not just use variance instead of counts? Variance = (Sum of the squared differences between each value and the mean) / (Number of values)

Using the variance measure to detect extremely sparse or near-constant features might not be the most suitable approach. Variance can be influenced by outliers and can be misleading results.

import numpy as np

# Array of continuous values (1 to 100)
continuous_data = np.arange(1, 101)
print(continuous_data)

# Array with mostly zeros and an outlier
sparse_data_with_outlier = np.array([0] * 100 + [1000])
print(sparse_data_with_outlier)

# Calculate variances
variance_continuous = np.var(continuous_data)
variance_sparse = np.var(sparse_data_with_outlier)

print("Continuous Data Variance:", variance_continuous)
print("Sparse Data Variance:", variance_sparse)
[  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0 1000]
Continuous Data Variance: 833.25
Sparse Data Variance: 9802.96049406921

2. Check for multicollinearity

Multicollinearity: In statistics, multicollinearity is a phenomenon in which one or more predictors in a multivariate regression model can be linearly predicted from the others with a substantial degree of accuracy. In other words, it means that one or more predictors are highly correlated with one another.

Multicollinearity presents a problem in multivariate regression because, without having independent/uncorrelated predictors, it is difficult to know how much each predictor truly contributes to predicting the target variable. In other words, when two or more predictors are closely related or measure almost the same thing, then the underlying impact on the target varible is being accounted for twice (or more) across the predictors.

While multicollinearity does not reduce a model’s overall predictive power, it can produce estimates of the regression coefficients that are not statistically valid.

Variance Inflation Factor (VIF)

The VIF (Variance Inflation Factor) is a statistical measure used to detect multicollinearity in regression analysis. VIF helps to quantify the extent to which multicollinearity is present in the model.

The intuition behind the VIF score is based on the idea that if two or more predictor variables are highly related, it becomes difficult for the model to distinguish the individual effects of these variables on the dependent variable. Consequently, the coefficient estimates for these variables become unstable, and their standard errors become inflated, making the results less reliable.

The VIF is calculated for each independent variable in the regression model. Specifically, to calculate the VIF for predictor i, the following steps are taken:

  1. Fit a regression model with predictor variable i as the target/dependent variable and all other independent variables (excluding i) as predictors.

  2. Calculate the R-squared value (R²) for this regression model. R² represents the proportion of variance in variable i that can be explained by the other independent variables.

  3. Calculate the VIF for variable i using the formula: VIF(i) = 1 / (1 - R²)

  4. The interpretation of the VIF score is as follows:

    • A VIF of 1 indicates that there is no multicollinearity for the variable, meaning it is not correlated with any other independent variable in the model.

    • VIF values between 1 and 5 (or an R² between 0 and .80) generally indicate low to moderate multicollinearity, which is typically considered acceptable in most cases.

    • VIF values greater than 5 (some sources use a threshold of 10) suggest high multicollinearity, indicating that the variable is highly correlated with other independent variables, and its coefficient estimates may be unreliable.

We can calculate VIF for all predictors quickly using the variance_inflation_factor function from the statsmodels package.

from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_print_VIF(X):
    # Calculate VIF for each predictor in X
    vif = pd.DataFrame()
    vif["Variable"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    # Display the VIF values
    print(vif)

The variance_inflation_factor() function expects a constant column to code for each model’s y-intercept. We will add this constant before calculating VIF scores.

# Add constant to the dataframe for VIF calculation
X_with_const = add_constant(X_good)
# Calc VIF scores
calc_print_VIF(X_with_const)
                Variable        VIF
0            OverallQual  19.518808
1  SaleCondition_Abnorml   2.072691
2   SaleCondition_Family   1.229967
3   SaleCondition_Normal  15.774223
4  SaleCondition_Partial   3.441928

It looks like two of the predictors, “Normal” and “OverallQual”, have high VIF scores. We can further investigate this score by creating a plot of the correlation matrix of all predictors.

from check_assumptions import plot_corr_matrix

# Calculate correlation matrix
fig = plot_corr_matrix(X_good.corr())

# import matplotlib.pyplot as plt
# plt.savefig('..//fig//regression//assumptions//corrMat_multicollinearity.png', bbox_inches='tight', dpi=300, facecolor='white');

The Normal variable appears to be highly negatively correlated with both Partial and Abnormal. In fact, Normal has a considerable amount of negative corrleation with all predictors. If we think about our predictors holistically, it appears we have several categories describing somewhat rarer sale conditions, and then a more common/default “normal” condition. Regardless of the value of “Normal”, if all other predictors are set to 0, that is a very good indication that it was a “Normal” sale. Since “Normal” tends to negate the remaining predictors presense and is redundant, it makes sense to remove it form the list of predictors and only consider the manner in which the sale was unusal.

Correlation matrix vs VIF

You might wonder why we even bother calculating the VIF score given that we could easily inspect the correlation matrices instead. VIF scores give a more reliable estimate of multicollinearity mainly due to their ability to assess multivariate interactions. That is, the correlation matrix only shows pairwise relationships between variables, but it does not reveal the impact of all independent variables simultaneously on each other. The VIF, on the other hand, takes into account all other independent variables when assessing multicollinearity for a specific variable. This individual assessment allows you to pinpoint which variables are causing multicollinearity issues. In addition, the VIF helps identify which variables are causing the problem, enabling you to take appropriate actions to address the issue.

In summary, while the correlation matrix can give you a general idea of potential multicollinearity, the VIF score provides a more comprehensive and quantitative assessment, helping you identify, measure, and address multicollinearity issues in a regression model effectively.

X_better = X_good.drop('SaleCondition_Normal',axis = 1)
X_better.columns
Index(['OverallQual', 'SaleCondition_Abnorml', 'SaleCondition_Family',
       'SaleCondition_Partial'],
      dtype='object')

After dropping the problematic variable with multicollinearity, we can recalculate VIF for each predictor in X. This time we’ll call a helper function to save a little time.

from check_assumptions import multicollinearity_test
multicollinearity_test(X_better);
==========================
VERIFYING MULTICOLLINEARITY...
                Variable       VIF
0            OverallQual  1.237386
1  SaleCondition_Abnorml  1.068003
2   SaleCondition_Family  1.014579
3  SaleCondition_Partial  1.154805

Note that we still see some correlation between OverallQual and Partial. However, the correlation is not so strong that those predictors can be reasonably well predicted from one another (low VIF score).

3. Fit the model

Before we can assess the remaining assumptions of the model (linearity, normality, homoscedasticiy, and independence), we first must fit the model.

Train/test split

Since we’re working with multiple predictors, we must take care evaluate evidence of overfitting in this model. The test set will be left out during model fitting/training so that we can measure the model’s ability to generalize to new data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_better, y_log,
                                                    test_size=0.33,
                                                    random_state=4)

print(X_train.shape)
print(X_test.shape)
(978, 4)
(482, 4)

Zscore the data

As in the previous episode, we will zscore this data so that all predictors are on the same scale. This will allow us to compare coefficient sizes later.

from preprocessing import zscore
X_train_z = zscore(df=X_train, train_means=X_train.mean(), train_stds=X_train.std())
X_test_z = zscore(df=X_test, train_means=X_train.mean(), train_stds=X_train.std())
X_train_z.head()
OverallQual SaleCondition_Abnorml SaleCondition_Family SaleCondition_Partial
368 -0.819197 -0.262263 -0.124741 -0.300348
315 0.630893 -0.262263 -0.124741 -0.300348
135 0.630893 -0.262263 -0.124741 -0.300348
473 1.355938 -0.262263 -0.124741 3.326071
1041 -0.094152 -0.262263 -0.124741 -0.300348
import statsmodels.api as sm

# Add a constant column to the predictor variables dataframe - this acts as the y-intercept in the model
X_train_z = sm.add_constant(X_train_z)

# Add the constant to the test set as well so we can use the model to form predictions on the test set later
X_test_z = sm.add_constant(X_test_z)
X_test_z.head()
const OverallQual SaleCondition_Abnorml SaleCondition_Family SaleCondition_Partial
280 1.0 0.630893 -0.262263 -0.124741 -0.300348
1365 1.0 0.630893 -0.262263 -0.124741 -0.300348
132 1.0 -0.819197 -0.262263 -0.124741 -0.300348
357 1.0 -0.819197 -0.262263 -0.124741 -0.300348
438 1.0 -0.819197 -0.262263 -0.124741 -0.300348
# Fit the multivariate regression model
model = sm.OLS(y_train, X_train_z)
trained_model = model.fit()

4. Evaluate evidence of overfitting or severe underfitting

Before we go any further in assessing the model’s assumptions and ultimately running hypothesis tests, we should first check to see if there is evidence of overfitting or severe underfitting.

from regression_predict_sklearn import measure_model_err
# to calculate residuals and R-squared for the test set, we'll need to get the model predictions first
y_pred_train = trained_model.predict(X_train_z)
y_pred_test = trained_model.predict(X_test_z)
errors_df = measure_model_err(y, np.mean(y),
                      y_train, y_pred_train,
                      y_test, y_pred_test,
                      'R-squared', y_log_scaled=True)

errors_df
Baseline Error Train Error Test Error
0 0.0 0.662168 0.723296

You can also extract rsquared for the training data directly from the trained statsmodels object.

# R-squared for train set
R2_train = trained_model.rsquared
print(R2_train)
0.6656610907646399

No evidence of overfitting (test and train errors are comparable) or severe underfitting (R-squared is not astonishingly low). You may notice that the test set R-squared is actually slightly higher than the train set R-squared. This may come as a surprise given that the test set was left out during model training. However, test set performance often can appear to be slightly better than train set simply given the limited number of samples often available in the test set.

5a) Check linearity assumption

The linearity assumption of multivariate regression states that the overall relationship between the predictors and the target variable should be approximately linear. This doesn’t necessarily imply that each predictor must have a perfectly linear relationship. So long as the sum of combined effects is linear, then the linearity assumption has been met. That said, if you observe a strong nonlinear pattern between one or more predictors, this often does cascade into an overall nonlinear effect in the model. We will review one method to investigate each individual predictor’s relationship with the target as well

Why do we care?

As discussed in the previous episode, the predictions will be inaccurate because our model is underfitting (i.e., not adquately capturing the variance of the data since you can’t effectively draw a line through nonlinear data). In addition to having a fatal impact on predictive power, violations of linearity can affect the validity of hypothesis tests on the regression coefficients. The p-values associated with the coefficients may not accurately reflect the statistical significance, potentially leading to erroneous conclusions.

Visualizing linearity in multivariate models

When working with univariate models, we are able to assess the linearity assumption PRIOR to model fitting simply by creating a scatterplot between the predictor and target. With multivariate models, however, we need a different approach in order to isolate the relationship between individual predictors and the target. That is, we need to account for effects of the remaining predictors.

Partial regression plots

Partial regression plots, otherwise known as added variable plots, help visualize the relationship between a single predictor and the target variable while taking into account the effects of other predictor variables. By plotting the partial regressions against the target variable of interest, we can assess whether the relationship is approximately linear for each predictor.

Partial regression plots are formed by:

  1. Computing the residuals of regressing the target variable against the predictor variables but omitting Xi (predictor of interest)
  2. Computing the residuals from regressing Xi against the remaining independent variables.
  3. Plot the residuals from (1) against the residuals from (2).

By looking at the visualization, we can assess the impact of adding individual predictors to a model that has all remaining predictors. If we see a non-zero slope, this indicates a predictor has a meaningful relationship with the target after accounting for effects from other predictors.

# Create the partial regression plots using statsmodels
import matplotlib.pyplot as plt
from statsmodels.graphics.regressionplots import plot_partregress_grid;

fig = plt.figure(figsize=(12, 8));
plot_partregress_grid(trained_model, fig=fig, exog_idx=list(range(1,X_train_z.shape[1])))
# fig.savefig('..//fig//regression//assumptions//partialRegression.png', bbox_inches='tight', dpi=300, facecolor='white');
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1

Inspect the plots

In conclusion, our model appears to be satisfying the linearity assumption based on these plots.

5b) A more wholesome view of linearity (and homoscedasticity)

What if instead of 4 predictors, we have 100 predictors in our model? Partial regression plots can become burdensome to look through when working with many predictors. Furthermore, we still need to assess whether or not the overall relationship revealed by the model is linear or not. For this analysis, we can create two plots that both help evaluate both homoscedasticity and linearity.

Homoscedasticity refers to when the variance of the model residuals is constant over all X values. Homoscedasticity is desirable because we want the residuals to be the same across all values of the independent variables / predictors. For hypothesis testing, confidence intervals, and p-values to be valid and meaningful, it is crucial that the underlying assumptions, including homoscedasticity, are met. If the assumption is violated, the inference drawn from the hypothesis tests may not be accurate.

We will call a pre-baked helper function to generate the plots and calculate a test statistic that assesses homoscedasticity (Goldfeld-Quandt test).

Fig1. Predictions vs actual values

Fig2. Residuals vs predicted values

Goldfeld-Quandt test

The Goldfeld-Quandt test is a statistical test used to check for heteroscedasticity (unequal variances) in a regression model. It splits the data into two groups based on a specified split point (default is the median) and then estimates separate variances for each group. The test statistic is calculated based on the F-distribution, and the p-value is obtained by comparing the test statistic to the critical value from the F-distribution.

If the p-value is greater than your chosen significance level (e.g., 0.05), you fail to reject the null hypothesis, indicating no evidence of heteroscedasticity. In this case, the variance of residuals is assumed to be equal for both groups. If the p-value is less than your significance level, you can reject the null hypothesis, suggesting the presence of heteroscedasticity. This means that the variance of residuals is different for different groups, indicating potential issues with the regression model’s assumptions.

from check_assumptions import homoscedasticity_linearity_test
fig = homoscedasticity_linearity_test(trained_model=trained_model,
                                      y=y_train, y_pred=y_pred_train,
                                      y_log_scaled=True, plot_raw=False)
# fig.savefig('..//fig//regression//assumptions//linearity-homoscedasticity_pred_v_residuals.png',bbox_inches='tight', dpi=300)

=======================================
VERIFYING LINEARITY & HOMOSCEDASTICITY...

 Goldfeld-Quandt test (homoscedasticity) ----
                value
F statistic  0.920670
p-value      0.818216
Homoscedasticity test: Passes (homoscedasticity is assumed)

 Residuals plots ----

Inspect the plot and Goldfeld-Quandt test

Fig1. Predictions vs actual values

Fig2. Residuals vs predicted values

How to remedy issues with linearity

If you encounter issues with the linearity assumption, you can try the following solutions:

  1. Transformations: Apply nonlinear transformations to X and/or Y. Common transformations are the natural logarithm, square root, and inverse. A Box-Cox transformation of the outcome may help, as well. Partial regression plots can help identify predictors that have a nonlinear relationship with Y.
  2. Remove nonlinear predictors: Remove predictors that exhibit a nonlinear trend with the target (e.g., via inspecting partial regression plots)
  3. Limit range of training data: With data like ours, you could try to limit the training data to sale prices in the 25-75th percentile range. This would yield an accurate (linear) description of the data, but the model would not generalize well to sale prices outside this range. If your goal is only to describe data and not extrapolate to unseen datasets, this approach may be sufficient.
  4. Adding additional predictors: Add predictors to help capture the relationship between the predictors and the label. Remember, we really just need the overall relationship between target and predictors to be linear. Sometimes, adding additional predictors that relate to the target can help produce an overall linear model. With our housing data, there may be
  5. Try spline or polynomial regression: In polynomial regression, you add polynomial terms to some of the predictors (i.e., polynomial regression). In a similar vein to solution 1, polynomial regression will allow you to include transformed predictors which may linearly relate to the target. Spline regression is similar, however, it allows you to fit separate polynomials to different segments of the data

If none of those approaches work, you can also consider nonlinear models if you have a sufficiently large dataset (learning nonlinear relationships requires lots of data).

5. Evaluate normality of residuals & constant variance (homoscedasticity) of residuals

Given that the linearity assumption is in question with this model, we would typically begin exploring alternative models/predictors/transformations for our analysis. However, we will check the remaining assumptions here as practice for future work.

Normal residuals: In a linear regression model, it is assumed that the model residuals/errors are normally distributed. If the residuals are not normally distributed, their randomness is lost, which implies that the model is not able to explain the relation in the data.

Many statistical tests and estimators used in multivariate regression, such as t-tests and confidence intervals, also rely on the assumption of normality. If the residuals are not approximately normally distributed, the results of these tests and estimators may be invalid or biased.

Histograms

Histograms can help give a preliminary sense of the overall distribution of the data. We can also quickly calculate the “skew” of a distribution using numpy’s skew() function.

# Extract the residuals and calculate median — should lie close to 0 if it is a normal distribution
resids = y_train - y_pred_train
print('Median of residuals:', np.median(resids))
plt.hist(resids);
resids.skew()
# plt.savefig('..//fig//regression//assumptions//normal_resids_histogram.png',bbox_inches='tight', dpi=300)

Median of residuals: 0.0047744607860336075





-0.3768190722299776

Quantile-quantile (QQ) plots

While histograms are helpful as a preliminary glance at the data, it can be difficult to tell by eye just how different the distribution is from a normal distribtion. Instead, we can use a popular type of plot known as a quantile-quantile plot (QQ-plot) of the model residuals. Quantiles — often referred to as percentiles — indicate values in your data below which a certain proportion of the data falls. For instance, if data comes from a classical bell-curve Normal distrubtion with a mean of 0 and a standard deviation of 1, the 0.5 quantile, or 50th percentile, is 0 (half the data falls above 0, half below zero).

import statsmodels.graphics.gofplots as smg

# Plot the QQ-plot of residuals
smg.qqplot(resids, line='s')

# Add labels and title
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Sample Quantiles')
plt.title('QQ-Plot of Residuals')

# plt.savefig('..//fig//regression//assumptions//normal_resids_QQplot.png',bbox_inches='tight', dpi=300)
Text(0.5, 1.0, 'QQ-Plot of Residuals')

Unpacking the QQ-plot

To construct a QQ-plot, the raw data is first sorted from smaller to larger values. Then, empirical quantiles can be assigned to each sample in the dataset. These measurements can then be compared to theoretical quantiles from a normal distribution. Oftentimes, QQ-plots show zscores rather than actual quantile values since zscores can be interpreted more easily.

X-axis: Theoretical Quantiles This x-axis represents nothing but Z-values/Z-scores of standard normal distribution.

Y-axis: Sample Quantiles The y-axis represents the quantiles of your observed data (i.e., the sorted possible values of the residuals).

Red diagonal line Data drawn from a normal distribution fall along the line y = x in the Q-Q plot. If the residuals are perfectly normal, they will follow this diagonal perfectly.

Common Diagnostics

  1. Right-skewed: If the data falls above the red line (where y=x) where x > 0, that means that you have a right skewed distrution (long tail on the right side of the distrubtion). A right-skewed distribution will have have higher than expected z-scores for data that is greater than the mean (zscore = 0).
  2. Left-skewed: If the data falls below the red line (where y=x) where x < 0, that means that you have a left skewed distrution (long tail on the left side of the distrubtion). This causes the sample distribtuion to have lower (more negative) than expected z-scores for data that is greater than the mean (zscore = 0).
  3. Long tails / tall peak: Combination of 1&2 above — points below the mean (zscore = 0) will fall below the red line, and points above the mean will fall above the red line

Quantitative assessments of normality

Shapiro-Wilk test and Kolmogorov–Smirnov tests: There are a couple of methods that can yield quantiative assessments of normality. However, they are both very sensitive to sample size and pale in comparison to visual inspection of the residuals (via histogram/density plot or a QQ-plot). With larger sample sizes (>1000 observations), even minor deviations from normality may lead to rejecting the null hypothesis, making these tests less useful.

from scipy import stats
# Shapiro-Wilk
shapiro_stat, shapiro_p = stats.shapiro(resids)
print(f"Shapiro-Wilk test: statistic={shapiro_stat:.4f}, p-value={shapiro_p:.10f}")

# Perform the Kolmogorov-Smirnov test on the test_residuals
ks_stat, ks_p = stats.kstest(resids, 'norm')
print(f"Kolmogorov-Smirnov test: statistic={ks_stat:.4f}, p-value={ks_p:.10f}")

# Display the plot (assuming you have implemented a plot in your function)
plt.show()
Shapiro-Wilk test: statistic=0.9825, p-value=0.0000000019
Kolmogorov-Smirnov test: statistic=0.3115, p-value=0.0000000000

Causes of non-normal residuals

Violations of normality often arise due to…

Later in the workshop, we can use the following helper function to run the normality tests/visualizations.

from check_assumptions import normal_resid_test
fig = normal_resid_test(resids)
# fig.savefig('..//fig//regression//assumptions//normal_resids_fullTest.png',bbox_inches='tight', dpi=300)
==========================
VERIFYING NORMAL ERRORS...
Median of residuals: 0.0047744607860336075
Skewness of resids (+/- 0.5 is bad): -0.3768190722299776
Shapiro-Wilk test: statistic=0.9825, p-value=0.0000000019
Shapiro-Wilk test passes: False
Kolmogorov-Smirnov test: statistic=0.3115, p-value=0.0000000000
Kolmogorov-Smirnov test passes: False

6. Independent errors test

Independent errors assumption: In the context of linear regression, the independent errors assumption states that the errors (also called residuals) in the model are not correlated with each other. The assumption of independent errors implies that, on average, the errors are balanced around zero, meaning that the model is equally likely to overestimate as it is to underestimate the true values of the dependent variable.

The importance of this assumption lies in its role in ensuring that the estimated coefficients obtained from the regression analysis are unbiased. When the errors have a mean of zero and are independent, the regression coefficients are estimated in such a way that they minimize the sum of squared errors between the observed data and the model’s predictions. This property ensures that the estimates of the coefficients are not systematically biased in one direction, and they tend to converge to the true population parameters as the sample size increases.

Mathematically, for a linear regression model, the independent errors assumption can be written as:

Corr(εi, εj) = 0 for all i ≠ j

Where:

Durbin-Watson Test

The Durbin-Watson test is a statistical test that checks for autocorrelation in the residuals. If the residuals are independent, there should be no significant autocorrelation. The test statistic ranges between 0 and 4. A value around 2 indicates no autocorrelation, while values significantly below 2 indicate positive autocorrelation, and values above 2 indicate negative autocorrelation. You can use the statsmodels library to calculate the Durbin-Watson test statistic:

from statsmodels.stats.stattools import durbin_watson
durbin_watson_statistic = durbin_watson(resids)
print(f"Durbin-Watson test statistic: {durbin_watson_statistic}")
Durbin-Watson test statistic: 1.9864287360736372

It can also be helpful to re-examine the model predictions vs residual plot to look for violations of this assumption. If the errors are independent, the residuals should be randomly scattered around zero with no specific patterns or trends. We’ll call a pre-baked helper function to quickly create this plot and run the Durbin-Watson test.

from check_assumptions import independent_resid_test
fig = independent_resid_test(y_pred_train, resids, include_plot=True)
# fig.savefig('..//fig//regression//assumptions//independentResids_fullTest.png',bbox_inches='tight', dpi=300)
==========================
VERIFYING INDEPENDENT ERRORS...
Durbin-Watson test statistic: 1.9864287360736372
Durbin-Watson test statistic is within the expected range (1.5 to 2.5) for no significant autocorrelation.

Running all assumptions tests at once

from check_assumptions import eval_regression_assumptions
eval_regression_assumptions(trained_model=trained_model, X=X_train_z, y=y_train,
                            y_pred=y_pred_train, y_log_scaled=True, plot_raw=False, threshold_p_value=.05);
==========================
VERIFYING MULTICOLLINEARITY...
                Variable       VIF
0            OverallQual  1.098041
1  SaleCondition_Abnorml  1.009233
2   SaleCondition_Family  1.003844
3  SaleCondition_Partial  1.100450









=======================================
VERIFYING LINEARITY & HOMOSCEDASTICITY...

 Goldfeld-Quandt test (homoscedasticity) ----
                value
F statistic  0.920670
p-value      0.818216
Homoscedasticity test: Passes (homoscedasticity is assumed)

 Residuals plots ----









==========================
VERIFYING NORMAL ERRORS...
Median of residuals: 0.0047744607860336075
Skewness of resids (+/- 0.5 is bad): -0.3768190722299776
Shapiro-Wilk test: statistic=0.9825, p-value=0.0000000019
Shapiro-Wilk test passes: False
Kolmogorov-Smirnov test: statistic=0.3115, p-value=0.0000000000
Kolmogorov-Smirnov test passes: False









==========================
VERIFYING INDEPENDENT ERRORS...
Durbin-Watson test statistic: 1.9864287360736372
Durbin-Watson test statistic is within the expected range (1.5 to 2.5) for no significant autocorrelation.

Exploring an alternative set of predictors

Now that you know how to assess the 5 assumptions of linear regression, try validating a multivariate linear model that predicts log(SalePrice) from the following predictors:

  • LotArea
  • YearBuilt
  • YearRemodAdd
  • GarageArea
  • GarageCars
  • Neighborhood
  1. First, extract the data you’ll be using to fit the model.

  2. Next, preprocess the data using encode_predictors_housing_data(X) and remove_bad_cols(X, 95) from preprocessing.py.

  3. Use multicollinearity_test() from check_assumptions.py to check for multicollinearity and remove any problematic predictors. Repeat the test until multicollinearity is removed (VIF scores < 10).

  4. Perform a train/test split leaving 33% of the data out for the test set. Use random_state=0 to match the same results as everyone else.

  5. Use the zscore() helper from preprocessing.py to zscore the train and test sets. This will allow us to compare coefficient sizes later.

  6. Train the model using the statsmodels package. Don’t forget to add the constant to both train/test sets (so you can do prediciton on both)

  7. Check for evidence of extreme underfitting or overfitting using measure_model_err() from regression_predict_sklearn.py

  8. Check all model assumptions

Solution

1) First, extract the data you’ll be using to fit the model.

# Extract target, `y` and predictors, `X`.
y = housing['target']
y_log = np.log(y)
predictors = ['LotArea', 'YearBuilt', 'YearRemodAdd', 'GarageArea', 'GarageCars', 'Neighborhood']
X=housing['data'][predictors]
X.head()

2) Next, preprocess the data using encode_predictors_housing_data(X) and remove_bad_cols(X, 95) from preprocessing.py.

# Preprocess the data
from preprocessing import encode_predictors_housing_data
X_enc = encode_predictors_housing_data(X)
X_enc.head()

from preprocessing import remove_bad_cols
X_good = remove_bad_cols(X_enc, 95)

3) Use multicollinearity_test() from check_assumptions.py to check for multicollinearity and remove any problematic predictors. Repeat the test until multicollinearity is removed (VIF scores < 10).

multicollinearity_test(X_good);
X_better = X_good.drop(['GarageCars','YearBuilt'],axis = 1)
multicollinearity_test(X_better);

4) Perform a train/test split leaving 33% of the data out for the test set. Use random_state=0 to match the same results as everyone else.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_better, y_log, test_size=0.33, random_state=0)

5) Use the zscore() helper from preprocessing.py to zscore the train and test sets. This will allow us to compare coefficient sizes later.

from preprocessing import zscore
X_train_z = zscore(df=X_train, train_means=X_train.mean(), train_stds=X_train.std())
X_test_z = zscore(df=X_test, train_means=X_train.mean(), train_stds=X_train.std())
X_train_z.head()

6) Train the model using the statsmodels package. Don’t forget to add the constant to both train/test sets (so you can do prediciton on both)

# Add a constant column to the predictor variables dataframe
X_train_z = sm.add_constant(X_train_z)
# Add the constant to the test set as well so we can use the model to form predictions on the test set later
X_test_z = sm.add_constant(X_test_z)
# Fit the multivariate regression model
model = sm.OLS(y_train, X_train_z)
trained_model = model.fit()

7) Check for evidence of extreme underfitting or overfitting using measure_model_err() from regression_predict_sklearn.py

from regression_predict_sklearn import measure_model_err
# to calculate residuals and R-squared for the test set, we'll need to get the model predictions first
y_pred_train = trained_model.predict(X_train_z)
y_pred_test = trained_model.predict(X_test_z)
errors_df = measure_model_err(y, np.mean(y),
y_train, y_pred_train,
y_test, y_pred_test,
'RMSE', y_log_scaled=True)

errors_df.head()

8) Check all model assumptions

eval_regression_assumptions(trained_model=trained_model, X=X_train_z, y=y_train,
y_pred=y_pred_train, y_log_scaled=True, plot_raw=False, threshold_p_value=.05);

Key Points

  • All models are wrong, but some are useful.

  • Before reading into a model’s estimated coefficients, modelers must take care to check for evidence of overfitting and assess the 5 assumptions of linear regression.

  • One-hot enoding, while necesssary, can often produce very sparse binary predictors which have little information. Predictors with very little variability should be removed prior to model fitting.


Model interpretation and hypothesis testing

Overview

Teaching: 45 min
Exercises: 2 min
Questions
  • How can multivariate models be used to detect interesting relationships found in nature?

  • How can we interpret statistical results with minor violations of model assumptions?

Objectives
  • Understand how to use statistics to evaluate the likelihood of existing relationships recovered by a multivariate model.

  • Understand how to interpret a regression model and evaluate the strength/direction of different predictor and target relationships.

Hypothesis testing and signfiicant features/predictors

We will ultimately use hypothesis testing to determine if our model has found any statistically signficant relationships. What does it mean to be statistically signficiant? It means that an observed relationship is unlikely (< 5% chance if p=.05) to occur due to chance alone.

To run statistics on a regression model, we start with two hypotheses — one null and one alternative.

In other words, we are testing to see if a predictor has a consistent effect on some target variable. We are NOT testing the magnitidute of the effect (we will discuss effect sizes later); simply whether or not an observed effect is due to chance or not. In statistics, we start with the null hypothesis as our default and review evidence (the fitted model and its estimated parameters and error measurement) to see if the observed data suggests that the null hypothesis should be rejected.

Overview of hypothesis testing procedure

The procedure for testing whether predictor(s) have a statistically significant effect on a target variable in a regression model typically involves the following steps:

  1. Formulate the null hypothesis (H₀) and alternative hypothesis (H₁) for the test. The null hypothesis typically states that the predictor has no effect on the response variable (coef=0), while the alternative hypothesis suggests that there is a significant effect (coef!=0).

  2. If using multiple predictors, check for multicollinearity. Multicollinearity can be an especially pervasive.

  3. Fit the regression model to your data. Obtain the estimated coefficients for each predictor, along with their standard errors.

  4. Check for evidence of overfitting: If overfitting occurs, it doesn’t make much sense to make general claims about observed relationships in the model.

  5. Evaluate linearity assumption: If using a univariate model, can do this step before model fitting via a simple scatterplot.

  6. Evaluate normality of errors assumption.

  7. Calculate the test statistic: Calculate the test statistic based on the estimated coefficient and its standard error. The test statistic depends on the specific regression model and hypothesis being tested. Common test statistics include t-statistic, z-statistic, or F-statistic.

  8. Determine the critical value: Determine the critical value or significance level (α) at which you want to test the hypothesis. The significance level typically ranges from 0.01 to 0.05, depending on the desired level of confidence.

  9. Compare the test statistic and critical value: Compare the calculated test statistic with the critical value. If the test statistic falls within the critical region (i.e., the calculated p-value is less than the significance level), you reject the null hypothesis and conclude that the predictor is statistically significant. If the test statistic does not fall within the critical region, you fail to reject the null hypothesis, indicating that the predictor is not statistically significant.

  10. Interpret the results: Based on the conclusion from the hypothesis test, interpret the significance of the predictor. If the predictor is deemed statistically significant, it suggests that there is evidence of a relationship between the predictor and the response variable. If the predictor is not statistically significant, it implies that there is no significant evidence of an effect.

It’s important to note that significance tests provide statistical evidence for or against the null hypothesis, but they should be interpreted alongside other factors such as effect size, practical significance, and the context of the problem being studied. Additionally, it’s crucial to consider the assumptions and limitations of the regression model and the underlying data when interpreting the model.

Steps 1-5

We’ll begin by working with the model from the last episode’s exercise. Recall that this model had residuals which deviated from normality slightly. Out of all model assumptions, normal residuals is the least strict. We will discuss how to report this deviation from normality when we write up the results of the statistical tests.

from exercise_solutions import regression_assumptions_e1_explore_altPredictors
trained_model = regression_assumptions_e1_explore_altPredictors()
16 columns removed, 14 remaining.
Columns removed: ['Neighborhood_Blmngtn', 'Neighborhood_Blueste', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_ClearCr', 'Neighborhood_Crawfor', 'Neighborhood_IDOTRR', 'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NPkVill', 'Neighborhood_NoRidge', 'Neighborhood_SWISU', 'Neighborhood_SawyerW', 'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Neighborhood_Veenker']

==========================
VERIFYING MULTICOLLINEARITY...
                Variable          VIF
0             GarageArea    29.329312
1             GarageCars    33.374751
2                LotArea     2.245048
3   Neighborhood_CollgCr     1.391947
4   Neighborhood_Edwards     1.254815
5   Neighborhood_Gilbert     1.242131
6     Neighborhood_NAmes     1.503415
7    Neighborhood_NWAmes     1.168004
8   Neighborhood_NridgHt     1.300592
9   Neighborhood_OldTown     1.486816
10   Neighborhood_Sawyer     1.163274
11  Neighborhood_Somerst     1.268948
12             YearBuilt  9570.221184
13          YearRemodAdd  9448.656444









==========================
VERIFYING MULTICOLLINEARITY...
                Variable       VIF
0             GarageArea  7.757408
1                LotArea  2.239206
2   Neighborhood_CollgCr  1.350229
3   Neighborhood_Edwards  1.236703
4   Neighborhood_Gilbert  1.166287
5     Neighborhood_NAmes  1.465777
6    Neighborhood_NWAmes  1.159278
7   Neighborhood_NridgHt  1.288820
8   Neighborhood_OldTown  1.249199
9    Neighborhood_Sawyer  1.157742
10  Neighborhood_Somerst  1.249040
11          YearRemodAdd  9.285893









==========================
VERIFYING MULTICOLLINEARITY...
                Variable       VIF
0             GarageArea  1.403296
1                LotArea  1.057810
2   Neighborhood_CollgCr  1.275199
3   Neighborhood_Edwards  1.166506
4   Neighborhood_Gilbert  1.142521
5     Neighborhood_NAmes  1.264983
6    Neighborhood_NWAmes  1.098685
7   Neighborhood_NridgHt  1.256041
8   Neighborhood_OldTown  1.152291
9    Neighborhood_Sawyer  1.116459
10  Neighborhood_Somerst  1.221762
11          YearRemodAdd  1.503561









=======================================
VERIFYING LINEARITY & HOMOSCEDASTICITY...

 Goldfeld-Quandt test (homoscedasticity) ----
                value
F statistic  0.831952
p-value      0.977493
Homoscedasticity test: Passes (homoscedasticity is assumed)

 Residuals plots ----









==========================
VERIFYING NORMAL ERRORS...
Median of residuals: 0.004087877077902924
Skewness of resids (+/- 0.5 is bad): -0.42300070977189735
Shapiro-Wilk test: statistic=0.9730, p-value=0.0000000000
Shapiro-Wilk test passes: False
Kolmogorov-Smirnov test: statistic=0.3038, p-value=0.0000000000
Kolmogorov-Smirnov test passes: False









==========================
VERIFYING INDEPENDENT ERRORS...
Durbin-Watson test statistic: 2.022691389055886
Durbin-Watson test statistic is within the expected range (1.5 to 2.5) for no significant autocorrelation.

Steps 6-9

  1. Calculate the test statistic: Calculate the test statistic based on the estimated coefficient and its standard error. The test statistic depends on the specific regression model and hypothesis being tested. Common test statistics include t-statistic, z-statistic, or F-statistic.

  2. Determine the critical value: Determine the critical value or significance level (α) at which you want to test the hypothesis. The significance level typically ranges from 0.01 to 0.05, depending on the desired level of confidence.

  3. Compare the test statistic and critical value: Compare the calculated test statistic with the critical value. If the test statistic falls within the critical region (i.e., the calculated p-value is less than the significance level), you reject the null hypothesis and conclude that the predictor is statistically significant. If the test statistic does not fall within the critical region, you fail to reject the null hypothesis, indicating that the predictor is not statistically significant.

  4. Interpret the results: Based on the conclusion from the hypothesis test, interpret the significance of the predictor. If the predictor is deemed statistically significant, it suggests that there is evidence of a relationship between the predictor and the response variable. If the predictor is not statistically significant, it implies that there is no significant evidence of an effect.

Calculating the test statistic

t-statistic: The t-statistic is typically used to test the statistical significance of individual coefficient estimates in the regression model. It measures the ratio of the estimated coefficient to its standard error. The t-test helps assess whether a specific predictor variable has a significant effect on the response variable while accounting for the uncertainty in the coefficient estimate.

P-values for t-statistics are calculated based on the t-distribution. The t-distribution is a probability distribution that is used when the population standard deviation is unknown and needs to be estimated from the sample.

To calculate the p-value for a t-statistic, you follow these general steps:

  1. Formulate the null hypothesis (H0) and alternative hypothesis (H1) for the test you are conducting.

  2. Calculate the t-statistic for the test using the formula:

    t = (estimate - null_value) / standard_error

    where “estimate” is the estimated coefficient or difference, “null_value” is the value specified under the null hypothesis (often 0), and “standard_error” is the standard error of the coefficient or difference estimate.

  3. Look up the p-value associated with the calculated t-value and degrees of freedom in the t-distribution table or use statistical software to calculate it. The p-value represents the probability of observing a t-value as extreme as, or more extreme than, the calculated value under the null hypothesis.

  4. Compare the p-value to the predetermined significance level (commonly 0.05). If the p-value is less than the significance level, you reject the null hypothesis in favor of the alternative hypothesis. If the p-value is greater than or equal to the significance level, you fail to reject the null hypothesis.

By calculating the p-value for the t-statistic, you can assess the statistical significance of the coefficient estimate or the difference being tested. A lower p-value indicates stronger evidence against the null hypothesis and suggests a higher likelihood that a significant relationship or effect exists. It does not mean that the effect is stronger.

The more manual route of calculating p-values…

In this code, we calculate the t-values by dividing the coefficient estimates by the standard errors. The t-value represents the ratio of the estimated coefficient (or difference) to its standard error. It measures the number of standard errors by which the estimated coefficient differs from zero. The standard error reflects the precision of the estimated coefficient, and a larger t-value indicates a larger difference relative to the standard error.

Next, we use the t-values to calculate the two-sided p-values using the stats.t.sf function from the SciPy library. The np.abs(t_values) ensures that we consider the absolute values of the t-values to calculate the p-values for both positive and negative t-values. We multiply the resulting p-values by 2 to obtain the two-sided p-values. The p-value is the probability of observing a t-value as extreme as, or more extreme than, the one calculated, assuming the null hypothesis is true. By convention, if the p-value is smaller than a predetermined significance level (commonly 0.05), we reject the null hypothesis in favor of the alternative hypothesis, indicating that the coefficient is statistically significant.

from scipy import stats
import numpy as np

# Get the coefficient estimates and standard errors
coefs = trained_model.params
std_errs = trained_model.bse

# Calculate the t-values and p-values
t_values = coefs / std_errs
p_values = stats.t.sf(np.abs(t_values), df=trained_model.df_resid) * 2
p_values
array([0.00000000e+00, 5.77683408e-39, 1.51433066e-15, 5.63916857e-67,
       1.11482797e-01, 1.04392453e-05, 6.54771638e-01, 1.98182267e-02,
       1.73282954e-01, 1.99578527e-05, 6.51408046e-09, 1.59324911e-05,
       6.94929557e-01])

The faster route

p_values = trained_model.pvalues
print(p_values)
p_values[p_values < .05]
const                   0.000000e+00
GarageArea              5.639169e-67
LotArea                 1.514331e-15
YearRemodAdd            5.776834e-39
Neighborhood_CollgCr    1.114828e-01
Neighborhood_Edwards    1.043925e-05
Neighborhood_Gilbert    6.547716e-01
Neighborhood_NAmes      1.981823e-02
Neighborhood_NWAmes     1.732830e-01
Neighborhood_NridgHt    1.995785e-05
Neighborhood_OldTown    6.514080e-09
Neighborhood_Sawyer     1.593249e-05
Neighborhood_Somerst    6.949296e-01
dtype: float64





const                   0.000000e+00
GarageArea              5.639169e-67
LotArea                 1.514331e-15
YearRemodAdd            5.776834e-39
Neighborhood_Edwards    1.043925e-05
Neighborhood_NAmes      1.981823e-02
Neighborhood_NridgHt    1.995785e-05
Neighborhood_OldTown    6.514080e-09
Neighborhood_Sawyer     1.593249e-05
dtype: float64
#### Alternatively, use model summary (nice perk from statsmodels package)
trained_model.summary()
OLS Regression Results
Dep. Variable: SalePrice R-squared: 0.621
Model: OLS Adj. R-squared: 0.616
Method: Least Squares F-statistic: 131.7
Date: Tue, 15 Aug 2023 Prob (F-statistic): 1.19e-193
Time: 06:52:26 Log-Likelihood: -27.825
No. Observations: 978 AIC: 81.65
Df Residuals: 965 BIC: 145.2
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 12.0260 0.008 1500.602 0.000 12.010 12.042
GarageArea 0.1779 0.009 18.725 0.000 0.159 0.196
LotArea 0.0669 0.008 8.111 0.000 0.051 0.083
YearRemodAdd 0.1343 0.010 13.660 0.000 0.115 0.154
Neighborhood_CollgCr -0.0144 0.009 -1.593 0.111 -0.032 0.003
Neighborhood_Edwards -0.0384 0.009 -4.431 0.000 -0.055 -0.021
Neighborhood_Gilbert 0.0038 0.009 0.447 0.655 -0.013 0.021
Neighborhood_NAmes -0.0210 0.009 -2.334 0.020 -0.039 -0.003
Neighborhood_NWAmes 0.0115 0.008 1.363 0.173 -0.005 0.028
Neighborhood_NridgHt 0.0385 0.009 4.287 0.000 0.021 0.056
Neighborhood_OldTown -0.0504 0.009 -5.856 0.000 -0.067 -0.034
Neighborhood_Sawyer -0.0367 0.008 -4.337 0.000 -0.053 -0.020
Neighborhood_Somerst -0.0035 0.009 -0.392 0.695 -0.021 0.014
Omnibus: 86.068 Durbin-Watson: 2.023
Prob(Omnibus): 0.000 Jarque-Bera (JB): 260.402
Skew: -0.422 Prob(JB): 2.85e-57
Kurtosis: 5.383 Cond. No. 2.36



Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpret the results

  1. Which predictors are significant?
  2. How much variance is explained by this model?
  3. How do you interpret these results in the context of our assumption tests?

The model performance is as follows:

The following predictors were found to have p-values < .05:

Contextual Note: The significance of predictors like YearRemodAdd, GarageArea, and LotArea suggests that they have a meaningful impact on SalePrice. However, the Shapiro-Wilk and Kolmogorov-Smirnov tests for normality and the associated p-values indicate that the residuals deviate from a normal distribution. Specifically, the model has a tendency to overestimate low-value homes and underestimate high-value homes. It’s important to keep in mind that while normality of residuals is an assumption of linear regression, minor deviations may not invalidate the entire model, especially if the deviation is not severe and the other assumptions (such as linearity, homoscedasticity, and independence of residuals) are met. Our model still provides valuable insights into the relationships between predictor variables and the target SalePrice. The R-squared value of 0.621 suggests that approximately 62.1% of the variance in SalePrice can be explained by the predictor variables in the model. However, given the observed pattern in residuals, it is important to pay closer attention to how these predictors affect homes in different SalePrice ranges. The model’s accuracy could vary depending on whether you’re predicting prices for low-value or high-value homes.

In summary, while the model’s residuals show some deviation from normality, the provided regression results still offer insights into the relationships between predictor variables and SalePrice. Careful consideration of the model’s assumptions, limitations, and context will help you interpret the results accurately. As the statistician George Box famously said, “All models are wrong, but some are useful.” This reminds us that models are simplifications of reality, and while they may not perfectly capture all aspects, they can still provide valuable insights and guidance.

Feature importance

In addition to running hypothesis tests, we should look more closely at the specific strenght of relationships found between predictors and target. A straightforward way to do this (assuming you have zscored the predictors) is to plot the coef values and magnitudes.

# Get coefficient values in terms of percent change in sale price
coefs = (np.exp(trained_model.params) - 1)*100
coefs
const                   1.670383e+07
GarageArea              1.946508e+01
LotArea                 6.917864e+00
YearRemodAdd            1.437357e+01
Neighborhood_CollgCr   -1.432055e+00
Neighborhood_Edwards   -3.764838e+00
Neighborhood_Gilbert    3.840818e-01
Neighborhood_NAmes     -2.082547e+00
Neighborhood_NWAmes     1.151902e+00
Neighborhood_NridgHt    3.927266e+00
Neighborhood_OldTown   -4.915032e+00
Neighborhood_Sawyer    -3.608071e+00
Neighborhood_Somerst   -3.470761e-01
dtype: float64
from interpret_model import coef_plot
help(coef_plot)
Help on function coef_plot in module interpret_model:

coef_plot(coefs: pandas.core.series.Series, plot_const: bool = False, index: bool = None) -> matplotlib.figure.Figure
    Plot coefficient values and feature importance based on sorted feature importance.
    
    Args:
        coefs (pd.Series or np.ndarray): Coefficient values.
        plot_const (bool, optional): Whether or not to plot the y-intercept coef value. Default is False.
        index (list or pd.Index, optional): Index labels for the coefficients. Default is None.
    
    Returns:
        plt.Figure: The figure containing the coefficient plots.
fig = coef_plot(coefs=coefs, plot_const=False)
# Retrieve the axes from the figure
ax1 = fig.get_axes()[0]  # Assuming the first axis is what you want to modify

# Adjust the x-axis label of the first axis
ax1.set_xlabel("Percent Change In Sale Price")
# fig.savefig('..//fig//regression//interpret//coef_plot.png', bbox_inches='tight', dpi=300, facecolor='white');
Text(0.5, 27.722222222222214, 'Percent Change In Sale Price')

Key Points

  • All models are wrong, but some are useful.


Feature selection with PCA

Overview

Teaching: 45 min
Exercises: 2 min
Questions
  • How can PCA be used as a feature selection method?

Objectives

Multivariate Regression with PCA

Predict if house sales price will be high for market from house characteristics

Ames housing dataset data

load dataset

from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True, parser='auto')

view data

df = housing.data.copy(deep=True)
df = df.astype({'Id':int})  # set data type of Id to int
df = df.set_index('Id')  # set Id column to be the index of the DataFrame
df

all feature names

print(df.columns.tolist())
['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition']

Reminder to access the Data Dictionary

from IPython.display import display, Pretty
display(Pretty(housing.DESCR))

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

MSSubClass: Identifies the type of dwelling involved in the sale.

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.

       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park
       RM	Residential Medium Density

LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

Street: Type of road access to property

       Grvl	Gravel
       Pave	Paved

Alley: Type of alley access to property

       Grvl	Gravel
       Pave	Paved
       NA 	No alley access

LotShape: General shape of property

       Reg	Regular
       IR1	Slightly irregular
       IR2	Moderately Irregular
       IR3	Irregular

LandContour: Flatness of the property

       Lvl	Near Flat/Level
       Bnk	Banked - Quick and significant rise from street grade to building
       HLS	Hillside - Significant slope from side to side
       Low	Depression

Utilities: Type of utilities available

       AllPub	All public Utilities (E,G,W,& S)
       NoSewr	Electricity, Gas, and Water (Septic Tank)
       NoSeWa	Electricity and Gas Only
       ELO	Electricity only

LotConfig: Lot configuration

       Inside	Inside lot
       Corner	Corner lot
       CulDSac	Cul-de-sac
       FR2	Frontage on 2 sides of property
       FR3	Frontage on 3 sides of property

LandSlope: Slope of property

       Gtl	Gentle slope
       Mod	Moderate Slope
       Sev	Severe Slope

Neighborhood: Physical locations within Ames city limits

       Blmngtn	Bloomington Heights
       Blueste	Bluestem
       BrDale	Briardale
       BrkSide	Brookside
       ClearCr	Clear Creek
       CollgCr	College Creek
       Crawfor	Crawford
       Edwards	Edwards
       Gilbert	Gilbert
       IDOTRR	Iowa DOT and Rail Road
       MeadowV	Meadow Village
       Mitchel	Mitchell
       Names	North Ames
       NoRidge	Northridge
       NPkVill	Northpark Villa
       NridgHt	Northridge Heights
       NWAmes	Northwest Ames
       OldTown	Old Town
       SWISU	South & West of Iowa State University
       Sawyer	Sawyer
       SawyerW	Sawyer West
       Somerst	Somerset
       StoneBr	Stone Brook
       Timber	Timberland
       Veenker	Veenker

Condition1: Proximity to various conditions

       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street
       Norm	Normal
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad

Condition2: Proximity to various conditions (if more than one is present)

       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street
       Norm	Normal
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad

BldgType: Type of dwelling

       1Fam	Single-family Detached
       2FmCon	Two-family Conversion; originally built as one-family dwelling
       Duplx	Duplex
       TwnhsE	Townhouse End Unit
       TwnhsI	Townhouse Inside Unit

HouseStyle: Style of dwelling

       1Story	One story
       1.5Fin	One and one-half story: 2nd level finished
       1.5Unf	One and one-half story: 2nd level unfinished
       2Story	Two story
       2.5Fin	Two and one-half story: 2nd level finished
       2.5Unf	Two and one-half story: 2nd level unfinished
       SFoyer	Split Foyer
       SLvl	Split Level

OverallQual: Rates the overall material and finish of the house

       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average
       5	Average
       4	Below Average
       3	Fair
       2	Poor
       1	Very Poor

OverallCond: Rates the overall condition of the house

       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average
       5	Average
       4	Below Average
       3	Fair
       2	Poor
       1	Very Poor

YearBuilt: Original construction date

YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)

RoofStyle: Type of roof

       Flat	Flat
       Gable	Gable
       Gambrel	Gabrel (Barn)
       Hip	Hip
       Mansard	Mansard
       Shed	Shed

RoofMatl: Roof material

       ClyTile	Clay or Tile
       CompShg	Standard (Composite) Shingle
       Membran	Membrane
       Metal	Metal
       Roll	Roll
       Tar&Grv	Gravel & Tar
       WdShake	Wood Shakes
       WdShngl	Wood Shingles

Exterior1st: Exterior covering on house

       AsbShng	Asbestos Shingles
       AsphShn	Asphalt Shingles
       BrkComm	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       CemntBd	Cement Board
       HdBoard	Hard Board
       ImStucc	Imitation Stucco
       MetalSd	Metal Siding
       Other	Other
       Plywood	Plywood
       PreCast	PreCast
       Stone	Stone
       Stucco	Stucco
       VinylSd	Vinyl Siding
       Wd Sdng	Wood Siding
       WdShing	Wood Shingles

Exterior2nd: Exterior covering on house (if more than one material)

       AsbShng	Asbestos Shingles
       AsphShn	Asphalt Shingles
       BrkComm	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       CemntBd	Cement Board
       HdBoard	Hard Board
       ImStucc	Imitation Stucco
       MetalSd	Metal Siding
       Other	Other
       Plywood	Plywood
       PreCast	PreCast
       Stone	Stone
       Stucco	Stucco
       VinylSd	Vinyl Siding
       Wd Sdng	Wood Siding
       WdShing	Wood Shingles

MasVnrType: Masonry veneer type

       BrkCmn	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       None	None
       Stone	Stone

MasVnrArea: Masonry veneer area in square feet

ExterQual: Evaluates the quality of the material on the exterior

       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor

ExterCond: Evaluates the present condition of the material on the exterior

       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor

Foundation: Type of foundation

       BrkTil	Brick & Tile
       CBlock	Cinder Block
       PConc	Poured Contrete
       Slab	Slab
       Stone	Stone
       Wood	Wood

BsmtQual: Evaluates the height of the basement

       Ex	Excellent (100+ inches)
       Gd	Good (90-99 inches)
       TA	Typical (80-89 inches)
       Fa	Fair (70-79 inches)
       Po	Poor (<70 inches
       NA	No Basement

BsmtCond: Evaluates the general condition of the basement

       Ex	Excellent
       Gd	Good
       TA	Typical - slight dampness allowed
       Fa	Fair - dampness or some cracking or settling
       Po	Poor - Severe cracking, settling, or wetness
       NA	No Basement

BsmtExposure: Refers to walkout or garden level walls

       Gd	Good Exposure
       Av	Average Exposure (split levels or foyers typically score average or above)
       Mn	Mimimum Exposure
       No	No Exposure
       NA	No Basement

BsmtFinType1: Rating of basement finished area

       GLQ	Good Living Quarters
       ALQ	Average Living Quarters
       BLQ	Below Average Living Quarters
       Rec	Average Rec Room
       LwQ	Low Quality
       Unf	Unfinshed
       NA	No Basement

BsmtFinSF1: Type 1 finished square feet

BsmtFinType2: Rating of basement finished area (if multiple types)

       GLQ	Good Living Quarters
       ALQ	Average Living Quarters
       BLQ	Below Average Living Quarters
       Rec	Average Rec Room
       LwQ	Low Quality
       Unf	Unfinshed
       NA	No Basement

BsmtFinSF2: Type 2 finished square feet

BsmtUnfSF: Unfinished square feet of basement area

TotalBsmtSF: Total square feet of basement area

Heating: Type of heating

       Floor	Floor Furnace
       GasA	Gas forced warm air furnace
       GasW	Gas hot water or steam heat
       Grav	Gravity furnace
       OthW	Hot water or steam heat other than gas
       Wall	Wall furnace

HeatingQC: Heating quality and condition

       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor

CentralAir: Central air conditioning

       N	No
       Y	Yes

Electrical: Electrical system

       SBrkr	Standard Circuit Breakers & Romex
       FuseA	Fuse Box over 60 AMP and all Romex wiring (Average)
       FuseF	60 AMP Fuse Box and mostly Romex wiring (Fair)
       FuseP	60 AMP Fuse Box and mostly knob & tube wiring (poor)
       Mix	Mixed

1stFlrSF: First Floor square feet

2ndFlrSF: Second floor square feet

LowQualFinSF: Low quality finished square feet (all floors)

GrLivArea: Above grade (ground) living area square feet

BsmtFullBath: Basement full bathrooms

BsmtHalfBath: Basement half bathrooms

FullBath: Full bathrooms above grade

HalfBath: Half baths above grade

Bedroom: Bedrooms above grade (does NOT include basement bedrooms)

Kitchen: Kitchens above grade

KitchenQual: Kitchen quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

Functional: Home functionality (Assume typical unless deductions are warranted)

       Typ	Typical Functionality
       Min1	Minor Deductions 1
       Min2	Minor Deductions 2
       Mod	Moderate Deductions
       Maj1	Major Deductions 1
       Maj2	Major Deductions 2
       Sev	Severely Damaged
       Sal	Salvage only

Fireplaces: Number of fireplaces

FireplaceQu: Fireplace quality

       Ex	Excellent - Exceptional Masonry Fireplace
       Gd	Good - Masonry Fireplace in main level
       TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
       Fa	Fair - Prefabricated Fireplace in basement
       Po	Poor - Ben Franklin Stove
       NA	No Fireplace

GarageType: Garage location

       2Types	More than one type of garage
       Attchd	Attached to home
       Basment	Basement Garage
       BuiltIn	Built-In (Garage part of house - typically has room above garage)
       CarPort	Car Port
       Detchd	Detached from home
       NA	No Garage

GarageYrBlt: Year garage was built

GarageFinish: Interior finish of the garage

       Fin	Finished
       RFn	Rough Finished
       Unf	Unfinished
       NA	No Garage

GarageCars: Size of garage in car capacity

GarageArea: Size of garage in square feet

GarageQual: Garage quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       NA	No Garage

GarageCond: Garage condition

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       NA	No Garage

PavedDrive: Paved driveway

       Y	Paved
       P	Partial Pavement
       N	Dirt/Gravel

WoodDeckSF: Wood deck area in square feet

OpenPorchSF: Open porch area in square feet

EnclosedPorch: Enclosed porch area in square feet

3SsnPorch: Three season porch area in square feet

ScreenPorch: Screen porch area in square feet

PoolArea: Pool area in square feet

PoolQC: Pool quality

       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       NA	No Pool

Fence: Fence quality

       GdPrv	Good Privacy
       MnPrv	Minimum Privacy
       GdWo	Good Wood
       MnWw	Minimum Wood/Wire
       NA	No Fence

MiscFeature: Miscellaneous feature not covered in other categories

       Elev	Elevator
       Gar2	2nd Garage (if not described in garage section)
       Othr	Other
       Shed	Shed (over 100 SF)
       TenC	Tennis Court
       NA	None

MiscVal: $Value of miscellaneous feature

MoSold: Month Sold (MM)

YrSold: Year Sold (YYYY)

SaleType: Type of sale

       WD 	Warranty Deed - Conventional
       CWD	Warranty Deed - Cash
       VWD	Warranty Deed - VA Loan
       New	Home just constructed and sold
       COD	Court Officer Deed/Estate
       Con	Contract 15% Down payment regular terms
       ConLw	Contract Low Down payment and low interest
       ConLI	Contract Low Interest
       ConLD	Contract Low Down
       Oth	Other

SaleCondition: Condition of sale

       Normal	Normal Sale
       Abnorml	Abnormal Sale -  trade, foreclosure, short sale
       AdjLand	Adjoining Land Purchase
       Alloca	Allocation - two linked properties with separate deeds, typically condo with a garage unit
       Family	Sale between family members
       Partial	Home was not completed when last assessed (associated with New Homes)

Downloaded from openml.org.

Understanding the data

What does TotRmsAbvGrd refer to?

Solution

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

Variable Classes

How many variables are numeric?

Solution

36 numeric (are these all continuous or discrete?) 43 categorical (are these all categorical or ordinate ?)

Where’s All The Nans?

How many Nan entries are there per variable?

Solution

df.isna().sum().sort_values(ascending=False).head(20)

OUTPUT_START

|column|count| |—|—| |PoolQC |1453| MiscFeature | 1406 Alley | 1369 Fence | 1179 FireplaceQu | 690 LotFrontage |259 GarageYrBlt |81 GarageCond |81 GarageType |81 GarageFinish |81 GarageQual |81 BsmtExposure |38 BsmtFinType2 |38 BsmtCond |37 BsmtQual |37 BsmtFinType1 |37 MasVnrArea | 8 MasVnrType | 8 Electrical | 1 MSSubClass | 0 dtype: int64

OUTPUT_END

EXERCISE_START

Which of these variables could be the best predictor of house sale price? Why?

Solution

Possible answers: SquareFt, OverallQual, YearBuilt They intutively are going to be corrleated with SalePrice - but NB: also with each other!

Target Feature: SalePrice

# add target variable 'sales price' to data df from housing object
df[housing.target_names[0]] = housing.target.tolist()
df.describe()
MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
count 1460.000000 1201.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1452.000000 1460.000000 1460.000000 ... 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 56.897260 70.049958 10516.828082 6.099315 5.575342 1971.267808 1984.865753 103.685262 443.639726 46.549315 ... 94.244521 46.660274 21.954110 3.409589 15.060959 2.758904 43.489041 6.321918 2007.815753 180921.195890
std 42.300571 24.284752 9981.264932 1.382997 1.112799 30.202904 20.645407 181.066207 456.098091 161.319273 ... 125.338794 66.256028 61.119149 29.317331 55.757415 40.177307 496.123024 2.703626 1.328095 79442.502883
min 20.000000 21.000000 1300.000000 1.000000 1.000000 1872.000000 1950.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000 34900.000000
25% 20.000000 59.000000 7553.500000 5.000000 5.000000 1954.000000 1967.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 2007.000000 129975.000000
50% 50.000000 69.000000 9478.500000 6.000000 5.000000 1973.000000 1994.000000 0.000000 383.500000 0.000000 ... 0.000000 25.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2008.000000 163000.000000
75% 70.000000 80.000000 11601.500000 7.000000 6.000000 2000.000000 2004.000000 166.000000 712.250000 0.000000 ... 168.000000 68.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2009.000000 214000.000000
max 190.000000 313.000000 215245.000000 10.000000 9.000000 2010.000000 2010.000000 1600.000000 5644.000000 1474.000000 ... 857.000000 547.000000 552.000000 508.000000 480.000000 738.000000 15500.000000 12.000000 2010.000000 755000.000000

8 rows × 37 columns

what does SalePrice look like?

import helper_functions
helper_functions.plot_salesprice(
    df,
    # ylog=True
)

Is this a normal distribution? Will that distribution influcence modelling this value? How?

Log Sales price

import numpy as np
import helper_functions

# convert to normal
df['SalePrice'] = np.log(df['SalePrice'].tolist())

helper_functions.plot_salesprice(
   df,
    # ylog=True
)

Use All Features - One-hot encode Categorical Variables

# Original DataFrame dimensions (+ SalesPrice)
print(f"{df.shape=}")
df.shape=(1460, 80)
# one hot encode categorical variables
import pandas as pd
numeric_variables = df.describe().columns.tolist()
nominative_variables = [x for x in df.columns.tolist() if x not in numeric_variables]

dummy_df = pd.get_dummies(df[nominative_variables])
print(dummy_df.shape)
dummy_df
(1460, 252)
MSZoning_'C (all)' MSZoning_FV MSZoning_RH MSZoning_RL MSZoning_RM Street_Grvl Street_Pave Alley_Grvl Alley_Pave LotShape_IR1 ... SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
Id
1 0 0 0 1 0 0 1 0 0 0 ... 0 0 0 1 0 0 0 0 1 0
2 0 0 0 1 0 0 1 0 0 0 ... 0 0 0 1 0 0 0 0 1 0
3 0 0 0 1 0 0 1 0 0 1 ... 0 0 0 1 0 0 0 0 1 0
4 0 0 0 1 0 0 1 0 0 1 ... 0 0 0 1 1 0 0 0 0 0
5 0 0 0 1 0 0 1 0 0 1 ... 0 0 0 1 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1456 0 0 0 1 0 0 1 0 0 0 ... 0 0 0 1 0 0 0 0 1 0
1457 0 0 0 1 0 0 1 0 0 0 ... 0 0 0 1 0 0 0 0 1 0
1458 0 0 0 1 0 0 1 0 0 0 ... 0 0 0 1 0 0 0 0 1 0
1459 0 0 0 1 0 0 1 0 0 0 ... 0 0 0 1 0 0 0 0 1 0
1460 0 0 0 1 0 0 1 0 0 0 ... 0 0 0 1 0 0 0 0 1 0

1460 rows × 252 columns

# merge one-hot encoded columns with numeric columns
model_df = pd.concat([df[numeric_variables], dummy_df], axis=1) #.drop('SalePrice', axis=1)

# how many total coulmns now?
print(model_df.shape)
model_df
(1460, 289)
MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 ... SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
Id
1 60 65.0 8450 7 5 2003 2003 196.0 706 0 ... 0 0 0 1 0 0 0 0 1 0
2 20 80.0 9600 6 8 1976 1976 0.0 978 0 ... 0 0 0 1 0 0 0 0 1 0
3 60 68.0 11250 7 5 2001 2002 162.0 486 0 ... 0 0 0 1 0 0 0 0 1 0
4 70 60.0 9550 7 5 1915 1970 0.0 216 0 ... 0 0 0 1 1 0 0 0 0 0
5 60 84.0 14260 8 5 2000 2000 350.0 655 0 ... 0 0 0 1 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1456 60 62.0 7917 6 5 1999 2000 0.0 0 0 ... 0 0 0 1 0 0 0 0 1 0
1457 20 85.0 13175 6 6 1978 1988 119.0 790 163 ... 0 0 0 1 0 0 0 0 1 0
1458 70 66.0 9042 7 9 1941 2006 0.0 275 0 ... 0 0 0 1 0 0 0 0 1 0
1459 20 68.0 9717 5 6 1950 1996 0.0 49 1029 ... 0 0 0 1 0 0 0 0 1 0
1460 20 75.0 9937 5 6 1965 1965 0.0 830 290 ... 0 0 0 1 0 0 0 0 1 0

1460 rows × 289 columns

# How many numerical columns now?
model_df.describe()
MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 ... SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
count 1460.000000 1201.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1452.000000 1460.000000 1460.000000 ... 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 56.897260 70.049958 10516.828082 6.099315 5.575342 1971.267808 1984.865753 103.685262 443.639726 46.549315 ... 0.003425 0.083562 0.002055 0.867808 0.069178 0.002740 0.008219 0.013699 0.820548 0.085616
std 42.300571 24.284752 9981.264932 1.382997 1.112799 30.202904 20.645407 181.066207 456.098091 161.319273 ... 0.058440 0.276824 0.045299 0.338815 0.253844 0.052289 0.090317 0.116277 0.383862 0.279893
min 20.000000 21.000000 1300.000000 1.000000 1.000000 1872.000000 1950.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 20.000000 59.000000 7553.500000 5.000000 5.000000 1954.000000 1967.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
50% 50.000000 69.000000 9478.500000 6.000000 5.000000 1973.000000 1994.000000 0.000000 383.500000 0.000000 ... 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
75% 70.000000 80.000000 11601.500000 7.000000 6.000000 2000.000000 2004.000000 166.000000 712.250000 0.000000 ... 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
max 190.000000 313.000000 215245.000000 10.000000 9.000000 2010.000000 2010.000000 1600.000000 5644.000000 1474.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 289 columns

Modelling Dataset Description

  1. how many observations are there in our dataset?
  2. how many features are there in the whole dataset?
  3. how many numerical features are there in the whole dataset?

Solution

  1. 1460 observations (len(df))
  2. 79 features total (len(df.columns.tolist())) - 1 (can’t use SalesPrice)
  3. 36 numerical features (len(df[num_cols].columns.tolist()) - 1 (can’t use SalesPrice)

Modelling Feature Selection

  1. Can all of those features be used in a model?
  2. Would you want to use all of those features?

    Solution

    1. yes all the features could be used. With possible implications for the quality of the model.
    2. features that are not (anti)correlated with the target variable may not add any useful information to the model
    3. features that are correlated with other features may not add a lot more information and may produce a poorer quality model.

Model Feature Count

  1. how many features should be used total?

    Solution

    A possible approach:

    1. n = number of observations
    2. uncorrelated features count = (n - 1)
    3. as correlation increases, feature count proportional to sqrt(n)
    4. assuming some correlation: sqrt(1460) = 38.21 per: Optimal number of features as a function of sample size for various classification rules

    Data analysis and modeling can be very emprical

    You need to try things out to see what works. If your features are indepent and identically distributed, or not, will impact how many observations are required

    Generally for a classifcation model

    1. Distribution of features per target class matters a ton
    2. More observations mean you can use more features

Overfitting

What is model overfitting? how does a model become overfit?

Solution

your model is unabel to generalize - it has ‘memorized’ the data, rather than the patterns in it.

TODO: ADD IN HERE.

EXERCISE_END

Model Feature Quality

  1. which features should be used to predict the target variable? (which variables are good predictors?)

    Solution

Many possible answers here, some general ideas

  1. those that are most correlated with the target variable
  2. those that are not correlated with each other

Build regression model to predict sales price

Plot correlations and histograms of those columns

Reminder:

  1. What features should go in a model to predict high house price?
  2. What features are correlated with high house price?

Remove nulls from features

# which columns have the most nulls
model_df.isnull().sum().sort_values(ascending=False).head(5)
LotFrontage        259
GarageYrBlt         81
MasVnrArea           8
MSSubClass           0
BsmtExposure_Av      0
dtype: int64
# assume null means none - replace all nulls with zeros for lotFrontage and MasVnrArea
no_null_model_df = model_df
no_null_model_df['LotFrontage'] = no_null_model_df['LotFrontage'].fillna(0)
no_null_model_df['MasVnrArea'] = no_null_model_df['MasVnrArea'].fillna(0)

# GarageYrBlt 0 makes no sense - replace with mean
no_null_model_df['GarageYrBlt'] = no_null_model_df['GarageYrBlt'].fillna(no_null_model_df['GarageYrBlt'].mean())
no_null_model_df.isnull().sum().sort_values(ascending=False).head(5)
MSSubClass            0
Exterior1st_Stucco    0
BsmtFinType1_GLQ      0
BsmtFinType1_BLQ      0
BsmtFinType1_ALQ      0
dtype: int64

separate features from target

# define features and target
features = no_null_model_df.drop('SalePrice', axis=1)
features
target = no_null_model_df['SalePrice']
# confirm features do not contain target
[x for x in features.columns if x == 'SalePrice']
[]

Establish Model Performance Baseline

How well does always guessing the mean do in terms of RMSE?

from math import sqrt

mean_sale_price = model_df.SalePrice.mean()
print(f"{mean_sale_price=:.2f}")

diffs = model_df.SalePrice - mean_sale_price
rse = (diffs * diffs).apply(sqrt)
baseline_rmse = rse.mean()
print(f'baseline rmse: {baseline_rmse:.2f}')
mean_sale_price=12.02
baseline rmse: 0.31

Define function to fit and assess a Linear model

from collections import defaultdict
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, KFold
import matplotlib.pyplot as plt
from typing import Tuple
from math import sqrt
import numpy as np


def run_linear_regression_with_kf(features: pd.DataFrame, labels: pd.Series,
                                    n_splits=5, title='logistic regression model'
                                   ) -> Tuple[float,float,float,float]:
    """
    scale, split, and model data. Return model performance statistics, plot confusion matrix
    feature: dataframe of feature columns to model
    labels: series of labels to model against
    test_size: fraction of labels to use in test split
    title: title for chart
    return: recall mean, recall sd, precision mean, precision sd
    """
    # set up splits/folds and array for stats.
    kf = KFold(n_splits=n_splits)
    r2s = np.zeros(n_splits)
    rmses = np.zeros(n_splits)
    train_rmses = np.zeros(n_splits)

    # fit model for each split/fold
    for i, (train_idx, test_idx) in enumerate(kf.split(X=features, y=labels)):
        # split features data for dataframes
        try:
            X_train = features.iloc[train_idx]
            y_train = labels.iloc[train_idx]
            X_test = features.iloc[test_idx]
            y_test = labels.iloc[test_idx]

        # or split features data for ndarrays (pca transformed features)
        except AttributeError:
            X_train = features[train_idx]
            y_train = labels.iloc[train_idx]
            X_test = features[test_idx]
            y_test = labels.iloc[test_idx]


        # scale all features to training features
        scaler = preprocessing.StandardScaler().fit(X_train)
        X_train = scaler.transform(X_train)
        X_test = scaler.transform(X_test)

        # fit model, evaluate
        regr = LinearRegression().fit(X_train, y_train)
        y_pred = regr.predict(X_test)
        r2s[i] = r2_score(y_test, y_pred)
        rmses[i] = sqrt(mean_squared_error(y_test, y_pred))
        y_pred_train = regr.predict(X_train)
        train_rmses[i] = sqrt(mean_squared_error(y_train, y_pred_train))

    r2_mean = r2s.mean()
    r2_sd = r2s.std()
    rmse_mean = rmses.mean()
    rmse_sd = rmses.std()
    train_rmse_mean = train_rmses.mean()
    train_rmse_sd = train_rmses.std()

    # plot y_true vs y_pred
    fig, ax = plt.subplots(1, 1, figsize=(6, 6))
    ax.scatter(y_test, y_pred, alpha=0.3)
    ax.set_title(f'{title}\n' \
                 'mean r2: {:.2f},\n'\
                 'mean rmse {:.2f}'
                 .format(r2_mean, rmse_mean)
    )
    ax.set_xlabel('True Value')
    ax.set_ylabel('Predicted Value')
    stats = (r2_mean, rmse_mean, r2_sd, rmse_sd)
    train_test_errors = (rmse_mean, rmse_sd, train_rmse_mean, train_rmse_sd)
    model_data_and_pred = (regr, X_train, y_train, X_test, y_test, y_pred)
    fig_and_ax = (fig, ax)

    return stats, train_test_errors, model_data_and_pred, fig_and_ax



fit a linear model with all features

# set kfold splits
n_splits = 3
# keep all model stats in one dict
all_stats = {}

#r2_mean, rmse_mean, r2_sd, rmse_sd, regr, fig, ax =
plt.ion()
stats, train_test_errors, model_data_and_pred, fig_and_ax = run_linear_regression_with_kf(features=features, labels=target, n_splits=n_splits, title='all features')
all_stats['all'] = stats

model has difficulty inferring with several very large outliers

# plot ignoring outliers
fig, ax = fig_and_ax
ax.set_ylim(10.5, 14)
fig

Overfitting - bias and variance

# the model is overfit
rmse_mean, rmse_sd, train_rmse_mean, train_rmse_sd = train_test_errors
print(f'test rmse ± sd: \t {rmse_mean} ± {rmse_sd}')
print(f'train rmse ± sd:\t {train_rmse_mean} ± {train_rmse_sd}')

test rmse ± sd: 	 27282372629.48491 ± 28180380572.225414
train rmse ± sd:	 0.08863829483269596 ± 0.00626317568881641

PCA all features to 100 dimensions

from sklearn.decomposition import PCA
n_components = 100
p = PCA(n_components=n_components )
features_pca = p.fit_transform(features)
stats, train_test_errors, model_data_and_pred, fig_and_ax = run_linear_regression_with_kf(features=features_pca, labels=target,
                                                title=f'{n_components} Princpal Components', n_splits=n_splits)

all_stats['all_pca'] = stats
# fewer dimensions - higher bias error?

overfitting?

# the model is NOT overfit
rmse_mean, rmse_sd, train_rmse_mean, train_rmse_sd = train_test_errors
print(f'test rmse ± sd: \t {rmse_mean} ± {rmse_sd}')
print(f'train rmse ± sd:\t {train_rmse_mean} ± {train_rmse_sd}')
test rmse ± sd: 	 0.14793978100161198 ± 0.012474628694285275
train rmse ± sd:	 0.12223878649819635 ± 0.005870162810591587

Model Comparison

# create combined stats df
stats_df = pd.DataFrame.from_dict(all_stats).set_index(
    pd.Index(['r2_mean', 'rmse_mean', 'r2_sd', 'rmse_sd'], name='statistics')
)

# plot figures
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
stats_df.loc['r2_mean'].plot(ax=axs[0], kind='bar', yerr=stats_df.loc['r2_sd'], title='mean r2',  color='lightblue', ylim=(-stats_df.loc['r2_mean']['all_pca']/2, stats_df.loc['r2_mean']['all_pca']*2))
stats_df.loc['rmse_mean'].plot(ax=axs[1], kind='bar', yerr=stats_df.loc['rmse_sd'], title=f'mean RMSE',  color='orange', ylim=(0, stats_df.loc['rmse_mean']['all_pca']*3))

# plot baseline - guess mean every time RMSE
xmin, xmax = plt.xlim()
axs[1].hlines(baseline_rmse, xmin=xmin, xmax=xmax)
axs[1].text(xmax/3, baseline_rmse + baseline_rmse*0.05, 'Baseline RMSE')

# title and show
plt.suptitle(f'model statistics\nerrbars=sd n={n_splits}')
plt.show()

Fit a PCA model with fewer dimensions.

What do you think the out come will be?

Solution

upping variables in classifier, reduce bias error. tail ends of distributions can have high predictive power - a small amount of variance can be impactful

What Is Going On?

Intuition:

PCA is a way to rotate the axes of your dataset around the data so that the axes line up with the directions of the greatest variation through the data.

reviewed

  1. exlpored Ames housing dataset
  2. looked for variables that would correlate with/be good predictors for housing prices
  3. indicated that PCA might be a way to approach this problem

We’ll go into more detail on PCA in the next episode

Key Points


Unpacking PCA

Overview

Teaching: 45 min
Exercises: 2 min
Questions
  • What is the intuition behind how PCA works?

Objectives
  • explain the overall process of PCA

  • explain the result of a PCA operation

  • define a primary component

  • define dimensionality reduction

  • explain how PCA can be used as a dimensionality reduction technique

PCA part 2

What Just Happened !?

Intuition:

PCA is a way to rotate the axes of your dataset around the data so that the axes line up with the directions of the greatest variation through the data.

2d PCA example - get orientation/intuition.

  1. show two variables
  2. show PCA result
  3. intuition is rotated axes
  4. each new axis is a % of other axes now.
  5. you can scale that to n dimensions.

keep detail of PCA minimized below - keep them in here. those 5 steps won’t build extra intuition though.

relationship between two variables x and y

# Here is a random data feature, re-run until you like it:
from helper_functions import create_feature_scatter_plot
feature_ax, features, axlims = create_feature_scatter_plot(random_state=13)

Reminder: correlated features lead to multicollinearity issues

A VIF score above 10 means that you can’t meaningfully interpret the model’s estimated coefficients/p-values due to confounding effects.

from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

def calc_print_VIF(X):
    # Calculate VIF for each predictor in X
    vif = pd.DataFrame()
    vif["Variable"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    # Display the VIF values
    print(vif)
    
calc_print_VIF(pd.DataFrame(features))
   Variable        VIF
0         0  45.174497
1         1  45.174497

PCA of those two variables

from sklearn.decomposition import PCA
p = PCA(n_components=2)  # instantiate PCA transform
features_pca = p.fit_transform(features)  # perform PCA and re-project data 
calc_print_VIF(pd.DataFrame(features_pca))
   Variable  VIF
0         0  1.0
1         1  1.0

plot PCA result

import matplotlib.pyplot as plt
from helper_functions import plot_pca_features
fig, ax, (ax_min, ax_max) = plot_pca_features(features_pca)
plt.show()

original and PCA result comparison

from helper_functions import plot_pca_feature_comparison

point_to_highlight=10
plot_pca_feature_comparison(features, features_pca, ax_max, ax_min, p, point_to_highlight)
plt.show()

The process of PCA is analagous to walking around your data and looking at it from a new angle

from IPython.display import YouTubeVideo
YouTubeVideo("BorcaCtjmog")

3d data example

https://setosa.io/ev/principal-component-analysis/

this rotation of the axes, mean that new pca axes are made up of proportions of the old axes

what are those proportions?

The pca “components_” property, or the eigenvectors of each primary component

from helper_functions import show_pcs_on_unit_axes

show_pcs_on_unit_axes(p)
plt.show()
    
for i, pc in enumerate(p.components_):
    print(f"PC{i}: {pc}")
    
PC0: [ 0.70710678 -0.70710678]
PC1: [0.70710678 0.70710678]

demonstrate transform of one point from original feature space to PC-space

fmt_str = "{:.2f}, {:.2f}"
print("START in feature space:")
print(fmt_str.format(features[point_to_highlight,0],features[point_to_highlight,1]))
print()
print("END: in pca space:")
print(fmt_str.format(features_pca[point_to_highlight,0], features_pca[point_to_highlight,1]))

START in feature space:
1.20, -1.29

END: in pca space:
1.76, -0.07

step 1 scale feature space data

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features_scaled = scaler.fit(features).transform(features)

print("scaled feature space:")
print("{:.2f}, {:.2f}".format(
    features_scaled[point_to_highlight, 0], 
    features_scaled[point_to_highlight, 1])
    )
scaled feature space:
1.20, -1.29

step 2 get dot product of feature space values and principle component eigenvectors

# use both x and y coords of original point here as new pc0-coord is combination of both axes!
print("{:.2f}".format( 
    # first dimension of example point * first dimension of PC0 eigenvector
    features_scaled[point_to_highlight, 0] * p.components_[0,0] 
    +
    # second dimension of example point * second dimension of PC0 eigenvector
    features_scaled[point_to_highlight, 1] * p.components_[0,1]
    )
)
1.76
# Again: use both x and y coords of original point here as new pc1-coord is combination of both axes!
print("{:.2f}".format( 
    # first dimension of example point * first dimension of PC1 eigenvector
    features_scaled[point_to_highlight, 0] * p.components_[1,0] 
    +
    # first dimension of example point * first dimension of PC1 eigenvector
    features_scaled[point_to_highlight, 1] * p.components_[1,1]
    )
)
-0.07

this is called a dimensionality REDUCTION technique, because one dimension now explains more of the variability of your data

fig, ax = plt.subplots()
x = ["pc0", "pc1"]
ax.bar(x, p.explained_variance_ratio_)
for i in range(len(x)):
    plt.text(x=i, y=p.explained_variance_ratio_[i], 
             s="{:.3f}".format(p.explained_variance_ratio_[i]), 
             ha="center")
ax.set_title("explained variance ratio")
plt.show()

Key Points

  • PCA transforms your original data by projecting it into new axes

  • primary components are orthogonal vectors through your data that explain the most variability


Regularization methods - lasso, ridge, and elastic net

Overview

Teaching: 45 min
Exercises: 2 min
Questions
  • How can LASSO regularization be used as a feature selection method?

Objectives

Introduction to the LASSO model in high-dimensional data analysis

In the realm of high-dimensional data analysis, where the number of predictors begins to approach or exceed the number of observations, traditional regression methods can become challenging to implement and interpret. The Least Absolute Shrinkage and Selection Operator (LASSO) offers a powerful solution to address the complexities of high-dimensional datasets. This technique, introduced by Robert Tibshirani in 1996, has gained immense popularity due to its ability to provide both effective prediction and feature selection.

The LASSO model is a regularization technique designed to combat overfitting by adding a penalty term to the regression equation. The essence of the LASSO lies in its ability to shrink the coefficients of less relevant predictors towards zero, effectively “shrinking” them out of the model. This not only enhances model interpretability by identifying the most important predictors but also reduces the risk of multicollinearity and improves predictive accuracy.

LASSO’s impact on high-dimensional data analysis is profound. It provides several benefits:

The L1 penalty

The key concept behind the LASSO is its use of the L1 penalty, which is defined as the sum of the absolute values of the coefficients (parameters) of the model, multiplied by a regularization parameter (usually denoted as λ or alpha).

In the context of linear regression, the L1 penalty can be incorporated into the ordinary least squares (OLS) loss function as follows:

LASSO Model

Where:

The L1 penalty has a unique property that it promotes sparsity. This means that it encourages some coefficients to be exactly zero, effectively performing feature selection. In contrast to the L2 penalty (Ridge penalty), which squares the coefficients and promotes small but non-zero values, the L1 penalty tends to lead to sparse solutions where only a subset of predictors are chosen. As a result, the LASSO automatically performs feature selection, which is especially advantageous when dealing with high-dimensional datasets where many predictors may have negligible effects on the outcome.

Compare full-dim and LASSO results

Load full dim, zscored, data

We’ll use most of the data for the test set so that this dataset’s dimensionality begins to approach the number of observations. Regularization techniques such as LASSO tend to shine when working in this context. If you have plenty of data to estimate each coefficient, you will typically find that an unregularized model performs better.

from preprocessing import prep_fulldim_zdata
X_train_z, X_holdout_z, y_train, y_holdout, y = prep_fulldim_zdata(const_thresh= 95, test_size=.95, y_log_scaled=True)

from sklearn.model_selection import train_test_split
X_val_z, X_test_z, y_val, y_test = train_test_split(X_holdout_z, y_holdout, test_size=0.5, random_state=0)
X_train_z.head()
133 columns removed, 82 remaining.
Columns removed: ['MasVnrArea', 'LowQualFinSF', 'GarageYrBlt', '3SsnPorch', 'LotFrontage', 'KitchenAbvGr', 'PoolArea', 'Condition1_Artery', 'Condition1_PosA', 'Condition1_PosN', 'Condition1_RRAe', 'Condition1_RRAn', 'Condition1_RRNe', 'Condition1_RRNn', "Exterior2nd_'Brk Cmn'", "Exterior2nd_'Wd Shng'", 'Exterior2nd_AsbShng', 'Exterior2nd_AsphShn', 'Exterior2nd_BrkFace', 'Exterior2nd_CBlock', 'Exterior2nd_CmentBd', 'Exterior2nd_ImStucc', 'Exterior2nd_Other', 'Exterior2nd_Stone', 'Exterior2nd_Stucco', 'LotConfig_FR2', 'LotConfig_FR3', 'Alley_Grvl', 'Alley_Pave', 'RoofMatl_ClyTile', 'RoofMatl_CompShg', 'RoofMatl_Membran', 'RoofMatl_Metal', 'RoofMatl_Roll', 'RoofMatl_Tar&Grv', 'RoofMatl_WdShake', 'RoofMatl_WdShngl', 'Heating_Floor', 'Heating_GasA', 'Heating_GasW', 'Heating_Grav', 'Heating_OthW', 'Heating_Wall', 'MasVnrType_BrkCmn', 'Condition2_Artery', 'Condition2_Feedr', 'Condition2_Norm', 'Condition2_PosA', 'Condition2_PosN', 'Condition2_RRAe', 'Condition2_RRAn', 'Condition2_RRNn', 'HouseStyle_1.5Unf', 'HouseStyle_2.5Fin', 'HouseStyle_2.5Unf', 'HouseStyle_SFoyer', 'HouseStyle_SLvl', 'GarageType_2Types', 'GarageType_Basment', 'GarageType_CarPort', "MSZoning_'C (all)'", 'MSZoning_FV', 'MSZoning_RH', 'RoofStyle_Flat', 'RoofStyle_Gambrel', 'RoofStyle_Mansard', 'RoofStyle_Shed', 'Utilities_AllPub', 'Utilities_NoSeWa', 'SaleCondition_AdjLand', 'SaleCondition_Alloca', 'SaleCondition_Family', 'Neighborhood_Blmngtn', 'Neighborhood_Blueste', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_ClearCr', 'Neighborhood_Crawfor', 'Neighborhood_IDOTRR', 'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NPkVill', 'Neighborhood_NoRidge', 'Neighborhood_SWISU', 'Neighborhood_SawyerW', 'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Neighborhood_Veenker', 'MiscFeature_Gar2', 'MiscFeature_Othr', 'MiscFeature_Shed', 'MiscFeature_TenC', 'BldgType_2fmCon', 'BldgType_Duplex', 'BldgType_Twnhs', 'SaleType_COD', 'SaleType_CWD', 'SaleType_Con', 'SaleType_ConLD', 'SaleType_ConLI', 'SaleType_ConLw', 'SaleType_Oth', 'Foundation_Slab', 'Foundation_Stone', 'Foundation_Wood', 'Electrical_FuseF', 'Electrical_FuseP', 'Electrical_Mix', 'LandContour_Bnk', 'LandContour_HLS', 'LandContour_Low', 'MSSubClass_30', 'MSSubClass_40', 'MSSubClass_45', 'MSSubClass_70', 'MSSubClass_75', 'MSSubClass_80', 'MSSubClass_85', 'MSSubClass_90', 'MSSubClass_160', 'MSSubClass_180', 'MSSubClass_190', 'Exterior1st_AsbShng', 'Exterior1st_AsphShn', 'Exterior1st_BrkComm', 'Exterior1st_BrkFace', 'Exterior1st_CBlock', 'Exterior1st_CemntBd', 'Exterior1st_ImStucc', 'Exterior1st_Stone', 'Exterior1st_Stucco', 'Exterior1st_WdShing', 'Street']
BsmtFinSF1 GarageCars 1stFlrSF YrSold EnclosedPorch WoodDeckSF 2ndFlrSF OpenPorchSF BsmtHalfBath GrLivArea ... MSSubClass_20 MSSubClass_50 MSSubClass_60 MSSubClass_120 Exterior1st_'Wd Sdng' Exterior1st_HdBoard Exterior1st_MetalSd Exterior1st_Plywood Exterior1st_VinylSd CentralAir
633 0.075477 -1.072050 -0.173045 -0.495091 -0.251998 2.007070 -0.849556 -0.694796 4.124766 -0.961978 ... 1.223298 -0.117041 -0.589094 -0.239117 2.357786 -0.372423 -0.462275 -0.323431 -0.85322 0.166683
908 -0.352031 0.350853 -0.589792 -1.203750 -0.251998 0.434247 -0.849556 -0.694796 -0.239117 -1.313923 ... 1.223298 -0.117041 -0.589094 -0.239117 -0.418317 2.648339 -0.462275 -0.323431 -0.85322 0.166683
611 0.374016 0.350853 -0.237993 -0.495091 -0.251998 -0.707093 -0.849556 -0.694796 4.124766 -1.016827 ... -0.806264 -0.117041 -0.589094 -0.239117 -0.418317 2.648339 -0.462275 -0.323431 -0.85322 0.166683
398 -1.070913 -1.072050 -0.116216 -0.495091 -0.251998 -0.707093 -0.849556 -0.694796 -0.239117 -0.913986 ... -0.806264 -0.117041 -0.589094 -0.239117 -0.418317 -0.372423 2.133579 -0.323431 -0.85322 0.166683
91 0.362075 0.350853 0.311355 -1.203750 -0.251998 -0.707093 -0.849556 -0.694796 -0.239117 -0.552900 ... 1.223298 -0.117041 -0.589094 -0.239117 -0.418317 2.648339 -0.462275 -0.323431 -0.85322 0.166683

5 rows × 82 columns

print(X_train_z.shape)
(73, 82)

Intro to LassoCV

LassoCV in scikit-learn performs cross-validation to find the best alpha value (lambdas in traditional LASSO equation) from the specified list of alphas. It does this by fitting a separate LASSO regression for each alpha value on each fold of the cross-validation. The alphas parameter determines the values of alpha to be tested.

Cross-validation is a method used to assess the performance and generalizability of a machine learning model. It involves partitioning the dataset into multiple subsets, training the model on some of these subsets, and then evaluating its performance on the remaining subset(s).

For example, in 5-fold cross-validation, you split your data into 5 partitions (“folds”). You then train the model on 4 of the folds, leaving one out for validation. You do this 5 times — training/validating on different folds each time.

The LassoCV() functions returns the best model that was determined based on cross-validation performance. This best model’s coefficients can be accessed using the .coef_ attribute, and the optimal alpha can be accessed using the .alpha_ attribute.

from sklearn.linear_model import LassoCV
# help(LassoCV)

Specify range of alphas to evaluate

Specify a range of alpha values. Typically, small alphas work well. However, you don’t want to be so close to zero that you get no benefits from regularization (i.e., none of the coefs shrink to zero).

import numpy as np
alphas = np.logspace(-4, 1, 100) # The alphas you want to evaluate during cross-validation
alphas_with_zero = np.insert(alphas, 0, 0) # The unregularized model
print(alphas_with_zero[0:10])

max_iter = 100000 # This is the maximum number of iterations for which we want the model to run if it doesn’t converge before reaching this number. The default value is 1000

cv = 5 # number of folds to use during cross-validation
[0.         0.0001     0.00011233 0.00012619 0.00014175 0.00015923
 0.00017886 0.00020092 0.0002257  0.00025354]

Call LassoCV

lasso_cv = LassoCV(alphas=alphas, cv=cv, max_iter=max_iter, random_state=0)

Randomness in LassoCV

LassoCV uses coordinate descent, which is a convex optimization algorithm meaning that it solves for a global optimum (one possible optimum).

However, during coordinate descent, when multiple features are highly correlated, the algorithm can choose any one of them to update at each iteration. This can lead to some randomness in the selection of features and the order in which they are updated. While coordinate descent itself is deterministic, the order in which the correlated features are selected can introduce variability.

The random state argument in LassoCV allows you to control this randomness by setting a specific random seed. This can be helpful for reproducibility when working with models that involve correlated features. By specifying a random seed, you ensure that the same features will be chosen in the same order across different runs of the algorithm, making the results more predictable and reproducible.

In summary, while coordinate descent is a convex algorithm, the random state argument in LassoCV helps manage the potential randomness introduced by the selection of correlated features during the optimization process.

# fit the model
best_lasso_model = lasso_cv.fit(X_train_z, y_train)

Print the best alpha value selected.

best_lasso_model.alpha_
0.0093260334688322

Print the coefs

coefs = best_lasso_model.coef_
coefs
array([ 0.        ,  0.06596183,  0.        , -0.        ,  0.00999046,
        0.02956164,  0.        ,  0.02594706,  0.        ,  0.08300968,
       -0.        ,  0.03157657,  0.06126806,  0.01329162,  0.03171758,
       -0.01378301,  0.        ,  0.        ,  0.        ,  0.04026495,
       -0.        ,  0.0609676 ,  0.        ,  0.00054206,  0.00410852,
        0.00141018,  0.        , -0.        ,  0.01709261, -0.        ,
       -0.        ,  0.        ,  0.00041337,  0.02716718,  0.        ,
        0.        , -0.        ,  0.00381712, -0.        , -0.        ,
        0.        , -0.        , -0.        ,  0.        , -0.        ,
        0.        ,  0.        , -0.04891258,  0.00014635, -0.        ,
       -0.02633333,  0.        ,  0.        ,  0.        , -0.        ,
       -0.        , -0.        ,  0.        ,  0.02453974,  0.02236798,
       -0.02465517,  0.        ,  0.        ,  0.00378062,  0.        ,
        0.        , -0.00136852,  0.        ,  0.00914042, -0.        ,
        0.        , -0.        ,  0.        ,  0.        ,  0.        ,
        0.        , -0.00601784, -0.        ,  0.        ,  0.        ,
        0.        ,  0.01252988])

Assess sparsity

percent_sparsity = np.mean(coefs == 0) * 100
percent_sparsity
63.41463414634146

Evaluate model error

from regression_predict_sklearn import measure_model_err
# to calculate residuals and R-squared for the test set, we'll need to get the model predictions first
y_pred_train = best_lasso_model.predict(X_train_z)
y_pred_test = best_lasso_model.predict(X_test_z)
errors_df = measure_model_err(y, np.mean(y),
                              y_train, y_pred_train,
                              y_test, y_pred_test,
                              'RMSE', y_log_scaled=True) 
errors_df.head()
Baseline Error Train Error Test Error
0 79415.291886 13763.876079 56009.758064

Use fit_eval_model() to quickly compare models

from regression_predict_sklearn import fit_eval_model

# Full-dim model
fulldim_model, error_df = fit_eval_model(y=y, baseline_pred=y.mean(),
               X_train=X_train_z, y_train=y_train,
               X_test=X_test_z, y_test=y_test, 
               predictors=X_train_z.columns,
               metric='RMSE',
               y_log_scaled=True,
               model_type='unregularized',
               include_plots=True, plot_raw=True, verbose=True)

# LASSO
import numpy as np
best_lasso_model, error_df = fit_eval_model(y=y, baseline_pred=y.mean(),
                                         X_train=X_train_z, y_train=y_train,
                                         X_test=X_test_z, y_test=y_test, 
                                         predictors=X_train_z.columns,
                                         metric='RMSE',
                                         y_log_scaled=True,
                                         model_type='LassoCV', alphas=alphas, cv=cv, max_iter=max_iter,
                                         include_plots=True, plot_raw=True, verbose=True)
# of predictor vars = 82
# of train observations = 73
# of test observations = 694
Baseline RMSE = 79415.29188606751
Train RMSE = 2.905594915676157e-10
Holdout RMSE = 2357924.397903089
(Holdout-Train)/Train: 811511744180753792%









# of predictor vars = 82
# of train observations = 73
# of test observations = 694
Baseline RMSE = 79415.29188606751
Train RMSE = 13763.876078966621
Holdout RMSE = 56009.758063855006
(Holdout-Train)/Train: 307%

Investivating the alpha hyperparameter

Here, we will take a closer look at the alpha hyperparameter to see which values performed well.

We will see different values of alpha impact:

Unfortunately for us, the LassoCV function in scikit-learn doesn’t store the train errors for each model evaluated. It’s primarily designed to perform cross-validation to find the best alpha value for Lasso regression regularization. For our purposes, we will have to manually run Lasso for each alpha value.

Typically when comparing alpha values, you’re in the model selection stage of your analysis. This means that you need to use the validation set to evaluate performance. LassoCV does this implicitly by using portions of the training data as validation data. Here, we’ll use the explicit validation set we defined at the beginning of this episode.

import matplotlib.pyplot as plt
# Calculate the corresponding training scores for each alpha and fold
from sklearn.linear_model import Lasso
train_scores = []
val_scores = []

perc_sparse_vals = []
for alpha in alphas:
    lasso = Lasso(alpha=alpha, max_iter=max_iter)
    lasso.fit(X_train_z, y_train)
    coefs = lasso.coef_
    percent_sparse = np.mean(coefs == 0) * 100
    train_pred = lasso.predict(X_train_z)
    val_pred = lasso.predict(X_val_z)
    train_scores.append(np.mean((train_pred - y_train) ** 2))
    val_scores.append(np.mean((val_pred - y_val) ** 2))
    perc_sparse_vals.append(percent_sparse)

# Create a grid of subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Plot the training and validation scores
ax1.plot(np.log10(alphas), train_scores, label='Train', marker='o')
ax1.plot(np.log10(alphas), val_scores, label='Validation (CV)', marker='o')
ax1.set_xlabel('log(alpha)')
ax1.set_ylabel('Mean Squared Error')
ax1.set_title('LassoCV Train/Validation Scores')
ax1.legend()
ax1.grid(True)

# Plot percent sparsity
ax2.plot(np.log10(alphas), perc_sparse_vals, label='Percent Sparsity', marker='o', color='orange')
ax2.set_xlabel('log(alpha)')
ax2.set_ylabel('Percent Sparsity')
ax2.set_title('Percent Sparsity vs. alpha')
ax2.grid(True)

# Adjust layout and display the plots
plt.tight_layout()
plt.savefig('..//fig//regression//regularize//alpha_cv_results.png', bbox_inches='tight', dpi=300, facecolor='white');

LASSO as a feature selection method

A well optimized LASSO model can be used as a data-driven feature selection method. However, one thing to be mindful of is, given two features that are correlated, LASSO may randomly select one feature’s coefficient to shrink to zero. This can sometimes lead to less desirable model specifications. However, if you have one are more features that you want to ensure are included in the model, you can follow this procedure:

  1. Fit the LassoCV model to determine the optimal features to keep in the model, feats_keep
  2. Extend the list of feats_keep to include other features/predictors that are important to your research questions
  3. Fit a standard unregularized model using feats_keep as your predictors
  4. Take care to check the assumptions of the model, especially if you’re adding features after selecting them with LASSO. LASSO helps prevent multicollinearity issues, but adding features back in could re-introduce multicollinearity

Let’s see which features our best LASSO model decided to keep.

from interpret_model import coef_plot

# Get coefficient matrix
coef_matrix = best_lasso_model.coef_

fig = coef_plot(coefs=coef_matrix, plot_const=False, index=X_train_z.columns) 
fig.set_size_inches(6, 8)

Introduction to the ridge model in high-dimensional data analysis

Ridge regression introduces an L2 penalty term to the linear regression equation, which involves adding the squared magnitudes of the coefficients to the loss function. This regularization technique aims to minimize the sum of squared residuals while also constraining the coefficients. Here are some key points about Ridge regression:

In summary, Ridge regression and Lasso are regularization techniques that address high-dimensional data analysis challenges in distinct ways. Ridge regression is effective for shrinking coefficients to small values, but keeping all predictors in the model. Lasso excels in situations where feature selection and interpretability are essential. The choice between Ridge regression and Lasso depends on the specific goals of your analysis and the nature of your dataset.

The L2 penalty

The fundamental principle underlying ridge regression is its utilization of the L2 penalty, which involves the summation of the squared values of the coefficients (parameters) of the model, multiplied by a regularization parameter (often represented as λ or alpha).

In the realm of linear regression, the L2 penalty can be integrated into the ordinary least squares (OLS) loss function in the following manner:

Ridge Model

from sklearn.linear_model import LinearRegression, RidgeCV

ridge_cv = RidgeCV(alphas=alphas, store_cv_values=True)  # store_cv_values=True to enable cv_values_

# Fit the RidgeCV model
ridge_cv.fit(X_train_z, y_train)

# Print the chosen alpha and corresponding R-squared score
print("Chosen Alpha:", ridge_cv.alpha_)
print("R-squared Score:", ridge_cv.score(X_test_z, y_test))
Chosen Alpha: 10.0
R-squared Score: 0.8108420609911078
# Full-dimensional linear regression
full_dim_model = LinearRegression()
full_dim_model.fit(X_train_z, y_train)
full_dim_r2 = full_dim_model.score(X_test_z, y_test)
print("R-squared Score:", full_dim_r2)
R-squared Score: -7.321193011831792
from interpret_model import coef_plot

# Get coefficient matrix
coef_matrix = ridge_cv.coef_

fig = coef_plot(coefs=coef_matrix, plot_const=False, index=X_train_z.columns) 
fig.set_size_inches(6, 15)

Elastic Net

Elastic Net is a regularization technique used in linear regression models that combines both L1 (LASSO) and L2 (ridge) penalties. It aims to address some limitations of individual LASSO and ridge regression by providing a balance between feature selection (LASSO’s sparsity) and coefficient shrinkage (ridge’s stability).

The Elastic Net regularization term can be represented as a combination of the L1 and L2 penalties, and it depends on two hyperparameters: α (alpha) and λ (lambda). The α parameter controls the balance between the L1 and L2 penalties. When α is 0, Elastic Net behaves like ridge regression, and when α is 1, it behaves like LASSO.

To implement Elastic Net in Python, you can use libraries like scikit-learn. Here’s how you can do it:

# Initialize the ElasticNetCV model
alphas = [0.1, 1.0, 10.0]  # List of alpha values to search over
l1_ratios = [0.1, 0.5, 0.9]  # List of l1_ratio values to search over
elastic_net_cv = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, cv=5)

Key Points


Exploring additional datasets

Overview

Teaching: 45 min
Exercises: 2 min
Questions
  • How can we use everything we have learned to analyze a new dataset?

Objectives

Preprocessing

Note: Adapt get_feat_types() and encode_predictors_housing_data() for your data. Use new functions with slightly different names.

# get geat types - you'll need to create a similar function for your data that stores the type of each predictor
from preprocessing import get_feat_types
predictor_type_dict = get_feat_types()
continuous_fields = predictor_type_dict['continuous_fields']
# encode predictors (one-hot encoding for categorical data) - note you may have to create a new function starting from a copy of this one
from preprocessing import encode_predictors_housing_data
X_encoded = encode_predictors_housing_data(X)
# remove columns with nans or containing > 95% constant values (typically 0's)
from preprocessing import remove_bad_cols
X_good = remove_bad_cols(X, 95)
# train/test splits
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y_log, 
                                                    test_size=0.33, 
                                                    random_state=0)

print(x_train.shape)
print(x_test.shape)
# zscore
from preprocessing import zscore
# get means and stds
train_means = X_train.mean()
train_stds = X_train.std()
X_train_z = zscore(df=X_train, train_means=train_means, train_stds=train_stds)
X_test_z = zscore(df=X_test, train_means=train_means, train_stds=train_stds)
X_train_z.head()
# get random predictor permutations...
from preprocessing import get_predictor_combos
sampled_combinations = get_predictor_combos(X_train=X_train, K=K, n=25)

Feature selection

from feature_selection import get_best_uni_predictors

top_features = get_best_uni_predictors(N_keep=5, y=y, baseline_pred=y.mean(), 
                                       X_train=X_train, y_train=y_train,
                                       X_val=X_val, y_val=y_val,
                                       metric='RMSE', y_log_scaled=True)

top_features

Fit/eval model (sklearn version)

from regression_predict_sklearn import fit_eval_model
fit_eval_model(y=y, baseline_pred=y.mean(),
               X_train=X_train_z, y_train=y_train,
               X_test=X_test_z, y_test=y_test, 
               predictors=X_train_z.columns,
               metric='RMSE',
               y_log_scaled=True,
               model_type='unregularized',
               include_plots=True, plot_raw=True, verbose=True)

Model eval

# plot (1) true vs predicted vals for train/test sets and (2) best line of fit (only applies for univariate models)
from regression_predict_sklearn import plot_train_test_predictions 
(fig1, fig2) = plot_train_test_predictions(predictors=[predictor],
                                           X_train=x_train, X_test=x_test,
                                           y_train=y_train, y_test=y_test,
                                           y_pred_train=y_pred_train, y_pred_test=y_pred_test,
                                           y_log_scaled=True, plot_raw=True);
# report baseline, train, and test errors
from regression_predict_sklearn import measure_model_err
error_df = measure_model_err(y=y, baseline_pred=baseline_predict,
                             y_train=y_train, y_pred_train=y_pred_train, 
                             y_test=y_test, y_pred_test=y_pred_test, 
                             metric='MAPE', y_log_scaled=False) 

error_df.head()

Comparing models…

df_model_err = compare_models(y=y, baseline_pred=baseline_predict,
                              X_train=X_train, y_train=y_train, 
                              X_val=X_val, y_val=y_val,
                              predictors_list=X_train.columns, 
                              metric='RMSE', y_log_scaled=True, 
                              model_type='unregularized', 
                              include_plots=False, plot_raw=False, verbose=False)

Key Points


Introduction to High-Dimensional Clustering

Overview

Teaching: 20 minutes min
Exercises: 10 minutes min
Questions
  • What is clustering?

  • Why is clustering important?

  • What are the challenges of clustering in high-dimensional spaces?

  • How do we implement K-means clustering in Python?

  • How do we evaluate the results of clustering?

Objectives
  • Explain what clustering is.

  • Discuss common clustering algorithms.

  • Highlight the challenges of clustering in high-dimensional spaces.

  • Implement and evaluate basic clustering algorithms.

Introduction to High-Dimensional Clustering

What is Clustering?

Clustering is a method of unsupervised learning that groups a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. It is used in various applications such as market segmentation, document clustering, image segmentation, and more.

Common Clustering Algorithms

Challenges in High-Dimensional Spaces

High-dimensional spaces present unique challenges for clustering:

  1. Curse of Dimensionality: As the number of dimensions increases, the distance between points becomes less meaningful, making it difficult to distinguish between clusters.
  2. Sparsity: In high-dimensional spaces, data points tend to be sparse, which can hinder the performance of clustering algorithms.
  3. Visualization: Visualizing clusters in high-dimensional spaces is challenging since we can only plot in two or three dimensions.

Exercise: Discuss with your neighbor some applications of clustering you might encounter in your research or daily life. Why is it important?

Data Preparation for Clustering

Loading and Inspecting the Dataset

We’ll use the Ames Housing dataset for our examples. This dataset contains various features about houses in Ames, Iowa.

from sklearn.datasets import fetch_openml

# Load the dataset
housing = fetch_openml(name="house_prices", as_frame=True, parser='auto')

df = housing.data.copy(deep=True)
df = df.astype({'Id': int})
df = df.set_index('Id')
df.head()

Preprocessing the Data

Handling Missing Values

# Handling missing values
df.fillna(df.mean(), inplace=True)

Scaling Features

Clustering algorithms are sensitive to the scale of the data. We’ll use StandardScaler to scale our features.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

Exercise: Inspect the first few rows of the scaled dataset. Why is scaling important for clustering?

import pandas as pd

pd.DataFrame(df_scaled, columns=df.columns).head()

Implementing Clustering Algorithms

K-means Clustering

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Implementing K-means
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(df_scaled)
labels_kmeans = kmeans.labels_

# Evaluating K-means
silhouette_avg_kmeans = silhouette_score(df_scaled, labels_kmeans)
print(f"K-means Silhouette Score: {silhouette_avg_kmeans}")

Hierarchical Clustering

from sklearn.cluster import AgglomerativeClustering

# Implementing Hierarchical Clustering
hierarchical = AgglomerativeClustering(n_clusters=5)
labels_hierarchical = hierarchical.fit_predict(df_scaled)

# Evaluating Hierarchical Clustering
silhouette_avg_hierarchical = silhouette_score(df_scaled, labels_hierarchical)
print(f"Hierarchical Clustering Silhouette Score: {silhouette_avg_hierarchical}")

DBSCAN Clustering

from sklearn.cluster import DBSCAN

# Implementing DBSCAN
dbscan = DBSCAN(eps=1.5, min_samples=5)
labels_dbscan = dbscan.fit_predict(df_scaled)

# Evaluating DBSCAN
silhouette_avg_dbscan = silhouette_score(df_scaled, labels_dbscan)
print(f"DBSCAN Silhouette Score: {silhouette_avg_dbscan}")

Exercise: Modify the parameters of the DBSCAN algorithm and observe how the silhouette score changes. Why do you think the results differ?

Key Points

  • Understanding the basics of clustering and its importance.

  • Challenges in high-dimensional spaces.

  • Implementing basic clustering algorithms.


Addressing challenges in high-dimensional clustering

Overview

Teaching: 20 min
Exercises: 10 min
Questions
  • How can dimensionality reduction help in high-dimensional clustering?

  • What are specialized clustering algorithms for high-dimensional data?

  • How can we visualize high-dimensional data?

  • What insights can we gain from visualizing clusters?

Objectives
  • Apply dimensionality reduction techniques to high-dimensional data.

  • Implement specialized clustering algorithms for high-dimensional data.

  • Evaluate the impact of these techniques on clustering performance.

  • Visualize clustering results.

Addressing Challenges in High-Dimensional Clustering

Dimensionality Reduction Techniques

High-dimensional clustering often requires reducing the number of dimensions to make the problem more tractable. The choice of dimensionality reduction technique can significantly impact the clustering results. Here, we will explore several popular techniques and discuss their strengths and weaknesses.

Principal Component Analysis (PCA)

PCA reduces the dimensionality of the data by transforming it into a new set of variables that are linear combinations of the original variables. It aims to capture the maximum variance with the fewest number of components.

Local vs global structure

When reducing high-dimensional data to lower dimensions, it’s important to consider both local and global structures. Local structure refers to the relationships and distances between nearby data points. Preserving local structure ensures that points that were close together in the high-dimensional space remain close in the lower-dimensional representation. This is crucial for identifying clusters or neighborhoods within the data. Global structure, on the other hand, refers to the overall arrangement and distances between clusters or distant points in the dataset. Preserving global structure ensures that the broader data topology and the relative positioning of different clusters are maintained.

from sklearn.decomposition import PCA

# Applying PCA
pca = PCA(n_components=10)  # Reduce to 10 dimensions
df_pca = pca.fit_transform(df_scaled)

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a non-linear dimensionality reduction technique particularly well-suited for embedding high-dimensional data in a low-dimensional space. It is effective in visualizing high-dimensional data. Techniques like t-SNE excel at preserving local structure, making them great for detailed cluster visualization, but they may distort the global arrangement of clusters.

from sklearn.manifold import TSNE

# Applying t-SNE
tsne = TSNE(n_components=2)
df_tsne = tsne.fit_transform(df_scaled)

Pairwise Controlled Manifold Approximation Projection (PacMAP)

PacMAP is a dimensionality reduction technique that focuses on preserving both global and local structures in the data, providing a balance between t-SNE’s focus on local structure and PCA’s focus on global structure.

from pacmap import PaCMAP

# Applying PacMAP
pacmap = PaCMAP(n_components=2)
df_pacmap = pacmap.fit_transform(df_scaled)

Exercise 1: Compare Silhouette Scores with PCA

Compare the silhouette scores of K-means clustering on the original high-dimensional data and the data reduced using PCA. What do you observe?

# K-means on PCA-reduced data
kmeans_pca = KMeans(n_clusters=5, random_state=42)
kmeans_pca.fit(df_pca)
labels_kmeans_pca = kmeans_pca.labels_
silhouette_avg_kmeans_pca = silhouette_score(df_pca, labels_kmeans_pca)
print(f"K-means on PCA-reduced data Silhouette Score: {silhouette_avg_kmeans_pca}")

Determining whether to prioritize local or global structure in a research clustering context depends on the goals of your analysis and the nature of the data. Here are some key considerations:

Global vs Local Structure Deep Dive

Understanding and choosing the right technique based on the need to preserve either local or global structure (or both) can significantly impact the insights drawn from the data visualization.

When to Care More About Local Structure

  1. Cluster Identification and Separation:
    • If your primary goal is to identify and separate distinct clusters within your data, preserving local structure is crucial. Techniques that focus on local structure, such as t-SNE, ensure that points that are close in high-dimensional space remain close in the reduced space, making clusters more discernible.
    • Example: In gene expression data, where the goal is to identify distinct groups of genes or samples with similar expression patterns, preserving local neighborhoods is essential for accurate clustering.
  2. Neighborhood Analysis:
    • When the analysis requires examining the relationships between nearby data points, preserving local structure becomes important. This is common in studies where understanding local patterns or small-scale variations is key.
    • Example: In image recognition tasks, local structure preservation helps in identifying small groups of similar images, which can be crucial for tasks like facial recognition or object detection.
  3. Anomaly Detection:
    • For tasks like anomaly detection, where identifying outliers or unusual patterns within small regions of the data is important, maintaining local structure ensures that these patterns are not lost during dimensionality reduction.
    • Example: In network security, preserving local structure helps in detecting abnormal user behavior or network activity that deviates from typical patterns.

When to Care More About Global Structure

  1. Overall Data Topology:
    • If understanding the broad, overall arrangement of data points is critical, preserving global structure is essential. This helps in maintaining the relative distances and relationships between distant points, providing a comprehensive view of the data’s topology.
    • Example: In geographical data analysis, maintaining global structure can help in understanding the broader spatial distribution of features like climate patterns or population density.
  2. Data Integration and Comparison:
    • When integrating multiple datasets or comparing data across different conditions, preserving global structure helps in maintaining consistency and comparability across the entire dataset.
    • Example: In multi-omics studies, where different types of biological data (e.g., genomics, proteomics) are integrated, preserving global structure ensures that the overall relationships between data types are maintained.
  3. Data Compression and Visualization:
    • For tasks that require data compression or large-scale visualization, preserving global structure can help in maintaining the integrity of the dataset while reducing its dimensionality.
    • Example: In large-scale data visualization, techniques that preserve global structure, such as PCA, help in creating interpretable visual summaries that reflect the overall data distribution.

Balancing Local and Global Structures

In many cases, it may be necessary to strike a balance between preserving local and global structures. Techniques like PacMAP offer a compromise by maintaining both local and global relationships, making them suitable for applications where both detailed clustering and overall data topology are important.

Key Questions to Consider:

By carefully considering these factors, you can determine the appropriate emphasis on local or global structure for your specific research context.

Exercise 2: Understanding Global vs Local Structure

Compare t-SNE and PacMAP on the same high-dimensional data. Visualize the results and discuss how each technique handles local and global structures.

import matplotlib.pyplot as plt

# Visualizing t-SNE Clustering
plt.scatter(df_tsne[:, 0], df_tsne[:, 1], c=labels_kmeans, cmap='viridis')
plt.title("t-SNE Clustering")
plt.show()

# Visualizing PacMAP Clustering
plt.scatter(df_pacmap[:, 0], df_pacmap[:, 1], c=labels_kmeans, cmap='viridis')
plt.title("PacMAP Clustering")
plt.show()

Discuss the following:

  • How does t-SNE’s focus on local structures affect the visualization?
  • How does PacMAP’s balance of local and global structures compare to t-SNE?
  • Which method provides a better visualization for understanding the overall data topology?

Specialized Clustering Algorithms

Spectral Clustering

Spectral Clustering uses the eigenvalues of a similarity matrix to perform dimensionality reduction before clustering. It is useful for capturing complex, non-linear relationships.

from sklearn.cluster import SpectralClustering

# Implementing Spectral Clustering
spectral = SpectralClustering(n_clusters=5, affinity='nearest_neighbors', random_state=42)
labels_spectral = spectral.fit_predict(df_scaled)

# Evaluating Spectral Clustering
silhouette_avg_spectral = silhouette_score(df_scaled, labels_spectral)
print(f"Spectral Clustering Silhouette Score: {silhouette_avg_spectral}")

Exercise 3: Implement Spectral Clustering on PCA Data

Implement and evaluate Spectral Clustering on the PCA-reduced data. How does its performance compare to other methods?

# Spectral Clustering on PCA-reduced data
labels_spectral_pca = spectral.fit_predict(df_pca)
silhouette_avg_spectral_pca = silhouette_score(df_pca, labels_spectral_pca)
print(f"Spectral Clustering on PCA-reduced data Silhouette Score: {silhouette_avg_spectral_pca}")

Visualizing Clustering Results

Dimensionality Reduction using PCA

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Applying PCA
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df_scaled)

# Visualizing K-means Clustering
plt.scatter(df_pca[:, 0], df_pca[:, 1], c=labels_kmeans, cmap='viridis')
plt.title("K-means Clustering")
plt.show()

Comparing Clustering Results

# Visualizing Hierarchical Clustering
plt.scatter(df_pca[:, 0], df_pca[:, 1], c=labels_hierarchical, cmap='viridis')
plt.title("Hierarchical Clustering")
plt.show()

# Visualizing DBSCAN Clustering
plt.scatter(df_pca[:, 0], df_pca[:, 1], c=labels_dbscan, cmap='viridis')
plt.title("DBSCAN Clustering")
plt.show()

Visualizing t-SNE and PacMAP Results

# Visualizing t-SNE Clustering
plt.scatter(df_tsne[:, 0], df_tsne[:, 1], c=labels_kmeans, cmap='viridis')
plt.title("t-SNE Clustering")
plt.show()

# Visualizing PacMAP Clustering
plt.scatter(df_pacmap[:, 0], df_pacmap[:, 1], c=labels_kmeans, cmap='viridis')
plt.title("PacMAP Clustering")
plt.show()

Exercise 4: Compare Clustering Visualizations

Compare the visualizations of different clustering algorithms. Which algorithm do you think performed best and why?


Summary and Q&A

Recap

Q&A

Feel free to ask any questions or share your thoughts on today’s lesson.


Follow-up Materials:

Key Points

  • Exploring techniques to mitigate the challenges of high-dimensional clustering.

  • Applying dimensionality reduction techniques.

  • Using specialized clustering algorithms.

  • Visualizing clustering results.