Instructor Notes

Linear regression

helps us link two variables
creates line of best fit
show gapminder example of life epxectancy in UK
straight line since 1950
mathsisfun link for regression

linear regression code

def least_squares(data): x_sum = 0 y_sum = 0 x_sq_sum = 0 xy_sum = 0

assert len(data[0]) == len(data[1]) assert len(data) == 2

n = len(data[0]) for i in range(0, n): x = int(data[0][i]) y = data[1][i] x_sum = x_sum + x y_sum = y_sum + y x_sq_sum = x_sq_sum + (x**2) xy_sum = xy_sum + (x*y)

m = ((n * xy_sum) - (x_sum * y_sum)) m = m / ((n * x_sq_sum) - (x_sum ** 2)) c = (y_sum - m * x_sum) / n

print("Results of linear regression:") print("x_sum=", x_sum, "y_sum=", y_sum, "x_sq_sum=", x_sq_sum, "xy_sum=",xy_sum) print("m=", m, "c=", c) return m, c x_data = [2,3,5,7,9] y_data = [4,5,7,10,15] least_squares([x_data,y_data])

testing accuracy

def measure_error(data1, data2): assert len(data1) == len(data2) err_total = 0 for i in range(0, len(data1)): err_total = err_total + (data1[i] - data2[i]) ** 2 err = math.sqrt(err_total / len(data1)) return err

m, c = least_squares([x_data,y_data]) linear_data = [] for x in x_data: y = m * x + c linear_data.append(y) print(measure_error(y_data,linear_data))

Graphing the data

import matplotlib.pyplot as plt def make_graph(x_data, y_data, linear_data): plt.plot(x_data, y_data, label="Original Data") plt.plot(x_data, linear_data, label="Line of best fit") plt.grid() plt.legend() plt.show() x_data = [2,3,5,7,9] y_data = [4,5,7,10,15]] m,c = least_squares([x_data,y_data]) linear_data = [] for x in x_data: y = m * x + c # add the result to the linear_data list linear_data.append(y) make_graph(x_data, y_data, linear_data)

Predicting life expectancy Lets use real data from gapminder, download gapminder-life-expectancy.csv

Code to load the CSV file and predict

import pandas as pd def process_life_expectancy_data(filename, country, min_date, max_date): df = pd.read_csv(filename, index_col="Life expectancy") life_expectancy = df.loc[country, str(min_date):str(max_date)] x_data = list(range(min_date, max_date + 1)) m, c = least_squares([x_data, life_expectancy]) linear_data = [] for x in x_data: y = m * x + c linear_data.append(y) error = measure_error(life_expectancy, linear_data) print("error is ", error) make_graph(x_data, life_expectancy, linear_data) process_life_expectancy_data("../data/gapminder-life-expectancy.csv", "United Kingdom", 1950, 2010)

Exercises

model life expectancy for Germany 1950-2000
predict german life expectancy 2001-2016

Logarithmic regression

Way around linear limiations, use gapminder graphs to illustrate logarithmis inverse of exponents.

example code to load life expectancy and gdp

def read_data(gdp_file, life_expectancy_file, year): df_gdp = pd.read_csv(gdp_file, index_col="Country Name") gdp = df_gdp.loc[:, year] df_life_expt = pd.read_csv(life_expectancy_file,index_col="Life expectancy") life_expectancy = df_life_expt.loc[:, year] data = [] for country in life_expectancy.index: if country in gdp.index: if (math.isnan(life_expectancy[country]) is False) and (math.isnan(gdp[country]) is False): data.append((country, life_expectancy[country],gdp[country])) else: print("Excluding ", country, ",NaN in data (life_exp = ", life_expectancy[country], "gdp=", gdp[country], ")") else: print(country, "is not in the GDP country data") combined = pd.DataFrame.from_records(data, columns=("Country","Life Expectancy", "GDP")) combined = combined.set_index("Country") # we'll need sorted data for graphing properly later on combined = combined.sort_values("Life Expectancy") return combined

Modify process_data function to take the log of the data

add import math

gdp = data["GDP"].tolist() gdp_log = data["GDP"].apply(math.log).tolist() life_exp = data["Life Expectancy"].tolist() m, c = least_squares([life_exp, gdp_log])

when graphing we can choose either the log or the linear version.

# list for logarithmic version log_data = [] # list for raw version linear_data = [] for x in life_exp: y_log = m * x + c log_data.append(y_log) y = math.exp(y_log) linear_data.append(y) # uncomment for log version, further changes needed in make_graph too # make_graph(life_exp, gdp_log, log_data) make_graph(life_exp, gdp, linear_data)

change line in least_squares function to treat data as floats, previously we had integers on the x axis for years

x = int(data[0][i])

becomes

x = data[0][i]

Now need a scatter graph to instead of line plot.

def make_graph(x_data, y_data, linear_data): plt.scatter(x_data, y_data, label="Original Data") plt.plot(x_data, linear_data, color="orange", label="Line of best fit") plt.grid() plt.legend() plt.show()

Exercises

compare log and linear graphs
remove outliers from the data

Sklearn

sklearn is a library with lots of useful ML functions.

Includes a linear regression library

import numpy as np import sklearn.linear_model as skl_lin

replace our call to least_squares with:

x_data_arr = np.array(x_data).reshape(-1, 1) life_exp_arr = np.array(life_expectancy).reshape(-1, 1) regression = skl_lin.LinearRegression().fit(x_data_arr, life_exp_arr) m = regression.coef_[0][0] c = regression.intercept_[0]

computing output changes to

linear_data = regression.predict(x_data_arr)

test it.

Sklearn also includes error measuring code:

import sklearn.metrics as skl_metrics error = math.sqrt(skl_metrics.mean_squared_error(life_exp_arr, linear_data))

Exercises

compare scikit learn and own implementation of linear regression
predict german life expectancy

Polynomial regression

Useful for non-linear data.

import sklearn.preprocessing as skl_pre

polynomial_features = skl_pre.PolynomialFeatures(degree=5)``x_poly = polynomial_features.fit_transform(x_data_arr)polynomial_model = skl_lin.LinearRegression().fit(x_poly, life_exp_arr)polynomial_data = polynomial_model.predict(x_poly)make_graph(x_data, life_expectancy, polynomial_data)```

do some predicitions:

predictions_x = np.array(list(range(2001,2017))).reshape(-1, 1) predictions_polynomial = polynomial_model.predict(polynomial_features.fit_transform(predictions_x)) predictions_linear = regression.predict(predictions_x)

measure error: linear_error = math.sqrt(skl_metrics.mean_squared_error(life_exp_arr, linear_data)) print("linear error is ", linear_error) polynomial_error = math.sqrt(skl_metrics.mean_squared_error(life_exp_arr, polynomial_data)) print("polynomial error is", polynomial_error)

Exercises:

compare linear and polynomial models

Clustering

Finds groups in data

Also used in data compresssion and pattern recognition.

K Means clustering

Analogy, randomly place a load of cafes in a city, see which ones are more popular, move the unpopular ones closer to the popular ones. Repeat until we have clusters of cafes in a few areas.

sklearn has a kmeans implementation although its relatively simple, we’ll just stick to their version.

advantages/Limitations of kmeans

requires number of clusters to be known in advance, struggles on irregular or overlapping/concentric shapes.
fast and easy to compute
low memory overhead, suitable for large datasets
good default option

Exercises

Kmeans with overlapping clsuters
how many clusters

Spectral clustering

works better with concentric circles. Adds extra dimensions to the data.

Exercises

comparing kmeans and spectral performance

Neural Networks

Based on how the brain works. Concept of artifical neuron. Good at classification tasks, image recognition.

Perceptrons

Multiple inputs, each multiplied by a weight. Usually scaled 0 to 1.0. Sum of all inputs. Activation function for the sum. Threshold in original perceptron.

linear separability problems

Multilayer perceptrons

solves linear separability

sklearn implementation minst data set

test/training data

Exercises

changing learning parameters
using your own handwriting

Cross Validation

use all the data for both training/testing. Multiple iterations.

Exercises

cloud image classification