Instructor Notes

Linear regression

  • helps us link two variables
  • creates line of best fit
  • show gapminder example of life epxectancy in UK
  • straight line since 1950
  • mathsisfun link for regression

linear regression code

def least_squares(data): x_sum = 0 y_sum = 0 x_sq_sum = 0 xy_sum = 0

assert len(data[0]) == len(data[1]) assert len(data) == 2

n = len(data[0]) for i in range(0, n): x = int(data[0][i]) y = data[1][i] x_sum = x_sum + x y_sum = y_sum + y x_sq_sum = x_sq_sum + (x**2) xy_sum = xy_sum + (x*y)

m = ((n * xy_sum) - (x_sum * y_sum)) m = m / ((n * x_sq_sum) - (x_sum ** 2)) c = (y_sum - m * x_sum) / n

print("Results of linear regression:") print("x_sum=", x_sum, "y_sum=", y_sum, "x_sq_sum=", x_sq_sum, "xy_sum=",xy_sum) print("m=", m, "c=", c) return m, c x_data = [2,3,5,7,9] y_data = [4,5,7,10,15] least_squares([x_data,y_data])

testing accuracy

def measure_error(data1, data2): assert len(data1) == len(data2) err_total = 0 for i in range(0, len(data1)): err_total = err_total + (data1[i] - data2[i]) ** 2 err = math.sqrt(err_total / len(data1)) return err

m, c = least_squares([x_data,y_data]) linear_data = [] for x in x_data: y = m * x + c linear_data.append(y) print(measure_error(y_data,linear_data))

Graphing the data

import matplotlib.pyplot as plt def make_graph(x_data, y_data, linear_data): plt.plot(x_data, y_data, label="Original Data") plt.plot(x_data, linear_data, label="Line of best fit") plt.grid() plt.legend() plt.show() x_data = [2,3,5,7,9] y_data = [4,5,7,10,15]] m,c = least_squares([x_data,y_data]) linear_data = [] for x in x_data: y = m * x + c # add the result to the linear_data list linear_data.append(y) make_graph(x_data, y_data, linear_data)

Predicting life expectancy Lets use real data from gapminder, download gapminder-life-expectancy.csv

Code to load the CSV file and predict

import pandas as pd def process_life_expectancy_data(filename, country, min_date, max_date): df = pd.read_csv(filename, index_col="Life expectancy") life_expectancy = df.loc[country, str(min_date):str(max_date)] x_data = list(range(min_date, max_date + 1)) m, c = least_squares([x_data, life_expectancy]) linear_data = [] for x in x_data: y = m * x + c linear_data.append(y) error = measure_error(life_expectancy, linear_data) print("error is ", error) make_graph(x_data, life_expectancy, linear_data) process_life_expectancy_data("../data/gapminder-life-expectancy.csv", "United Kingdom", 1950, 2010)

Exercises

  • model life expectancy for Germany 1950-2000
  • predict german life expectancy 2001-2016

Logarithmic regression

Way around linear limiations, use gapminder graphs to illustrate logarithmis inverse of exponents.

example code to load life expectancy and gdp

def read_data(gdp_file, life_expectancy_file, year): df_gdp = pd.read_csv(gdp_file, index_col="Country Name") gdp = df_gdp.loc[:, year] df_life_expt = pd.read_csv(life_expectancy_file,index_col="Life expectancy") life_expectancy = df_life_expt.loc[:, year] data = [] for country in life_expectancy.index: if country in gdp.index: if (math.isnan(life_expectancy[country]) is False) and (math.isnan(gdp[country]) is False): data.append((country, life_expectancy[country],gdp[country])) else: print("Excluding ", country, ",NaN in data (life_exp = ", life_expectancy[country], "gdp=", gdp[country], ")") else: print(country, "is not in the GDP country data") combined = pd.DataFrame.from_records(data, columns=("Country","Life Expectancy", "GDP")) combined = combined.set_index("Country") # we'll need sorted data for graphing properly later on combined = combined.sort_values("Life Expectancy") return combined

Modify process_data function to take the log of the data

add import math

gdp = data["GDP"].tolist() gdp_log = data["GDP"].apply(math.log).tolist() life_exp = data["Life Expectancy"].tolist() m, c = least_squares([life_exp, gdp_log])

when graphing we can choose either the log or the linear version.

# list for logarithmic version log_data = [] # list for raw version linear_data = [] for x in life_exp: y_log = m * x + c log_data.append(y_log) y = math.exp(y_log) linear_data.append(y) # uncomment for log version, further changes needed in make_graph too # make_graph(life_exp, gdp_log, log_data) make_graph(life_exp, gdp, linear_data)

change line in least_squares function to treat data as floats, previously we had integers on the x axis for years

x = int(data[0][i])

becomes

x = data[0][i]

Now need a scatter graph to instead of line plot.

def make_graph(x_data, y_data, linear_data): plt.scatter(x_data, y_data, label="Original Data") plt.plot(x_data, linear_data, color="orange", label="Line of best fit") plt.grid() plt.legend() plt.show()

Exercises

  • compare log and linear graphs
  • remove outliers from the data

Sklearn

sklearn is a library with lots of useful ML functions.

Includes a linear regression library

import numpy as np import sklearn.linear_model as skl_lin

replace our call to least_squares with:

x_data_arr = np.array(x_data).reshape(-1, 1) life_exp_arr = np.array(life_expectancy).reshape(-1, 1) regression = skl_lin.LinearRegression().fit(x_data_arr, life_exp_arr) m = regression.coef_[0][0] c = regression.intercept_[0]

computing output changes to

linear_data = regression.predict(x_data_arr)

test it.

Sklearn also includes error measuring code:

import sklearn.metrics as skl_metrics error = math.sqrt(skl_metrics.mean_squared_error(life_exp_arr, linear_data))

Exercises

  • compare scikit learn and own implementation of linear regression
  • predict german life expectancy

Polynomial regression

Useful for non-linear data.

import sklearn.preprocessing as skl_pre

polynomial_features = skl_pre.PolynomialFeatures(degree=5)``x_poly = polynomial_features.fit_transform(x_data_arr)polynomial_model = skl_lin.LinearRegression().fit(x_poly, life_exp_arr)polynomial_data = polynomial_model.predict(x_poly)make_graph(x_data, life_expectancy, polynomial_data)```

do some predicitions:

predictions_x = np.array(list(range(2001,2017))).reshape(-1, 1) predictions_polynomial = polynomial_model.predict(polynomial_features.fit_transform(predictions_x)) predictions_linear = regression.predict(predictions_x)

measure error: linear_error = math.sqrt(skl_metrics.mean_squared_error(life_exp_arr, linear_data)) print("linear error is ", linear_error) polynomial_error = math.sqrt(skl_metrics.mean_squared_error(life_exp_arr, polynomial_data)) print("polynomial error is", polynomial_error)

Exercises:

  • compare linear and polynomial models

Clustering

Finds groups in data

Also used in data compresssion and pattern recognition.

K Means clustering

Analogy, randomly place a load of cafes in a city, see which ones are more popular, move the unpopular ones closer to the popular ones. Repeat until we have clusters of cafes in a few areas.

sklearn has a kmeans implementation although its relatively simple, we’ll just stick to their version.

advantages/Limitations of kmeans

  • requires number of clusters to be known in advance, struggles on irregular or overlapping/concentric shapes.
  • fast and easy to compute
  • low memory overhead, suitable for large datasets
  • good default option

Exercises

  • Kmeans with overlapping clsuters
  • how many clusters

Spectral clustering

works better with concentric circles. Adds extra dimensions to the data.

Exercises

  • comparing kmeans and spectral performance

Neural Networks

Based on how the brain works. Concept of artifical neuron. Good at classification tasks, image recognition.

Perceptrons

Multiple inputs, each multiplied by a weight. Usually scaled 0 to 1.0. Sum of all inputs. Activation function for the sum. Threshold in original perceptron.

linear separability problems

Multilayer perceptrons

solves linear separability

sklearn implementation minst data set

test/training data

Exercises

  • changing learning parameters
  • using your own handwriting

Cross Validation

use all the data for both training/testing. Multiple iterations.

Exercises

  • cloud image classification

Introduction


Supervised methods - Regression


Supervised methods - Classification


Ensemble methods


Unsupervised methods - Clustering


Unsupervised methods - Dimensionality reduction


Neural Networks


Ethics and the Implications of Machine Learning


Find out more