This lesson is in the early stages of development (Alpha version)

Introduction to Machine Learning with Scikit Learn: Instructor Notes

Linear regression

linear regression code

def least_squares(data): x_sum = 0 y_sum = 0 x_sq_sum = 0 xy_sum = 0

assert len(data[0]) == len(data[1]) assert len(data) == 2

n = len(data[0]) for i in range(0, n): x = int(data[0][i]) y = data[1][i] x_sum = x_sum + x y_sum = y_sum + y x_sq_sum = x_sq_sum + (x**2) xy_sum = xy_sum + (x*y)

m = ((n * xy_sum) - (x_sum * y_sum)) m = m / ((n * x_sq_sum) - (x_sum ** 2)) c = (y_sum - m * x_sum) / n

print("Results of linear regression:") print("x_sum=", x_sum, "y_sum=", y_sum, "x_sq_sum=", x_sq_sum, "xy_sum=",xy_sum) print("m=", m, "c=", c) return m, c x_data = [2,3,5,7,9] y_data = [4,5,7,10,15] least_squares([x_data,y_data])

testing accuracy

def measure_error(data1, data2): assert len(data1) == len(data2) err_total = 0 for i in range(0, len(data1)): err_total = err_total + (data1[i] - data2[i]) ** 2 err = math.sqrt(err_total / len(data1)) return err

m, c = least_squares([x_data,y_data]) linear_data = [] for x in x_data: y = m * x + c linear_data.append(y) print(measure_error(y_data,linear_data))

Graphing the data

import matplotlib.pyplot as plt def make_graph(x_data, y_data, linear_data): plt.plot(x_data, y_data, label="Original Data") plt.plot(x_data, linear_data, label="Line of best fit") plt.grid() plt.legend() x_data = [2,3,5,7,9] y_data = [4,5,7,10,15]] m,c = least_squares([x_data,y_data]) linear_data = [] for x in x_data: y = m * x + c # add the result to the linear_data list linear_data.append(y) make_graph(x_data, y_data, linear_data)

Predicting life expectancy Lets use real data from gapminder, download gapminder-life-expectancy.csv

Code to load the CSV file and predict

import pandas as pd def process_life_expectancy_data(filename, country, min_date, max_date): df = pd.read_csv(filename, index_col="Life expectancy") life_expectancy = df.loc[country, str(min_date):str(max_date)] x_data = list(range(min_date, max_date + 1)) m, c = least_squares([x_data, life_expectancy]) linear_data = [] for x in x_data: y = m * x + c linear_data.append(y) error = measure_error(life_expectancy, linear_data) print("error is ", error) make_graph(x_data, life_expectancy, linear_data) process_life_expectancy_data("../data/gapminder-life-expectancy.csv", "United Kingdom", 1950, 2010)


Logarithmic regression

Way around linear limiations, use gapminder graphs to illustrate logarithmis inverse of exponents.

example code to load life expectancy and gdp

def read_data(gdp_file, life_expectancy_file, year): df_gdp = pd.read_csv(gdp_file, index_col="Country Name") gdp = df_gdp.loc[:, year] df_life_expt = pd.read_csv(life_expectancy_file,index_col="Life expectancy") life_expectancy = df_life_expt.loc[:, year] data = [] for country in life_expectancy.index: if country in gdp.index: if (math.isnan(life_expectancy[country]) is False) and (math.isnan(gdp[country]) is False): data.append((country, life_expectancy[country],gdp[country])) else: print("Excluding ", country, ",NaN in data (life_exp = ", life_expectancy[country], "gdp=", gdp[country], ")") else: print(country, "is not in the GDP country data") combined = pd.DataFrame.from_records(data, columns=("Country","Life Expectancy", "GDP")) combined = combined.set_index("Country") # we'll need sorted data for graphing properly later on combined = combined.sort_values("Life Expectancy") return combined

Modify process_data function to take the log of the data

add import math

gdp = data["GDP"].tolist() gdp_log = data["GDP"].apply(math.log).tolist() life_exp = data["Life Expectancy"].tolist() m, c = least_squares([life_exp, gdp_log])

when graphing we can choose either the log or the linear version.

# list for logarithmic version log_data = [] # list for raw version linear_data = [] for x in life_exp: y_log = m * x + c log_data.append(y_log) y = math.exp(y_log) linear_data.append(y) # uncomment for log version, further changes needed in make_graph too # make_graph(life_exp, gdp_log, log_data) make_graph(life_exp, gdp, linear_data)

change line in least_squares function to treat data as floats, previously we had integers on the x axis for years

x = int(data[0][i])


x = data[0][i]

Now need a scatter graph to instead of line plot.

def make_graph(x_data, y_data, linear_data): plt.scatter(x_data, y_data, label="Original Data") plt.plot(x_data, linear_data, color="orange", label="Line of best fit") plt.grid() plt.legend()



sklearn is a library with lots of useful ML functions.

Includes a linear regression library

import numpy as np import sklearn.linear_model as skl_lin

replace our call to least_squares with:

x_data_arr = np.array(x_data).reshape(-1, 1) life_exp_arr = np.array(life_expectancy).reshape(-1, 1) regression = skl_lin.LinearRegression().fit(x_data_arr, life_exp_arr) m = regression.coef_[0][0] c = regression.intercept_[0]

computing output changes to

linear_data = regression.predict(x_data_arr)

test it.

Sklearn also includes error measuring code:

import sklearn.metrics as skl_metrics error = math.sqrt(skl_metrics.mean_squared_error(life_exp_arr, linear_data))


Polynomial regression

Useful for non-linear data.

import sklearn.preprocessing as skl_pre

polynomial_features = skl_pre.PolynomialFeatures(degree=5)`` x_poly = polynomial_features.fit_transform(x_data_arr) polynomial_model = skl_lin.LinearRegression().fit(x_poly, life_exp_arr) polynomial_data = polynomial_model.predict(x_poly) make_graph(x_data, life_expectancy, polynomial_data)```

do some predicitions:

predictions_x = np.array(list(range(2001,2017))).reshape(-1, 1) predictions_polynomial = polynomial_model.predict(polynomial_features.fit_transform(predictions_x)) predictions_linear = regression.predict(predictions_x)

measure error: linear_error = math.sqrt(skl_metrics.mean_squared_error(life_exp_arr, linear_data)) print("linear error is ", linear_error) polynomial_error = math.sqrt(skl_metrics.mean_squared_error(life_exp_arr, polynomial_data)) print("polynomial error is", polynomial_error)



Finds groups in data

Also used in data compresssion and pattern recognition.

K Means clustering

Analogy, randomly place a load of cafes in a city, see which ones are more popular, move the unpopular ones closer to the popular ones. Repeat until we have clusters of cafes in a few areas.

sklearn has a kmeans implementation although its relatively simple, we’ll just stick to their version.

advantages/Limitations of kmeans


Spectral clustering

works better with concentric circles. Adds extra dimensions to the data.


Neural Networks

Based on how the brain works. Concept of artifical neuron. Good at classification tasks, image recognition.


Multiple inputs, each multiplied by a weight. Usually scaled 0 to 1.0. Sum of all inputs. Activation function for the sum. Threshold in original perceptron.

linear separability problems

Multilayer perceptrons

solves linear separability

sklearn implementation minst data set

test/training data


Cross Validation

use all the data for both training/testing. Multiple iterations.