Instructor Notes
Linear regression
- helps us link two variables
- creates line of best fit
- show gapminder example of life epxectancy in UK
- straight line since 1950
- mathsisfun link for regression
linear regression code
def least_squares(data): x_sum = 0
y_sum = 0 x_sq_sum = 0
xy_sum = 0
assert len(data[0]) == len(data[1])
assert len(data) == 2
n = len(data[0]) for i in range(0, n):
x = int(data[0][i]) y = data[1][i]
x_sum = x_sum + x y_sum = y_sum + y
x_sq_sum = x_sq_sum + (x**2)
xy_sum = xy_sum + (x*y)
m = ((n * xy_sum) - (x_sum * y_sum))
m = m / ((n * x_sq_sum) - (x_sum ** 2))
c = (y_sum - m * x_sum) / n
print("Results of linear regression:")
print("x_sum=", x_sum, "y_sum=", y_sum, "x_sq_sum=", x_sq_sum, "xy_sum=",xy_sum)
print("m=", m, "c=", c) return m, c
x_data = [2,3,5,7,9] y_data = [4,5,7,10,15]
least_squares([x_data,y_data])
testing accuracy
def measure_error(data1, data2):
assert len(data1) == len(data2) err_total = 0
for i in range(0, len(data1)):
err_total = err_total + (data1[i] - data2[i]) ** 2
err = math.sqrt(err_total / len(data1))
return err
m, c = least_squares([x_data,y_data])
linear_data = [] for x in x_data:
y = m * x + c linear_data.append(y)
print(measure_error(y_data,linear_data))
Graphing the data
import matplotlib.pyplot as plt
def make_graph(x_data, y_data, linear_data):
plt.plot(x_data, y_data, label="Original Data")
plt.plot(x_data, linear_data, label="Line of best fit")
plt.grid() plt.legend()
plt.show() x_data = [2,3,5,7,9]
y_data = [4,5,7,10,15]]
m,c = least_squares([x_data,y_data])
linear_data = [] for x in x_data:
y = m * x + c
# add the result to the linear_data list
linear_data.append(y)
make_graph(x_data, y_data, linear_data)
Predicting life expectancy Lets use real data from gapminder, download gapminder-life-expectancy.csv
Code to load the CSV file and predict
import pandas as pd
def process_life_expectancy_data(filename, country, min_date, max_date):
df = pd.read_csv(filename, index_col="Life expectancy")
life_expectancy = df.loc[country, str(min_date):str(max_date)]
x_data = list(range(min_date, max_date + 1))
m, c = least_squares([x_data, life_expectancy])
linear_data = [] for x in x_data:
y = m * x + c linear_data.append(y)
error = measure_error(life_expectancy, linear_data)
print("error is ", error)
make_graph(x_data, life_expectancy, linear_data)
process_life_expectancy_data("../data/gapminder-life-expectancy.csv", "United Kingdom", 1950, 2010)
Logarithmic regression
Way around linear limiations, use gapminder graphs to illustrate logarithmis inverse of exponents.
example code to load life expectancy and gdp
def read_data(gdp_file, life_expectancy_file, year):
df_gdp = pd.read_csv(gdp_file, index_col="Country Name")
gdp = df_gdp.loc[:, year]
df_life_expt = pd.read_csv(life_expectancy_file,index_col="Life expectancy")
life_expectancy = df_life_expt.loc[:, year]
data = []
for country in life_expectancy.index:
if country in gdp.index:
if (math.isnan(life_expectancy[country]) is False) and (math.isnan(gdp[country]) is False):
data.append((country, life_expectancy[country],gdp[country]))
else:
print("Excluding ", country, ",NaN in data (life_exp = ", life_expectancy[country], "gdp=", gdp[country], ")")
else:
print(country, "is not in the GDP country data")
combined = pd.DataFrame.from_records(data, columns=("Country","Life Expectancy", "GDP"))
combined = combined.set_index("Country")
# we'll need sorted data for graphing properly later on
combined = combined.sort_values("Life Expectancy")
return combined
Modify process_data function to take the log of the data
add import math
gdp = data["GDP"].tolist()
gdp_log = data["GDP"].apply(math.log).tolist()
life_exp = data["Life Expectancy"].tolist()
m, c = least_squares([life_exp, gdp_log])
when graphing we can choose either the log or the linear version.
# list for logarithmic version
log_data = [] # list for raw version
linear_data = [] for x in life_exp:
y_log = m * x + c log_data.append(y_log)
y = math.exp(y_log) linear_data.append(y)
# uncomment for log version, further changes needed in make_graph too
# make_graph(life_exp, gdp_log, log_data)
make_graph(life_exp, gdp, linear_data)
change line in least_squares function to treat data as floats, previously we had integers on the x axis for years
x = int(data[0][i])
becomes
x = data[0][i]
Now need a scatter graph to instead of line plot.
def make_graph(x_data, y_data, linear_data):
plt.scatter(x_data, y_data, label="Original Data")
plt.plot(x_data, linear_data, color="orange", label="Line of best fit")
plt.grid() plt.legend()
plt.show()
Sklearn
sklearn is a library with lots of useful ML functions.
Includes a linear regression library
import numpy as np
import sklearn.linear_model as skl_lin
replace our call to least_squares with:
x_data_arr = np.array(x_data).reshape(-1, 1)
life_exp_arr = np.array(life_expectancy).reshape(-1, 1)
regression = skl_lin.LinearRegression().fit(x_data_arr, life_exp_arr)
m = regression.coef_[0][0]
c = regression.intercept_[0]
computing output changes to
linear_data = regression.predict(x_data_arr)
test it.
Sklearn also includes error measuring code:
import sklearn.metrics as skl_metrics
error = math.sqrt(skl_metrics.mean_squared_error(life_exp_arr, linear_data))
Exercises
- compare scikit learn and own implementation of linear regression
- predict german life expectancy
Polynomial regression
Useful for non-linear data.
import sklearn.preprocessing as skl_pre
polynomial_features = skl_pre.PolynomialFeatures(degree=5)``x_poly
=
polynomial_features.fit_transform(x_data_arr)polynomial_model
= skl_lin.LinearRegression().fit(x_poly,
life_exp_arr)polynomial_data =
polynomial_model.predict(x_poly)make_graph(x_data,
life_expectancy, polynomial_data)```
do some predicitions:
predictions_x = np.array(list(range(2001,2017))).reshape(-1, 1)
predictions_polynomial = polynomial_model.predict(polynomial_features.fit_transform(predictions_x))
predictions_linear = regression.predict(predictions_x)
measure error:
linear_error = math.sqrt(skl_metrics.mean_squared_error(life_exp_arr, linear_data))
print("linear error is ", linear_error)
polynomial_error = math.sqrt(skl_metrics.mean_squared_error(life_exp_arr, polynomial_data))
print("polynomial error is", polynomial_error)
Clustering
Finds groups in data
Also used in data compresssion and pattern recognition.
K Means clustering
Analogy, randomly place a load of cafes in a city, see which ones are more popular, move the unpopular ones closer to the popular ones. Repeat until we have clusters of cafes in a few areas.
sklearn has a kmeans implementation although its relatively simple, we’ll just stick to their version.
Neural Networks
Based on how the brain works. Concept of artifical neuron. Good at classification tasks, image recognition.
Perceptrons
Multiple inputs, each multiplied by a weight. Usually scaled 0 to 1.0. Sum of all inputs. Activation function for the sum. Threshold in original perceptron.
linear separability problems