This lesson is in the early stages of development (Alpha version)

# Linear regression

• helps us link two variables
• creates line of best fit
• show gapminder example of life epxectancy in UK
• straight line since 1950
• mathsisfun link for regression

linear regression code

`def least_squares(data):` ` x_sum = 0` ` y_sum = 0` ` x_sq_sum = 0` ` xy_sum = 0`

` assert len(data[0]) == len(data[1])` ` assert len(data) == 2`

` n = len(data[0])` ` for i in range(0, n):` ` x = int(data[0][i])` ` y = data[1][i]` ` x_sum = x_sum + x` ` y_sum = y_sum + y` ` x_sq_sum = x_sq_sum + (x**2)` ` xy_sum = xy_sum + (x*y)`

` m = ((n * xy_sum) - (x_sum * y_sum))` ` m = m / ((n * x_sq_sum) - (x_sum ** 2))` ` c = (y_sum - m * x_sum) / n`

` print("Results of linear regression:")` ` print("x_sum=", x_sum, "y_sum=", y_sum, "x_sq_sum=", x_sq_sum, "xy_sum=",xy_sum)` ` print("m=", m, "c=", c)` ` return m, c` `x_data = [2,3,5,7,9]` `y_data = [4,5,7,10,15]` `least_squares([x_data,y_data])`

testing accuracy

`def measure_error(data1, data2):` ` assert len(data1) == len(data2)` ` err_total = 0` ` for i in range(0, len(data1)):` ` err_total = err_total + (data1[i] - data2[i]) ** 2` ` err = math.sqrt(err_total / len(data1))` ` return err`

`m, c = least_squares([x_data,y_data])` `linear_data = []` `for x in x_data:` ` y = m * x + c` ` linear_data.append(y)` `print(measure_error(y_data,linear_data))`

Graphing the data

`import matplotlib.pyplot as plt` `def make_graph(x_data, y_data, linear_data):` ` plt.plot(x_data, y_data, label="Original Data")` ` plt.plot(x_data, linear_data, label="Line of best fit")` ` plt.grid()` ` plt.legend()` ` plt.show()` `x_data = [2,3,5,7,9]` `y_data = [4,5,7,10,15]]` `m,c = least_squares([x_data,y_data])` `linear_data = []` `for x in x_data:` ` y = m * x + c` ` # add the result to the linear_data list` ` linear_data.append(y)` `make_graph(x_data, y_data, linear_data)`

Predicting life expectancy Lets use real data from gapminder, download gapminder-life-expectancy.csv

Code to load the CSV file and predict

`import pandas as pd` `def process_life_expectancy_data(filename, country, min_date, max_date):` ` df = pd.read_csv(filename, index_col="Life expectancy")` ` life_expectancy = df.loc[country, str(min_date):str(max_date)]` ` x_data = list(range(min_date, max_date + 1))` ` m, c = least_squares([x_data, life_expectancy])` ` linear_data = []` ` for x in x_data:` ` y = m * x + c` ` linear_data.append(y)` ` error = measure_error(life_expectancy, linear_data)` ` print("error is ", error)` ` make_graph(x_data, life_expectancy, linear_data)` `process_life_expectancy_data("../data/gapminder-life-expectancy.csv", "United Kingdom", 1950, 2010)`

## Exercises

• model life expectancy for Germany 1950-2000
• predict german life expectancy 2001-2016

## Logarithmic regression

Way around linear limiations, use gapminder graphs to illustrate logarithmis inverse of exponents.

example code to load life expectancy and gdp

`def read_data(gdp_file, life_expectancy_file, year):` ` df_gdp = pd.read_csv(gdp_file, index_col="Country Name")` ` gdp = df_gdp.loc[:, year]` ` df_life_expt = pd.read_csv(life_expectancy_file,index_col="Life expectancy")` ` life_expectancy = df_life_expt.loc[:, year]` ` data = []` ` for country in life_expectancy.index:` ` if country in gdp.index:` ` if (math.isnan(life_expectancy[country]) is False) and (math.isnan(gdp[country]) is False):` ` data.append((country, life_expectancy[country],gdp[country]))` ` else:` ` print("Excluding ", country, ",NaN in data (life_exp = ", life_expectancy[country], "gdp=", gdp[country], ")")` ` else:` ` print(country, "is not in the GDP country data")` ` combined = pd.DataFrame.from_records(data, columns=("Country","Life Expectancy", "GDP"))` ` combined = combined.set_index("Country")` ` # we'll need sorted data for graphing properly later on` ` combined = combined.sort_values("Life Expectancy")` ` return combined`

Modify process_data function to take the log of the data

add `import math`

`gdp = data["GDP"].tolist()` `gdp_log = data["GDP"].apply(math.log).tolist()` `life_exp = data["Life Expectancy"].tolist()` `m, c = least_squares([life_exp, gdp_log])`

when graphing we can choose either the log or the linear version.

`# list for logarithmic version` ` log_data = []` ` # list for raw version` ` linear_data = []` ` for x in life_exp:` ` y_log = m * x + c` ` log_data.append(y_log)` ` y = math.exp(y_log)` ` linear_data.append(y)` ` # uncomment for log version, further changes needed in make_graph too` ` # make_graph(life_exp, gdp_log, log_data)` ` make_graph(life_exp, gdp, linear_data)`

change line in least_squares function to treat data as floats, previously we had integers on the x axis for years

` x = int(data[0][i])`

becomes

` x = data[0][i]`

Now need a scatter graph to instead of line plot.

`def make_graph(x_data, y_data, linear_data):` ` plt.scatter(x_data, y_data, label="Original Data")` ` plt.plot(x_data, linear_data, color="orange", label="Line of best fit")` ` plt.grid()` ` plt.legend()` ` plt.show()`

## Exercises

• compare log and linear graphs
• remove outliers from the data

# Sklearn

sklearn is a library with lots of useful ML functions.

Includes a linear regression library

`import numpy as np` `import sklearn.linear_model as skl_lin`

replace our call to least_squares with:

`x_data_arr = np.array(x_data).reshape(-1, 1)` `life_exp_arr = np.array(life_expectancy).reshape(-1, 1)` `regression = skl_lin.LinearRegression().fit(x_data_arr, life_exp_arr)` `m = regression.coef_[0][0]` `c = regression.intercept_[0]`

computing output changes to

`linear_data = regression.predict(x_data_arr)`

test it.

Sklearn also includes error measuring code:

`import sklearn.metrics as skl_metrics` `error = math.sqrt(skl_metrics.mean_squared_error(life_exp_arr, linear_data))`

## Exercises

• compare scikit learn and own implementation of linear regression
• predict german life expectancy

## Polynomial regression

Useful for non-linear data.

`import sklearn.preprocessing as skl_pre`

```polynomial_features = skl_pre.PolynomialFeatures(degree=5)`` ```x_poly = polynomial_features.fit_transform(x_data_arr)``` ```polynomial_model = skl_lin.LinearRegression().fit(x_poly, life_exp_arr)``` ```polynomial_data = polynomial_model.predict(x_poly)``` ```make_graph(x_data, life_expectancy, polynomial_data)```

do some predicitions:

`predictions_x = np.array(list(range(2001,2017))).reshape(-1, 1)` `predictions_polynomial = polynomial_model.predict(polynomial_features.fit_transform(predictions_x))` `predictions_linear = regression.predict(predictions_x)`

measure error: `linear_error = math.sqrt(skl_metrics.mean_squared_error(life_exp_arr, linear_data))` `print("linear error is ", linear_error)` `polynomial_error = math.sqrt(skl_metrics.mean_squared_error(life_exp_arr, polynomial_data))` `print("polynomial error is", polynomial_error)`

## Exercises:

• compare linear and polynomial models

# Clustering

Finds groups in data

Also used in data compresssion and pattern recognition.

## K Means clustering

Analogy, randomly place a load of cafes in a city, see which ones are more popular, move the unpopular ones closer to the popular ones. Repeat until we have clusters of cafes in a few areas.

sklearn has a kmeans implementation although its relatively simple, we’ll just stick to their version.

### advantages/Limitations of kmeans

• requires number of clusters to be known in advance, struggles on irregular or overlapping/concentric shapes.
• fast and easy to compute
• low memory overhead, suitable for large datasets
• good default option

### Exercises

• Kmeans with overlapping clsuters
• how many clusters

## Spectral clustering

works better with concentric circles. Adds extra dimensions to the data.

### Exercises

• comparing kmeans and spectral performance

# Neural Networks

Based on how the brain works. Concept of artifical neuron. Good at classification tasks, image recognition.

## Perceptrons

Multiple inputs, each multiplied by a weight. Usually scaled 0 to 1.0. Sum of all inputs. Activation function for the sum. Threshold in original perceptron.

linear separability problems

### Multilayer perceptrons

solves linear separability

sklearn implementation minst data set

test/training data

### Exercises

• changing learning parameters
• using your own handwriting

## Cross Validation

use all the data for both training/testing. Multiple iterations.

### Exercises

• cloud image classification