Introduction
Overview
Teaching: min
Exercises: minQuestions
Objectives
What is machine learning?
Machine learning is a set of of tools and techniques which let us find patterns in data. This lesson will introduce you to a few of these techniques, but there are many more which we simply don’t have time to cover here.
The techniques breakdown into two broad categories, predictors and classifiers. Predictors are used to predict a value (or set of value) given a set of inputs, for example trying to predict the cost of something given the economic conditions and the cost of raw materials or predicting a country’s GDP given its life expectancy. Classifiers try to classify data into different categories, for example deciding what characters are visible in a picture of some writing or if a message is spam or not.
Training Data
Many (but not all) machine learning systems “learn” by taking a series of input data and output data and using it to form a model. The maths behind the machine learning doesn’t care what the data is as long as it can represented numerically or categorised. Some examples might include:
 predicting a person’s weight based on their height
 predicting commute times given traffic conditions
 predicting house prices given stock market prices
 classifying if an email is spam or not
 classifying what if an image contains a person or not
Typically we will need to train our models with hundreds, thousands or even millions of examples before they work well enough to do any useful predictions or classifications with them.
Some systems will do training as a one shot process which produces a model. Others might try to continuosly refine their training through the real use of the system and human feedback to it. For example every time you mark an email as spam or not spam you are probably contributing to further training of your spam filter’s model.
Types of output
Predictors will usually involve a continuos scale of outputs, such as the price of something. Classifiers will tell you which class (or classes) are present in the data. For example a system to recognise hand writing from an input image will need to classify the output into one of a set of potential characters.
Machine learning vs Artificial Intelligence
Artificial Intelligence often means a system with general intelligence, able to solve any problem. AI is a very broad term. ML systems are usually trained to work on a particular problem. But they can appear to “learn” but isn’t a general intelligence that can solve anything a human could. They often need hundreds or thousands of examples to learn and are confined to relatively simple classifications. A human like system could learn from a single example.
Another definition of Artificial Intelligence dates back to the 1950s and Alan Turing’s “Immitation Game”. This said that we could consider a system intelligent when it could fool a human into thinking they were talking to another human when they were actually talking to a computer. Modern attempts at this are getting close to fooling humans, but we are still a very long way from a machine which has full human like intelligence.
Over Hyping of Artificial Intelligence and Machine Learning
There is a lot of hype around machine learning and artificial intelligence right now, while many real advances have been made a lot of people are overstating what can be achieved. Recent advances in computer hardware and machine learning algorithms have made it a lot more useful, but its been around over 50 years.
The Gartner Hype Cycle looks at which technologies are being overhyped. In the August 2018 analysis AI Platform as a service, Deep Learning chips, Deep learning neural networks, Conversational AI and Self Driving Cars are all shown near the “Peak of inflated expectations”.
Image from Jeremy Kemp via Wikimedia
Applications of machine learning
Machine learning in our daily lives
 Image recognition
 Object classification
 Character recognition
 Insurance payout predictions
 Crime prediction
Example of machine learning in research
 Classifying remote sensing images to find water.
 Looking for breast cancer in medical images
 Predicting what cows are doing from GPS data
Limitations of Machine Learning
Garbage In = Garbage Out
There is a classic expression in Computer Science, “Garbage In = Garbage Out”. This means that if the input data we use is garbage then the ouput will be too. If for instance we try to get a machine learning system to find a link between two unlinked variables then it might still come up with a model that attempts this, but the output will be meaningless.
Bias or lacking training data
Input data may also be lacking enough diversity to cover all examples. Due to how the data was obtained there might be biases in it that are then reflected in the ML system. For example if we collect data on crime reporting it could be biased towards wealthier areas where crimes are more likely to be reported. Historical data might not cover enough history.
Extrapolation
We can only make reliable predictions about data which is in the same range as our training data. If we try to extrapolate beyond what was covered in the training data we’ll probably get wrong answers.
Over fitting
Sometimes ML algorithms become over trained to their training data and struggle to work when presented with real data. In some cases it best not to train too many times.
Inability to explain answers
Many machine learning techniques will give us an answer given some input data even if that answer is wrong. Most are unable to explain any kind of logic in arriving at that answer. This can make diagnosing and even detecting problems with them difficult.
Where have you encountered machine learning already?
Discuss with the person next to you:
 Where have I seen machine learning in use?
 What kind of input data does that machine learning system use to make predictions/classifications?
 Is there any evidence that your interaction with the system contributes to further training?
 Do you have any examples of the system failing?
Write your answers into the etherpad.
Key Points
Regression
Overview
Teaching: min
Exercises: minQuestions
Objectives
Linear regression
If we take two variable and graph them against each other we can look for relationships between them. Once this relationship is established we can use that to produce a model which will help us predict future values of one variable given the other.
If the two variables form a linear relationship (a straight line can be drawn to link them) then we can create a linear equation to link them. This will be of the form y = m * x + c, where x is the variable we know, y is the variable we’re calculating, m is the slope of the line linking them and c is the point at which the line crosses the y axis (where x = 0).
Using the Gapminder website we can graph all sorts of data about the development of different countries. Lets have a look at the change in life expectancy over time in the United Kingdom.
Since around 1950 life expectancy appears to be increasing with a pretty straight line in other words a linear relationship. We can use this data to try and calculate a line of best fit that will attempt to draw a perfectly straight line through this data. One method we can use is called linear regression or least square regression. The linear regression will create a linear equation that minimises the average distance from the line of best fit to each point in the graph. It will calculate the values of m and c for a linear equation for us. We could do this manually, but lets use Python to do it for us.
Coding a linear regression with Python
This code will calculate a least squares or linear regression for us.
def least_squares(data):
x_sum = 0
y_sum = 0
x_sq_sum = 0
xy_sum = 0
# the list of data should have two equal length columns
assert len(data) == 2
assert len(data[0]) == len(data[1])
n = len(data[0])
# least squares regression calculation
for i in range(0, n):
x = int(data[0][i])
y = data[1][i]
x_sum = x_sum + x
y_sum = y_sum + y
x_sq_sum = x_sq_sum + (x**2)
xy_sum = xy_sum + (x*y)
m = ((n * xy_sum)  (x_sum * y_sum))
m = m / ((n * x_sq_sum)  (x_sum ** 2))
c = (y_sum  m * x_sum) / n
print("Results of linear regression:")
print("x_sum=", x_sum, "y_sum=", y_sum, "x_sq_sum=", x_sq_sum, "xy_sum=",
xy_sum)
print("m=", m, "c=", c)
return m, c
Lets test our code by using the example data from the mathsisfun link above.
x_data = [2,3,5,7,9]
y_data = [4,5,7,10,15]
least_squares([x_data,y_data])
We should get the following results:
Results of linear regression:
x_sum= 26 y_sum= 41 x_sq_sum= 168 xy_sum= 263
m= 1.5182926829268293 c= 0.30487804878048763
Testing the accuracy of a linear regression model
We now have a simple linear model for some data. It would be useful to test how accurate that model is. We can do this by computing the y value for every x value used in our original data and comparing the model’s y value with the original. We can turn this into a single overall error number by calculating the root mean square (RMS), this squares each comparison, takes the sum of all of them, divides this by the number of items and finally takes the square root of that value. By squaring and square rooting the values we prevent negative errors from cancelling out positive ones. The RMS gives us an overall error number which we can then use to measure our model’s accuracy with. The following code calculates RMS in Python.
import math
def measure_error(data1, data2):
assert len(data1) == len(data2)
err_total = 0
for i in range(0, len(data1)):
err_total = err_total + (data1[i]  data2[i]) ** 2
err = math.sqrt(err_total / len(data1))
return err
To calculate the RMS for the test data we just used we need to calculate the y coordinate for every x coordinate (2,3,5,7,9) that we had in the original data.
# get the m and c values from the least_squares function
m, c = least_squares([x_data, y_data])
# create an empty list for the model y data
linear_data = []
for x in x_data:
y = m * x + c
# add the result to the linear_data list
linear_data.append(y)
# calculate the error
print(measure_error(y_data,linear_data))
This will output an error of 0.7986268703523449, which means that on average the difference between our model and the real values is 0.7986268703523449. The less linear the data is the bigger this number will be. If the model perfectly matches the data then the value will be zero.
Graphing the data
To compare our model and data lets graph both of them using matplotlib.
import matplotlib.pyplot as plt
def calculate_linear(x_data, m, c):
linear_data = []
for x in x_data:
y = m * x + c
#add the result to the linear_data list
linear_data.append(y)
return(linear_data)
def make_graph(x_data, y_data, linear_data):
plt.plot(x_data, y_data, label="Original Data")
plt.plot(x_data, linear_data, label="Line of best fit")
plt.grid()
plt.legend()
plt.show()
x_data = [2,3,5,7,9]
y_data = [4,5,7,10,15]
m, c = least_squares([x_data, y_data])
linear_data = calculate_linear(x_data, m, c)
make_graph(x_data, y_data, calculate_linear(x_data, m, c))
Predicting life expectancy
Now lets try and model some real data with linear regression. We’ll use the Gapminder Foundation’s life expectancy data for this. Click here to download it.
# put this line at the top of the file
import pandas as pd
def process_life_expectancy_data(filename, country, min_date, max_date):
df = pd.read_csv(filename, index_col="Life expectancy")
# get the life expectancy for the specified country/dates
# we have to convert the dates to strings as pandas treats them that way
life_expectancy = df.loc[country, str(min_date):str(max_date)]
# create a list with the numerical range of min_date to max_date
# we could use the index of life_expectancy but it will be a string
# we need numerical data
x_data = list(range(min_date, max_date + 1))
# calculate line of best fit
m, c = least_squares([x_data, life_expectancy])
linear_data = calculate_linear(x_data, m, c)
error = measure_error(life_expectancy, linear_data)
print("error is ", error)
make_graph(x_data, life_expectancy, linear_data)
process_life_expectancy_data("../data/gapminderlifeexpectancy.csv",
"United Kingdom", 1950, 2010)
Modelling Life Expectancy
Combine all the code above into a single Python file, save it into a directory called code.
In the parent directory create another directory called data
Download the file https://scwaberystwyth.github.io/machinelearningnovice/data/gapminderlifeexpectancy.csv into the data directory The full code from above is also available to download from https://scwaberystwyth.github.io/machinelearningnovice/code/linear_regression.py
If you’re using a Unix or Unix like environment the following commands will do this in your home directory:
cd ~ mkdir code mkdir data cd data wget https://scwaberystwyth.github.io/machinelearningnovice/data/gapminderlifeexpectancy.csv
Adjust the program to calculate the life expectancy for Germany between 1950 and 2000. What are the values (m and c) of linear equation linking date and life expectancy?
Solution
process_life_expectancy_data("../data/gapminderlifeexpectancy.csv", "Germany", 1950, 2000)
m= 0.212219909502 c= 346.784909502
Predicting Life Expectancy
Use the linear equation you’ve just created to predict life expectancy in Germany for every year between 2001 and 2016. How accurate are your answers? If you worked for a pension scheme would you trust your answers to predict the future costs for paying pensioners?
Solution
for x in range(2001,2017): print(x,0.212219909502 * x  346.784909502)
Predicted answers:
2001 77.86712941150199 2002 78.07934932100403 2003 78.29156923050601 2004 78.503789140008 2005 78.71600904951003 2006 78.92822895901202 2007 79.140448868514 2008 79.35266877801604 2009 79.56488868751802 2010 79.77710859702 2011 79.98932850652199 2012 80.20154841602402 2013 80.41376832552601 2014 80.62598823502799 2015 80.83820814453003 2016 81.05042805403201
Compare with the real values:
df = pd.read_csv('../data/gapminderlifeexpectancy.csv',index_col="Life expectancy") for x in range(2001,2017): y = 0.215621719457 * x  351.935837103 real = df.loc['Germany', str(x)] print(x, "Predicted", y, "Real", real, "Difference", yreal)
2001 Predicted 77.86712941150199 Real 78.4 Difference 0.532870588498 2002 Predicted 78.07934932100403 Real 78.6 Difference 0.520650678996 2003 Predicted 78.29156923050601 Real 78.8 Difference 0.508430769494 2004 Predicted 78.503789140008 Real 79.2 Difference 0.696210859992 2005 Predicted 78.71600904951003 Real 79.4 Difference 0.68399095049 2006 Predicted 78.92822895901202 Real 79.7 Difference 0.771771040988 2007 Predicted 79.140448868514 Real 79.9 Difference 0.759551131486 2008 Predicted 79.35266877801604 Real 80.0 Difference 0.647331221984 2009 Predicted 79.56488868751802 Real 80.1 Difference 0.535111312482 2010 Predicted 79.77710859702 Real 80.3 Difference 0.52289140298 2011 Predicted 79.98932850652199 Real 80.5 Difference 0.510671493478 2012 Predicted 80.20154841602402 Real 80.6 Difference 0.398451583976 2013 Predicted 80.41376832552601 Real 80.7 Difference 0.286231674474 2014 Predicted 80.62598823502799 Real 80.7 Difference 0.074011764972 2015 Predicted 80.83820814453003 Real 80.8 Difference 0.03820814453 2016 Predicted 81.05042805403201 Real 80.9 Difference 0.150428054032
Answers are between 0.15 years over and 0.77 years under the reality. If this was being used in a pension scheme it might lead to a slight under prediction of life expectancy and cost the pension scheme a little more than expected.
Predicting Historical Life Expectancy
Now change your program to measure life expectancy in Canada between 1890 and 1914. Use the resulting m and c values to predict life expectancy in 1918. How accurate is your answer? If your answer was inaccurate, why was it inaccurate? What does this tell you about extrapolating models like this?
Solution
process_life_expectancy_data("../data/gapminderlifeexpectancy.csv", "Canada", 1890, 1914)
m = 0.369807692308 c = 654.215830769
print(1918 * 0.369807692308 654.215830769)
The predicted age is 55.0753 but the actual age is 47.17. This is inaccurate due to WW1 and the subsequent flu epidemic. Major events can produce trends that we’ve not seen before (or not for a long time), our models struggle to take account of things they’ve never seen. Even if we look back to 1800, the earliest date we have data for we never see a sudden drop in life expectancy like the 1918 one.
Logarithmic Regression
We’ve now seen how we can use linear regression to make a simple model and use that to predict values, but what do we do when the relationship between the data isn’t linear?
As an example lets take the relationship between income (GDP per Capita) and life expectancy. The gapminder website will graph this for us.
Logarithms Introduction
Logarithms are the inverse of an exponent (raising a number by a power).
log b(a) = c b^c = a
For example:
2^5 = 32 log 2(32) = 5
If you need more help on logarithms see the Khan Academy’s page
The relationship between these two variables clearly isn’t linear. But there is a trick we can do to make the data appear to be linear, we can take the logarithm of the Y axis (the GDP) by clicking on the arrow on the left next to GDP/capita and choosing log. This graph now appears to be linear.
Coding a logarithmic regression
Downloading the data
Download the GDP data from http://scwaberystwyth.github.io/machinelearningnovice/data/worldbankgdp.csv
Loading the data
We need to modify our code a little to work with this example. Firstly the data is now stored in two different files so we’ll have to read both of them and combine them together. The two datasets don’t quite have an identical list of countries, the life expectancy data is from gapminder themselves and includes French Overseas Departments and British Overseas Territories as seperate entities, it also includes Taiwan. The GDP data is from the World Bank and doesn’t differentiate many of the overseas territories/departments and doesn’t include Taiwan. Some countries are also lacking GDP data, life expectancy or both. When we load the data we’ll have to discard any country which doesn’t have valid data in both datasets. Missing data is marked as an NaN (not a number), when loading it we’ll have to check for NaN’s using the math.isnan()
function.
To match the analysis we just did on the gapminder website we only want to focus on a single year, so we’ll filter the data down to a single year which the user can specify.
Finally the data is sorted in the files by country name, but to help with graphing it later on we need to sort it by life expectancy instead. For this we can use Pandas sort_values()
function to do this.
def read_data(gdp_file, life_expectancy_file, year):
df_gdp = pd.read_csv(gdp_file, index_col="Country Name")
gdp = df_gdp.loc[:, year]
df_life_expt = pd.read_csv(life_expectancy_file,
index_col="Life expectancy")
# get the life expectancy for the specified country/dates
# we have to convert the dates to strings as pandas treats them that way
life_expectancy = df_life_expt.loc[:, year]
data = []
for country in life_expectancy.index:
if country in gdp.index:
# exclude any country where data is unknown
if (math.isnan(life_expectancy[country]) is False) and \
(math.isnan(gdp[country]) is False):
data.append((country, life_expectancy[country],
gdp[country]))
else:
print("Excluding ", country, ",NaN in data (life_exp = ",
life_expectancy[country], "gdp=", gdp[country], ")")
else:
print(country, "is not in the GDP country data")
combined = pd.DataFrame.from_records(data, columns=("Country",
"Life Expectancy", "GDP"))
combined = combined.set_index("Country")
# we'll need sorted data for graphing properly later on
combined = combined.sort_values("Life Expectancy")
return combined
Processing the data
Once the data is loaded we’ll need to convert the GDP data to its logarithmic form by using the math.log()
function. Pandas has a special function called apply
which can apply an operation to every item in a column, by using the statement data["GDP"].apply(math.log)
it will calculate the logarithmic form of every value in the GDP column and turn it into a new dataframe. We’ll convert the data into two lists to simplify working with it, these can be used by the least_squares, make_graph and measure_error functions.
Once we’ve calculated the line of best fit with the least_squares function we can graph it. But now we have two choices on how to do the graphing, we can either leave the data in its logarithmic form and draw a straight line of best fit. Or we could convert it back to its original form with the math.exp()
function and graph the curved line of best fit. To allow us to do either we’ll calculate both forms of the line of best fit and store them in the lists linear_data and log_data.
def process_data(gdp_file, life_expectancy_file, year):
data = read_data(gdp_file, life_expectancy_file, year)
gdp = data["GDP"].tolist()
gdp_log = data["GDP"].apply(math.log).tolist()
life_exp = data["Life Expectancy"].tolist()
m, c = least_squares([life_exp, gdp_log])
# list for logarithmic version
log_data = []
# list for raw version
linear_data = []
for x in life_exp:
y_log = m * x + c
log_data.append(y_log)
y = math.exp(y_log)
linear_data.append(y)
# uncomment for log version, further changes needed in make_graph too
# make_graph(life_exp, gdp_log, log_data)
make_graph(life_exp, gdp, linear_data)
err = measure_error(linear_data, gdp)
print("error=", err)
A small change to the least_squares function is needed to handle this data. Previously we were working with dates on the xaxis and these were all strings which the least_squares function converted into integers. Now we have life expectancy on the xaxis and that data is already floats, so we need to remove the conversion to integers. Lets change the line x = int(data[0][1]
in our least_squares function to x = data[0][1]
.
def least_squares(data):
x_sum = 0
y_sum = 0
x_sq_sum = 0
xy_sum = 0
# the list of data should have two equal length columns
assert len(data) == 2
assert len(data[0]) == len(data[1])
n = len(data[0])
# least squares regression calculation
for i in range(0, n):
x = data[0][i]
y = data[1][i]
x_sum = x_sum + x
y_sum = y_sum + y
x_sq_sum = x_sq_sum + (x**2)
xy_sum = xy_sum + (x*y)
m = ((n * xy_sum)  (x_sum * y_sum))
m = m / ((n * x_sq_sum)  (x_sum ** 2))
c = (y_sum  m * x_sum) / n
print("Results of linear regression:")
print("x_sum=", x_sum, "y_sum=", y_sum, "x_sq_sum=", x_sq_sum, "xy_sum=",
xy_sum)
print("m=", m, "c=", c)
return m, c
Finally to run everything we need to call the process_data function, this takes three parameters, the GDP filename, the life expectancy filename and the year we want to process as a string.
process_data("../data/worldbankgdp.csv",
"../data/gapminderlifeexpectancy.csv", "1980")
Graphing the data
Previously we drew a line graph showing life expectancy over time. This made sense as a line as it was tracking a single variable over time. But now we are plotting two variables against each other and need to use a scatter graph instead, so we’ll change the first plt.plot
call to plt.scatter
.
def make_graph(x_data, y_data, linear_data):
plt.scatter(x_data, y_data, label="Original Data")
plt.plot(x_data, linear_data, color="orange", label="Line of best fit")
plt.grid()
plt.legend()
plt.show()
The process_data function gave us a choice of plotting either the logarithmic or nonlogarithmic version of the data depending on which data we pass to make_graph. If we uncomment the line # make_graph(life_exp, gdp_log, log_data)
and comment the line make_graph(life_exp, gdp, linear_data)
then we can switch to showing the logarithmic version.
Comparing the logarithmic and nonlogarithmic graphs
Convert the code above to plot the logarithmic version of the graph. Save the graph. Now change back to the nonlogarithmic version. Compare the two graphs, which one do you think is easier to read?
Removing outliers from the data
The correlation of GDP and life expectancy has a few big outliers that are probably increasing the error rate on this model. These are typically countries with very high GDP and sometimes not very high life expectancy. These tend to be either small countries with artificially high GDPs such as Monaco and Luxemborg or oil rich countries such as Qatar or Brunei. Kuwait, Qatar and Brunei have already been removed from this data set, but are available in the file worldbankgdpoutliers.csv. Try experimenting with adding and removing some of these high income countries to see what effect it has on your model’s error rate. Do you think its a good idea to remove these outliers from your model? How might you do this automatically?
Key Points
Introducing Scikit Learn
Overview
Teaching: min
Exercises: minQuestions
Objectives
SciKit Learn (also known as sklearn) is an open source machine learning library for Python which has a very wide range of machine learning algorithms. It makes it very easy for a Python programmer to use machine learning techniques without having to implement them.
Linear Regression with scikitlearn
Lets adapt our linear regression program to use scikitlearn instead of our own regression function. We can go and remove the least_squares and measure_error functions from our code. We’ll save this under a different filename to the original linear regression code so that we can compare the answers of the two, they should be identical.
First lets add the import for sklearn, we’re also going to need the numpy library so we’ll import that too:
import numpy as np
import sklearn.linear_model as skl_lin
Now lets replace the calculation with our own least_squares function with the one from scikitlearn. The scikitlearn regression function is much more capable than the simple one we wrote earlier and is designed for datasets where multiple parameters are used, its expecting to be given multidemnsional arrays data. To get it to accept single dimension data such as we have we need to convert the array to a numpy one and use numpy’s reshape function. The resulting data is also designed to show us multiple coefficients and intercepts, so these values will be arrays, since we’ve just got one parameter we can just grab the first item from each of these arrays. Instead of manually calculating the results we can now use scikitlearn’s predict function. Finally lets calculate the error. scikitlearn doesn’t provide a root mean squared error function, but it does provide a mean squared error function. We can calculate the root mean squared error simply by taking the square root of the output of this function. The mean_squared_error function is part of the scikitlearn metrics module, so we’ll have to add that to our imports at the top of the file:
import sklearn.metrics as skl_metrics
Lets go ahead and change the process_data function for life expectancy to use scikitlearn’s LinearRegression function instead of our own version.
import pandas as pd
import math
def process_life_expectancy_data(filename, country, min_date, max_date):
df = pd.read_csv(filename, index_col="Life expectancy")
# get the life expectancy for the specified country/dates
# we have to convert the dates to strings as pandas treats them that way
life_expectancy = df.loc[country, str(min_date):str(max_date)]
x_data = list(range(min_date, max_date + 1))
x_data_arr = np.array(x_data).reshape(1, 1)
life_exp_arr = np.array(life_expectancy).reshape(1, 1)
regression = skl_lin.LinearRegression().fit(x_data_arr, life_exp_arr)
m = regression.coef_[0][0]
c = regression.intercept_[0]
# old manual version
#linear_data = calculate_linear(x_data, m, c)
# new scikit learn version
linear_data = regression.predict(x_data_arr)
# old manual version
#error = measure_error(life_expectancy, linear_data)
# new scikit learn version
error = math.sqrt(skl_metrics.mean_squared_error(life_exp_arr, linear_data))
print("error=", error)
# uncomment to make the graph
#make_graph(life_exp, gdp, linear_data)
process_life_expectancy_data("../data/gapminderlifeexpectancy.csv",
"United Kingdom", 1950, 2016)
Now if we go ahead and run the new program we should get the same answers and same graph as before.
Comparing the Scikit learn and our own linear regression implementations.
Adjust both the original program (using our own linear regression implementation) and the sklearn version to calculate the life expectancy for Germany between 1950 and 2000. What are the values (m and c) of linear equation linking date and life expectancy? Are they the same in both?
Solution
process_life_expectancy_data("../data/gapminderlifeexpectancy.csv", "Germany", 1950, 2000)
m= 0.212219909502 c= 346.784909502 They should be identical
Predicting Life Expectancy
Use the linear equation you’ve just created to predict life expectancy in Germany for every year between 2001 and 2016. How accurate are your answers? If you worked for a pension scheme would you trust your answers to predict the future costs for paying pensioners?
Solution
for x in range(2001,2017): print(x,0.212219909502 * x  346.784909502)
Predicted answers:
2001 77.86712941150199 2002 78.07934932100403 2003 78.29156923050601 2004 78.503789140008 2005 78.71600904951003 2006 78.92822895901202 2007 79.140448868514 2008 79.35266877801604 2009 79.56488868751802 2010 79.77710859702 2011 79.98932850652199 2012 80.20154841602402 2013 80.41376832552601 2014 80.62598823502799 2015 80.83820814453003 2016 81.05042805403201
Compare with the real values:
df = pd.read_csv('../data/gapminderlifeexpectancy.csv',index_col="Life expectancy") for x in range(2001,2017): y = 0.215621719457 * x  351.935837103 real = df.loc['Germany', str(x)] print(x, "Predicted", y, "Real", real, "Difference", yreal)
2001 Predicted 77.86712941150199 Real 78.4 Difference 0.532870588498 2002 Predicted 78.07934932100403 Real 78.6 Difference 0.520650678996 2003 Predicted 78.29156923050601 Real 78.8 Difference 0.508430769494 2004 Predicted 78.503789140008 Real 79.2 Difference 0.696210859992 2005 Predicted 78.71600904951003 Real 79.4 Difference 0.68399095049 2006 Predicted 78.92822895901202 Real 79.7 Difference 0.771771040988 2007 Predicted 79.140448868514 Real 79.9 Difference 0.759551131486 2008 Predicted 79.35266877801604 Real 80.0 Difference 0.647331221984 2009 Predicted 79.56488868751802 Real 80.1 Difference 0.535111312482 2010 Predicted 79.77710859702 Real 80.3 Difference 0.52289140298 2011 Predicted 79.98932850652199 Real 80.5 Difference 0.510671493478 2012 Predicted 80.20154841602402 Real 80.6 Difference 0.398451583976 2013 Predicted 80.41376832552601 Real 80.7 Difference 0.286231674474 2014 Predicted 80.62598823502799 Real 80.7 Difference 0.074011764972 2015 Predicted 80.83820814453003 Real 80.8 Difference 0.03820814453 2016 Predicted 81.05042805403201 Real 80.9 Difference 0.150428054032
Other types of regression
Linear regression obviously has its limits for working with data that isn’t linear. Scikitlearn has a number of other regression techniques which can be used on nonlinear data. Some of these (such as isotonic regression) will only interpolate data in the range of the training data and can’t extrapolate beyond it. One nonlinear technique that works with many types of data is polynomial regression. This creates a polynomial equation of the form y = a + bx + cx^2 + dx^3 etc. The more terms we add to the polynomial the more accurately we can model a system.
Scikitlearn includes a polynomial modelling tool as part of its preprocessing library which we’ll need to add to our list of imports.
import sklearn.preprocessing as skl_pre
Now lets modify the process_life_expectancy_data
function to calculate the polynomial. This takes two parts, the first is to preprocess the data into polynomial form. We first call the PolynomialFeatures function with the parameter degree. The degree parameter controls how many components the polynomial will have, a polynomial of the form y = a + bx + cx^2 + dx^3 has 4 degrees. Typically a value between 5 and 10 is sufficient. We must then process the numpy array that we used for the X axis in the linear regression to convert it into a set of polynomial features.
This only gets us halfway to being able to create a model that we can use for predictions. To form the complete model we actually have to perform a linear regression on the polynomial model, but we’ll use the polynomial features as the X axis instead of the numpy array. The Y axis will still be the life expectancy numpy array that we used before. The resulting model can now be used to make some predictions like we did before using the predict function.
If we want to draw the line of best fit we can pass the polynomial features in as a parameter to predict() and this will generate the y values for the full range of our data. This can be plotted by passing it to make_graph in place of the linear data.
Finally we can make some predictions of future data. Lets create a list containing the date range we’d like to predict, as with other lists/arrays we’ve used we’ll have to reshape it to make scikitlearn work with it. Now lets use this list of dates to predict life expectancy using both our linear and polynomial models.
def process_life_expectancy_data_poly(filename, country, min_date, max_date):
df = pd.read_csv(filename, index_col="Life expectancy")
# get the life expectancy for the specified country/dates
# we have to convert the dates to strings as pandas treats them that way
life_expectancy = df.loc[country, str(min_date):str(max_date)]
x_data = list(range(min_date, max_date + 1))
x_data_arr = np.array(x_data).reshape(1, 1)
life_exp_arr = np.array(life_expectancy).reshape(1, 1)
polynomial_features = skl_pre.PolynomialFeatures(degree=5)
x_poly = polynomial_features.fit_transform(x_data_arr)
polynomial_model = skl_lin.LinearRegression().fit(x_poly, life_exp_arr)
polynomial_data = polynomial_model.predict(x_poly)
#make_graph(x_data, life_expectancy, polynomial_data)
# make some predictions
predictions_x = list(range(2011,2025))
predictions_x_arr = np.array(predictions_x).reshape(1, 1)
predictions_polynomial = polynomial_model.predict(polynomial_features.fit_transform(predictions_x_arr))
plt.plot(x_data, life_expectancy, label="Original Data")
plt.plot(predictions_x, predictions_polynomial, label="Polynomial Prediction")
plt.grid()
plt.legend()
plt.show()
To measure the error lets calculate the RMS error on both the linear and polynomial data.
def process_life_expectancy_data_poly(filename, country, min_date, max_date):
df = pd.read_csv(filename, index_col="Life expectancy")
# get the life expectancy for the specified country/dates
# we have to convert the dates to strings as pandas treats them that way
life_expectancy = df.loc[country, str(min_date):str(max_date)]
x_data = list(range(min_date, max_date + 1))
x_data_arr = np.array(x_data).reshape(1, 1)
life_exp_arr = np.array(life_expectancy).reshape(1, 1)
polynomial_features = skl_pre.PolynomialFeatures(degree=5)
x_poly = polynomial_features.fit_transform(x_data_arr)
polynomial_model = skl_lin.LinearRegression().fit(x_poly, life_exp_arr)
polynomial_data = polynomial_model.predict(x_poly)
polynomial_error = math.sqrt(
skl_metrics.mean_squared_error(life_exp_arr, polynomial_data))
print("polynomial error is", polynomial_error)
process_life_expectancy_data_poly("../data/gapminderlifeexpectancy.csv",
"United Kingdom", 1950, 2016)
process_life_expectancy_data("../data/gapminderlifeexpectancy.csv",
"United Kingdom", 1950, 2016)
Exercise: Comparing linear and polynomial models
Train a linear and polynomial model on life expectancy data from China between 1960 and 2000. Then predict life expectancy from 2001 to 2016 using both methods. Compare their root mean squared errors, which is more accurate? Why do you think this model is the more accurate one?
Solution
modify the call to the process_life_expectancy_data
process_life_expectancy_data_poly("../data/gapminderlifeexpectancy.csv", "China", 1960, 2000)
linear prediction error is 5.385162846665607 polynomial prediction error is 28.169167771983528 The linear model is more accurate, polynomial models often become wildly inaccurate beyond the range they were trained on. Look at the predicted life expectancies, the polynomial model predicts a life expectancy of 131 by 2016!
Key Points
Clustering with Scikit Learn
Overview
Teaching: min
Exercises: minQuestions
Objectives
Clustering
Clustering is the grouping of data points which are similar to each other. It can be a powerful technique for identifying patterns in data. Clustering analysis does not usually require any training and is known as an unsupervised learning technique. The lack of a need for training means it can be applied quickly.
Applications of Clustering
 Looking for trends in data
 Data compression, all data clustering around a point can be reduced to just that point. For example, reducing colour depth of an image.
 Pattern recognition
Kmeans Clustering
The Kmeans clustering algorithm is a simple clustering algorithm that tries to identify the centre of each cluster. It does this by searching for a point which minimises the distance between the centre and all the points in the cluster. The algorithm needs to be told how many clusters to look for, but a common technique is to try different numbers of clusters and combine it with other tests to decide on the best combination.
Kmeans with Scikit Learn
To perform a kmeans clustering with Scikit learn we first need to import the sklearn.cluster module.
import sklearn.cluster as skl_cluster
For this example, we’re going to use scikit learn’s built in random data blob generator instead of using an external dataset. For this we’ll also need the sklearn.datasets.samples_generator
module.
import sklearn.datasets as skl_datasets
Now let’s create some random blobs using the make_blobs function. The n_samples
argument sets how many points we want to use in all of our blobs. cluster_std
sets the standard deviation of the points, the smaller this value the closer together they will be. centers
sets how many clusters we’d like. random_state
is the initial state of the random number generator, by specifying this we’ll get the same results every time we run the program. If we don’t specify a random state then we’ll get different points every time we run. This function returns two things, an array of data points and a list of which cluster each point belongs to.
data, cluster_id = skl_datasets.make_blobs(n_samples=400, cluster_std=0.75, centers=4, random_state=1)
Now that we have some data we can go ahead and try to identify the clusters using Kmeans. First, we need to initialise the KMeans module and tell it how many clusters to look for. Next, we supply it some data via the fit function, in much the same we did with the regression functions earlier on. Finally, we run the predict function to find the clusters.
Kmean = skl_cluster.KMeans(n_clusters=4)
Kmean.fit(data)
clusters = Kmean.predict(data)
The data can now be plotted to show all the points we randomly generated. To make it clearer which cluster points have been classified to we can set the colours (the c parameter) to use the clusters
list that was returned
by the predict function. The Kmeans algorithm also lets us know where it identified the centre of each cluster as. These are stored as a list called cluster_centers_
inside the Kmean
object. Let’s go ahead and plot the points from the clusters, colouring them by the output from the Kmeans algorithm, and also plot the centres of each cluster as a red X.
import matplotlib.pyplot as plt
plt.scatter(data[:, 0], data[:, 1], s=5, linewidth=0, c=clusters)
for cluster_x, cluster_y in Kmean.cluster_centers_:
plt.scatter(cluster_x, cluster_y, s=100, c='r', marker='x')
plt.show()
import sklearn.cluster as skl_cluster
import sklearn.datasets as skl_datasets
import matplotlib.pyplot as plt
data, cluster_id = skl_datasets.make_blobs(n_samples=400, cluster_std=0.75, centers=4, random_state=1)
Kmean = skl_cluster.KMeans(n_clusters=4)
Kmean.fit(data)
clusters = Kmean.predict(data)
plt.scatter(data[:, 0], data[:, 1], s=5, linewidth=0, c=clusters)
for cluster_x, cluster_y in Kmean.cluster_centers_:
plt.scatter(cluster_x, cluster_y, s=100, c='r', marker='x')
plt.show()
Working in multiple dimensions
Although this example shows two dimensions the kmeans algorithm can work in more than two, it just becomes very difficult to show this visually once we get beyond 3 dimensions. Its very common in machine learning to be working with multiple variables and so our classifiers are working in multidimensional spaces.
Limitations of KMeans
 Requires number of clusters to be known in advance
 Struggles when clusters have irregular shapes
 Will always produce an answer finding the required number of clusters even if the data isn’t clustered (or clustered in that many clusters).
 Requires linear cluster boundaries
Advantages of KMeans
 Simple algorithm, fast to compute. A good choice as the first thing to try when attempting to cluster data.
 Suitable for large datasets due to its low memory and computing requirements.
Exercise: KMeans with overlapping clusters
Adjust the program above to increase the standard deviation of the blobs (the cluster_std parameter to make_blobs) and increase the number of samples (n_samples) to 4000. You should start to see the clusters overlapping. Do the clusters that are identified make sense? Is there any strange behaviour from this?
Solution
The resulting image from increasing n_samples to 4000 and cluster_std to 3.0 looks like this: The straight line boundaries between clusters look a bit strange.
Exercise: How many clusters should we look for?
As KMeans requires us to specify the number of clusters to expect a common strategy to get around this is to vary the number of clusters we are looking for. Modify the program to loop through searching for between 2 and 10 clusters. Which (if any) of the results look more sensible? What criteria might you use to select the best one?
Solution
for cluster_count in range(2,11): Kmean = skl_cluster.KMeans(n_clusters=cluster_count) Kmean.fit(data) clusters = Kmean.predict(data) plt.scatter(data[:, 0], data[:, 1], s=5, linewidth=0,c=clusters) for cluster_x, cluster_y in Kmean.cluster_centers_: plt.scatter(cluster_x, cluster_y, s=100, c='r', marker='x') # give the graph a title with the number of clusters plt.title(str(cluster_count)+" Clusters") plt.show()
None of these look very sensible clusterings because all the points really form one large cluster. We might look at a measure of similarity of the cluster to test if its really multiple clusters. A simple standard deviation or interquartile range might be a good starting point.
Spectral Clustering
Spectral clustering is a technique that attempts to overcome the linear boundary problem of kmeans clustering. It works by treating clustering as a graph partitioning problem, its looking for nodes in a graph with a small distance between them. See this introduction to Spectral Clustering if you are interested in more details about how spectral clustering works.
Here is an example of using spectral clustering on two concentric circles
Spectral clustering uses something called a kernel trick to introduce additional dimensions to the data. A common example of this is trying to cluster one circle within another (concentric circles). A Kmeans classifier will fail to do this and will end up effectively drawing a line which crosses the circles. Spectral clustering will introduce an additional dimension that effectively moves one of the circles away from the other in the additional dimension. This has the downside of being more computationally expensive than kmeans clustering.
Spectral Clustering with Scikit Learn
Lets try out using Scikit Learn’s spectral clustering. To make the concentric circles in the above example we need to use the make_circles function in the sklearn.datasets module. This works in a very similar way to the make_blobs function we used earlier on.
import sklearn.datasets as skl_data
circles, circles_clusters = skl_data.make_circles(n_samples=400, noise=.01, random_state=0)
The code for calculating the SpectralClustering is very similar to the kmeans clustering, instead of using the sklearn.cluster.KMeans class we use the sklearn.cluster.SpectralClustering class.
model = skl_cluster.SpectralClustering(n_clusters=2, affinity='nearest_neighbors', assign_labels='kmeans')
The SpectralClustering class combines the fit and predict functions into a single function called fit_predict.
labels = model.fit_predict(circles)
Here is the whole program combined with the kmeans clustering for comparison. Note that this produces two figures, to view both of them use the “Inline” graphics terminal inside the Python console instead of the “Automatic” method which will open a window and only show you one of the graphs.
import sklearn.cluster as skl_cluster
import sklearn.datasets as skl_data
circles, circles_clusters = skl_data.make_circles(n_samples=400, noise=.01, random_state=0)
# cluster with kmeans
Kmean = skl_cluster.KMeans(n_clusters=2)
Kmean.fit(circles)
clusters = Kmean.predict(circles)
# plot the data, colouring it by cluster
plt.scatter(circles[:, 0], circles[:, 1], s=15, linewidth=0.1, c=clusters,cmap='flag')
plt.show()
# cluster with spectral clustering
model = skl_cluster.SpectralClustering(n_clusters=2, affinity='nearest_neighbors', assign_labels='kmeans')
labels = model.fit_predict(circles)
plt.scatter(circles[:, 0], circles[:, 1], s=15, linewidth=0, c=labels, cmap='flag')
plt.show()
Comparing kmeans and spectral clustering performance
Modify the program we wrote in the previous exercise to use spectral clustering instead of kmeans, save it as a new file. Time how long both programs take to run. Add the line
import time
at the top of both files, as the first line in the file get the start time withstart_time = time.time()
. End the program by getting the time again and subtracting the start time from it to get the total run time. Addend_time = time.time()
andprint("Elapsed time:",end_timestart_time,"seconds")
to the end of both files. Compare how long both programs take to run generating 4,000 samples and testing them for between 2 and 10 clusters. How much did your run times differ? How much do they differ if you increase the number of samples to 8,000? How long do you think it would take to compute 800,000 samples (estimate this, it might take a while to run for real)?Solution
KMeans version, runtime around 4 seconds (your computer might be faster/slower)
import matplotlib.pyplot as plt import sklearn.cluster as skl_cluster from sklearn.datasets import make_blobs import time start_time = time.time() data, cluster_id = make_blobs(n_samples=4000, cluster_std=3, centers=4, random_state=1) for cluster_count in range(2,11): Kmean = skl_cluster.KMeans(n_clusters=cluster_count) Kmean.fit(data) clusters = Kmean.predict(data) plt.scatter(data[:, 0], data[:, 1], s=15, linewidth=0, c=clusters) plt.title(str(cluster_count)+" Clusters") plt.show() end_time = time.time() print("Elapsed time = ", end_timestart_time, "seconds")
Spectral version, runtime around 9 seconds (your computer might be faster/slower)
import matplotlib.pyplot as plt import sklearn.cluster as skl_cluster from sklearn.datasets import make_blobs import time start_time = time.time() data, cluster_id = make_blobs(n_samples=4000, cluster_std=3, centers=4, random_state=1) for cluster_count in range(2,11): model = skl_cluster.SpectralClustering(n_clusters=cluster_count, affinity='nearest_neighbors', assign_labels='kmeans') labels = model.fit_predict(data) plt.scatter(data[:, 0], data[:, 1], s=15, linewidth=0, c=labels) plt.title(str(cluster_count)+" Clusters") plt.show() end_time = time.time() print("Elapsed time = ", end_timestart_time, "seconds")
When the number of points increases to 8000 the runtimes are 24 seconds for the spectral version and 5.6 seconds for kmeans. The runtime numbers will differ depending on the speed of your computer, but the relative different should be similar. For 4000 points kmeans took 4 seconds, spectral 9 seconds, 2.25 fold difference. For 8000 points kmeans took 5.6 seconds, spectral took 24 seconds. 4.28 fold difference. Kmeans 1.4 times slower for double the data, spectral 2.6 times slower. The realative difference is diverging. Its double by doubling the amount of data. If we use 100 times more data we might expect a 100 fold divergence in execution times. Kmeans might take a few minutes, spectral will take hours.
Key Points
Dimensionality Reduction
Overview
Teaching: min
Exercises: minQuestions
Objectives
Dimensionality Reduction
Dimensionality reduction is the process of using a subset of the coordinates, which may be transformed, of the dataset to capture the variation in features of the data set. It can be a helpful preprocessing step before doing other operations on the data, such as classification, regression or visualization.
Dimensionality Reduction with Scikitlearn
First setup our environment and load the MNIST digits dataset which will be used as our initial example.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import decomposition
from sklearn import datasets
from sklearn import manifold
digits = datasets.load_digits()
# Examine the dataset
print(digits.data)
print(digits.target)
X = digits.data
y = digits.target
Principle Component Analysis (PCA)
PCA is a technique that does rotations of data in a two dimensional array to decompose the array into combinations vectors that are orthogonal and can be ordered according to the amount of information they carry.
# PCA
pca = decomposition.PCA(n_components=2)
pca.fit(X)
X_pca = pca.transform(X)
fig = plt.figure(1, figsize=(4, 4))
plt.clf()
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap=plt.cm.nipy_spectral,
edgecolor='k',label=y)
plt.colorbar(boundaries=np.arange(11)0.5).set_ticks(np.arange(10))
plt.savefig("pca.svg")
tdistributed Stochastic Neighbor Embedding (tSNE)
# tSNE embedding
tsne = manifold.TSNE(n_components=2, init='pca',
random_state = 0)
X_tsne = tsne.fit_transform(X)
fig = plt.figure(1, figsize=(4, 4))
plt.clf()
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap=plt.cm.nipy_spectral,
edgecolor='k',label=y)
plt.colorbar(boundaries=np.arange(11)0.5).set_ticks(np.arange(10))
plt.savefig("tsne.svg")
Exercise: Working in three dimensions
The above example has considered only two dimensions since humans can visualize two dimensions very well. However, there can be cases where a dataset requires more than two dimensions to be appropriately decomposed. Modify the above programs to use three dimensions and create appropriate plots. Do three dimensions allow one to better distinguish between the digits?
Solution
from mpl_toolkits.mplot3d import Axes3D # PCA pca = decomposition.PCA(n_components=3) pca.fit(X) X_pca = pca.transform(X) fig = plt.figure(1, figsize=(4, 4)) plt.clf() ax = fig.add_subplot(projection='3d') ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=y, cmap=plt.cm.nipy_spectral, s=9, lw=0) plt.savefig("pca_3d.svg")
# tSNE embedding tsne = manifold.TSNE(n_components=3, init='pca', random_state = 0) X_tsne = tsne.fit_transform(X) fig = plt.figure(1, figsize=(4, 4)) plt.clf() ax = fig.add_subplot(projection='3d') ax.scatter(X_tsne[:, 0], X_tsne[:, 1], X_tsne[:, 2], c=y, cmap=plt.cm.nipy_spectral, s=9, lw=0) plt.savefig("tsne_3d.svg")
Exercise: Parameters
Look up parameters that can be changed in PCA and tSNE, and experiment with these. How do they change your resulting plots? Might the choice of parameters lead you to make different conclusions about your data?
Exercise: Other Algorithms
There are other algorithms that can be used for doing dimensionality reduction, for example the Higher Order Singular Value Decomposition (HOSVD) Do an internet search for some of these and examine the example data that they are used on. Are there cases where they do poorly? What level of care might you need to use before applying such methods for automation in critical scenarios? What about for interactive data exploration?
Key Points
Neural Networks
Overview
Teaching: min
Exercises: minQuestions
Objectives
Introduction
Neural networks are a machine learning method inspired by how the human brain works. They are particularly good at doing pattern recognition and classification tasks, often using images as inputs. They are a wellestablished machine learning technique that has been around since the 1950s but have gone through several iterations since that have overcome fundamental limitations of the previous one. The current stateoftheart neural networks is often referred to as deep learning.
Perceptrons
Perceptrons are the building blocks of neural networks. They are an artificial version of a single neuron in the brain. They typically have one or more inputs and a single output. Each input will be multiplied by a weight and the value of all the weighted inputs are then summed together. Finally, the summed value is put through an activation function which decides if the neuron “fires” a signal. In some cases, this activation function is simply a threshold step function which outputs zero below a certain input and one above it. Other designs of neurons use other activation functions, but typically they have an output between zero and one and are still steplike in their nature.
Coding a perceptron
Below is an example of a perceptron written as a Python function. The function takes three parameters: Inputs is a list of input values, Weights is a list of weight values and Threshold is the activation threshold.
First let us multiply each input by the corresponding weight. To do this quickly and concisely, we will use the numpy multiply function which can multiply each item in a list by a corresponding item in another list.
We then take the sum of all the inputs multiplied by their weights. Finally, if this value is less than the activation threshold, we output zero, otherwise we output a one.
import numpy as np
def perceptron(inputs, weights, threshold):
assert len(inputs) == len(weights)
# multiply the inputs and weights
values = np.multiply(inputs,weights)
# sum the results
total = sum(values)
# decide if we should activate the perceptron
if total < threshold:
return 0
else:
return 1
Computing with a perceptron
A single perceptron can perform basic linear classification problems such as computing the logical AND, OR, and NOT functions.
OR
Input 1  Input 2  Output 

0  0  0 
0  1  1 
1  0  1 
1  1  1 
AND
Input 1  Input 2  Output 

0  0  0 
0  1  0 
1  0  0 
1  1  1 
NOT
Input 1  Output 

0  1 
1  0 
We can get a single perceptron to compute each of these functions
OR:
inputs = [[0.0,0.0],[1.0,0.0],[0.0,1.0],[1.0,1.0]]
for input in inputs:
print(input,perceptron(input, [0.5,0.5], 0.5))
AND:
inputs = [[0.0,0.0],[1.0,0.0],[0.0,1.0],[1.0,1.0]]
for input in inputs:
print(input,perceptron(input, [0.5,0.5], 1.0))
NOT:
The NOT function only has a single input but to make it work in the perceptron, we need to introduce a bias term which is always the same value. In this example, it is the second input. It has a weight of 1.0, the weight on the real input is 1.0.
inputs = [[0.0,1.0],[1.0,1.0]]
for input in inputs:
print(input,perceptron(input, [1.0,1.0], 1.0))
A perceptron can be trained to compute any function which has linear separability. A simple training algorithm called the perceptron learning algorithm can be used to do this and scikitlearn has its own implementation of it. We are going to skip over the perceptron learning algorithm and move straight onto more powerful techniques. If you want to learn more about it see this page from Dublin City University.
Perceptron limitations
A single perceptron cannot solve any function that is not linearly separable, meaning that we need to be able to divide the classes of inputs and outputs with a straight line. A common example of this is the XOR function shown below:
Input 1  Input 2  Output 

0  0  0 
0  1  1 
1  0  1 
1  1  0 
(Make a graph of this)
This function outputs a zero both when all its inputs are one or zero and its not possible to separate with a straight line. This is known as linear separability, when this limitation was discovered in the 1960s it effectively halted development of neural networks for over a decade in a period known as the “AI Winter”.
Multilayer Perceptrons
A single perceptron cannot be used to solve a nonlinearly separable function. For that, we need to use multiple perceptrons and typically multiple layers of perceptrons. They are formed of networks of artificial neurons which each take one or more inputs and typically have a single output. The neurons are connected together in large networks typically of 10s to 1000s of neurons. Typically, networks are connected in layers with an input layer, middle or hidden layer (or layers) and finally an output layer.
Training Multilayer perceptrons
Multilayer perceptrons need to be trained by showing them a set of training data and measuring the error between the network’s predicted output and the true value. Training takes an iterative approach that improves the network a little each time a new training example is presented. There are a number of training algorithms available for a neural network today, but we are going to use one of the best established and well known, the backpropagation algorithm. The algorithm is called back propagation because it takes the error calculated between an output of the network and the true value and takes it back through the network to update the weights. If you want to read more about back propagation, please see this chapter from the book “Neural Networks  A Systematic Introduction”.
Multilayer perceptrons in scikitlearn
We are going to build a multilayer perceptron for recognising handwriting from images. Scikit Learn includes some example handwriting data from the MNIST data set, this consists of 70,000 images of hand written digits. Each image is 28x28 pixels in size (784 pixels in total) and is represented in grayscale with values between zero for fully black and 255 for fully white. This means we will need 784 perceptrons in our input layer, each taking the input of one pixel and 10 perceptrons in our output layer to represent each digit we might classify. If trained correctly then only the perceptron in the output layer to “fire” will be on the one representing the in the image (this is a massive oversimplification!).
We can import this dataset from sklearn.datasets
with then load it into memory by calling the fetch_openml
function.
import sklearn.datasets as skl_data
data, labels = skl_data.fetch_openml('mnist_784', version=1, return_X_y=True)
This creates two arrays of data, one called data
which contains the image data and the other labels
that contains the labels for those images which will tell us which digit is in the image. A common convention is to call the data X
and the labels y
.
As neural networks typically want to work with data that ranges between 0 and 1.0 we need to normalise our data to this range. Python has a shortcut which lets us divide the entire data array by 255 and store the result, we can simply do:
data = data / 255.0
instead of writing a loop ourselves to divide every pixel by 255. Although the final result is the same and will take about the same (possibly a little less, it might do some clever optimisations) amount of computation.
Now we need to initialise a neural network, scikit learn has an entire library sklearn.neural_network
for this and the MLPClassifier
class handles multilayer perceptrons. This network takes a few parameters including the size of the hidden layer, the maximum number of training iterations we’re going to allow, the exact algorithm to use, if we’d like verbose output about what the training is doing and the initial state of the random number generator.
In this example we specify a multilayer perceptron with 50 hidden nodes, we allow a maximum of 50 iterations to train it, we turn on verbose output to see what’s happening and initialise the random state to 1 so that we always get the same behaviour.
import sklearn.neural_network as skl_nn
mlp = skl_nn.MLPClassifier(hidden_layer_sizes=(50,), max_iter=50, verbose=1, random_state=1)
We now have a neural network but we have not done any training of it yet. Before training, let us split our dataset into two parts, a training set which we will use to train the classifier and a test set which we will use to see how well the training is working. By using different data for the two, we can help show, we have not only trained a network which works just with the data it was trained on, this is known as overfitting and can end up creating models which do not “generalise” or work with data other than their training data.
Typically, 10 to 20% of the data will be used as training data. Let us see how big our dataset is to decide how many samples we want to train with. The describe
attribute in Pandas will tell us how many rows our data has:
print(data.describe)
This tells us we have 70,000 rows in the dataset.
<bound method NDFrame.describe of pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 pixel10 ... pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783 pixel784
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
69995 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
69996 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
69997 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
69998 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
69999 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[70000 rows x 784 columns]>
Let us take 90% of the data for training and 10% for testing, so we will use the first 63,000 samples in the dataset as the training data and the last 7,000 as the test data. We can split these using a slice operator.
data_train = data[0:63000]
labels_train = labels[0:63000]
data_test = data[63001:]
labels_test = labels[63001:]
Now let us go ahead and train the network. This line will take about one minute to run. We do this by calling the fit
function inside the mlp
class instance. This needs two arguments the data itself and the labels showing what class each item should be classified to.
mlp.fit(data_train,labels_train)
Finally, let us score the accuracy of our network against both the original training data and the test data. If the training had converged to the point where each iteration of training was not improving the accuracy, then the accuracy of the training data should be 1.0 (100%).
print("Training set score", mlp.score(data_train, labels_train))
print("Testing set score", mlp.score(data_test, labels_test))
Here is the complete program:
import matplotlib.pyplot as plt
import sklearn.datasets as skl_data
import sklearn.neural_network as skl_nn
data, labels = skl_data.fetch_openml('mnist_784', version=1, return_X_y=True)
data = data / 255.0
mlp = skl_nn.MLPClassifier(hidden_layer_sizes=(50,), max_iter=50, verbose=1, random_state=1)
data_train = data[0:63000]
labels_train = labels[0:63000]
data_test = data[63001:]
labels_test = labels[63001:]
mlp.fit(data_train, labels_train)
print("Training set score", mlp.score(data_train, labels_train))
print("Testing set score", mlp.score(data_test, labels_test))
Prediction using a multilayer perceptron
Now that we have trained a multilayer perceptron, we can give it some input data and ask it to perform a prediction. In this case, our input data is a 28x28 pixel image, which can also be represented as a 784element list of data. The output will be a number between 0 and 9 telling us which digit the network thinks we have supplied. The predict
function in the MLPClassifier
class can be used to make a prediction. Let us try using the first digit from our test set as an example.
Before we can pass it to the predictor, we have to extract one of the digits from the test set. We can use iloc
on the dataframe to get hold of the first element in the test set. In order to present it to the predictor, we have to turn it into a numpy array which has the dimensions of 1x784 instead of 28x28. We can then call the predict
function with this array as our parameter. This will return an array of predictions (as it could have been given multiple inputs), the first element of this will be the predicted digit. You may get a warning stating “X does not have valid feature names”, this is because we didn’t encode feature names into our X (digit images) data.
test_digit = data_test.iloc[0].to_numpy().reshape(1,784)
test_digit_prediciton = mlp.predict(test_digit)[0]
print("Predicted value",test_digit_prediciton)
We can now verify if the prediction is correct by looking at the corresponding item in the labels_test
array.
print("Actual value",labels_test.iloc[0])
This should be the same value which is being predicted.
Changing the learning parameters
There are several parameters which control the training of the data. One of these is called the learning rate, increasing this can reduce how many learning iterations we need. But make it too large and we will end up overshooting. Try tweaking this parameter by adding the parameter
learning_rate_init
, the default value of this is 0.001. Try increasing it to around 0.1.
Using your own handwriting
Create an image using Microsoft Paint, the GNU Image Manipulation Project (GIMP) or jspaint. The image needs to be grayscale and 28 x 28 pixels.
Try and draw a digit (09) in the image and save it into your code directory.
The code below loads the image (called digit.png, change to whatever your file is called) using the OpenCV library. Some Anaconda installations need this installed either through the package manager or by running the command:
conda install c condaforge opencv
from the anaconda terminal.OpenCV assumes that images are 3 channel red, green, blue and we have to convert to one channel grayscale with
cvtColor
.We also need to normalise the image by dividing each pixel by 255.
To verify the image, we can plot it by using OpenCV’s
imshow
function (we could also use Matplotlib’smatshow
function).To check what digit it is, we can pass it into
mlp.predict
, but we have to convert it from a 28x28 array to a one dimensional 784byte long array with thereshape
function.Did it correctly classify your hand(mouse) writing? Try a few images. If you have time try drawing images on a touch screen or taking a photo of something you have really written by hand. Remember that you will have to resize it to be 28x28 pixels.
import cv2 import matplotlib.pyplot as plt digit = cv2.imread("digit.png") digit_gray = cv2.cvtColor(digit, cv2.COLOR_BGR2GRAY) digit_norm = digit_gray/255.0 cv2.imshow("Normalised Digit",digit_norm) print("Your digit is",mlp.predict(digit_norm.reshape(1,784)))
Measuring Neural Network performance
We have now trained a neural network and tested prediction on a few images. This might have given us a feel for how well our network is performing, but it would be much more useful to have a more objective measure. Since recognising digits is a classification problem, we can measure how many predictions were correct in a set of test data. As we already have a test set of data with 7,000 images let us use that and see how many predictions the neural network has got right. We will loop through every image in the test set, run it through our predictor and compare the result with the label for that image. We will also keep a tally of how many images we got right and see what percentage were correct.
correct=0
for row in data_test.iterrows():
# image contains a tuple of the row number and image data
image = row[1].to_numpy().reshape(1,784)
prediction = mlp.predict(image)[0]
actual = labels_test[row[0]]
if prediction == actual:
correct = correct + 1
print((correct/len(data_test))*100)
Confusion Matrix
We now know what percentage of images were correctly classified, but we don’t know anything about the distribution of that across our different classes (the digits 0 to 9 in this case). A more powerful technique is known as a confusion matrix. Here we draw a grid with each class along both the x and y axis. The x axis is the actual number of items in each class and the y axis is the predicted number. In a perfect classifier there will be a diagonal line of values across the grid moving from the top left to bottom right corresponding to the number in each class and all other cells will be zero. If any cell outside of the diagonal is nonzero then it indicates a missclassification. Scikit Learn has a function called confusion_matrix
in the sklearn.metrics
class which can display a confusion matrix for us. It will need two inputs, an array showing how many items were in each class for both the real data and the classifications. We already have the real data in the labels_test array, but we need to build it for the classifications by classifying each image (in the same order as the real data) and storing the result in another array.
from sklearn.metrics import confusion_matrix
predictions = []
for image in data_test.iterrows():
# image contains a tuple of the row number and image data
image = image[1].to_numpy().reshape(1,784)
predictions.append(mlp.predict(image)[0])
confusion_matrix(labels_test,predictions)
A better way to plot a confusion matrix
The
ConfusionMatrixDisplay
class in thesklearn.metrics
package can create a graphical representation of a confusion matrix with colour coding to highlight how many items are in each cell. This colour coding can be useful when working with very large numbers of classes. Try and use thefrom_predictions()
method in theConfusionMatrixDisplay
class to display a graphical confusion matrix.Solution
from sklearn.metrics import ConfusionMatrixDisplay ConfusionMatrixDisplay.from_predictions(labels_test,predictions)
Cross Validation
Previously we split the data into training and test sets. But what happens if the test set includes important features we want to train on that happen to be missing in the training set? We are having to throw away part of our data to use in the testing set.
Cross validation runs the training/testing multiple times but splits the data in a different way each time. This way all of the data gets used both for training and testing. We can use multiple iterations of training with different data in each set to eventually include the entire dataset.
example list
[1,2,3,4,5,6,7,8]
train = 1,2,3,4,5,6 test = 7,8
train = 1,2,3,4,7,8 test = 5,6
train = 1,2,5,6,7,8 test = 3,4
train = 3,4,5,6,7,8 test = 1,2
(generate an image of this)
Cross Validation code example
The sklearn.model_selection
module provides support for doing k fold cross validation in scikitlearn. It can automatically partition our data for cross validation.
Let us import this and call it skl_msel
import sklearn.model_selection as skl_msel
Now we can choose how many ways we would like to split our data, three or four are common choices.
kfold = skl_msel.KFold(4)
Now we can loop through our data and test on each combination. The kfold.split
function returns two variables and we will have our for loop work through both of them. The train variable will contain a list of which items (by index number) we are currently using to train and the test one will contain the list of which items we are going to test on.
for (train, test) in kfold.split(data):
Now inside the loop, we can select the data by doing data_train = data.iloc[train]
and labels_train = labels.iloc[train]
. In some versions of Python/Pandas/Scikit Learn, you might be able to do data_train = data[train]
and labels_train = labels[train]
. This is a useful Python shorthand which will use the list of indices from train
to select which items from data
and labels
we use. We can repeat this process with the test set.
data_train = data.iloc[train]
labels_train = labels.iloc[train]
data_test = data.iloc[test]
labels_test = labels.iloc[test]
Finally, we need to train the classifier with the selected training data and then score it against the test data. The scores for each set of test data should be similar.
mlp.fit(data_train,labels_train)
print("Testing set score", mlp.score(data_test, labels_test))
Once we have established that the cross validation was ok, we can go ahead and train using the entire dataset by doing mlp.fit(data,labels)
.
Here is the entire example program:
import matplotlib.pyplot as plt
import sklearn.datasets as skl_data
import sklearn.neural_network as skl_nn
import sklearn.model_selection as skl_msel
data, labels = skl_data.fetch_openml('mnist_784', version=1, return_X_y=True)
data = data / 255.0
mlp = skl_nn.MLPClassifier(hidden_layer_sizes=(50,), max_iter=50, random_state=1)
kfold = skl_msel.KFold(4)
for (train, test) in kfold.split(data):
data_train = data.iloc[train]
labels_train = labels.iloc[train]
data_test = data.iloc[test]
labels_test = labels.iloc[test]
mlp.fit(data_train,labels_train)
print("Training set score", mlp.score(data_train, labels_train))
print("Testing set score", mlp.score(data_test, labels_test))
mlp.fit(data,labels)
Deep Learning
Deep learning usually refers to newer neural network architectures which use a special type of network known as a convolutional network. Typically, these have many layers and thousands of neurons. They are very good at tasks such as image recognition but take a long time to train and run. They are often used with GPU (Graphical Processing Units) which are good at executing multiple operations simultaneously. It is very common to use cloud computing or HPC systems with multiple GPUs attached.
Scikit learn is not really setup for Deep Learning. We will have to rely on other libraries. Common choices include Google’s TensorFlow, Keras, (Py)Torch or Darknet. There is however an interface layer between sklearn and tensorflow called skflow. A short example of doing this can be found at https://www.kdnuggets.com/2016/02/scikitfloweasydeeplearningtensorflowscikitlearn.html.
Cloud APIs
Google, Microsoft, Amazon, and many others now have Cloud based Application Programming Interfaces (APIs) where you can upload an image and have them return you the result. Most of these services rely on a large pretrained (and often proprietary) neural network.
Exercise: Try cloud image classification
Take a photo with your phone camera or find an image online of a common daily scene. Upload it Google’s Vision AI example at https://cloud.google.com/vision/ How many objects has it correctly classified? How many did it incorrectly classify? Try the same image with Microsoft’s Computer Vision API at https://azure.microsoft.com/engb/services/cognitiveservices/computervision/ Does it do any better/worse than Google?
Key Points
Ethics and Implications of Machine Learning
Overview
Teaching: min
Exercises: minQuestions
Objectives
Ethics and Machine Learning
There are increasing worries about the ethics of using machine learning. In recent year’s we’ve seen a number of worrying problems from machine learning entering all kinds of aspects of daily life and the economy:
 The first death from an autonomous car which failed to brake for a pedestrian.[1]
 Highly targetted advertising based around social media and internet usage. [2]
 The outcomes of elections and referendums being influenced by highly targetted social media posts . This is compunded by the data being obtained without the users’s consent. [3]
 The mass deploymeny of facial recognition technologies. [4]
 The possible first use of autonomous military robots making a decision to kill in battle. [5]
Problems with bias
Machine learning systems are often presented as more impartial and consistent ways to make decisions. For example sentencing criminals or deciding if somebody should be granted bail. There have been a number of examples recently where machine learning systems have been shown to be biased because the data they were trained on was already biased. This can occur due to the training data being unrepresentative and under representing certain groups. For example if you were trying to automatically screen job candidates and used a sample of people the same company had previously decided to employ then any biases in their past employment processes would be reflected in the machine learning.
Problems with explaining decisions
Many machine learning systems (e.g. neural networks) can’t really explain their decisions. Although the input and output are known trying to explain why the training caused the network to behave in a certain way can be very difficult. If a decision is questioned by a human its difficult to provide any rationale as to how a decision was arrived at.
Problems with accuracy
No machine learning system is ever 100% accurate. Getting into the high 90s is usually considered good. But when we’re evaluating millions of data items this can translate into 100s of thousands of misidentifications. If the implications of these incorrect decisions are serious then it will cause major problems. For instance if it results in somebody being imprisoned or even investigated for a crime or maybe just being denied insurance or a credit card.
Energy Usage
Many machine learning systems (especially deep learning) need vast amounts of computational power which in turn can consume vast amounts of energy. Depending on the source of that energy this might account for significant amounts of fossil fuels being burned. It is not uncommon for a modern GPU accelerated computer to use several kilowatts of power, running this for one hour could easily use as much energy a typical home would use in an entire day. This can be particularly bad when models are constantly being retrained or when “parameter sweeps” are done to find the best set of parameters to train with.
Ethics of machine learning in research
Not all research using machine learning will have major ethical implications. Many research projects don’t directly affect the lives of other people, but this isn’t always the case.
Some questions you might want to ask yourself (and which an ethics committee might also ask you):
 Will anything your machine learning system does make a decision that somehow affects a person’s life?
 Will anything your machine learning system does make a decision that somehow affects an animial’s life?
 Will you be using any people to create your training data? Will they have to look at any disturbing or traumatic material during the training process?
 Are there any inherent biases in the dataset(s) you’re using for training?
 How much energy will this computation use? Are there more efficient ways to get the same answer?
Exercise: Ethical implications of your own research
Split into pairs or groups of three. Think of a use case for machine learning in your research areas. What ethical implications (if any) might there be from using machine learning in your research? Write down your group’s answers in the etherpad.
Key Points
Find out more
Overview
Teaching: min
Exercises: minQuestions
Objectives
Other algorithms
There are many other machine learning algorithms that might be suitable for helping to answer your research questions.
The Scikit Learn webpage has a good overview of all the features available in the library.
Ensemble Learning
Ensemble Learning is a technique which combines multiple machine learning algorithms together to improve results. A popular ensemble technique is Random Forest which creates a “forest” of decision trees and then tries to prune it down to the most effective ones. Its a flexible algorithm that can work both as a regression and a classification system. See the article Random Forest Simple Explanation for more information.
Genetic Algorithms
Genetic algorithms are a technique which tries to mimic biological evolution. They will learn to solve a problem through a gradual process of simulated evolution. Each generation is mutated slightly and then evaluated with a fitness function, the fittest “genes” will then be selected for the next generation. Sometimes this is combined with neural networks to change the network’s size structure.
This video shows a genetic algorithm evolving neural networks to play a video game.
Useful Resources

Machine Learning for Everyone  A useful overview of many different machine learning techniques, all introduced in an easy to follow way.

Google machine learning crash course  A quick course from Google on how to use some of their machine learning products.

Facebook Field Guide to Machine Learning  A good introduction to machine learning concepts from Facebook.

Amazon Machine Learning guide  An introduction to the key concepts in machine learning from Amazon.

Azure AI  Microsoft’s Cloud based AI platform.
Key Points