This lesson is being piloted (Beta version)
If you teach this lesson, please tell the authors and provide feedback by opening an issue in the source repository

# Artificial Intelligence (AI) and Machine Learning (ML) in a nutshell

## Overview

Teaching: 0 min
Exercises: 0 min
Questions
• What is a brief history of the field of (AI) and Machine Learning (ML)?

• What do we mean by Artificial Intelligence and Machine Learning? How are they defined?

Objectives
• Understand Machine Learning as a subfield of AI and name two others

• Name four types and machine learning and describe the difference between supervised and unsupervised learning

• Describe the difference between a model of data, and a model trained on data

## Models and Algorithms

There is a lot of jargon and terminology in Machine Learning, and it can be confusing to a newcomer especially when some terms have more than one meaning. The term model is one such example. In this episode we will present two definitions of model, and also introduce the term algorithm.

### Modelling the world

Whether the task is prediction or classification, the method supervised or unsupervised, the underlying principle is one of modelling the world through data.

We as humans are making models all the time. A quick look out the window and our in-built weather model, learned from past soakings, helps us make the umbrella decision. At school we learned some of the simplest models for numerical data - the average, the maximum and the minimum. If I were to guess the height of the next person to walk through the door, the average would be a good model to use. If I were building the door then that would not be a good model as everyone above average height would have to stoop to enter (unless perhaps I were a Medieval Monarch…), so the maximum height might be a better starting point. If average is the conceptual model I’ve chosen then I need some data, say, the height of all 19 year olds in Finland. To apply this model to the data I need an algorithm. We will use a simple algorithm that you probably learned as a child: add up all the heights and then divide by the number of measurements. The value returned is 173.5cm (according to Wikipedia). That value is now my trained model. If you give me the name of any 19 year old Finnish person my model will tell you their height - 173.5cm.

It will be wrong a lot of the time but there is a famous aphorism in the statistical and machine learning communities:

“All models are wrong, some are useful” - George Box (1976)

In machine learning we have lots of models to choose from, and choosing a model depends on the type of data, the amount of data, the purpose, and a certain amount of pragmatism.

Before getting to GLAM specific scenarios and data, we will look at simple data to develop some intuition around modelling concepts. The following diagram shows four plots of data points with various linear models fit to the data. The data points were generated by some unknown process (unknown to the reader at least. It is a mathematical function that takes a number in and returns another number). The task of machine learning is to estimate what that function might be using example data generated by it (if we knew what the function was we wouldn’t need ML). To make the task harder for the algorithm, random noise (errors) has been added to the data to mask the true process. This replicates what we experience in the real world, there’s always something we can’t capture in our models, or some error in our measurements. Typical examples we may encounter in cultural heritage are errors in transcription, or poorly digitised images.

The image contains 4 plots, labelled Model A-D. The horizontal axis represents the input values and the vertical axis the outputs. Notice there are no labels on the axes. What the axes represent is not important to the algorithm, it is only mapping an input value to an output - two numbers.

Model A is specified so that it can have up to two bends in the line it draws. This is the conceptual model we have chosen to test on this data. The crosses represent 15 training data points for the model to fit itself to. To implement our conceptual model with this data we have chosen the Least Squares algorithm (first used by Carl Frederick Gauss in around 1795) which is designed to find the line which minimises the total squared distance between the line and the crosses. The dotted red line is the trained model calculated by the algorithm and the dashed grey line is the true model, or the correct answer (which we only know because this is a simulation). Our model looks to have done pretty well. This trained model is defined as a mathematical formula which can now be used to make predictions for new data.

With fewer example data points we may have to compromise on the complexity of the conceptual model. Model B has been trained used only 3 data points, and so a simpler straight line conceptual model, no bends allowed, is the best we can use. The algorithm is still least squares and the line is very close to the training points, but we can see it is far away from the dashed line true model. It will not predict well but it may still be useful and satisfy our requirements, depending on what it is being used for. When the model is simpler than the true function, we call it underfitting.

Model C was trained on 35 example data points. With more data we can try out more complex conceptual models. The red dotted line represents a very complex model which is able to draw quite intricate curves, in fact it is allowed to bend 12 times. If we compare this to the smooth dashed line of Model A we can see here is that the red line is too wiggly. It is following the data points more closely because it been given the freedom to do so. This is an example of a phenomenon known as overfitting. The model has fit itself to the training data but it has learned the wrong pattern and would not generalise well to data it has not seen before - it is learning as much about the added noise as the true function.

In machine learning we always work with two sets of data. The training data is used to train the model, the testing data is used to verify whether the model will generalise to unseen data. It is important that the training data and testing data are both representative of the real world data we will use our model against in an application but they must not overlap with each other. The process of training a model is often called the Data Science Lifecycle. It is an iterative process of finding the best model using training data. The very end of this process is when the test data is used, and often in competitive situations (e.g. an academic competition, or private companies competing for a tender) the test data is not seen by the data scientists at all.

Model D again shows the same conceptual model as Model C but this time it has been trained on 500 data points. Now it is following the smooth curve of the true model. With more data available the complex model has now been able to see through the noise and find the true pattern in the data.

The point of plots C & D, and to some extent B, is that we should try to choose the simplest model for our purposes. This is known as Occam’s Razor. A complex model can learn any pattern given enough data, but often the simpler one is enough or even better and needs less data (which we know is resource intensive to create).

### Interpreting Models

There is another aspect of model selection which is explaining their workings. The mathematical formula for the true model consists of 4 parameters which are multiplied by combinations of the input variable, but model C has 13. It becomes especially important as we move to models which use millions of parameters, such as those used for image recognition. This is a theme which will be addressed in a later episode.

### Modelling summary

• With small amounts of data available we need to simplify our model of the world
• An overly complex model can fit too closely to training data, a process called overfitting
• The opposite of overfitting is underfitting. This means the model is too simple, as in the case of the straight line model.
• A simple model may still be good enough for our purposes
• If we want to model the complexities of the world, we need more data

### Feature engineering

One of the trends seen in machine learning, especially in the field of deep learning is that the more training data available, the better the results, and the more complex problems can be solved. While this is true, it is not helpful if you don’t have much training data available and need to work with simpler conceptual models but still want to solve complex problems.

One solution to this is to use expert domain knowledge of the data to augment it with additional features. This process is known as Feature engineering. As a worked example, we will consider a column of dates. To a computer a date (without the time portion) is a sequential number which represents the number of days since a certain fixed point in time (e.g. 1st January 1970). As a number it doesn’t contain any of the periodic information that a date in other formats does. As domain experts we can add some insight to the machine learning process by representing a date in different ways.

If my task was to predict visitor numbers to a museum I might convert a date to a weekday/weekend indicator, or match it to a school holiday calendar. The following diagram shows several different representations of the same date:

### Embeddings

Machine Learning algorithms require numerical inputs, which often means we need to convert our data into numerical form. In the case of text this can be problematic. One option is have each word as a feature but then our model can only use words appearing in the training data. Also we lose the semantic nature of words - for example, the words car and automobile are treated as independent features. Word embeddings allow us to capture relationships between words by placing them in a large multi-dimensional (numerical) space, such that words with similar meanings are in some way closer to each other. The multi-dimensional nature allows for the multiple meanings of many words. For example, Date is a representation of time, a fruit, and an evening out with dinner.

To illustrate creating word embeddings, here is a simple two dimensional representation of different fruits. Each fruit is represented by two numbers: their ‘roundness’, and their colour on a ‘rainbow’ scale (Red=0.0, Violet=1.0). In reality we wouldn’t define the dimensions, and they wouldn’t be interpretable in this way. Now if we wanted to represent Kiwi numerically it would be [0.7,0.5]. We also see that Banana is most similar to Lemon, so perhaps we would need to add a ‘Citrus’ dimension for a better representation.

0.0 0.4 0.7 1.0
R (0.0)   Strawberry
O (0.16)       Orange
Y (0.33) Banana Lemon
G (0.5)     Kiwi
B (0.66)       Blueberry
I (0.83)
V (1.0)     Plum

## Key Points

• Machine Learning is a subfield of AI which identifies patterns in data

• Supervised learning algorithms learn by example

• Unsupervised learning algorithms put data into groups of similar objects or records