This lesson is being piloted (Beta version)

Linear and Logistic Regression

Overview

Teaching: 35 min
Exercises: 15 min
Questions
  • How can a model make predictions?

  • How do we measure the performance of predictions?

Objectives
  • Define a linear regression model.

  • Define a logistic regression model.

  • Split data into training and testing sets.

Kyphosis Data

Make sure the rpart package is loaded, and examine the structure of the kyphosis data frame.

library(rpart)
str(kyphosis)
'data.frame':	81 obs. of  4 variables:
 $ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
 $ Age     : int  71 158 128 2 1 1 61 37 113 59 ...
 $ Number  : int  3 3 4 5 4 2 2 3 2 6 ...
 $ Start   : int  5 14 5 1 15 16 17 16 16 12 ...

Notice that there are 81 observations of four variables, so this is a rather small data set for machine learning techniques. In this episode, we will use this data set to illustrate the process of training and testing, where our models will be built using classical linear and logistic regression.

Make a training set and a test set

The first step in the process is to create a random train/test split of our data set. We will use the training set to build our model, without looking at the testing set. After our model is built, we will measure the accuracy of its predictions using the testing set.

There are various R packages that automate common tasks in machine learning, but it is instructive to use base R for now. The following commands will randomly select the row indexes of the training set (and therefore also of the testing set).

trainSize <- round(0.75 * nrow(kyphosis))
set.seed(6789) # so we all get the same random sets
trainIndex <- sample(nrow(kyphosis), trainSize)

Take a look at the trainIndex variable in the Environment tab of RStudio. Since we set a particular value for the random seed, we should all see the same sample of random numbers.

Next we form two data frames using these indexes. Recall that the selection of -trainIndex will select all rows whose indexes are not in the trainIndex vector.

trainDF <- kyphosis[trainIndex, ]
testDF <- kyphosis[-trainIndex, ]

We can View the train and test sets in RStudio to check that they form a random partition of the kyphosis data.

View(trainDF)
View(testDF)

Linear Regression as Supervised Learning

In the previous episode, we constructed a scatterplot of Number versus Start and observed a slight negative association. A model of this relationship is given by the least squares regression line, which we can consider as a simple example of supervised learning. To compute the slope and y-intercept of this line, we use the lm function in R.

In the following code block, the formula Start ~ Number specifies that Number is the explanatory (independent) variable, and Start is the response (dependent) variable. A helpful mnemonic is to read the ~ symbol as “as explained by.” To illustrate the process of supervised learning, we fit this regression line to the training set trainDF, saving the testing set for later.

model1 <- lm(Start ~ Number, data = trainDF)
summary(model1)

Call:
lm(formula = Start ~ Number, data = trainDF)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.018  -1.610   1.615   3.186   5.798 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  16.4268     1.6393  10.021 2.38e-14 ***
Number       -1.2041     0.3721  -3.236  0.00199 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.698 on 59 degrees of freedom
Multiple R-squared:  0.1507,	Adjusted R-squared:  0.1364 
F-statistic: 10.47 on 1 and 59 DF,  p-value: 0.001988

The predicted Start is obtained by multiplying Number by the regression slope -1.2041 and adding the intercept 16.4268.

Challenge: Make a prediction and plot the regression line

  1. Predict the starting vertebra when the number of vertebrae involved is 3.
  2. Adding the term geom_smooth(method = "lm") to a scatterplot will produce a plot of the regression line. Modify the ggplot code from the previous episode to create a scatterplot of the training data, along with a regression line.

Solution

Three times -1.2041 plus 16.4268 is approximately 12.81.

library(ggplot2) # don't need this line if tidyverse is already loaded
ggplot(trainDF, aes(x = Number, y = Start)) + 
  geom_point() + 
  geom_smooth(method = "lm")

plot of chunk unnamed-chunk-7

Try the Testing Data Set

In R there is a generic method called predict that will make predictions given models of various types. For example, we can compute the predicted starting vertebrae for all the cases in our testing set as follows.

predictedStart <- predict(model1, testDF)

Challenge: Check our prediction

Check that the result of the predict function agrees with the result of the previous challenge.

Solution

head(predictedStart)
      12       21       32       36       37       38 
12.81438 14.01851 14.01851 12.81438 12.81438 10.40611 
head(testDF)
   Kyphosis Age Number Start
12   absent 148      3    16
21   absent  22      2    16
32   absent 125      2    11
36   absent  93      3    16
37   absent   1      3     9
38  present  52      5     6

Notice that the first row of our testing set has a Number value of 3, and the first value of predictedStart agrees with our answer to the previous challenge.

In general, the value of Start predicted by the model will not equal the actual value of Start in the testing set. However, in a good model, we would hope that the predicted values will be close to the actual values. To assess how close our predictions are to reality, we compute a vector of errors: predicted values minus actual values.

actualStart <- testDF$Start
errors <- predictedStart - actualStart
cat(round(errors, 1))
-3.2 -2 3 -3.2 3.8 4.4 0.2 2.6 10.6 2.8 -3.4 0.4 -3 -0.2 -0.2 1.6 -1.4 0.6 -5.6 -3.4

Measuring the Prediction Error

There are several ways to summarize the overall error in a regression model. The mean of the errors is not a good choice, because errors will usually have positive and negative values, which will cancel when averaged. To avoid this cancellation effect, we can take the mean of the squares of the errors: the Mean Squared Error, or MSE.

mean(errors^2)
[1] 13.14172

For a measurement of overall error in the same units as the output variable, we take the square root of the MSE to obtain the Root Mean Squared Error, or RMSE.

sqrt(mean(errors^2))
[1] 3.625151

Challenge: Mean Absolute Error

The Mean Absolute Error (MAE) is the average of the absolute values of the errors. Compute the MAE for the above example.

Solution

mean(abs(errors))
[1] 2.77752

In upcoming episodes, we will compare different regression models using the RMSE of the prediction error on the testing set.

Logistic Regression

In the previous episode, we observed that in the context of the kyphosis data, it would be natural to try to predict whether a post-op kyphosis will be present, given the age of the patient, the number of vertebrae involved, and the number of the first vertebra operated on. In this situation, Kyphosis is our categorical response variable, and Age, Number, and Start are the explanatory variables. You can see the levels of the Kyphosis variable with the following command.

levels(kyphosis$Kyphosis)
[1] "absent"  "present"

Since our response variable is categorical, we need to employ a classification model. The following command will fit a Multiple Logistic Regression model to our training data using the glm command.

model2 <- glm(Kyphosis ~ Age + Number + Start, data = trainDF, family = "binomial")

Notice that we specified the formula Kyphosis ~ Age + Number + Start, because Kyphosis is the response variable and Age, Number, and Start are the explanatory variables. Since Age, Number, and Start make up all the remaining columns in our data frame, we could have used the equivalent formula Kyphosis ~ ., as follows.

model2 <- glm(Kyphosis ~ ., data = trainDF, family = "binomial")

The formula Kyphosis ~ . can be read as “Kyphosis as explained by everything else.”

The predict function for binomial glm models will return predicted probabilities of the response variable if we specify the option type = "response".

predict(model2, testDF, type = "response")
         12          21          32          36          37          38 
0.060991556 0.004837329 0.066676823 0.027771160 0.034862103 0.389677526 
         39          43          44          46          47          51 
0.287882930 0.988363673 0.531083119 0.183999574 0.122064478 0.244337950 
         52          57          60          66          68          69 
0.003170923 0.014410381 0.061131542 0.069057143 0.236889844 0.056456800 
         70          76 
0.035581184 0.206575349 

If we actually want to know whether or not the model predicts that a kyphosis is present, we need to convert these probabilities to the appropriate levels of the Kyphosis variable. The first level, absent, corresponds to probabilities near zero, and the second level, present, corresponds to probabilities near one. So we can create a vector of the predicted Kyphosis values of our testing set as follows.

predictedKyphosis <- ifelse(predict(model2, testDF, type = "response") < 0.5,
                            "absent", "present")
predictedKyphosis
       12        21        32        36        37        38        39        43 
 "absent"  "absent"  "absent"  "absent"  "absent"  "absent"  "absent" "present" 
       44        46        47        51        52        57        60        66 
"present"  "absent"  "absent"  "absent"  "absent"  "absent"  "absent"  "absent" 
       68        69        70        76 
 "absent"  "absent"  "absent"  "absent" 

The actual occurrences of kyphosis in our testing set are given by the vector testDF$Kyphosis, so the following command will tell us which predictions were correct.

testDF$Kyphosis == predictedKyphosis
   12    21    32    36    37    38    39    43    44    46    47    51    52 
 TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE 
   57    60    66    68    69    70    76 
 TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE 

One way to measure the performance of a classification model is to report the proportion of correct predictions, known as the accuracy of the model. Recall that the sum function, when applied to a logical vector, will return the number of TRUEs in the vector.

accuracy <- sum(testDF$Kyphosis == predictedKyphosis)/nrow(testDF)
cat("Proportion of correct predictions using testing data: ", accuracy, "\n")
Proportion of correct predictions using testing data:  0.8 
cat("Testing data error rate: ", 1-accuracy, "\n")
Testing data error rate:  0.2 

Challenge: Try different train/test splits

Before we constructed the models in this episode, we divided the data into a training set and a testing set. Try experimenting with different train/test splits to see if it has any effect on the model accuracy. Specifically, re-do the construction of model2, the logistic regression model, with the following modifications.

  1. Try a different value of the random seed. Does the prediction accuracy change?

  2. We originally chose a training set size that was 75% of the size of the kyphosis data frame. Experiment with different percentages for the train/test split. Does changing the size of the training set affect the accuracy of the model?

Solution

Let’s define a function to avoid repeating code.

testAccuracy <- function(splitPercent, randomSeed) {
  tSize <- round(splitPercent * nrow(kyphosis))
  set.seed(randomSeed) 
  tIndex <- sample(nrow(kyphosis), tSize)
  trnDF <- kyphosis[tIndex, ]
  tstDF <- kyphosis[-tIndex, ]
  kmod <- glm(Kyphosis ~ ., data = trnDF, family = "binomial")
  predK <- ifelse(predict(kmod, tstDF, type = "response") < 0.5, "absent", "present")
  return(sum(tstDF$Kyphosis == predK)/nrow(tstDF))
}

Try different seeds:

testAccuracy(0.75, 6789) # our original settings
[1] 0.8
testAccuracy(0.75, 1234)
[1] 0.9

So the choice of seed makes a difference.

Try different split percentages:

testAccuracy(0.9, 6789)
[1] 0.75
testAccuracy(0.5, 6789)
[1] 0.7804878

Key Points

  • Classical linear and logistic regression models can be thought of as examples of regression and classification models in machine learning.

  • Testing sets can be used to measure the performance of a model.