This lesson is being piloted (Beta version)
If you teach this lesson, please tell the authors and provide feedback by opening an issue in the source repository

Making predictions from a logistic regression model

Overview

Teaching: 20 min
Exercises: 20 min
Questions
  • How can we calculate predictions from a logistic regression model manually?

  • How can we calculate predictions from a logistic regression model in R?

Objectives
  • Calculate predictions in terms of the log odds, the odds and the probability of success from a logistic regression model using parameter estimates given by the model output.

  • Use the make_predictions() function from the jtools package to generate predictions from a logistic regression model in terms of the log odds, the odds and the probability of success.

As with the linear regression models, the logistic regression model allows us to make predictions. First we will calculate predictions of the log odds, the odds and the probability of success using the model equations. Then we will see how R can calculate predictions for us using the make_predictions() function.

Calculating predictions manually

Let us use the SmokeNow_Age model from the previous episode. The equation for this model in terms of the log odds was:

\[\text{logit}(E(\text{SmokeNow})) = 2.60651 − 0.05423 \times \text{Age}.\]

Therefore, for a 30-year old individual, the model predicts a log odds of

\[\text{logit}(E(\text{SmokeNow})) = 2.60651 − 0.05423 \times 30 = 0.97961.\]

Since the odds are more interpretable than the log odds, we can convert our log odds prediction to the odds scale. We do so by exponentiation the log odds:

\[\frac{E(\text{SmokeNow})}{1-E(\text{SmokeNow})} = e^{0.97961} = 2.663.\]

Therefore, the model predicts that individuals that have been smokers, are 2.663 as likely to still be smokers at the age of 30 than not.

Recall that the model could also be expressed in terms of the probability of “success”:

\[\text{Pr}(\text{SmokeNow} = \text{Yes}) = \text{logit}^{-1}(2.60651 − 0.05423 \times \text{Age}).\]

In R, we can calculate the inverse logit using the inv.logit() function from the boot package. Therefore, for a 30-year old individual, the model predicts a probability of $\text{SmokeNow} = \text{Yes}$:

inv.logit(2.60651 - 0.05423 * 30)
[1] 0.7270308

Or in mathematical notation: $\text{Pr}(\text{SmokeNow} = \text{Yes}) = \text{logit}^{-1}(2.60651 − 0.05423 \times 30) = 0.727.$

Exercise

Given the summ output from our PhysActive_FEV1 model, the model can be described as \(\text{logit}(E(\text{PhysActive})) = -1.18602 + 0.00046 \times \text{FEV1}.\)
A) Calculate the log odds of physical activity predicted by the model for an individual with an FEV1 of 3000.
B) Calculate the odds of physical activity predicted by the model for an individual with an FEV1 of 3000. How many more times is the individual likely to be physically active than not?
C) Using the inv.logit() function from the package boot, calculate the probability of an individual with an FEV1 of 3000 being physically active.

Solution

A) $\text{logit}(E(\text{PhysActive}) = -1.18602 + 0.00046 \times 3000 = 0.194.$
B) $e^{0.194} = 1.21$, so the individual is 1.21 times more likely to be physically active than not. This can be calculated in R as follows:

exp(0.194)
[1] 1.214096

C) $\text{Pr}(\text{PhysActive}) = {logit}^{-1}(-1.18602 + 0.00046 \times 3000) = 0.548.$ This can be calculated in R as follows:

inv.logit(-1.18602 + 0.00046 * 3000)
[1] 0.5483435

Calculating predictions using make_predictions()

Using the make_predictions() function brings two advantages. First, when calculating multiple predictions, we are saved the effort of inserting multiple values into our model manually and doing the calculations. Secondly, make_predictions() returns 95% confidence intervals around the predictions, giving us a sense of the uncertainty around the predictions.

To use make_predictions(), we need to create a tibble with the explanatory variable values for which we wish to have mean predictions from the model. We do this using the tibble() function. Note that the column name must correspond to the name of the explanatory variable in the model, i.e. Age. In the code below, we create a tibble with the values 30, 50 and 70. We then provide make_predictions() with this tibble, alongside the model from which we wish to have predictions.

Recall that we can calculate predictions on the log odds or the probability scale. To obtain predictions on the log odds scale, we include outcome.scale = "link" in our make_predictions() command. For example:

predictionDat <- tibble(Age = c(30, 50, 70)) #data for which we wish to predict

make_predictions(SmokeNow_Age, new_data = predictionDat,
                 outcome.scale = "link")
# A tibble: 3 × 4
    Age SmokeNow    ymax   ymin
  <dbl>    <dbl>   <dbl>  <dbl>
1    30    0.980  1.11    0.851
2    50   -0.105 -0.0257 -0.184
3    70   -1.19  -1.07   -1.31 

From the output we can see that the model predicts a log odds of -0.105 for a 50-year old individual. The 95% confidence interval around this prediction is [-0.184, -0.0257].

To calculate predictions on the probability scale, we include outcome.scale = "response" in our make_predictions() command:

make_predictions(SmokeNow_Age, new_data = predictionDat,
                 outcome.scale = "response")
# A tibble: 3 × 4
    Age SmokeNow  ymax  ymin
  <dbl>    <dbl> <dbl> <dbl>
1    30    0.727 0.752 0.701
2    50    0.474 0.494 0.454
3    70    0.233 0.256 0.212

From the output we can see that the model predicts a probability of still smoking of 0.474 for a 50-year old individual. The 95% confidence interval around this prediction is [0.454, 0.494].

Exercise

Using the make_predictions() function and the PhysActive_FEV1 model:
A) Obtain the log odds of the expectation of physical activity for individuals with an FEV1 of 2000, 3000 or 4000. Ensure that your predictions include confidence intervals.
B) Exponentiate the log odds at an FEV1 of 4000. How many more times is an individual likely to be physically active than not with an FEV1 of 4000?
C) Obtain the probabilities of individuals with an FEV1 of 2000, 3000 or 4000 being physically active. Ensure that your predictions include confidence intervals.

Solution

A) Including outcome.scale = "link" gives us predictions on the log odds scale:

predictionDat <- tibble(FEV1 = c(2000, 3000, 4000))

make_predictions(PhysActive_FEV1, new_data = predictionDat,
                 outcome.scale = "link")
# A tibble: 3 × 4
   FEV1 PhysActive   ymax   ymin
  <dbl>      <dbl>  <dbl>  <dbl>
1  2000     -0.261 -0.175 -0.347
2  3000      0.202  0.255  0.148
3  4000      0.664  0.740  0.589

B) $e^{0.664} = 1.94$, so an individual is 1.94 times more likely to be physically active.
C) Including outcome.scale = "response" gives us predictions on the probability scale:

make_predictions(PhysActive_FEV1, new_data = predictionDat,
                 outcome.scale = "response")
# A tibble: 3 × 4
   FEV1 PhysActive  ymax  ymin
  <dbl>      <dbl> <dbl> <dbl>
1  2000      0.435 0.456 0.414
2  3000      0.550 0.563 0.537
3  4000      0.660 0.677 0.643

Key Points

  • Predictions of the log odds, the odds and the probability of success can be manually calculated using the model’s equation.

  • Predictions of the log odds, the odds and the probability of success alongside 95% CIs can be obtained using the make_predictions() function.