Making predictions from a logistic regression model
Overview
Teaching: 20 min
Exercises: 20 minQuestions
How can we calculate predictions from a logistic regression model manually?
How can we calculate predictions from a logistic regression model in R?
Objectives
Calculate predictions in terms of the log odds, the odds and the probability of success from a logistic regression model using parameter estimates given by the model output.
Use the
make_predictions()
function from thejtools
package to generate predictions from a logistic regression model in terms of the log odds, the odds and the probability of success.
As with the linear regression models, the logistic regression model allows us to make predictions. First we will calculate predictions of the log odds, the odds and the probability of success using the model equations. Then we will see how R can calculate predictions for us using the make_predictions()
function.
Calculating predictions manually
Let us use the SmokeNow_Age
model from the previous episode. The equation for this model in terms of the log odds was:
Therefore, for a 30-year old individual, the model predicts a log odds of
\[\text{logit}(E(\text{SmokeNow})) = 2.60651 − 0.05423 \times 30 = 0.97961.\]Since the odds are more interpretable than the log odds, we can convert our log odds prediction to the odds scale. We do so by exponentiation the log odds:
\[\frac{E(\text{SmokeNow})}{1-E(\text{SmokeNow})} = e^{0.97961} = 2.663.\]Therefore, the model predicts that individuals that have been smokers, are 2.663 as likely to still be smokers at the age of 30 than not.
Recall that the model could also be expressed in terms of the probability of “success”:
\[\text{Pr}(\text{SmokeNow} = \text{Yes}) = \text{logit}^{-1}(2.60651 − 0.05423 \times \text{Age}).\]In R, we can calculate the inverse logit using the inv.logit()
function from the boot
package. Therefore, for a 30-year old individual, the model predicts a probability of $\text{SmokeNow} = \text{Yes}$:
inv.logit(2.60651 - 0.05423 * 30)
[1] 0.7270308
Or in mathematical notation: $\text{Pr}(\text{SmokeNow} = \text{Yes}) = \text{logit}^{-1}(2.60651 − 0.05423 \times 30) = 0.727.$
Exercise
Given the
summ
output from ourPhysActive_FEV1
model, the model can be described as \(\text{logit}(E(\text{PhysActive})) = -1.18602 + 0.00046 \times \text{FEV1}.\)
A) Calculate the log odds of physical activity predicted by the model for an individual with an FEV1 of 3000.
B) Calculate the odds of physical activity predicted by the model for an individual with an FEV1 of 3000. How many more times is the individual likely to be physically active than not?
C) Using theinv.logit()
function from the packageboot
, calculate the probability of an individual with an FEV1 of 3000 being physically active.Solution
A) $\text{logit}(E(\text{PhysActive}) = -1.18602 + 0.00046 \times 3000 = 0.194.$
B) $e^{0.194} = 1.21$, so the individual is 1.21 times more likely to be physically active than not. This can be calculated in R as follows:exp(0.194)
[1] 1.214096
C) $\text{Pr}(\text{PhysActive}) = {logit}^{-1}(-1.18602 + 0.00046 \times 3000) = 0.548.$ This can be calculated in R as follows:
inv.logit(-1.18602 + 0.00046 * 3000)
[1] 0.5483435
Calculating predictions using make_predictions()
Using the make_predictions()
function brings two advantages. First, when calculating multiple predictions, we are saved the effort of inserting multiple values into our model manually and doing the calculations. Secondly, make_predictions()
returns 95% confidence intervals around the predictions, giving us a sense of the uncertainty around the predictions.
To use make_predictions()
, we need to create a tibble
with the explanatory variable values for which we wish to have mean predictions from the model. We do this using the tibble()
function. Note that the column name must correspond to the name of the explanatory variable in the model, i.e. Age
. In the code below, we create a tibble
with the values 30, 50 and 70. We then provide make_predictions()
with this tibble
, alongside the model from which we wish to have predictions.
Recall that we can calculate predictions on the log odds or the probability scale. To obtain predictions on the log odds scale, we include outcome.scale = "link"
in our make_predictions()
command. For example:
predictionDat <- tibble(Age = c(30, 50, 70)) #data for which we wish to predict
make_predictions(SmokeNow_Age, new_data = predictionDat,
outcome.scale = "link")
# A tibble: 3 × 4
Age SmokeNow ymax ymin
<dbl> <dbl> <dbl> <dbl>
1 30 0.980 1.11 0.851
2 50 -0.105 -0.0257 -0.184
3 70 -1.19 -1.07 -1.31
From the output we can see that the model predicts a log odds of -0.105 for a 50-year old individual. The 95% confidence interval around this prediction is [-0.184, -0.0257].
To calculate predictions on the probability scale, we include outcome.scale = "response"
in our make_predictions()
command:
make_predictions(SmokeNow_Age, new_data = predictionDat,
outcome.scale = "response")
# A tibble: 3 × 4
Age SmokeNow ymax ymin
<dbl> <dbl> <dbl> <dbl>
1 30 0.727 0.752 0.701
2 50 0.474 0.494 0.454
3 70 0.233 0.256 0.212
From the output we can see that the model predicts a probability of still smoking of 0.474 for a 50-year old individual. The 95% confidence interval around this prediction is [0.454, 0.494].
Exercise
Using the
make_predictions()
function and thePhysActive_FEV1
model:
A) Obtain the log odds of the expectation of physical activity for individuals with an FEV1 of 2000, 3000 or 4000. Ensure that your predictions include confidence intervals.
B) Exponentiate the log odds at an FEV1 of 4000. How many more times is an individual likely to be physically active than not with an FEV1 of 4000?
C) Obtain the probabilities of individuals with an FEV1 of 2000, 3000 or 4000 being physically active. Ensure that your predictions include confidence intervals.Solution
A) Including
outcome.scale = "link"
gives us predictions on the log odds scale:predictionDat <- tibble(FEV1 = c(2000, 3000, 4000)) make_predictions(PhysActive_FEV1, new_data = predictionDat, outcome.scale = "link")
# A tibble: 3 × 4 FEV1 PhysActive ymax ymin <dbl> <dbl> <dbl> <dbl> 1 2000 -0.261 -0.175 -0.347 2 3000 0.202 0.255 0.148 3 4000 0.664 0.740 0.589
B) $e^{0.664} = 1.94$, so an individual is 1.94 times more likely to be physically active.
C) Includingoutcome.scale = "response"
gives us predictions on the probability scale:make_predictions(PhysActive_FEV1, new_data = predictionDat, outcome.scale = "response")
# A tibble: 3 × 4 FEV1 PhysActive ymax ymin <dbl> <dbl> <dbl> <dbl> 1 2000 0.435 0.456 0.414 2 3000 0.550 0.563 0.537 3 4000 0.660 0.677 0.643
Key Points
Predictions of the log odds, the odds and the probability of success can be manually calculated using the model’s equation.
Predictions of the log odds, the odds and the probability of success alongside 95% CIs can be obtained using the
make_predictions()
function.