This lesson is being piloted (Beta version)
If you teach this lesson, please tell the authors and provide feedback by opening an issue in the source repository

An introduction to logistic regression

Overview

Teaching: 45 min
Exercises: 45 min
Questions
  • In what scenario is a logistic regression model useful?

  • How is the logistic regression model expressed in terms of the log odds?

  • How is the logistic regression model expressed in terms of the probability of success?

  • What is the effect of the explanatory variable in terms of the odds?

Objectives
  • Identify questions that can be addressed with a logistic regression model.

  • Formulate the model equation in terms of the log odds.

  • Formulate the model equation in terms of the probability of success.

  • Express the effect of an explanatory variable in terms of a multiplicative change in the odds.

Scenarios in which logistic regression may be useful

Logistic regression is commonly used, but when is it appropriate to apply this method? Broadly speaking, logistic regression may be suitable when the following conditions hold:

Exercise

A colleague has started working with the NHANES data set. They approach you for advice on the use of logistic regression on this data. Assuming that the assumptions of the logistic regression model hold, which of the following questions could potentially be tackled with a logistic regression model? Think closely about the outcome and explanatory variables, between which a relationship will be modelled to answer the research questions.

A) Does home ownership (whether a participant’s home is owned or rented) vary across income bracket in the general US population?
B) Is there an association between BMI and pulse rate in the general US population?
C) Do participants with diabetes on average have a higher weight than participants without diabetes?

Solution

A) The outcome variable is home ownership and the explanatory variable is income bracket. Since home ownership is a binary outcome variable, logistic regression could be a suitable way to investigate this question.
B) Since both variables are continuous, logistic regression is not suitable for this question.
C) The outcome variable is weight and the explanatory variable is diabetes. Since the outcome variable is continuous and the explanatory variable is binary, this question is not suited for logistic regression. Note that an alternative question, with diabetes as the outcome variable and weight as the explanatory variable, could be investigated using logistic regression.

The logistic regression model equation in terms of the log odds

The logistic regression model can be described by the following equation:

\[\text{log}\left(\frac{E(y)}{1-E(y)}\right) = \beta_0 + \beta_1 \times x_1.\]

The right-hand side of the equation has the same form as that for simple linear regression. So we will first interpret the left-hand side of the equation. The outcome variable is denoted by $y$. Logistic regression models the log odds of $E(y)$, which we encountered in the previous episode.

The log odds can be denoted by $\text{logit()}$ for simplicity, giving us the following equation:

\[\text{logit}(E(y)) = \beta_0 + \beta_1 \times x_1.\]

As we learned in the previous episode, the expectation of $y$ is another way of referring to the probability of success. We also learned that the probability of success equals one minus the probability of failure. Therefore, the left-hand side of our equation can be denoted by:

\[\text{logit}(E(y)) = \text{log}\left(\frac{E(y)}{1-E(y)}\right) = \text{log}\left(\frac{\text{Pr}(y=1)}{1-\text{Pr}(y=1)}\right) = \text{log}\left(\frac{\text{Pr}(y=1)}{\text{Pr}(y=0)}\right).\]

This leads us to interpreting $\text{logit}(E(y))$ as the log odds of $y=1$ (or success).

In the logistic regression model equation, the expectation of $y$ is a function of $\beta_0$ and $\beta_1 \times x_1$. The intercept is denoted by $\beta_0$ - this is the log odds when the explanatory variable, $x_1$, equals 0. The effect of our explanatory variable is denoted by $\beta_1$ - for every one-unit increase in $x_1$, the log odds changes by $\beta_1$.

Before fitting the model, we have $y$ and $x_1$ values for each observation in our data. For example, suppose we want to model the relationship between diabetes and BMI. $y$ would represent diabetes ($y=1$ if a participant has diabetes and $y=0$ otherwise). $x_1$ would represent BMI. After we fit the model, R will return to us values of $\beta_0$ and $\beta_1$ - these are estimated using our data.

Exercise

We are asked to study the association between BMI and diabetes. We are given the following equation of a logistic regression model to use:

\(\text{logit}(E(y)) = \beta_0 + \beta_1 \times x_1\).

Match the following components of this logistic regression model to their descriptions:

  1. $\text{logit}(E(y))$
  2. ${\beta}_0$
  3. $x_1$
  4. ${\beta}_1$

A) The log odds of having diabetes, for a particular value of BMI.
B) The expected change in the log odds of having diabetes with a one-unit increase in BMI.
C) The expected log odds of having diabetes when the BMI equals 0.
D) A specific value of BMI.

Solution

A) 1
B) 4
C) 2
D) 3

The logistic regression model equation in terms of the probability of success

Alternatively, the logistic regression model can be expressed in terms of probabilities of success. This formula is obtained by using the inverse function of $\text{logit}()$, denoted by $\text{logit}^{-1}()$. In general terms, an inverse function “reverses” the original function, returning the input value. This means that $\text{logit}^{-1}(\text{logit}(E(y))) = E(y)$. Taking the inverse logit on both sides of the logistic regression equation introduced above, we obtain:

\[\begin{align} \text{logit}^{-1}(\text{logit}(E(y))) & = \text{logit}^{-1}(\beta_0 + \beta_1 \times x_1) \\ E(y) & = \text{logit}^{-1}(\beta_0 + \beta_1 \times x_1). \end{align}\]

The advantage of this formulation is that our output is in terms of probabilities of success. We will encounter this formulation when plotting the results of our models.

Exercise

We are asked to study the association between age and smoking status. We are given the following equation of a logistic regression model to use:
\(E(y) = \text{logit}^{-1}(\beta_0 + \beta_1 \times x_1).\)
Match the following components of this logistic regression model to their descriptions:

  1. $E(y)$
  2. ${\beta}_0$
  3. $x_1$
  4. ${\beta}_1$
  5. $\text{logit}^{-1}()$

A) A specific value of age.
B) The expected probability of being a smoker for a particular value of age.
C) The inverse logit function.
D) The expected log odds of still smoking given an age of 0.
E) The expected change in the log odds with a one-unit difference in age.

Solution

A) 3
B) 1
C) 5
D) 2
E) 4

The multiplicative change in the odds ratio

As we have seen above, the effect of an explanatory variable is expressed in terms of the log odds of success. For every unit increase in $x_1$, the log odds changes by $\beta_1$. For a different interpretation, we can express the effect of an explanatory variable in terms of multiplicative change in the odds of success.

Specifically, $\frac{\text{Pr}(y=1)}{\text{Pr}(y=0)}$ is multiplied by $e^{\beta_1}$ for every unit increase in $x_1$. For example, if $\frac{\text{Pr}(y=1)}{\text{Pr}(y=0)} = 0.2$ at $x=0$ and $\beta_1 = 2$, then at $x=1$ $\frac{\text{Pr}(y=1)}{\text{Pr}(y=0)} = 0.2 \times e^2 = 1.48$. In general terms, this relationship is expressed as:

\[\frac{\text{Pr}(y=1|x=a+1)}{\text{Pr}(y=0|x=a+1)} = \frac{\text{Pr}(y=1|x=a)}{\text{Pr}(y=0|x=a)} \times e^{\beta_1},\]

where $\frac{\text{Pr}(y=1|x=a)}{\text{Pr}(y=0|x=a)}$ is read as the “odds of $y$ being $1$ given $x$ being $a$”. In this context, $a$ is any value that $x$ can take on.

Importantly, this means that the change in the odds of success is not linear. The change depends on the odds $\frac{\text{Pr}(y=1|x=a)}{\text{Pr}(y=0|x=a)}$. We will exemplify this in the challenge below.

Exercise

We are given the following odds of success: $\frac{\text{Pr}(y=1|x=2)}{\text{Pr}(y=0|x=2)} = 0.4$ and $\frac{\text{Pr}(y=1|x=6)}{\text{Pr}(y=0|x=6)} = 0.9$. We are also given an estimate of the effect of an explanatory variable: $\beta_1 = 1.2$.

A) Calculate the expected multiplicative change in the odds of success. An exponential can be calculated in R using exp(), e.g. $e^2$ is obtained with exp(2).
B) Calculate $\frac{\text{Pr}(y=1|x=3)}{\text{Pr}(y=0|x=3)}$.
C) Calculate $\frac{\text{Pr}(y=1|x=7)}{\text{Pr}(y=0|x=7)}$.
D) By how much did the odds of success change when going from $x=2$ to $x=3$? And when going from $x=6$ to $x=7$?

Solution

A) $e^{1.2} = 3.32.$ We can obtain this in R as follows:

exp(1.2)
[1] 3.320117

B) $\frac{\text{Pr}(y=1|x=3)}{\text{Pr}(y=0|x=3)} = 0.4 \times 3.32 = 1.328$
C) $\frac{\text{Pr}(y=1|x=7)}{\text{Pr}(y=0|x=7)} = 0.9 \times 3.32 = 2.988$
D) $1.328 - 0.4 = 0.928$ and $2.988 - 0.9 = 2.088$. So the second change is greater than the first change.

If you are interested in the reason why this multiplicative change exists, see the callout box below.

Why does the odds change by a factor of $e^{\beta_1}$?

To understand the reason behind the multiplicative relationship, we need to look at the ratio of two odds:

\[\frac{\frac{\text{Pr}(y=1|x=a+1)}{\text{Pr}(y=0|x=a+1)}}{\frac{\text{Pr}(y=1|x=a)}{\text{Pr}(y=0|x=a)}}.\]

Let’s call this ratio $A$. The numerator of $A$ is the odds of success when $x=a+1$. The denominator of $A$ is the odds of success when $x=a$. Therefore, if $A>1$ then the odds of success is greater when $x=a+1$. Alternatively, if $A<1$, then the odds of success is smaller when $x=a+1$.

Taking the exponential of a log returns the logged value, i.e. $e^{\text{log}(a)} = a$. Therefore, we can express the odds in terms of the exponentiated model equation:

\[\text{log}\left(\frac{\text{Pr}(y=1)}{\text{Pr}(y=0)}\right) = \beta_0 + \beta_1 \times x_1 \Leftrightarrow \frac{\text{Pr}(y=1)}{\text{Pr}(y=0)} = e^{\beta_0 + \beta_1 \times x_1}.\]

The ratio of two odds can thus be expressed in terms of the exponentiated model equations:

\[\frac{\frac{\text{Pr}(y=1|x=a+1)}{\text{Pr}(y=0|x=a+1)}}{\frac{\text{Pr}(y=1|x=a)}{\text{Pr}(y=0|x=a)}} = \frac{e^{\beta_0 + \beta_1 \times (a+1)}}{e^{\beta_0 + \beta_1\times a}}.\]

Since the exponential of a sum is the product of the exponentiated components:

\[\frac{\frac{\text{Pr}(y=1|x=a+1)}{\text{Pr}(y=0|x=a+1)}}{\frac{\text{Pr}(y=1|x=a)}{\text{Pr}(y=0|x=a)}} = \frac{e^{\beta_0} \times e^{(\beta_1 \times a)} \times e^{\beta_1}}{e^{\beta_0} \times e^{(\beta_1 \times a)}}.\]

This can then be simplified by crossing out components found in the numerator and the denominator:

\[\frac{\frac{\text{Pr}(y=1|x=a+1)}{\text{Pr}(y=0|x=a+1)}}{\frac{\text{Pr}(y=1|x=a)}{\text{Pr}(y=0|x=a)}} = e^{\beta_1}.\]

Finally, bringing the denominator from the left-hand side to the right-hand side:

\[\frac{\text{Pr}(y=1|x=a+1)}{\text{Pr}(y=0|x=a+1)} = \frac{\text{Pr}(y=1|x=a)}{\text{Pr}(y=0|x=a)} \times e^{\beta_1}.\]

Key Points

  • Logistic regression requires one binary dependent variable and one or more continuous or categorical explanatory variables.

  • The model equation in terms of the log odds is $\text{logit}(E(y)) = \beta_0 + \beta_1 \times x_1$.

  • The model equation in terms of the probability of success is $E(y) = \text{logit}^{-1}(\beta_0 + \beta_1 \times x_1)$.

  • The odds is multiplied by $e^{\beta_1}$ for a one-unit increase in $x_1$.