This lesson is being piloted (Beta version)
If you teach this lesson, please tell the authors and provide feedback by opening an issue in the source repository

An introduction to binary response variables

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • How can we calculate probabilities of success and failure?

  • How do we interpret the expectation of a binary variable?

  • How can we calculate and interpret the odds?

  • How can we calculate and interpret the log odds?

Objectives
  • Calculate the probabilities of success and failure given binary data.

  • Interpret the expectation of a binary variable.

  • Calculate and interpret the odds of success given binary data.

  • Calculate and interpret the log odds of success given binary data.

In this lesson we will work with binary outcome variables. That is, variables which can take one of two possible values. For example, these could be $0$ or $1$, “success” or “failure” or “yes” or “no”.

Probabilities and expectation

By analysing binary data, we can estimate the probabilities of success and failure. For example, if we consider individuals between the ages of 55 and 66, we may be interested in the probability that individuals who have once smoked, are still smoking during the NHANES study.

The probability of success is estimated by the proportion of individuals who are still smoking. Similarly, the probability of failure is estimated by the proportion of individuals who are no longer smoking. In this context, we would consider an individual that still smokes a “success” and an individual that no longer smokes a “failure”.

We calculate these values in RStudio through four operations:

  1. Removing empty rows with drop_na();
  2. Subsetting individuals of the appropriate age using filter();
  3. Counting the number of individuals in each of the two levels of SmokeNow using count();
  4. Calculating proportions by dividing the counts by the total number of non-NA observations using mutate().
dat %>%
  drop_na(SmokeNow) %>% # there are no empty rows in Age
                        # so we only use drop_na on SmokeNow
  filter(between(Age, 55, 66)) %>%
  count(SmokeNow, name = "n") %>%
  mutate(prop = n/sum(n))
  SmokeNow   n      prop
1       No 359 0.6232639
2      Yes 217 0.3767361

We see that the probability of success is estimated as $0.38$ and the probability of failure is estimated as $0.62$. In mathematical notation:
$\text{Pr}(\text{SmokeNow} = \text{Yes}) = 0.38$ and $\text{Pr}(\text{SmokeNow} = \text{No}) = 0.62$.

You may have noticed that the probabilities of success and failure add to 1. This is true because there are only two possible outcomes for a binary response variable. Therefore, the probability of success equals 1 minus the probability of failure: $\text{Pr}(\text{Success}) = 1 - \text{Pr}(\text{Failure})$.

In the linear regression lessons, we modelled the expectation of the outcome variable, $E(y)$. In the case of binary variables, we will also work with the expectation of the outcome variable. When $y$ is a binary variable, $E(y)$ is equal to the probability of success. In our example above, $E(y) = \text{Pr}(\text{SmokeNow} = \text{Yes}) = 0.38$.

Exercise

You have been asked to study physical activity (PhysActive) in individuals with an FEV1 (FEV1) between 3750 and 4250 in the NHANES data.
A) Estimate the probabilitites that someone is or is not physically active for individuals with an FEV1 between 3750 and 4250.
B) What value is $E(\text{PhysActive})$ for individuals with an FEV1 between 3750 and 4250?

Solution

A) To obtain the probabilities:

dat %>%
  drop_na(PhysActive) %>%
  filter(between(FEV1, 3750, 4250)) %>%
  count(PhysActive) %>%
  mutate(prop = n/sum(n)) 
  PhysActive   n      prop
1         No 242 0.3159269
2        Yes 524 0.6840731

We therefore estimate the probability of physical activity to be $0.68$ and the probability of no physical activity to be $0.32$.

B) $E(\text{PhysActive}) = \text{Pr}(\text{PhysActive} = \text{Yes}) = 0.68$

Why does $E(y)$ equal the probability of success?

In general, the expectation of a variable equals its probability-weighted mean. This is calculated by taking the sum of all values that a variable can take on, each multiplied by the probability of that value occuring.

In mathematical notation, this is indicated by:

\[E(y) = \sum_i\Big(y_i \times \text{Pr}(y = y_i)\Big)\]

In the case of a binary variable, the variable can take one of two values: $0$ and $1$. Therefore, the expectation becomes:

\[E(y) = \sum_i\Big(y_i \times \text{Pr}(y = y_i)\Big) = 0 \times \text{Pr}(y = 0) + 1 \times \text{Pr}(y = 1) = \text{Pr}(y = 1)\]

Since “success” is considered $y=1$, the expectation of a binary variable equals the probability of success.

Odds and log odds

Besides probabilities, binary data is often interpreted through odds. The odds are defined as:

\[\frac{E(y)}{1-E(y)}.\]

Since the expectation of $y$ equals the probability of success, the odds can also be written as:

\[\frac{E(y)}{1-E(y)} = \frac{\text{Pr}(\text{Success})}{1-\text{Pr}(\text{Success})} = \frac{\text{Pr}(\text{Success})}{\text{Pr}(\text{Failure})}.\]

Therefore, an odds greater than $1$ indicates that the probability of success is greater than the probability of failure. For example, an odds of 1.5 indicates that success is 1.5 times as likely as failure. An odds less than $1$ indicates that the probability of failure is greater than the probability of success. For example, an odds of 0.75 indicates that success is 0.75 times as likely as failure.

Binary outcome variables can be modeled through the log odds. We can see the relationship between the log odds and the expectation in the plot below. As we can see in the plot, a log odds greater than zero is associated with a probability of success greater than 0.5. Likewise, a log odds smaller than 0 is associated with a probability of success less than 0.5.

plot of chunk plot log odds vs expectation

In mathematical notation, the log odds is defined as:

\[\text{log}\left(\frac{E(y)}{1-E(y)}\right).\]

The interpretation of the probabilities, odds and log odds is summarised in the table below:

Measure Turning point Interpretation
Probability 0.5 Proportion of observations that are successes
Odds 1.0 How many times more likely is success than failure?
Log odds 0 If log odds > 0, probability is > 0.5.

The odds and the log odds can be calculated in RStudio through an extension of the code that we used to calculate the probabilities. From our table of probabilities we isolate the row with the probability of success using filter(). We then calculate the odds and the log odds using the summarise() function.

dat %>%
  drop_na(SmokeNow) %>%
  filter(between(Age, 55, 66)) %>%
  count(SmokeNow) %>%
  mutate(prop = n/sum(n)) %>%
  filter(SmokeNow == "Yes") %>%
  summarise(odds = prop/(1 - prop),
    log_odds = log(prop/(1 - prop)))
       odds  log_odds
1 0.6044568 -0.503425

Exercise

You have been asked to study physical activity (PhysActive) in individuals with an FEV1 (FEV1) between 3750 and 4250 in the NHANES data. Calculate the odds and the log odds of physical activity for individuals with an FEV1 between 3750 and 4250. How is the odds interpreted here?

Solution

dat %>%
  drop_na(PhysActive) %>%
  filter(between(FEV1, 3750, 4250)) %>%
  count(PhysActive) %>%
  mutate(prop = n/sum(n)) %>%
  filter(PhysActive == "Yes") %>%
  summarise(odds = prop/(1 - prop),
    log_odds = log(prop/(1 - prop)))
      odds log_odds
1 2.165289 0.772554

Since the odds equal 2.17, we expect individuals with an FEV1 between 3750 and 4250 to be 2.17 times more likely to be physically active than not.

What does $\text{log}()$ do?

The $\text{log}()$ is a transformation used widely in statistics, including in the modelling of binary variables. In general, $\text{log}_a(b)$ tells us to what power we need to raise $a$ to obtain the value $b$.

For example, $2^3 = 2 \times 2 \times 2 = 8$. Therefore, $\text{log}_2(8)=3$, since we raise $2$ to the power of $3$ to obtain 8.

Similarly, $\text{log}_3(81)=4$, since $3^4=81$.

In logistic regression, we use $\text{log}_{e}()$, where $e$ is a mathematical constant. The constant $e$ approximately equals 2.718.

Rather than writing $\text{log}_{e}()$, we write $\text{log}()$ for simplicity.

In R, we can calculate the log using the log() function. For example, to calculate to what power we need to raise $e$ to obtain $10$:

log(10)
[1] 2.302585

Key Points

  • The probabilities of success and failure are estimated as the proportions of participants with a success and failure, respectively.

  • The expectation of a binary variable equals the probability of success.

  • The odds equal the ratio of the probability of success and one minus the probability of success. The odds quantify how many times more likely success is than failure.

  • The log odds are calculated by taking the log of the odds. When the log odds are greater than 0, the probability of success is greater than 0.5.