This lesson is being piloted (Beta version)
If you teach this lesson, please tell the authors and provide feedback by opening an issue in the source repository

# Linear regression including an interaction between one continuous and one categorical explanatory variable

## Overview

Teaching: 20 min
Exercises: 30 min
Questions
• When is it appropriate to add an interaction to a multiple linear regression model?

• How do we add an interaction term in the lm() command?

• How do the coefficient estimates given by summ() relate to the multiple linear regression model equation?

• How can we visualise the final model in R?

Objectives
• Recognise from an exploratory plot when an interaction between a continuous and a categorical explanatory variable is appropriate.

• Fit a linear regression model including an interaction between one continuous and one categorical explanatory variable using lm().

• Use the jtools package to interpret the model output.

• Use the interactions package to visualise the model output.

In the previous episode, we modeled the effect of the continuous explanatory variable as equal across the groups of the categorical explanatory variable. For example, the effect of Weight on BMI was estimated as $0.31$ for both females and males. In this episode we will expand our modelling approach to include an interaction between the continuous and categorical explanatory variables. As a result, not only the intercept but also the coefficient of the explanatory variable will differ across the levels of the categorical variable in our model. This is appropriate when the slope in our scatterplot differs between the levels of the categorical variable.

## Visually exploring the need for an interaction

We will use the variables BPSysAve, Age and Sex as an example. Looking at the scatterplot below, for which the data has been filtered to include participants over the age of 17, it seems that the slope for females is greater than for males.

dat %>%
filter(Age > 17) %>%
ggplot(aes(x = Age, y = BPSysAve, colour = Sex)) +
geom_point(alpha = 0.4) +
ylab("Average systolic blood pressure (mmHg)") +
xlab("Age (years)")


## Exercise

You have been asked to model the relationship between Hemoglobin and Age in the NHANES data, splitting observations by Sex. Use the ggplot2 package to create an exploratory plot, ensuring that:

1. Only participants aged 18 or over are included.
2. Hemoglobin (Hemoglobin) is on the y-axis and Age (Age) is on the x-axis.
3. These data are shown as a scatterplot, with points coloured by Sex and an opacity of 0.4.
4. The axes are labelled as “Hemoglobin (g/dL)” and “Age (years)”.

## Solution

dat %>%
filter(Age > 17) %>%
ggplot(aes(x = Age, y = Hemoglobin, colour = Sex)) +
geom_point(alpha = 0.4) +
ylab("Hemoglobin (g/dL)") +
xlab("Age (years)")


## Fitting and interpreting a multiple linear regression model with an interaction

The code for fitting our model is similar to the previous lm() commands. Notice however that instead of a + between the explanatory variables, we use a *. This is how we tell lm() that we want an interaction between the explanatory variables to be included in the model. The interaction allows the effect of Age to differ across the levels of Sex.

In the output from summ() we see two coefficients that relate to the baseline level of Sex: an intercept and the effect for Age. For the alternative level of Sex, we see two further coefficients: the difference in the intercept (Sexmale) and the difference in the slope (Age:Sexmale). The equation for this model therefore becomes:

$E(\text{Average systolic blood pressure}) = 93.39 + \left(0.55 \times \text{Age}\right) + \left(17.06 \times x_2\right) - \left(0.28 \times x_2 \times \text{Age}\right)$,

where $x_2 = 1$ for male participants and $0$ otherwise.

Since the difference in the intercepts is positive, we expect a greater intercept for males than for females. Furthermore, since the difference in the slopes is negative, we expect a smaller slope for males than for females.

BPSysAve_Age_Sex <- dat %>%
filter(Age > 17) %>%
lm(formula = BPSysAve ~ Age * Sex)

summ(BPSysAve_Age_Sex)

MODEL INFO:
Observations: 5996 (501 missing obs. deleted)
Dependent Variable: BPSysAve
Type: OLS linear regression

MODEL FIT:
F(3,5992) = 552.67, p = 0.00
R² = 0.22

Standard errors: OLS
------------------------------------------------
Est.   S.E.   t val.      p
----------------- ------- ------ -------- ------
(Intercept)         93.39   0.80   116.27   0.00
Age                  0.55   0.02    35.63   0.00
Sexmale             17.06   1.15    14.89   0.00
Age:Sexmale         -0.28   0.02   -12.68   0.00
------------------------------------------------


## Exercise

1. Using the lm() command, fit a multiple linear regression model of Hemoglobin (Hemoglobin) as a function of Age (Age), grouped by Sex, including an interaction between Age and Sex. Make sure to only include participants who were 18 years or older. Name the lm object Hemoglobin_Age_Sex.
2. Using the summ function from the jtools package, answer the following questions:

A) What concentration of Hemoglobin does the model predict, on average, for an individual, belonging to the baseline level of Sex, with an Age of 0?
B) By how much is Hemoglobin expected to differ, on average, for the alternative level of Sex, at an Age of 0?
C) By how much is Hemoglobin expected to differ, on average, for a one-unit difference in Age in the baseline level of Sex?
D) By how much is Hemoglobin expected to differ, on average, for a one-unit difference in Age in the alternative level of Sex?
E) Given these values and the names of the response and explanatory variables, how can the general equation $E(y) = \beta_0 + {\beta}_1 \times x_1 + {\beta}_2 \times x_2 + {\beta}_3 \times x_1 \times x_2$ be adapted to represent the model?

## Solution

Hemoglobin_Age_Sex <- dat %>%
filter(Age > 17) %>%
lm(formula = Hemoglobin ~ Age * Sex)

summ(Hemoglobin_Age_Sex)

MODEL INFO:
Observations: 5995 (502 missing obs. deleted)
Dependent Variable: Hemoglobin
Type: OLS linear regression

MODEL FIT:
F(3,5991) = 1026.31, p = 0.00
R² = 0.34

Standard errors: OLS
------------------------------------------------
Est.   S.E.   t val.      p
----------------- ------- ------ -------- ------
(Intercept)         13.29   0.06   219.70   0.00
Age                  0.00   0.00     0.68   0.49
Sexmale              2.76   0.09    31.85   0.00
Age:Sexmale         -0.02   0.00   -13.94   0.00
------------------------------------------------


A) 13.29 g/dL.
B) On average, individuals from the two categories are expected to differ by 2.76 g/dL at an Age of 0.
C) On average, female participants with a one-unit difference in Age are expected to differ by 0.00 in their Hemoglobin concentration. So, not at all.
D) On average, male participants with a one-unit difference in Age are expected to differ by 0.02 in their Hemoglobin concentration.
E) $E(\text{Hemoglobin}) = 13.29 + 0.00 \times \text{Age} + 2.76 \times x_2 - 0.02 \times \text{Age} \times x_2$, where $x_2 = 1$ for participants of the male Sex and $0$ otherwise.

## Visualising a multiple linear regression model with an interaction

We can visualise the model, as before, using the interact_plot() function.

interact_plot(BPSysAve_Age_Sex, pred = Age, modx = Sex,
plot.points = TRUE, point.alpha = 0.4) +
ylab("Average systolic blood pressure (mmHg)") +
xlab("Age (years)")


## Exercise

To help others interpret the Hemoglobin_Age_Sex model, produce a figure. Make this figure using the interactions package.

## Solution

interact_plot(Hemoglobin_Age_Sex, pred = Age, modx = Sex,
plot.points = TRUE, point.alpha = 0.4) +
ylab("Hemoglobin (g/dL)") +
xlab("Age (years)")


## Key Points

• It may be appropriate to include an interaction when the slopes appear to differ across levels of a categorical variable.

• Replace + by * in the lm() command to add an interaction.

• When an interaction is included, two coefficients relate to differences between the two levels of a categorical variable - one relates to a difference in the intercept, the other to a difference in the slope.

• The function interact_plot() can be used to visualise the model.