Hierarchical models
Last updated on 2024-11-19 | Edit this page
Estimated time: 14 minutes
Overview
Questions
- How does Bayesian modeling accommodate group structure?
Objectives
- Learn to construct and fit hierarchical models.
Hierarchical (or multi-level) models are a class of models suited for situations where the study population comprises distinct but interconnected groups. For example, analyzing student performance across different schools, income disparities among various regions, or studying animal behavior within different populations are scenarios where such models would be appropriate.
Incorporating group-wise parameters allows us to model each group separately, and a model becomes hierarchical when we treat the parameters of the prior distribution as unknown. These parameters, known as hyperparameters, are assigned their own prior distribution, referred to as a hyperprior, and are learned during the model fitting process. Similarly as priors, the hyperpriors should be set so they reflect ranges of possible hyperparameter values. Conceptually, these hyperparameters and hyperpriors operate at a higher level of hierarchy, hence the name.
For example, let’s consider the beta-binomial model discussed in Episode 1. We used it to estimate the prevalence of left-handedness based on a sample of 50 students. If we were to include additional information, such as the students’ majors, we could extend the model as follows:
\[X_g \sim \text{Bin}(N_g, \theta_g) \\ \theta_g \sim \text{Beta}(\alpha, \beta) \\ \alpha, \beta \sim \Gamma(2, 0.1).\]
Here, the subscript \(g\) indexes the groups based on majors. The group-specific prevalences for left-handedness \(\theta_g\) are assigned a beta prior with hyperparameters \(\alpha\) and \(\beta\) which are treated as random variables. The final line indicates the hyperprior \(\Gamma(2, 0.1)\) governing the prior beliefs about the hyperparameters.
In this hierarchical beta-binomial model, students are considered exchangeable within their majors but no longer across the entire population. However, an underlying assumption of similarity exists between the groups since they share a common prior, that is learned. This way the groups are not entirely independent but are not treated as equal either.
One of the key advantages of Bayesian hierarchical models is their capacity to leverage information across groups. By pooling information from various groups, these models can yield more robust estimates, particularly when data availability is limited.
Another distinction from non-hierarchical models is that the prior, or the population distribution, of the parameters is learned in the process. This population distribution can provide information into parameter variability on a broader scale, even for groups where data is scarce or completely missing. For instance, if we had data on the handedness of students majoring in natural sciences, the population distribution can give insights into students in humanities and social sciences as well.
In the following example, we will perform a hierarchical analysis of human heights across different countries.
Example: human height
Let’s examine the heights of adults in various countries. In [1], averages and standard errors of adult heights in centimeters across different countries and age groups were provided. We’ll utilize this dataset to generate a sample of hypothetical individual heights and then assess our ability to reproduce these measured height statistics.
This approach is commonly employed when building and testing models: we generate simulated data with known parameters and then compare the inferred results to these known parameters. In our case, the true parameters are derived from real-world data.
We’ll employ a normal model with unknown mean \(\mu\) and standard deviation \(\sigma\) as our generative model, and treat these parameters hierarchically.
First, let’s load the data and examine its structure.
R
set.seed(4985)
height <- read.csv("data/height_data.csv")
str(height)
OUTPUT
'data.frame': 210000 obs. of 8 variables:
$ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Sex : chr "Boys" "Boys" "Boys" "Boys" ...
$ Year : int 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
$ Age.group : int 5 6 7 8 9 10 11 12 13 14 ...
$ Mean.height : num 103 109 115 120 125 ...
$ Mean.height.lower.95..uncertainty.interval: num 92.9 99.9 106.3 112.2 117.9 ...
$ Mean.height.upper.95..uncertainty.interval: num 114 118 123 128 132 ...
$ Mean.height.standard.error : num 5.3 4.72 4.27 3.92 3.66 ...
Let’s subset this data to simplify the analysis and focus on the height of adult women measured in 2019.
R
height_women <- height %>%
filter(
Age.group == 19,
Sex == "Girls",
Year == 2019
) %>%
# Select variables of interest
select(Country, Sex, Mean.height, Mean.height.standard.error)
Let’s select 10 countries randomly
R
set.seed(5431)
# Select countries
N_countries <- 10
Countries <- sample(unique(height_women$Country),
size = N_countries,
replace = FALSE) %>% sort
height_women10 <- height_women %>% filter(Country %in% Countries)
height_women10
OUTPUT
Country Sex Mean.height Mean.height.standard.error
1 Bangladesh Girls 152.3776 0.4966124
2 Belize Girls 158.1201 1.4026324
3 Cameroon Girls 160.4112 0.6146315
4 Chad Girls 162.1242 0.8894219
5 Cote d'Ivoire Girls 158.6524 0.9254438
6 Ghana Girls 158.8551 0.8002225
7 Kenya Girls 159.4338 0.6680202
8 Luxembourg Girls 165.0690 1.3778094
9 Taiwan Girls 160.6953 0.7307839
10 Venezuela Girls 160.0370 1.0813162
Simulate data
Now, we can treat the values in the table above as ground truth and simulate some data based on them. Let’s generate \(N=10\) samples for each country from the normal model with \(\mu = \text{Mean.height}\) and \(\sigma = \text{Mean.height.standard.error}\).
R
# Sample size per group
N <- 10
# For each country, generate heights
height_sim <- lapply(1:N_countries, function(i) {
my_df <- height_women10[i, ]
data.frame(Country = my_df$Country,
# Random values from normal
Height = rnorm(N,
mean = my_df$Mean.height,
sd = my_df$Mean.height.standard.error))
}) %>%
do.call(rbind, .)
Let’s plot the data
R
# Plot
height_sim %>%
ggplot() +
geom_point(aes(x = Height, y = Country)) +
labs(title = "Simulated data")
Modeling
Let’s build a normal model that uses partial pooling for the country means and standard deviations. The model can be stated as follows:
\[\begin{align} X_{gi} &\sim \text{N}(\mu_g, \sigma_g) \\ \mu_g &\sim \text{N}(\mu_\mu, \sigma_\mu) \\ \sigma_g &\sim \Gamma(\alpha_\sigma, \beta_\sigma) \\ \mu_\mu &\sim \text{N}(0, 100)\\ \sigma_\mu &\sim \Gamma(2, 0.1) \\ \alpha_\sigma, \beta_\sigma &\sim \Gamma(2, 0.01). \end{align}\]
Above, \(X_{gi}\) denotes the height for individual \(i\) in country \(g\). The country specific parameters \(\mu_g\) and \(\sigma_g\) are given normal and gamma priors, respectively, with unknown hyperparameters that, in turn, are given hyperpriors on the last two lines. The hyperpriors are quite uninformative as they allow large hyperparameter ranges.
Below is the Stan program for this model. The data points are input
as a concatenated vector X
. The country-specific start and
end indices are computed in the transformed data block. This approach
accommodates uneven sample sizes between groups, although in our data
these are equal.
The parameters block contains the declarations of mean and standard deviation vectors, along with the hyperparameters. The hyperparameter subscripts denote the parameter they are assigned to so, for instance, \(\sigma_{\mu}\) is the standard deviation of the mean parameter \(\mu\). The generated quantities block generates samples from the population distributions of \(\mu\) and \(\sigma\) and a country-agnostic posterior predictive distribution \(\tilde{X}\).
STAN
data {
int<lower=0> G; // number of groups
int<lower=0> N[G]; // sample size within each group
vector[sum(N)] X; // concatenated observations
}
transformed data {
// get first and last index for each group in X
int start_i[G];
int end_i[G];
for(g in 1:G) {
if(g == 1) {
start_i[1] = 1;
} else {
start_i[g] = start_i[g-1] + N[g-1];
}
end_i[g] = start_i[g] + N[g]-1;
}
}
parameters {
// parameters
vector[G] mu;
vector<lower=0>[G] sigma;
// hyperparameters
real mu_mu;
real<lower=0> sigma_mu;
real<lower=0> alpha_sigma;
real<lower=0> beta_sigma;
}
model {
// Likelihood for each group
for(i in 1:G) {
X[start_i[i]:end_i[i]] ~ normal(mu[i], sigma[i]);
}
// Priors
mu ~ normal(mu_mu, sigma_mu);
sigma ~ gamma(alpha_sigma, beta_sigma);
// Hyperpriors
mu_mu ~ normal(0, 100);
sigma_mu ~ gamma(2, 0.1);
alpha_sigma ~ gamma(2, 0.01);
beta_sigma ~ gamma(2, 0.01);
}
generated quantities {
real mu_tilde;
real<lower=0> sigma_tilde;
real X_tilde;
// Population distributions
mu_tilde = normal_rng(mu_mu, sigma_mu);
sigma_tilde = gamma_rng(alpha_sigma, beta_sigma);
// Posterior predictive distribution
X_tilde = normal_rng(mu_tilde, sigma_tilde);
}
Now we can call Stan and fit the model. Hierarchical models can
encounter convergence issues and for this reason, we’ll use 10000
iterations and set adapt_delta = 0.99
. This controls the
acceptance probability in Stan’s adaptation period and will result in a
smaller step size in the Markov chain. Moreover, we’ll speed up the
inference by running 2 chains in parallel by setting
cores = 2
.
R
stan_data <- list(G = length(unique(height_sim$Country)),
N = rep(N, length(Countries)),
X = height_sim$Height)
normal_hier_fit <- rstan::sampling(normal_hier_model,
stan_data,
iter = 10000,
chains = 2,
# Use to avoid divergent transitions:
control = list(adapt_delta = 0.99),
cores = 2)
For comparison, we will also run an unpooled analysis of heights, which makes use of the following Stan file:
R
unpooled_summaries <- list()
for(i in 1:N_countries) {
my_country <- unique(height_sim$Country)[i]
my_df <- height_sim %>%
filter(Country == my_country)
my_stan_data <- list(N = my_df %>% nrow,
X = my_df$Height)
my_fit <- rstan::sampling(normal_pooled_model,
my_stan_data,
iter = 10000,
chains = 2,
# Use to get rid of divergent transitions:
control = list(adapt_delta = 0.99),
cores = 2,
refresh = 0
)
unpooled_summaries[[i]] <- rstan::summary(my_fit, c("mu", "sigma"))$summary %>%
data.frame() %>%
rownames_to_column(var = "par") %>%
mutate(country = my_country)
}
unpooled_summaries <- do.call(rbind, unpooled_summaries)
Country-specific estimates
Let’s first compare the marginal posteriors for the country-specific estimates from the hierarchical model (blue) and an unpooled model (brown) that treats the parameters separately.
R
par_summary <- rstan::summary(normal_hier_fit, c("mu", "sigma"))$summary %>%
data.frame() %>%
rownames_to_column(var = "par") %>%
separate(par, into = c("par", "country"), sep = "\\[") %>%
mutate(country = gsub("\\]", "", country)) %>%
mutate(country = Countries[as.integer(country)])
ggplot() +
geom_errorbar(data = par_summary, aes(x = country, ymin = X2.5., ymax = X97.5.),
color = posterior_color) +
geom_point(data = height_women10 %>%
rename_with(~ c('mu', 'sigma'), 3:4) %>%
gather(key = "par",
value = "value",
-c(Country, Sex)),
aes(x = Country, y = value)) +
geom_errorbar(data = unpooled_summaries,
aes(x = country, ymin = X2.5., ymax = X97.5.),
color = "brown") +
facet_wrap(~ par, scales = "free", ncol = 1) +
coord_flip()
Above, the black points represent the true values, and the intervals are the 95% CIs for a hierarchical and non-hierarchical models, respectively. As apparent, the CIs from the hierarchical model are more concentrated and better capture the true values.
Challenge
Experiment with the data and fit. Explore the effect of sample size, unequal sample sizes between countries, and the amount of countries, for example.
Hyperparameters
Let’s then plot the population distribution’s parameters, that is, the hyperparameters. The sample-based values are included in the plots of \(\mu_\mu\) and \(\sigma_\mu\) (why not for the other two hyperparameters?). It seems that the model has slightly underestimated the overall average and variance of the mean parameter \(\mu\), which is not suprising given the low number of data points.
R
## Population distributions:
population_samples_l <- rstan::extract(normal_hier_fit,
c("mu_mu", "sigma_mu", "alpha_sigma", "beta_sigma")) %>%
do.call(cbind, .) %>%
set_colnames(c("mu_mu", "sigma_mu", "alpha_sigma", "beta_sigma")) %>%
data.frame() %>%
mutate(sample = 1:nrow(.)) %>%
gather(key = "hyperpar", value = "value", -sample)
ggplot() +
geom_histogram(data = population_samples_l,
aes(x = value, y = after_stat(density)),
fill = posterior_color,
bins = 100) +
geom_vline(data = height_women %>%
rename_with(~ c('mu', 'sigma'), 3:4) %>%
filter(Sex == "Girls") %>%
summarise(mu_mu = mean(mu), sigma_mu = sd(mu)) %>%
gather(key = "hyperpar", value = "value"),
aes(xintercept = value)
)+
facet_wrap(~hyperpar, scales = "free")
Population distributions
Let’s then plot the population distributions and compare to the true sample means and standard errors.
R
population_l <- rstan::extract(normal_hier_fit, c("mu_tilde", "sigma_tilde")) %>%
do.call(cbind, .) %>%
data.frame() %>%
set_colnames( c("mu", "sigma")) %>%
mutate(sample = 1:nrow(.)) %>%
gather(key = "par", value = "value", -sample)
ggplot() +
geom_histogram(data = population_l,
aes(x = value, y = after_stat(density)),
bins = 100, fill = posterior_color) +
geom_histogram(data = height_women %>%
rename_with(~ c('mu', 'sigma'), 3:4) %>%
gather(key = "par", value = "value", -c(Country, Sex)) %>%
filter(Sex == "Girls"),
aes(x = value, y = after_stat(density)),
alpha = 0.75, bins = 30) +
facet_wrap(~par, scales = "free") +
labs(title = "Blue = posterior; black = sample")
We can see that the population distribution is able to capture the measured average heights and standard deviations relatively well, although the within-country variation is estimated to be too concentrated. However, remember that these estimates are based on a limited sample: 10 out of 200 countries with 10 individuals in each group.
Posterior predictive distribution
Finally, let’s plot the posterior predictive distribution for the whole population and overlay it with the simulated data based on all countries.
R
# For each country, generate some random girl's heights
height_all <- lapply(1:nrow(height_women), function(i) {
my_df <- height_women[i, ] %>%
rename_with(~ c('mu', 'sigma'), 3:4)
data.frame(Country = my_df$Country,
Sex = my_df$Sex,
# Random normal values based on sample mu and sd
Height = rnorm(N, my_df$mu, my_df$sigma))
}) %>%
do.call(rbind, .)
R
# Extract the posterior predictive distribution
PPD <- rstan::extract(normal_hier_fit, c("X_tilde")) %>%
data.frame() %>%
set_colnames( c("X_tilde")) %>%
mutate(sample = 1:nrow(.))
ggplot() +
geom_histogram(data = PPD,
aes(x = X_tilde, y = after_stat(density)),
bins = 100,
fill = posterior_color) +
geom_histogram(data = height_all,
aes(x = Height, y = after_stat(density)),
alpha = 0.75,
bins = 100)
Extensions
Here, we analyzed the height for women in a randomly chosen countries using a hierarchical model. The model could be extended further, for instance, by adding hierarchy between sexes, continents, developed/developing countries etc.
Key Points
- Hierarchical models are appropriate for scenarios where the study population naturally divides into subgroups.
- Hierarchical models borrow statistical strength across the population groups.
- Population distributions hold information about the variation of the model parameters over the whole population.
Reading
- Bayesian Data Analysis (3rd ed.): Ch. 5
- Statistical Rethinking (2nd ed.): Ch. 13
- Bayes Rules!: Ch. 15-19
References
[1] Height: Height and body-mass index trajectories of school-aged children and adolescents from 1985 to 2019 in 200 countries and territories: a pooled analysis of 2181 population-based studies with 65 million participants. Lancet 2020, 396:1511-1524