This lesson is in the early stages of development (Alpha version)

Mortality and survival in Game of Thrones

Overview

Teaching: 30 min
Exercises: 120 min
Questions
  • How can I import data into R?

  • How can I check the quality of the data?

  • How can I summarise my data through graphics?

  • How can I perform survival analysis?

Objectives
  • Explore GoT mortality dataset.

  • Visualise GoT dataset graphically.

  • Get to know the basics of suvival analyses.

Death is certain, the time is not

Let’s start by downloading Game of Thrones characters’ mortality data, that was published here. Please save the following two files using File - Save As dialog in your browser.

  1. Original characters data
  2. Additional data encoding table

In this episode, we will provide solutions based both on base R and Tidyverse. To begin with, we will only load the dplyr package, which we will use the most. Note that we will call some of the functions from other Tidyverse packages by package_name::function_name which is a common way of calling functions without loading the whole package.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

A visual guide for using the most common dplyr functions is provided in the Tidyverse section of this website.

Challenge 1

Save the two files in your data/ directory and change the working directory to it. Now read the data/character_data_S01-S08.csv and data/encoding.csv files into R.

Solution to Challenge 1

setwd('path/to/data')
got_dat <- read.csv(file = "character_data_S01-S08.csv", header = TRUE, stringsAsFactors = FALSE)
meta <- read.csv(file = "encoding.csv", header = TRUE, stringsAsFactors = FALSE)

Once data is loaded into R, let’s evaluate its quality.

Challenge 2

Does the table with GoT characters’s mortality data look correct? Are there any missing entries? Tip - use base R functionality to explore the data frame.

Solution to Challenge 2

## make a summary for each column 
summary(got_dat)
       id            name                sex           religion    
 Min.   :100.0   Length:359         Min.   :1.000   Min.   :1.000  
 1st Qu.:189.5   Class :character   1st Qu.:1.000   1st Qu.:4.000  
 Median :279.0   Mode  :character   Median :1.000   Median :9.000  
 Mean   :279.0                      Mean   :1.292   Mean   :7.042  
 3rd Qu.:368.5                      3rd Qu.:2.000   3rd Qu.:9.000  
 Max.   :458.0                      Max.   :2.000   Max.   :9.000  
                                                                   
   occupation    social_status   allegiance_last allegiance_switched
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000      
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.:1.000      
 Median :2.000   Median :2.000   Median :7.000   Median :1.000      
 Mean   :2.552   Mean   :1.688   Mean   :5.588   Mean   :1.153      
 3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:8.000   3rd Qu.:1.000      
 Max.   :9.000   Max.   :2.000   Max.   :9.000   Max.   :2.000      
                                                                    
  intro_season   intro_episode   intro_time_sec   intro_time_hrs 
 Min.   :1.000   Min.   : 1.00   Min.   :     1   Min.   : 0.00  
 1st Qu.:1.000   1st Qu.: 9.00   1st Qu.: 25348   1st Qu.: 7.04  
 Median :3.000   Median :26.00   Median : 77694   Median :21.58  
 Mean   :3.487   Mean   :29.07   Mean   : 87279   Mean   :24.24  
 3rd Qu.:5.000   3rd Qu.:49.00   3rd Qu.:146960   3rd Qu.:40.82  
 Max.   :8.000   Max.   :73.00   Max.   :229649   Max.   :63.79  
                                                                 
    dth_flag        dth_season     dth_episode     dth_time_sec   
 Min.   :0.0000   Min.   :1.000   Min.   : 1.00   Min.   :   342  
 1st Qu.:0.0000   1st Qu.:2.750   1st Qu.:22.25   1st Qu.: 67520  
 Median :1.0000   Median :5.000   Median :43.50   Median :131499  
 Mean   :0.5905   Mean   :4.571   Mean   :41.55   Mean   :126015  
 3rd Qu.:1.0000   3rd Qu.:6.000   3rd Qu.:59.00   3rd Qu.:178775  
 Max.   :1.0000   Max.   :8.000   Max.   :73.00   Max.   :226284  
                  NA's   :147     NA's   :147     NA's   :147     
  dth_time_hrs   censor_time_sec  censor_time_hrs   exp_season   
 Min.   : 0.10   Min.   :   342   Min.   : 0.10   Min.   :1.000  
 1st Qu.:18.76   1st Qu.:115072   1st Qu.:31.96   1st Qu.:1.000  
 Median :36.53   Median :190484   Median :52.91   Median :3.000  
 Mean   :35.00   Mean   :168922   Mean   :46.92   Mean   :3.487  
 3rd Qu.:49.66   3rd Qu.:230800   3rd Qu.:64.11   3rd Qu.:6.000  
 Max.   :62.86   Max.   :230800   Max.   :64.11   Max.   :8.000  
 NA's   :147                                                     
  exp_episode     exp_time_sec     exp_time_hrs   featured_episode_count
 Min.   : 1.00   Min.   :     8   Min.   : 0.00   Min.   : 1.000        
 1st Qu.: 5.00   1st Qu.: 14496   1st Qu.: 4.03   1st Qu.: 1.000        
 Median :20.00   Median : 66551   Median :18.49   Median : 3.000        
 Mean   :26.35   Mean   : 81644   Mean   :22.68   Mean   : 7.805        
 3rd Qu.:45.00   3rd Qu.:144592   3rd Qu.:40.16   3rd Qu.: 8.000        
 Max.   :73.00   Max.   :230347   Max.   :63.99   Max.   :67.000        
                                                                        
   prominence     dth_description    icd10_dx_code      icd10_dx_text     
 Min.   :0.1111   Length:359         Length:359         Length:359        
 1st Qu.:0.3333   Class :character   Class :character   Class :character  
 Median :0.8750   Mode  :character   Mode  :character   Mode  :character  
 Mean   :1.1292                                                           
 3rd Qu.:1.1716                                                           
 Max.   :7.3425                                                           
                                                                          
 icd10_cause_code   icd10_cause_text   icd10_place_code   icd10_place_text  
 Length:359         Length:359         Length:359         Length:359        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  top_location    geo_location    time_of_day       X             X.1         
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Mode:logical   Mode:logical  
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   NA's:359       NA's:359      
 Median :2.000   Median :1.000   Median :1.000                                
 Mean   :2.377   Mean   :1.175   Mean   :2.509                                
 3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:2.000                                
 Max.   :9.000   Max.   :2.000   Max.   :9.000                                
 NA's   :147     NA's   :147     NA's   :147                                  
   X.2            X.3            X.4            X.5         
 Mode:logical   Mode:logical   Mode:logical   Mode:logical  
 NA's:359       NA's:359       NA's:359       NA's:359      
                                                            
                                                            
                                                            
                                                            
                                                            

The last six columns have no entries at all and therefore should be removed to not interfere with statistical analyses.

## remove columns that only contain NAs as entries
# (1) for each column, do all rows contain NAs?
nas_all <- got_dat %>% 
  # use dplyr to apply the same function to all columns 
  # the function to be applied is provided inside list()
  # ~ is used to generate new column names automatically
  summarise_all(list(~all(is.na(.))))
# (2) which columns contain only NAs?
to_remove <- which(nas_all == TRUE)
# (3) remove these columns 
got <- got_dat[, -to_remove]

Graphical data exploration

Before proceeding into any kind of statistical analysis, it is worth exploring the dataset of interest from different perspectives.

To make graphical data visualisations, we will be using ggplot package.

library(ggplot2)

First, we will make plots to check the distribution of different variables:

Categorical:

Type of occupation was categorised as “silk collar” (e.g. clergy, merchants, politicians, and rulers) or “boiled leather collar” (e.g. warriors, farmers, and other occupations relying heavily on manual work).

Type of social status was categorised as “highborn” (lords, ladies, or legitimate offspring) or “lowborn” (all other characters).

Because some characters switched allegiance during the show, both their last known allegiance and whether or not they switched allegiance during the show were recorded.

Whether character died or not during the period provided in the dataset is flagged in column dth_flag.

Continuous:

A proxy measure for how prominently a character featured in the show was provided in the data. This prominence score was calculated by taking the number of episodes that a character appeared in and dividing that by the number of total episodes that the character could have appeared in (i.e. the number of episodes occurring from the character first being introduced until the point of death or censoring). This ratio was then multiplied by the number of seasons that the character had featured in.

Quick question

What every other variable in the dataset is: categorical or continuous?

Distribution

To begin with, let’s compare three categorical variables, e.g. occupation vs sex vs social status.

Challenge 3

Make a bar chart to show the distribution of three categorical variables of your choice, e.g. occupation vs sex vs social status. How can you ensure that all three variables are represented in single figure? Tip - think about the aesthetics mapping in ggplot().

Solution to Challenge 3

ggplot(got) +
geom_bar(aes(x = factor(occupation), fill = factor(social_status))) +
facet_wrap(~sex) +
scale_x_discrete(name = "occupation") +
scale_fill_viridis_d(name = "social status")

plot of chunk unnamed-chunk-9

This is not a very informative graph, because all categorical variables are encoded as numerical categories. Details of what number corresponds to what value are available in the data_dictionary.pdf file that you can download from the original data source link. For simplicity’s sake, we have saved them into the data/encoding.csv file, that you have loaded as meta object during Challenge 1.

Challenge 4

How can you find the values for each of the encoded categorical variable? Tip - use tidyr::pivot_longer function to list all variables for each character in a separate row, then iterate over them using dplyr::rowwise and finally, apply tidyr::pivot_wider function to collect all observations for each character into a single row again.

Solution to Challenge 4

## (1) What are the unique categorial variables?
cols_cat <- unique(meta$variable)

## (2) Use tidyr::pivot_longer to pivot data to long format:
# for each character, extract all of the categorical variables into a separate row
got_cat_long <- got %>%
  # use all_of() to force the selection of the column names listed in the variable cols_cat
    select(all_of(cols_cat), id, name) %>% 
    tidyr::pivot_longer(-c(id, name),
      names_to = "cat_variable",
      values_to = "cat_code")

## (3) Extract variables' values from meta data.frame
got_cat_long <- got_cat_long %>%
  # for each character and variable combination
  rowwise() %>% 
  mutate(variable_value = ifelse(
    # if character and variable combination is not NA
    is.na(cat_code), NA,
    # extract the variable's value from the meta data.frame
    meta %>% 
      filter(variable == cat_variable, code == cat_code) %>%
      select(value) %>%
      pull())) %>% 
    select(-cat_code) 
  
## (4) Use tidyr::pivot_wider to pivot data back to wide format:
got_cat <- got_cat_long %>% 
  tidyr::pivot_wider(names_from = cat_variable,
                     values_from = variable_value) %>% 
  # remove grouping by rows
  ungroup()

Now that you have a data.frame with values for the categorical variables, re-run the distribution plot. Make sure that x-axis is readible. Tip - rotate the labels.

ggplot(got_cat) +
  geom_bar(aes(x = factor(occupation), fill = factor(social_status))) +
  facet_wrap(~sex) +
  scale_x_discrete(name = "occupation") +
  scale_fill_viridis_d(name = "social status") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

plot of chunk unnamed-chunk-11

Let’s explore this dataset more by looking into how frequently new characters were introduced into the show. Which got data.frame column store this information?

Challenge 5

Make two bar charts: one to show how many character were introduced in every season and one to show how many characters died in each season.

Solution to Challenge 5

## make a bar chart to show how many character were introduced in every season
ggplot(got) +
  geom_bar(aes(x = as.factor(intro_season))) +
  scale_x_discrete(name = "Season number") +
  ggtitle("How many new characters were introduced in each season")

plot of chunk unnamed-chunk-13 Maybe this explain why season 7 is considered the worst of all?

Now, let’s plot how many characters died in each season. There are characters which have NAs in the corresponding data.frame columns. Can you add them to the plot with a more meaningful data label than NA?

## the second bar chart
ggplot(got %>%
  ## use dplyr mutate inside ggplot to quickly modify the column only for the plot
  mutate(dth_season = ifelse(is.na(dth_season), "Still alive", dth_season))) +
  geom_bar(aes(x = as.factor(dth_season))) +
  scale_x_discrete(name = "Season number") +
  ggtitle("How many characters died in each season")

plot of chunk unnamed-chunk-14

Brief overview

It is worth performing some basic statistics before diving deep into the questions that really interests you.

For example, we can check whether men and women have the same distribution of occupation using chi-square test. The chi-squared test is a statistical hypothesis test that assumes (the null hypothesis) that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable.

Challenge 6

Calculate chi-square statistic between sex and occupation, or your selected categorical variables. Which of the variables are independent of the sex variable and which are dependent?

We will use function chisq.test and set correct=FALSE to turn off Yates’ continuity correction.

Solution to Challenge 6

## look into the number of characters in each category
table(got_cat$sex, got_cat$occupation)
        
         Boiled leather collar Silk collar Unknown/Unclear
  Female                    44          24              37
  Male                     177          72               5
## run the test
chisq.test(got_cat$sex, got_cat$occupation, correct = FALSE)

	Pearson's Chi-squared test

data:  got_cat$sex and got_cat$occupation
X-squared = 80.436, df = 2, p-value < 2.2e-16

It seems as if sex and occupation variables are dependent? But information of the occupation for lots of the characters is unknwon. Perhaps these should be omitted from the test.

## remove characters for which occupation is not known
got_occup <- got_cat %>% 
  filter(occupation != "Unknown/Unclear")
## rerun the test
chisq.test(got_occup$sex, got_occup$occupation, correct = FALSE)

	Pearson's Chi-squared test

data:  got_occup$sex and got_occup$occupation
X-squared = 1.0293, df = 1, p-value = 0.3103

The cause of death is stored in column icd10_cause_text in the original dataset. Value dth_flag == 1 indicates that character died during the period described in the dataset.

head(got[got$dth_flag == 1, "icd10_cause_text"])
[1] "Assault by knife"                                                                  
[2] "Assault by knife"                                                                  
[3] "Legal execution"                                                                   
[4] "Assault by hanging, strangulation and suffocation"                                 
[5] "Assault by other specified sharp object"                                           
[6] "War operations involving firearm discharge and other forms of conventional warfare"

Challenge 7

Provide answers to the following questions:

  • What percentage of characters died by the end of the period included in the dataset?
  • What were the major causes of death?

Solution to Challenge 7

chars_died <- nrow(got[got$dth_flag == 1, ])
chars_total <- nrow(got)
## percentage of characters that died
chars_died/ chars_total * 100
[1] 59.05292

To identify the most common cause of death, use base R function table which calculates frequencies of entries.

causes <- table(got[got$dth_flag == 1, "icd10_cause_text"])
causes <- as.data.frame(causes[order(causes, decreasing = TRUE)])
causes$prop <- causes$Freq/chars_died * 100
cat(paste(causes$Var1,  "-", causes$prop, "\n", sep = " "))
Assault by knife - 27.3584905660377 
 War operations involving firearm discharge and other forms of conventional warfare - 24.5283018867925 
 Assault by smoke, fire and flames - 8.49056603773585 
 Assault by other specified sharp object - 5.66037735849057 
 Legal execution - 5.18867924528302 
 Assault by drugs, medicaments and biological substances - 3.30188679245283 
 War operations involving fires, conflagrations and hot substances - 3.30188679245283 
 Assault by bodily force - 2.83018867924528 
 Assault by unspecified means - 2.83018867924528 
 Assault by pushing from high place - 2.35849056603774 
 Bitten or struck by other mammal - 2.35849056603774 
 Assault by hanging, strangulation and suffocation - 1.88679245283019 
 Other maltreatment syndromes - 1.88679245283019 
  - 1.41509433962264 
 Bitten or struck by dog - 1.41509433962264 
 Assault by blunt object - 0.943396226415094 
 Intentional self-harm by jumping from a high place - 0.943396226415094 
 Intentional self-poisoning by and exposure to other unspecified drugs, medicaments and biological substances - 0.943396226415094 
 War operations, unspecified - 0.943396226415094 
 Assault by steam, hot vapours and hot objects - 0.471698113207547 
 Intentional self-harm by hanging - 0.471698113207547 
 Intentional self-harm by knife - 0.471698113207547 

Survival analysis

We will use Kaplan-Meier (KM) survival analysis with Cox proportional hazard regression modelling to quantify survival times and probabilities and to identify independent predictors of mortality, respectively.

A good introduction on the topic can be found at datacamp.

Kaplan-Meier model

The survival probability is the probability that an individual survives from the time origin (here, first appearance on the screen) to a specified future time (here, end of the period described in the dataset). The KM method is a non-parametric method used to estimate the survival probability from observed survival times. The KM survival curve provides a summary of the data and can be used to estimate e.g. median survival time.

Fit data to model

We will use survival package to perform model fitting and survminer package for survival curves plots. Install and load required packages.

install.packages(c("survival", "survminer"))
library(survival)
library(survminer)
Loading required package: ggpubr
Loading required package: magrittr

First, we will fit mortality data to the KM model. Column exp_time_hrs stores survival time of character in the show (hours), column dth_flag indicates whether character has died. Let’s add these columns to the got_cat data.frame, which contains catgeorical variables values, so that all neccessary information would be in one table.

## got and got_cat have the same order, therefore we can simply take the columns from got
got_cat$exp_time_hrs <- got$exp_time_hrs
got_cat$dth_flag <- got$dth_flag
surv_object <- with(got_cat, Surv(exp_time_hrs, dth_flag))

The function survfit will be used to compute KM survival estimate. Its main arguments include:

Let’s plot the survival probability vs time in the show. Also add a line for median survival time.

## survival without grouping requires to specify 1 in the formula
surv_model <- survfit(Surv(exp_time_hrs, dth_flag) ~ 1, data = got_cat)
ggsurvplot(surv_model, data = got_cat, surv.median.line = "hv")

plot of chunk unnamed-chunk-23

Use the surv_model object to extract the probability of surviving at least 1 h in the show.

surv_sum <- summary(surv_model)
## probabilities of surviving less than 1 hour
probs_1 <- surv_sum$surv[which(surv_sum$time < 1)]
## probability of surviving at least 1 hour
probs_1[length(probs_1)]
[1] 0.8462928

Stratified survival

Let’s check whether survival probability differs between various groups of characters. We will stratify individuals by:

To compare two or more survival curves, most commonly log-rank test is applied. Essentially, the log rank test compares the observed number of events (i.e. deaths) in each group to what would be expected if the null hypothesis were true (i.e., if the survival curves were identical).

The function survdiff can be used to compute log-rank test comparing two or more survival curves. The variable that stratifies individuals into groups have to be specified in the function’s formula.

Challenge 8

Fit KM model for the three variables: sex, social_status, allegiance_switched. You will need to specify these in the formula inside the survfit function. To add obtained p-value for test to the plot, use pval = TRUE argument in ggsurvplot function. Don’t forget to use the data.frame with string values for categorical variables so that the plots would have clear labels.

Solution to Challenge 8

## stratify by sex
surv_model <- survfit(Surv(exp_time_hrs, dth_flag) ~ sex, data = got_cat)
ggsurvplot(surv_model, data = got, pval = TRUE)

plot of chunk unnamed-chunk-25

## stratify by social_status
surv_model <- survfit(Surv(exp_time_hrs, dth_flag) ~ social_status, data = got_cat)
ggsurvplot(surv_model, data = got, pval = TRUE)

plot of chunk unnamed-chunk-26

## stratify by allegiance_switched
surv_model <- survfit(Surv(exp_time_hrs, dth_flag) ~ allegiance_switched, data = got_cat)
ggsurvplot(surv_model, data = got, pval = TRUE)

plot of chunk unnamed-chunk-27

In order to model survival based on prominence, which is a continuous variable, we have to categorise characters into groups (i.e. discrete variable).

Challenge 9

Divide characters into tertiles (i.e. high, medium, and low) based on their prominence. Tip - one of possible ways of doing this is with dplyr package. Make a KM survival curve plot for the prominence categories.

Solution to Challenge 9

prominence_cats <- c("Low", "Medium", "High")
## bin data into tertiles (n = 3)
got_cat$prominence_tertile <- ntile(got$prominence, n = 3)
got_cat$prominence <- prominence_cats[got_cat$prominence_tertile]
## stratify by prominence tertile
surv_model <- survfit(Surv(exp_time_hrs, dth_flag) ~ prominence, data = got_cat)
ggsurvplot(surv_model, data = got_cat, pval = TRUE)

plot of chunk unnamed-chunk-29

Cox model

Cox proportional hazards regression analysis, which works for both quantitative predictor variables and for categorical variables, extends survival analysis methods to assess the effect on survival time by of multiple risk factors simultaneously.

The function coxph can be used to compute the Cox proportional hazards regression model. Its main arguments include:

Univariate Cox regression for a single variable sex.

coxph(Surv(exp_time_hrs, dth_flag) ~ sex, data = got_cat)
Call:
coxph(formula = Surv(exp_time_hrs, dth_flag) ~ sex, data = got_cat)

          coef exp(coef) se(coef)     z        p
sexMale 0.6264    1.8709   0.1697 3.691 0.000223

Likelihood ratio test=15.24  on 1 df, p=9.462e-05
n= 359, number of events= 212 

Multivariate Cox model

To perform multivariate Cox regression, all variables of interest must be listed in the formula. The obtained p-values indicate whether the relationship between survival and the given risk factor was significant. Which variables are significant in this Cox model?

cox_fit <- coxph(Surv(exp_time_hrs, dth_flag) ~ sex + social_status + allegiance_switched + prominence, data = got_cat)
print(cox_fit)
Call:
coxph(formula = Surv(exp_time_hrs, dth_flag) ~ sex + social_status + 
    allegiance_switched + prominence, data = got_cat)

                          coef exp(coef) se(coef)      z        p
sexMale                 0.2584    1.2949   0.1727  1.496   0.1345
social_statusLowborn    0.3402    1.4053   0.1571  2.165   0.0304
allegiance_switchedYes -0.8756    0.4166   0.2092 -4.186 2.84e-05
prominenceLow          -1.4369    0.2377   0.2360 -6.089 1.14e-09
prominenceMedium        0.7619    2.1424   0.1602  4.755 1.98e-06

Likelihood ratio test=156.9  on 5 df, p=< 2.2e-16
n= 359, number of events= 212 

Hazard ratios (HR) are derived from the multivariate Cox model. Briefly, an HR > 1 indicates an increased risk of death if a specific risk factor is met by the individual. An HR < 1 indicates a decreased risk. Plot the obtained HR using function ggforest.

ggforest(cox_fit, data = got_cat)

plot of chunk unnamed-chunk-32

Challenge 10

What kind of a character was more likely to die in Game of Thrones?

Solution to Challenge 10

Character that was more likely to die in Game of Thrones:

  • Male, rather than female (but not statistically significant)
  • Lowborn, rather than highborn
  • Those who did not switch allegiance (disloyalty pays off?)
  • Characters who only featured moderately prominently (protection by the importance of the role?)

Key Points

  • Load data into R.

  • Practice using base R, dplyr and tidyverse functions for data manipulations.

  • Perform data visualisations using ggplot2 package.

  • Perform survival analyses using survival and survminer packages.