This lesson is still being designed and assembled (Pre-Alpha version)

Data Viz

Overview

Teaching: 10 min
Exercises: 2 min
Questions
  • —?

Objectives
  • —.

  • —.

Learning Objectives

By the end of this session, you should be able to:

Introduction

In this session, we will show you some examples of how to build a plot with the ggplot2 package. We will be looking at:

The ggplot2 package implements a “layered grammar of graphics” where data visualisation is divided into layers. This is principlally based on Edward Tufte’s work. He is a professor emeritus at Yale University, statistician and all-round godfather of data visualisation.

The layers that you will need to “build up” to form a plot include:

  1. the data you want to visualise (e.g. heights and weights).
  2. an aesthetic mapping to represent variables (e.g. map height to the y-axis, map weight to the x-axis).
  3. a geometric way to represent the information (e.g. display with a point).
  4. a coordinate system (e.g. use a standard cartesian axis).

First, let’s load the packages that we will be using into memory.

library(tidyverse) # contains ggplot2
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(viridis)
## Loading required package: viridisLite
library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
library(RColorBrewer)

Viridis is a great package that provides colour maps that are perceptually uniform in both colour and black-and-white. They are also designed to be perceived by viewers with common forms of colour blindness.

Maps allows you to plot maps of regions of the world

Import data

# long_df <- read_csv("./data/longitudinal-data.csv")
# demo_df <- read_csv("./data/demographic-data.csv")

We will also pull in some other data to help with the visualisations. And perorm a join between our two datasets so they are all in one place.

long_demo <- left_join(long_df, demo_df, by = "episode_id")
postcodes <- read_csv(file = "../../static/ggpostcodes.csv")
UK <- map_data(map = "world", region = "UK")

Data fields

Check our available fields.

names(long_df)
names(demo_df)
names(long_demo) 

Scatter Plots

A scatter plot is a useful way to represent two continuous variables. Correlations between the variables are highlighted. Oberserve how the plot is built up layer by layer, starting with the data, and moving to the aesthetics and then geoms.

ggplot(data = demo_df) +
  aes(x = weight) +
  aes(y = height) +
  geom_point() +
  geom_jitter() +
  aes(colour = sex) +
  geom_smooth(method = "lm", se = F) +
  ggtitle("Height and Weight by Sex") +
  xlab("Weight (Kg)") +
  ylab("Height (cm)") +
  scale_colour_viridis_d(end=0.8, alpha = 0.5) + 
  theme_minimal() +
  theme(legend.title = element_blank()) 

Histograms

Histograms are useful for displaying counts of continuous data. Here we use the argument binwidth = 1 which tells the histogram to count the number of PaO2 samples in “bins” of 1. i.e. 0-1, 1-2, 2-3 kPa and so on…

The representation is very sensitive to the bind width. Try changing it and see what happens.

ggplot(data = long_df) +
  aes(x = pao2_abg) +
  geom_histogram (binwidth = 1, fill ="grey",col = "black") +
  ggtitle("Arterial PaO2") +
  xlab("PaO2") +
  theme_minimal() 

Density Plots

Density plots (or probability density functions to give their grand title) are anologus to histograms. Rather than binning the data, everythign is smoothed. Just like histograms, they are sensitive to how this data is smoothed. This can be adjusted with the bw arugment (which stands for “bandwidth”).

ggplot(data = long_df) +
  aes(x = pao2_abg) +
  geom_density(fill ="grey",
               colour = "black") +
  ggtitle("Arterial PaO2") +
  xlab("PaO2") +
  theme_minimal()

Box-and-Whisker Plots

Before plotting Neutrophil Count grouped by previous Chemotherapy need to check data format

Box and whisker plots are useful for displaying continuous data that has a natural discrete grouping. Like the heights of men and women. Let’s look at something more interesting however. What about the neutrophil counts of those who are or are not on chemotherapy.

First, we need to inspect the chemotherapy variable, and modify it from a numeric value (0, 1) to a categorical value (a factor)

class(long_demo$chemo)
class(long_demo$neutrophil)

long_demo <- long_demo %>% 
  mutate(chemo = as.factor(chemo))

Now we can make the resulting plot.

ggplot(data = long_demo) +
  aes(x = chemo) +
  aes(y = neutrophil) +
  aes(colour = chemo) +
  geom_boxplot() +
  xlab("Chemotherapy")+
  ylab("Neutrophil") +
  ggtitle("Chemotherapy and Neutrophil Count") +
  theme_light() +
  theme(legend.position = "none")

Faceting Plots

Sometimes you might wish to encode information across multiple plots. This is the role of faceting. Here we see an example using steroids.

ggplot(data = long_demo) +
  aes(y = fluid_balance) +
  aes(x = pao2_abg) +
  geom_point(alpha = 0.5) +
  aes(colour = as.factor(steroids)) +
  facet_wrap(~as.factor(steroids)) +
  xlab("PaO2")+
  ylab("Daily Fluid balance (ml)") +
  ggtitle("PaO2, Steroids and Fluid Balance") +
  ylim(-5025,5025)+
  xlim(0,70) +
  theme_light() +
  theme(legend.position = "none")

Chloropleth

To show you how versatile ggplot2 is, we want to show you a plot that would be impossible in excel or most other software that clinicians commonly use. Here we can make a map of the UK, and highlight patient post codes (fakes!) in 7 lines of code!

There are multiple sources of data we need to use. So we need to build the layers up in parallel.

ggplot() +
  geom_polygon(
    data = UK,
    mapping = aes(x = long, y = lat, group = group),
    fill = "#004529", alpha = 0.75) +
  geom_point(
    data = postcodes, 
    mapping = aes(y =  latitude, x = longitude, group = group),
    colour = "#f7fcb9", size = 0.5) +
  coord_map() +
  theme_minimal() +
  theme(axis.text = element_blank(),
        axis.title = element_blank())

Notice how this time we are calling aes and data within the geom itself. This is perfectly acceptable, and makes the code easier to read when using multiple data sources. Also note how there are aesthetic mappings outside aes. This is useful when you don’t want to apply data to a mapping, but rather overwrite a mapping with a single value that isn’t related to the data. Here we are using colour for example, and setting it to the same value for everything within that geom.

Key Points