Data Viz
Overview
Teaching: 10 min
Exercises: 2 minQuestions
—?
Objectives
—.
—.
Learning Objectives
By the end of this session, you should be able to:
- Use the ggplot2 package to create your first plot
- Consider the most appropraite plot to display your data
- Understand how to use layers in ggplot2
- Know where to go to find out which aesthetic or geom to use
Introduction
In this session, we will show you some examples of how to build a plot with the ggplot2 package. We will be looking at:
- scatter plots
- histograms and density plots
- box plots
- facetted charts
- maps
The ggplot2 package implements a “layered grammar of graphics” where data visualisation is divided into layers. This is principlally based on Edward Tufte’s work. He is a professor emeritus at Yale University, statistician and all-round godfather of data visualisation.
The layers that you will need to “build up” to form a plot include:
- the data you want to visualise (e.g. heights and weights).
- an aesthetic mapping to represent variables (e.g. map height to the y-axis, map weight to the x-axis).
- a geometric way to represent the information (e.g. display with a point).
- a coordinate system (e.g. use a standard cartesian axis).
First, let’s load the packages that we will be using into memory.
library(tidyverse) # contains ggplot2
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(viridis)
## Loading required package: viridisLite
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
library(RColorBrewer)
Viridis is a great package that provides colour maps that are perceptually uniform in both colour and black-and-white. They are also designed to be perceived by viewers with common forms of colour blindness.
Maps allows you to plot maps of regions of the world
Import data
# long_df <- read_csv("./data/longitudinal-data.csv")
# demo_df <- read_csv("./data/demographic-data.csv")
We will also pull in some other data to help with the visualisations. And perorm a join between our two datasets so they are all in one place.
long_demo <- left_join(long_df, demo_df, by = "episode_id")
postcodes <- read_csv(file = "../../static/ggpostcodes.csv")
UK <- map_data(map = "world", region = "UK")
Data fields
Check our available fields.
names(long_df)
names(demo_df)
names(long_demo)
Scatter Plots
A scatter plot is a useful way to represent two continuous variables. Correlations between the variables are highlighted. Oberserve how the plot is built up layer by layer, starting with the data, and moving to the aesthetics and then geoms.
ggplot(data = demo_df) +
aes(x = weight) +
aes(y = height) +
geom_point() +
geom_jitter() +
aes(colour = sex) +
geom_smooth(method = "lm", se = F) +
ggtitle("Height and Weight by Sex") +
xlab("Weight (Kg)") +
ylab("Height (cm)") +
scale_colour_viridis_d(end=0.8, alpha = 0.5) +
theme_minimal() +
theme(legend.title = element_blank())
Histograms
Histograms are useful for displaying counts of continuous data. Here we use the argument binwidth = 1
which tells the histogram to count the number of PaO2 samples in “bins” of 1. i.e. 0-1, 1-2, 2-3 kPa and so on…
The representation is very sensitive to the bind width. Try changing it and see what happens.
ggplot(data = long_df) +
aes(x = pao2_abg) +
geom_histogram (binwidth = 1, fill ="grey",col = "black") +
ggtitle("Arterial PaO2") +
xlab("PaO2") +
theme_minimal()
Density Plots
Density plots (or probability density functions to give their grand title) are anologus to histograms. Rather than binning the data, everythign is smoothed. Just like histograms, they are sensitive to how this data is smoothed. This can be adjusted with the bw
arugment (which stands for “bandwidth”).
ggplot(data = long_df) +
aes(x = pao2_abg) +
geom_density(fill ="grey",
colour = "black") +
ggtitle("Arterial PaO2") +
xlab("PaO2") +
theme_minimal()
Box-and-Whisker Plots
Before plotting Neutrophil Count grouped by previous Chemotherapy need to check data format
Box and whisker plots are useful for displaying continuous data that has a natural discrete grouping. Like the heights of men and women. Let’s look at something more interesting however. What about the neutrophil counts of those who are or are not on chemotherapy.
First, we need to inspect the chemotherapy variable, and modify it from a numeric value (0, 1) to a categorical value (a factor)
class(long_demo$chemo)
class(long_demo$neutrophil)
long_demo <- long_demo %>%
mutate(chemo = as.factor(chemo))
Now we can make the resulting plot.
ggplot(data = long_demo) +
aes(x = chemo) +
aes(y = neutrophil) +
aes(colour = chemo) +
geom_boxplot() +
xlab("Chemotherapy")+
ylab("Neutrophil") +
ggtitle("Chemotherapy and Neutrophil Count") +
theme_light() +
theme(legend.position = "none")
Faceting Plots
Sometimes you might wish to encode information across multiple plots. This is the role of faceting. Here we see an example using steroids.
ggplot(data = long_demo) +
aes(y = fluid_balance) +
aes(x = pao2_abg) +
geom_point(alpha = 0.5) +
aes(colour = as.factor(steroids)) +
facet_wrap(~as.factor(steroids)) +
xlab("PaO2")+
ylab("Daily Fluid balance (ml)") +
ggtitle("PaO2, Steroids and Fluid Balance") +
ylim(-5025,5025)+
xlim(0,70) +
theme_light() +
theme(legend.position = "none")
Chloropleth
To show you how versatile ggplot2 is, we want to show you a plot that would be impossible in excel or most other software that clinicians commonly use. Here we can make a map of the UK, and highlight patient post codes (fakes!) in 7 lines of code!
There are multiple sources of data we need to use. So we need to build the layers up in parallel.
ggplot() +
geom_polygon(
data = UK,
mapping = aes(x = long, y = lat, group = group),
fill = "#004529", alpha = 0.75) +
geom_point(
data = postcodes,
mapping = aes(y = latitude, x = longitude, group = group),
colour = "#f7fcb9", size = 0.5) +
coord_map() +
theme_minimal() +
theme(axis.text = element_blank(),
axis.title = element_blank())
Notice how this time we are calling aes
and data
within the geom itself. This is perfectly acceptable, and makes the code easier to read when using multiple data sources. Also note how there are aesthetic mappings outside aes
. This is useful when you don’t want to apply data to a mapping, but rather overwrite a mapping with a single value that isn’t related to the data. Here we are using colour for example, and setting it to the same value for everything within that geom.
Key Points