Data Wrangling
Overview
Teaching: 45 min
Exercises: 2 minQuestions
What is data wrangling?
How do we wrangle data in R?
Objectives
Use the tidyverse package to build your first data pipeline
Perform basic data manipulation tasks
Inspect and summarise data
Data wrangling with dplyr
Data wrangling is the process of reshaping data from its raw format, inta a format that is suitable for analysis. A typical goal is to make sure our data is in a “tidy” format. To quote one of R’s current chief architects (Hadley Wickham), tidy data is shaped such that:
Each column is a variable. Each row is an observation.
The dplyr
package provides useful grammar for simplifying data manipulation. It is not needed but makes manipulation verbose and intuitive and is highly recommended. Let’s load of the package along with the rest of the tidyverse if not already done. And we can read in the synthetic cohort.
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
df <- read_csv("./data/synthetic_data_clean.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## creatinine = col_character(),
## arrival_dttm = col_datetime(format = ""),
## discharge_dttm = col_datetime(format = ""),
## dob = col_date(format = ""),
## vital_status = col_character(),
## sex = col_character(),
## id = col_character()
## )
## See spec(...) for full column specifications.
We can inspect the data again.
glimpse(df)
## Observations: 5,000
## Variables: 32
## $ ph_abg <dbl> 7.440, 7.300, 7.480, 7.330, 7.210, 7.500, 7.29…
## $ hco3_abg <dbl> 28.0, 20.2, 24.6, 22.3, 20.0, 25.3, 19.5, 23.1…
## $ temp_c <dbl> 36.9, 36.6, 37.2, 35.4, 35.9, 36.1, 35.6, 35.9…
## $ temp_nc <dbl> 37.1, 36.4, 36.2, 35.8, 37.1, 35.6, 38.3, 36.1…
## $ urea <dbl> 3.0, 5.8, 4.5, 13.7, 7.9, 8.9, 12.5, 9.4, 12.9…
## $ creatinine <chr> "80 micromol/L", "93 micromol/L", "75 micromol…
## $ na <dbl> 135, 140, 140, 136, 140, 138, 138, 138, 134, 1…
## $ k <dbl> 4.2, 4.4, 3.8, 5.0, 4.9, 4.2, 5.0, 5.8, 6.5, 4…
## $ hb <dbl> 90.0, 118.0, 113.0, 117.0, 171.0, 110.0, 132.0…
## $ wbc <dbl> 9.60, 11.50, 12.40, 13.20, 20.00, 16.90, 13.20…
## $ neutrophil <dbl> 7.35, 9.25, 11.50, 8.90, 14.59, 15.60, 10.40, …
## $ platelets <dbl> 295, 145, 211, 159, 128, 271, 360, 265, 127, 2…
## $ crp <dbl> 240.3, 85.0, 12.0, 9.0, 8.9, 8.9, 30.0, 8.0, 1…
## $ chemo <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0…
## $ chronic_rrt <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0…
## $ metastatic <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0…
## $ radiotx <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0…
## $ apache <dbl> 6, 8, 15, 14, 15, 15, 20, 20, 18, 56, 22, 18, …
## $ medical <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0…
## $ system <dbl> 3, 2, 3, 7, 2, 10, 1, 3, 1, 4, 1, 3, 2, 7, 7, …
## $ height <dbl> 1.65, 1.80, 1.70, 1.50, 1.90, 1.60, 1.60, 1.75…
## $ weight <dbl> 65, 50, 70, 95, 75, 95, 60, 75, 60, 85, 70, 80…
## $ elective_surgical <dbl> 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1…
## $ arrival_dttm <dttm> 2014-01-01 13:06:16, 2014-01-01 00:35:32, 201…
## $ discharge_dttm <dttm> 2014-01-02 01:06:16, 2014-01-05 00:35:32, 201…
## $ dob <date> 1964-01-01, 1984-01-01, 1964-01-01, 1944-01-0…
## $ vital_status <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "…
## $ sex <chr> "F", "M", "F", "M", "M", "F", "F", "M", "M", "…
## $ id <chr> "00001", "00002", "00003", "00004", "00005", "…
## $ lactate_1hr <dbl> 1.2, 1.8, 1.4, 1.5, 0.7, 2.6, 3.0, 1.2, 1.6, 1…
## $ lactate_6hr <dbl> 0.3, 0.5, 1.7, 1.4, 0.8, 1.5, 0.8, 0.7, 1.8, 1…
## $ lactate_12hr <dbl> 0.1, 0.3, 1.2, 1.4, 0.6, 1.5, 0.8, 0.4, 1.8, 1…
Using our synthetic cohort, let’s filter the demographic data by row.
filter(df, height >= 1.8)
## # A tibble: 1,019 x 32
## ph_abg hco3_abg temp_c temp_nc urea creatinine na k hb wbc
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 7.3 20.2 36.6 36.4 5.8 93 microm… 140 4.4 118 11.5
## 2 7.21 20 35.9 37.1 7.9 186 micro… 140 4.9 171 20
## 3 7.46 25.2 37.1 36.8 19 163 micro… 135 2.4 91 14.3
## 4 7.44 22.7 36.5 36.7 1.7 56 microm… 134 4.3 102 11.8
## 5 7.49 38 37 36.1 10.1 66 microm… 142 3.7 75 10.8
## 6 7.45 30.3 36.4 37.3 7.8 89 microm… 152 4.9 99 16
## 7 7.27 26.2 36.2 38 9.2 268 micro… 136 4.5 154 7.5
## 8 7.34 20.2 36.3 35.7 9.3 49 microm… 110 4.6 158 8.1
## 9 7.3 21.5 36.7 37.4 10.1 158 micro… 134 4.5 171 21.8
## 10 7.38 23.2 35.2 33.6 9.7 136 micro… 131 5.5 128 6
## # … with 1,009 more rows, and 22 more variables: neutrophil <dbl>,
## # platelets <dbl>, crp <dbl>, chemo <dbl>, chronic_rrt <dbl>,
## # metastatic <dbl>, radiotx <dbl>, apache <dbl>, medical <dbl>,
## # system <dbl>, height <dbl>, weight <dbl>, elective_surgical <dbl>,
## # arrival_dttm <dttm>, discharge_dttm <dttm>, dob <date>,
## # vital_status <chr>, sex <chr>, id <chr>, lactate_1hr <dbl>,
## # lactate_6hr <dbl>, lactate_12hr <dbl>
This filters rows from the df
data frame where height
is greater than or equal to 1.8
m.
TIP: Comparisons in R: Most of these are obvious
>
(greater than),>=
(greater than or equal to), and similarly for<
and<=
. The!=
operator means ‘not equal to’. But when we want to check if something is equal to something else we must use==
. Why? Because although R prefers you to use<-
when you name things, most programming languages use=
, and even R expects you to use=
when you pass values to functions. So, for a function such asmean(x)
we are normally lazy when we writemean(hrate)
. We should writemean(x = hrate)
, because inside the function all the work is done with the variablex
. When we writemean(x = hrate)
we explicitly telling R that we want it to usehrate
in place ofx
. This is a very long winded way of saying that when you want to test if one thing is equal to another then you need a different way of writing this, hence==
.
Filter always acts on rows, and will throw away any data that doesn’t meet your request.
include_graphics("./img/filter.png")
A similar function called select
acts over columns. Just want the sex
colum?
select(df, sex)
## # A tibble: 5,000 x 1
## sex
## <chr>
## 1 F
## 2 M
## 3 F
## 4 M
## 5 M
## 6 F
## 7 F
## 8 M
## 9 M
## 10 M
## # … with 4,990 more rows
include_graphics("./img/select.png")
So filter
chooses rows, and select
chooses columns.
Now here comes the proper magic. What if you want to both filter and select?
df %>%
filter(height >= 1.8) %>%
select(sex)
## # A tibble: 1,019 x 1
## sex
## <chr>
## 1 M
## 2 M
## 3 M
## 4 F
## 5 M
## 6 F
## 7 M
## 8 M
## 9 M
## 10 M
## # … with 1,009 more rows
The %>%
operator is called a pipe, and it (surprise, surprise) pipes data from one command to the next. So in plain English, the above line takes the data frame df
and pipes it into the filter function, removing all patients under 1.8
m. It then pipes the filtered data into the select
function where only the sex
column is retained.
The benefit of writing code like this, is that you start with your data at the top, and then each line is a single function that performs an action. The data is processed one step at a time at the bottom you get your tidy data. This is the very start of a data pipeline.
TIP: The order matters in a pipe! If we were to use the
select
function first, then thefilter
function would not have aheight
column to filter on.).
Next, you can continue to expand your data pipeline by piping your data into another function. We might want to summarise the data. By appending the pipeline with the group_by
and tally
functions we can tabulate the results.
df %>%
filter(height >= 1.8) %>%
select(sex) %>%
group_by(sex) %>%
tally()
## # A tibble: 2 x 2
## sex n
## <chr> <int>
## 1 F 58
## 2 M 961
There are a small number of ‘verbs’ (functions) in the dplyr package that when combined allow you to complete a large number of useful data maniputations. In addition to select
, and filter
, you will want you to learn:
arrange
- orders data using a named columngroup_by
- sets a column to be a grouping variablesummarise
- summarises data over a grouping variablemutate
- add a new column to the data
Armed with only these 6 functions, you can accomplish a huge amount that would be tiresome and frustrating in excel.
Arrange
df %>%
select(height, weight, sex) %>%
arrange(height)
## # A tibble: 5,000 x 3
## height weight sex
## <dbl> <dbl> <chr>
## 1 1.45 60 M
## 2 1.45 50 F
## 3 1.45 85 F
## 4 1.45 60 M
## 5 1.45 50 F
## 6 1.45 50 F
## 7 1.45 50 M
## 8 1.45 60 F
## 9 1.45 50 F
## 10 1.45 85 F
## # … with 4,990 more rows
include_graphics("./img/arrange.png")
Here we see an example where we select the height, weight and sex columns, and then arrange the data by the height column.
Group By and Summarise
We now see an example of group_by
and summarise
being used together. The result is to apply the function mean
to each group, and summarise the results. The na.rm = TRUE
has to be added if there are missing values are present in the data. We don’t have any here, but it’s good to remember for your own data.
df %>%
group_by(sex) %>%
summarise(average_height = mean(height, na.rm = TRUE))
## # A tibble: 2 x 2
## sex average_height
## <chr> <dbl>
## 1 F 1.62
## 2 M 1.74
the group by function on it’s own doesn’t really achieve much. It creates a grouping so that a summary function can act over that grouping.
Mutate
We might be interested in computing the BMI for a patient. We can do this easily by adding a new column with the mutate
verb.
df %>%
mutate(bmi = weight / (height/100)^2)
## # A tibble: 5,000 x 33
## ph_abg hco3_abg temp_c temp_nc urea creatinine na k hb wbc
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 7.44 28 36.9 37.1 3 80 microm… 135 4.2 90 9.6
## 2 7.3 20.2 36.6 36.4 5.8 93 microm… 140 4.4 118 11.5
## 3 7.48 24.6 37.2 36.2 4.5 75 microm… 140 3.8 113 12.4
## 4 7.33 22.3 35.4 35.8 13.7 173 micro… 136 5 117 13.2
## 5 7.21 20 35.9 37.1 7.9 186 micro… 140 4.9 171 20
## 6 7.5 25.3 36.1 35.6 8.9 81 microm… 138 4.2 110 16.9
## 7 7.29 19.5 35.6 38.3 12.5 55 microm… 138 5 132 13.2
## 8 7.4 23.1 35.9 36.1 9.4 251 micro… 138 5.8 99 15.7
## 9 7.36 24 35.9 36.6 12.9 390 micro… 134 6.5 9.7 8.85
## 10 7.43 22 36.6 37.3 5.1 56 microm… 139 4.6 141 13.5
## # … with 4,990 more rows, and 23 more variables: neutrophil <dbl>,
## # platelets <dbl>, crp <dbl>, chemo <dbl>, chronic_rrt <dbl>,
## # metastatic <dbl>, radiotx <dbl>, apache <dbl>, medical <dbl>,
## # system <dbl>, height <dbl>, weight <dbl>, elective_surgical <dbl>,
## # arrival_dttm <dttm>, discharge_dttm <dttm>, dob <date>,
## # vital_status <chr>, sex <chr>, id <chr>, lactate_1hr <dbl>,
## # lactate_6hr <dbl>, lactate_12hr <dbl>, bmi <dbl>
We now create a new column called bmi
.
include_graphics("./img/mutate.png")
Let’s combine a few of these functions to return only the data we want to see.
df %>%
mutate(bmi = weight / (height)^2) %>%
select(weight, height, bmi) %>%
arrange(desc(bmi))
## # A tibble: 5,000 x 3
## weight height bmi
## <dbl> <dbl> <dbl>
## 1 150 1.55 62.4
## 2 150 1.6 58.6
## 3 150 1.6 58.6
## 4 150 1.6 58.6
## 5 135 1.55 56.2
## 6 150 1.65 55.1
## 7 150 1.65 55.1
## 8 150 1.65 55.1
## 9 150 1.65 55.1
## 10 150 1.65 55.1
## # … with 4,990 more rows
There are many other functions in dplyr that can help with easy data maniputation. Have a look at the cheatsheet for more details.
Data Cleaning
We will now go through some common data cleaning tasks. It is better to try and do these in R, rather than adjust the underlying data in excel if possible. Some tasks are going to have to be done in excel, but the more you can do this with written instructions (and hence have written documentation) the better.
names(df)
## [1] "ph_abg" "hco3_abg" "temp_c"
## [4] "temp_nc" "urea" "creatinine"
## [7] "na" "k" "hb"
## [10] "wbc" "neutrophil" "platelets"
## [13] "crp" "chemo" "chronic_rrt"
## [16] "metastatic" "radiotx" "apache"
## [19] "medical" "system" "height"
## [22] "weight" "elective_surgical" "arrival_dttm"
## [25] "discharge_dttm" "dob" "vital_status"
## [28] "sex" "id" "lactate_1hr"
## [31] "lactate_6hr" "lactate_12hr"
Some of these variable names are too vague for my liking.
- chemo: active or historical?
- na: too similar to the R concept of “missing” i.e. NA
- system: a little ambiguous
We’re using the rename
function from the dplyr package, and then ‘overwriting’ our existing data with the renamed data. We use the format of rename(new_name = old_name)
. Note that we are using single =
and not double ==
because we are assigning something, not checking for equality.
df <- df %>%
rename(new_name = old_name)
df <- df %>%
rename(chemo_6_months = chemo,
sodium = na,
organ_system = system)
Exercise
Use the functions you have just learnt to:
- Find the mean apache score for survivors and non-survivors
- Create a new column with the highest temp from temp_c and temp_nc
Wrangling strings
Extract numbers
The parse_number()
function is useful when numbers have been embedded inside a character vector. We can see this in our data in the lactate_abg
column, where the units of measure (mmol/L) have crept into the column.
Parsing Dates
Let’s create two new columns that show us the length of stay for a patient, and their age on admission.
Dates can be extremely fiddly, but fortunately, tidyverse comes to our rescue. We will need only the arrival_dttm
and discharge_dttm
columns to do what we want.
df <- df %>%
mutate(los = interval(arrival_dttm, discharge_dttm) / days(1))
## Error in interval(arrival_dttm, discharge_dttm): could not find function "interval"
df %>%
select(los)
## Error in .f(.x[[i]], ...): object 'los' not found
Working with dates can be extremely difficult. Let’s practice a few simple and common tasks.
Exercise
Use the functions you have just learnt to:
- Find the mean los for survivors and non-survivors
- Calculate the age on arrival for these patients
Pivoting Data
Right now we have data arranged in 1 row per admission (1 row per patient since there are no repeat admissions). This is normally called the “wide” format. We might want this in a slightly different format (the “long” or “tall” format) to answer a different type of question. This process is known as pivoting.
In our data, the lactate was recorded at hours 1, 6 and 12. Really, this data needs to be stacked for easy analysis. We can achieve this with the pivot_longer()
function
df_long <- df %>%
pivot_longer(
cols = c("lactate_1hr", "lactate_6hr", "lactate_12hr"),
names_to = "lactate_time",
values_to = "lactate_value") %>%
select(id, lactate_time, lactate_value)
To showcase why this was useful, we can draw a spagetti plot and view how the lactate changes over time. Don’t worry about the plotting syntax for now, this will be covered in another talk.
Key Points
The process of transforming data from a raw to more usable format
Most easily using packages from the tidyverse