Data Wrangling

Overview

Teaching: 45 min
Exercises: 2 min

Questions

What is data wrangling?

How do we wrangle data in R?

Objectives

Use the tidyverse package to build your first data pipeline

Perform basic data manipulation tasks

Inspect and summarise data

Data wrangling with dplyr

Data wrangling is the process of reshaping data from its raw format, inta a format that is suitable for analysis. A typical goal is to make sure our data is in a “tidy” format. To quote one of R’s current chief architects (Hadley Wickham), tidy data is shaped such that:

Each column is a variable. Each row is an observation.

The dplyr package provides useful grammar for simplifying data manipulation. It is not needed but makes manipulation verbose and intuitive and is highly recommended. Let’s load of the package along with the rest of the tidyverse if not already done. And we can read in the synthetic cohort.

library(tidyverse)

## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

df <- read_csv("./data/synthetic_data_clean.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   creatinine = col_character(),
##   arrival_dttm = col_datetime(format = ""),
##   discharge_dttm = col_datetime(format = ""),
##   dob = col_date(format = ""),
##   vital_status = col_character(),
##   sex = col_character(),
##   id = col_character()
## )

## See spec(...) for full column specifications.

We can inspect the data again.

glimpse(df)

## Observations: 5,000
## Variables: 32
## $ ph_abg            <dbl> 7.440, 7.300, 7.480, 7.330, 7.210, 7.500, 7.29…
## $ hco3_abg          <dbl> 28.0, 20.2, 24.6, 22.3, 20.0, 25.3, 19.5, 23.1…
## $ temp_c            <dbl> 36.9, 36.6, 37.2, 35.4, 35.9, 36.1, 35.6, 35.9…
## $ temp_nc           <dbl> 37.1, 36.4, 36.2, 35.8, 37.1, 35.6, 38.3, 36.1…
## $ urea              <dbl> 3.0, 5.8, 4.5, 13.7, 7.9, 8.9, 12.5, 9.4, 12.9…
## $ creatinine        <chr> "80 micromol/L", "93 micromol/L", "75 micromol…
## $ na                <dbl> 135, 140, 140, 136, 140, 138, 138, 138, 134, 1…
## $ k                 <dbl> 4.2, 4.4, 3.8, 5.0, 4.9, 4.2, 5.0, 5.8, 6.5, 4…
## $ hb                <dbl> 90.0, 118.0, 113.0, 117.0, 171.0, 110.0, 132.0…
## $ wbc               <dbl> 9.60, 11.50, 12.40, 13.20, 20.00, 16.90, 13.20…
## $ neutrophil        <dbl> 7.35, 9.25, 11.50, 8.90, 14.59, 15.60, 10.40, …
## $ platelets         <dbl> 295, 145, 211, 159, 128, 271, 360, 265, 127, 2…
## $ crp               <dbl> 240.3, 85.0, 12.0, 9.0, 8.9, 8.9, 30.0, 8.0, 1…
## $ chemo             <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0…
## $ chronic_rrt       <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0…
## $ metastatic        <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0…
## $ radiotx           <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0…
## $ apache            <dbl> 6, 8, 15, 14, 15, 15, 20, 20, 18, 56, 22, 18, …
## $ medical           <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0…
## $ system            <dbl> 3, 2, 3, 7, 2, 10, 1, 3, 1, 4, 1, 3, 2, 7, 7, …
## $ height            <dbl> 1.65, 1.80, 1.70, 1.50, 1.90, 1.60, 1.60, 1.75…
## $ weight            <dbl> 65, 50, 70, 95, 75, 95, 60, 75, 60, 85, 70, 80…
## $ elective_surgical <dbl> 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1…
## $ arrival_dttm      <dttm> 2014-01-01 13:06:16, 2014-01-01 00:35:32, 201…
## $ discharge_dttm    <dttm> 2014-01-02 01:06:16, 2014-01-05 00:35:32, 201…
## $ dob               <date> 1964-01-01, 1984-01-01, 1964-01-01, 1944-01-0…
## $ vital_status      <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "…
## $ sex               <chr> "F", "M", "F", "M", "M", "F", "F", "M", "M", "…
## $ id                <chr> "00001", "00002", "00003", "00004", "00005", "…
## $ lactate_1hr       <dbl> 1.2, 1.8, 1.4, 1.5, 0.7, 2.6, 3.0, 1.2, 1.6, 1…
## $ lactate_6hr       <dbl> 0.3, 0.5, 1.7, 1.4, 0.8, 1.5, 0.8, 0.7, 1.8, 1…
## $ lactate_12hr      <dbl> 0.1, 0.3, 1.2, 1.4, 0.6, 1.5, 0.8, 0.4, 1.8, 1…

Using our synthetic cohort, let’s filter the demographic data by row.

filter(df, height >= 1.8)

## # A tibble: 1,019 x 32
##    ph_abg hco3_abg temp_c temp_nc  urea creatinine    na     k    hb   wbc
##     <dbl>    <dbl>  <dbl>   <dbl> <dbl> <chr>      <dbl> <dbl> <dbl> <dbl>
##  1   7.3      20.2   36.6    36.4   5.8 93 microm…   140   4.4   118  11.5
##  2   7.21     20     35.9    37.1   7.9 186 micro…   140   4.9   171  20  
##  3   7.46     25.2   37.1    36.8  19   163 micro…   135   2.4    91  14.3
##  4   7.44     22.7   36.5    36.7   1.7 56 microm…   134   4.3   102  11.8
##  5   7.49     38     37      36.1  10.1 66 microm…   142   3.7    75  10.8
##  6   7.45     30.3   36.4    37.3   7.8 89 microm…   152   4.9    99  16  
##  7   7.27     26.2   36.2    38     9.2 268 micro…   136   4.5   154   7.5
##  8   7.34     20.2   36.3    35.7   9.3 49 microm…   110   4.6   158   8.1
##  9   7.3      21.5   36.7    37.4  10.1 158 micro…   134   4.5   171  21.8
## 10   7.38     23.2   35.2    33.6   9.7 136 micro…   131   5.5   128   6  
## # … with 1,009 more rows, and 22 more variables: neutrophil <dbl>,
## #   platelets <dbl>, crp <dbl>, chemo <dbl>, chronic_rrt <dbl>,
## #   metastatic <dbl>, radiotx <dbl>, apache <dbl>, medical <dbl>,
## #   system <dbl>, height <dbl>, weight <dbl>, elective_surgical <dbl>,
## #   arrival_dttm <dttm>, discharge_dttm <dttm>, dob <date>,
## #   vital_status <chr>, sex <chr>, id <chr>, lactate_1hr <dbl>,
## #   lactate_6hr <dbl>, lactate_12hr <dbl>

This filters rows from the df data frame where height is greater than or equal to 1.8 m.

TIP: Comparisons in R: Most of these are obvious > (greater than), >= (greater than or equal to), and similarly for < and <=. The != operator means ‘not equal to’. But when we want to check if something is equal to something else we must use ==. Why? Because although R prefers you to use <- when you name things, most programming languages use =, and even R expects you to use = when you pass values to functions. So, for a function such as mean(x) we are normally lazy when we write mean(hrate). We should write mean(x = hrate), because inside the function all the work is done with the variable x. When we write mean(x = hrate) we explicitly telling R that we want it to use hrate in place of x. This is a very long winded way of saying that when you want to test if one thing is equal to another then you need a different way of writing this, hence ==.

Filter always acts on rows, and will throw away any data that doesn’t meet your request.

include_graphics("./img/filter.png")

plot of chunk unnamed-chunk-4

A similar function called select acts over columns. Just want the sex colum?

select(df, sex)

## # A tibble: 5,000 x 1
##    sex  
##    <chr>
##  1 F    
##  2 M    
##  3 F    
##  4 M    
##  5 M    
##  6 F    
##  7 F    
##  8 M    
##  9 M    
## 10 M    
## # … with 4,990 more rows

include_graphics("./img/select.png")

plot of chunk unnamed-chunk-6

So filter chooses rows, and select chooses columns.

Now here comes the proper magic. What if you want to both filter and select?

df %>%
  filter(height >= 1.8) %>%
  select(sex)

## # A tibble: 1,019 x 1
##    sex  
##    <chr>
##  1 M    
##  2 M    
##  3 M    
##  4 F    
##  5 M    
##  6 F    
##  7 M    
##  8 M    
##  9 M    
## 10 M    
## # … with 1,009 more rows

The %>% operator is called a pipe, and it (surprise, surprise) pipes data from one command to the next. So in plain English, the above line takes the data frame df and pipes it into the filter function, removing all patients under 1.8 m. It then pipes the filtered data into the select function where only the sex column is retained.

The benefit of writing code like this, is that you start with your data at the top, and then each line is a single function that performs an action. The data is processed one step at a time at the bottom you get your tidy data. This is the very start of a data pipeline.

TIP: The order matters in a pipe! If we were to use the select function first, then the filter function would not have a height column to filter on.).

Next, you can continue to expand your data pipeline by piping your data into another function. We might want to summarise the data. By appending the pipeline with the group_by and tally functions we can tabulate the results.

df %>%
  filter(height >= 1.8) %>%
  select(sex) %>%
  group_by(sex) %>%
  tally()

## # A tibble: 2 x 2
##   sex       n
##   <chr> <int>
## 1 F        58
## 2 M       961

There are a small number of ‘verbs’ (functions) in the dplyr package that when combined allow you to complete a large number of useful data maniputations. In addition to select, and filter, you will want you to learn:

arrange - orders data using a named column
group_by- sets a column to be a grouping variable
summarise- summarises data over a grouping variable
mutate - add a new column to the data

Armed with only these 6 functions, you can accomplish a huge amount that would be tiresome and frustrating in excel.

Arrange

df %>%
  select(height, weight, sex) %>%
  arrange(height)

## # A tibble: 5,000 x 3
##    height weight sex  
##     <dbl>  <dbl> <chr>
##  1   1.45     60 M    
##  2   1.45     50 F    
##  3   1.45     85 F    
##  4   1.45     60 M    
##  5   1.45     50 F    
##  6   1.45     50 F    
##  7   1.45     50 M    
##  8   1.45     60 F    
##  9   1.45     50 F    
## 10   1.45     85 F    
## # … with 4,990 more rows

include_graphics("./img/arrange.png")

plot of chunk unnamed-chunk-10

Here we see an example where we select the height, weight and sex columns, and then arrange the data by the height column.

Group By and Summarise

We now see an example of group_by and summarise being used together. The result is to apply the function mean to each group, and summarise the results. The na.rm = TRUE has to be added if there are missing values are present in the data. We don’t have any here, but it’s good to remember for your own data.

df %>%
  group_by(sex) %>%
  summarise(average_height = mean(height, na.rm = TRUE))

## # A tibble: 2 x 2
##   sex   average_height
##   <chr>          <dbl>
## 1 F               1.62
## 2 M               1.74

the group by function on it’s own doesn’t really achieve much. It creates a grouping so that a summary function can act over that grouping.

Mutate

We might be interested in computing the BMI for a patient. We can do this easily by adding a new column with the mutate verb.

df %>%
  mutate(bmi = weight / (height/100)^2)

## # A tibble: 5,000 x 33
##    ph_abg hco3_abg temp_c temp_nc  urea creatinine    na     k    hb   wbc
##     <dbl>    <dbl>  <dbl>   <dbl> <dbl> <chr>      <dbl> <dbl> <dbl> <dbl>
##  1   7.44     28     36.9    37.1   3   80 microm…   135   4.2  90    9.6 
##  2   7.3      20.2   36.6    36.4   5.8 93 microm…   140   4.4 118   11.5 
##  3   7.48     24.6   37.2    36.2   4.5 75 microm…   140   3.8 113   12.4 
##  4   7.33     22.3   35.4    35.8  13.7 173 micro…   136   5   117   13.2 
##  5   7.21     20     35.9    37.1   7.9 186 micro…   140   4.9 171   20   
##  6   7.5      25.3   36.1    35.6   8.9 81 microm…   138   4.2 110   16.9 
##  7   7.29     19.5   35.6    38.3  12.5 55 microm…   138   5   132   13.2 
##  8   7.4      23.1   35.9    36.1   9.4 251 micro…   138   5.8  99   15.7 
##  9   7.36     24     35.9    36.6  12.9 390 micro…   134   6.5   9.7  8.85
## 10   7.43     22     36.6    37.3   5.1 56 microm…   139   4.6 141   13.5 
## # … with 4,990 more rows, and 23 more variables: neutrophil <dbl>,
## #   platelets <dbl>, crp <dbl>, chemo <dbl>, chronic_rrt <dbl>,
## #   metastatic <dbl>, radiotx <dbl>, apache <dbl>, medical <dbl>,
## #   system <dbl>, height <dbl>, weight <dbl>, elective_surgical <dbl>,
## #   arrival_dttm <dttm>, discharge_dttm <dttm>, dob <date>,
## #   vital_status <chr>, sex <chr>, id <chr>, lactate_1hr <dbl>,
## #   lactate_6hr <dbl>, lactate_12hr <dbl>, bmi <dbl>

We now create a new column called bmi.

include_graphics("./img/mutate.png")

plot of chunk unnamed-chunk-13

Let’s combine a few of these functions to return only the data we want to see.

df %>%
  mutate(bmi = weight / (height)^2) %>%
  select(weight, height, bmi) %>%
  arrange(desc(bmi))

## # A tibble: 5,000 x 3
##    weight height   bmi
##     <dbl>  <dbl> <dbl>
##  1    150   1.55  62.4
##  2    150   1.6   58.6
##  3    150   1.6   58.6
##  4    150   1.6   58.6
##  5    135   1.55  56.2
##  6    150   1.65  55.1
##  7    150   1.65  55.1
##  8    150   1.65  55.1
##  9    150   1.65  55.1
## 10    150   1.65  55.1
## # … with 4,990 more rows

There are many other functions in dplyr that can help with easy data maniputation. Have a look at the cheatsheet for more details.

Data Cleaning

We will now go through some common data cleaning tasks. It is better to try and do these in R, rather than adjust the underlying data in excel if possible. Some tasks are going to have to be done in excel, but the more you can do this with written instructions (and hence have written documentation) the better.

names(df)

##  [1] "ph_abg"            "hco3_abg"          "temp_c"           
##  [4] "temp_nc"           "urea"              "creatinine"       
##  [7] "na"                "k"                 "hb"               
## [10] "wbc"               "neutrophil"        "platelets"        
## [13] "crp"               "chemo"             "chronic_rrt"      
## [16] "metastatic"        "radiotx"           "apache"           
## [19] "medical"           "system"            "height"           
## [22] "weight"            "elective_surgical" "arrival_dttm"     
## [25] "discharge_dttm"    "dob"               "vital_status"     
## [28] "sex"               "id"                "lactate_1hr"      
## [31] "lactate_6hr"       "lactate_12hr"

Some of these variable names are too vague for my liking.

chemo: active or historical?
na: too similar to the R concept of “missing” i.e. NA
system: a little ambiguous

We’re using the rename function from the dplyr package, and then ‘overwriting’ our existing data with the renamed data. We use the format of rename(new_name = old_name). Note that we are using single = and not double == because we are assigning something, not checking for equality.

df <- df %>%
  rename(new_name = old_name)

df <- df %>%
  rename(chemo_6_months = chemo,
         sodium = na,
         organ_system = system)

Exercise

Use the functions you have just learnt to:

Find the mean apache score for survivors and non-survivors
Create a new column with the highest temp from temp_c and temp_nc

Wrangling strings

Extract numbers

The parse_number() function is useful when numbers have been embedded inside a character vector. We can see this in our data in the lactate_abg column, where the units of measure (mmol/L) have crept into the column.

Parsing Dates

Let’s create two new columns that show us the length of stay for a patient, and their age on admission.

Dates can be extremely fiddly, but fortunately, tidyverse comes to our rescue. We will need only the arrival_dttm and discharge_dttm columns to do what we want.

df <- df %>%
  mutate(los = interval(arrival_dttm, discharge_dttm) / days(1))

## Error in interval(arrival_dttm, discharge_dttm): could not find function "interval"

df %>%
  select(los)

## Error in .f(.x[[i]], ...): object 'los' not found

Working with dates can be extremely difficult. Let’s practice a few simple and common tasks.

Exercise

Use the functions you have just learnt to:

Find the mean los for survivors and non-survivors
Calculate the age on arrival for these patients

Pivoting Data

Right now we have data arranged in 1 row per admission (1 row per patient since there are no repeat admissions). This is normally called the “wide” format. We might want this in a slightly different format (the “long” or “tall” format) to answer a different type of question. This process is known as pivoting.

In our data, the lactate was recorded at hours 1, 6 and 12. Really, this data needs to be stacked for easy analysis. We can achieve this with the pivot_longer() function

df_long <- df %>%
  pivot_longer(
    cols = c("lactate_1hr", "lactate_6hr", "lactate_12hr"),
    names_to = "lactate_time",
    values_to = "lactate_value") %>%
  select(id, lactate_time, lactate_value)

To showcase why this was useful, we can draw a spagetti plot and view how the lactate changes over time. Don’t worry about the plotting syntax for now, this will be covered in another talk.

Key Points

The process of transforming data from a raw to more usable format

Most easily using packages from the tidyverse

previous episode

Data Science for Practicing Clinicians

next episode

Data Wrangling

Overview

Data wrangling with dplyr

Arrange

Group By and Summarise

Mutate

Data Cleaning

Exercise

Wrangling strings

Extract numbers

Parsing Dates

Exercise

Pivoting Data

Key Points

previous episode

next episode