This lesson is still being designed and assembled (Pre-Alpha version)

# Data Wrangling

## Overview

Teaching: 45 min
Exercises: 2 min
Questions
• What is data wrangling?

• How do we wrangle data in R?

Objectives
• Use the tidyverse package to build your first data pipeline

• Perform basic data manipulation tasks

• Inspect and summarise data

## Data wrangling with dplyr

Data wrangling is the process of reshaping data from its raw format, inta a format that is suitable for analysis. A typical goal is to make sure our data is in a “tidy” format. To quote one of R’s current chief architects (Hadley Wickham), tidy data is shaped such that:

Each column is a variable. Each row is an observation.

The `dplyr` package provides useful grammar for simplifying data manipulation. It is not needed but makes manipulation verbose and intuitive and is highly recommended. Let’s load of the package along with the rest of the tidyverse if not already done. And we can read in the synthetic cohort.

``````library(tidyverse)
``````
``````## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
``````
``````## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
``````
``````## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
``````
``````df <- read_csv("./data/synthetic_data_clean.csv")
``````
``````## Parsed with column specification:
## cols(
##   .default = col_double(),
##   creatinine = col_character(),
##   arrival_dttm = col_datetime(format = ""),
##   discharge_dttm = col_datetime(format = ""),
##   dob = col_date(format = ""),
##   vital_status = col_character(),
##   sex = col_character(),
##   id = col_character()
## )
``````
``````## See spec(...) for full column specifications.
``````

We can inspect the data again.

``````glimpse(df)
``````
``````## Observations: 5,000
## Variables: 32
## \$ ph_abg            <dbl> 7.440, 7.300, 7.480, 7.330, 7.210, 7.500, 7.29…
## \$ hco3_abg          <dbl> 28.0, 20.2, 24.6, 22.3, 20.0, 25.3, 19.5, 23.1…
## \$ temp_c            <dbl> 36.9, 36.6, 37.2, 35.4, 35.9, 36.1, 35.6, 35.9…
## \$ temp_nc           <dbl> 37.1, 36.4, 36.2, 35.8, 37.1, 35.6, 38.3, 36.1…
## \$ urea              <dbl> 3.0, 5.8, 4.5, 13.7, 7.9, 8.9, 12.5, 9.4, 12.9…
## \$ creatinine        <chr> "80 micromol/L", "93 micromol/L", "75 micromol…
## \$ na                <dbl> 135, 140, 140, 136, 140, 138, 138, 138, 134, 1…
## \$ k                 <dbl> 4.2, 4.4, 3.8, 5.0, 4.9, 4.2, 5.0, 5.8, 6.5, 4…
## \$ hb                <dbl> 90.0, 118.0, 113.0, 117.0, 171.0, 110.0, 132.0…
## \$ wbc               <dbl> 9.60, 11.50, 12.40, 13.20, 20.00, 16.90, 13.20…
## \$ neutrophil        <dbl> 7.35, 9.25, 11.50, 8.90, 14.59, 15.60, 10.40, …
## \$ platelets         <dbl> 295, 145, 211, 159, 128, 271, 360, 265, 127, 2…
## \$ crp               <dbl> 240.3, 85.0, 12.0, 9.0, 8.9, 8.9, 30.0, 8.0, 1…
## \$ chemo             <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0…
## \$ chronic_rrt       <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0…
## \$ metastatic        <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0…
## \$ radiotx           <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0…
## \$ apache            <dbl> 6, 8, 15, 14, 15, 15, 20, 20, 18, 56, 22, 18, …
## \$ medical           <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0…
## \$ system            <dbl> 3, 2, 3, 7, 2, 10, 1, 3, 1, 4, 1, 3, 2, 7, 7, …
## \$ height            <dbl> 1.65, 1.80, 1.70, 1.50, 1.90, 1.60, 1.60, 1.75…
## \$ weight            <dbl> 65, 50, 70, 95, 75, 95, 60, 75, 60, 85, 70, 80…
## \$ elective_surgical <dbl> 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1…
## \$ arrival_dttm      <dttm> 2014-01-01 13:06:16, 2014-01-01 00:35:32, 201…
## \$ discharge_dttm    <dttm> 2014-01-02 01:06:16, 2014-01-05 00:35:32, 201…
## \$ dob               <date> 1964-01-01, 1984-01-01, 1964-01-01, 1944-01-0…
## \$ vital_status      <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "…
## \$ sex               <chr> "F", "M", "F", "M", "M", "F", "F", "M", "M", "…
## \$ id                <chr> "00001", "00002", "00003", "00004", "00005", "…
## \$ lactate_1hr       <dbl> 1.2, 1.8, 1.4, 1.5, 0.7, 2.6, 3.0, 1.2, 1.6, 1…
## \$ lactate_6hr       <dbl> 0.3, 0.5, 1.7, 1.4, 0.8, 1.5, 0.8, 0.7, 1.8, 1…
## \$ lactate_12hr      <dbl> 0.1, 0.3, 1.2, 1.4, 0.6, 1.5, 0.8, 0.4, 1.8, 1…
``````

Using our synthetic cohort, let’s filter the demographic data by row.

``````filter(df, height >= 1.8)
``````
``````## # A tibble: 1,019 x 32
##    ph_abg hco3_abg temp_c temp_nc  urea creatinine    na     k    hb   wbc
##     <dbl>    <dbl>  <dbl>   <dbl> <dbl> <chr>      <dbl> <dbl> <dbl> <dbl>
##  1   7.3      20.2   36.6    36.4   5.8 93 microm…   140   4.4   118  11.5
##  2   7.21     20     35.9    37.1   7.9 186 micro…   140   4.9   171  20
##  3   7.46     25.2   37.1    36.8  19   163 micro…   135   2.4    91  14.3
##  4   7.44     22.7   36.5    36.7   1.7 56 microm…   134   4.3   102  11.8
##  5   7.49     38     37      36.1  10.1 66 microm…   142   3.7    75  10.8
##  6   7.45     30.3   36.4    37.3   7.8 89 microm…   152   4.9    99  16
##  7   7.27     26.2   36.2    38     9.2 268 micro…   136   4.5   154   7.5
##  8   7.34     20.2   36.3    35.7   9.3 49 microm…   110   4.6   158   8.1
##  9   7.3      21.5   36.7    37.4  10.1 158 micro…   134   4.5   171  21.8
## 10   7.38     23.2   35.2    33.6   9.7 136 micro…   131   5.5   128   6
## # … with 1,009 more rows, and 22 more variables: neutrophil <dbl>,
## #   platelets <dbl>, crp <dbl>, chemo <dbl>, chronic_rrt <dbl>,
## #   metastatic <dbl>, radiotx <dbl>, apache <dbl>, medical <dbl>,
## #   system <dbl>, height <dbl>, weight <dbl>, elective_surgical <dbl>,
## #   arrival_dttm <dttm>, discharge_dttm <dttm>, dob <date>,
## #   vital_status <chr>, sex <chr>, id <chr>, lactate_1hr <dbl>,
## #   lactate_6hr <dbl>, lactate_12hr <dbl>
``````

This filters rows from the `df` data frame where `height` is greater than or equal to `1.8` m.

TIP: Comparisons in R: Most of these are obvious `>` (greater than), `>=` (greater than or equal to), and similarly for `<` and `<=`. The `!=` operator means ‘not equal to’. But when we want to check if something is equal to something else we must use `==`. Why? Because although R prefers you to use `<-` when you name things, most programming languages use `=`, and even R expects you to use `=` when you pass values to functions. So, for a function such as `mean(x)` we are normally lazy when we write `mean(hrate)`. We should write `mean(x = hrate)`, because inside the function all the work is done with the variable `x`. When we write `mean(x = hrate)` we explicitly telling R that we want it to use `hrate` in place of `x`. This is a very long winded way of saying that when you want to test if one thing is equal to another then you need a different way of writing this, hence `==`.

Filter always acts on rows, and will throw away any data that doesn’t meet your request.

``````include_graphics("./img/filter.png")
``````

A similar function called `select` acts over columns. Just want the `sex` colum?

``````select(df, sex)
``````
``````## # A tibble: 5,000 x 1
##    sex
##    <chr>
##  1 F
##  2 M
##  3 F
##  4 M
##  5 M
##  6 F
##  7 F
##  8 M
##  9 M
## 10 M
## # … with 4,990 more rows
``````
``````include_graphics("./img/select.png")
``````

So `filter` chooses rows, and `select` chooses columns.

Now here comes the proper magic. What if you want to both filter and select?

``````df %>%
filter(height >= 1.8) %>%
select(sex)
``````
``````## # A tibble: 1,019 x 1
##    sex
##    <chr>
##  1 M
##  2 M
##  3 M
##  4 F
##  5 M
##  6 F
##  7 M
##  8 M
##  9 M
## 10 M
## # … with 1,009 more rows
``````

The `%>%` operator is called a pipe, and it (surprise, surprise) pipes data from one command to the next. So in plain English, the above line takes the data frame `df` and pipes it into the filter function, removing all patients under `1.8` m. It then pipes the filtered data into the `select` function where only the `sex` column is retained.

The benefit of writing code like this, is that you start with your data at the top, and then each line is a single function that performs an action. The data is processed one step at a time at the bottom you get your tidy data. This is the very start of a data pipeline.

TIP: The order matters in a pipe! If we were to use the `select` function first, then the `filter` function would not have a `height` column to filter on.).

Next, you can continue to expand your data pipeline by piping your data into another function. We might want to summarise the data. By appending the pipeline with the `group_by` and `tally` functions we can tabulate the results.

``````df %>%
filter(height >= 1.8) %>%
select(sex) %>%
group_by(sex) %>%
tally()
``````
``````## # A tibble: 2 x 2
##   sex       n
##   <chr> <int>
## 1 F        58
## 2 M       961
``````

There are a small number of ‘verbs’ (functions) in the dplyr package that when combined allow you to complete a large number of useful data maniputations. In addition to `select`, and `filter`, you will want you to learn:

1. `arrange` - orders data using a named column
2. `group_by`- sets a column to be a grouping variable
3. `summarise`- summarises data over a grouping variable
4. `mutate` - add a new column to the data

Armed with only these 6 functions, you can accomplish a huge amount that would be tiresome and frustrating in excel.

### Arrange

``````df %>%
select(height, weight, sex) %>%
arrange(height)
``````
``````## # A tibble: 5,000 x 3
##    height weight sex
##     <dbl>  <dbl> <chr>
##  1   1.45     60 M
##  2   1.45     50 F
##  3   1.45     85 F
##  4   1.45     60 M
##  5   1.45     50 F
##  6   1.45     50 F
##  7   1.45     50 M
##  8   1.45     60 F
##  9   1.45     50 F
## 10   1.45     85 F
## # … with 4,990 more rows
``````
``````include_graphics("./img/arrange.png")
``````

Here we see an example where we select the height, weight and sex columns, and then arrange the data by the height column.

### Group By and Summarise

We now see an example of `group_by` and `summarise` being used together. The result is to apply the function `mean` to each group, and summarise the results. The `na.rm = TRUE` has to be added if there are missing values are present in the data. We don’t have any here, but it’s good to remember for your own data.

``````df %>%
group_by(sex) %>%
summarise(average_height = mean(height, na.rm = TRUE))
``````
``````## # A tibble: 2 x 2
##   sex   average_height
##   <chr>          <dbl>
## 1 F               1.62
## 2 M               1.74
``````

the group by function on it’s own doesn’t really achieve much. It creates a grouping so that a summary function can act over that grouping.

### Mutate

We might be interested in computing the BMI for a patient. We can do this easily by adding a new column with the `mutate` verb.

``````df %>%
mutate(bmi = weight / (height/100)^2)
``````
``````## # A tibble: 5,000 x 33
##    ph_abg hco3_abg temp_c temp_nc  urea creatinine    na     k    hb   wbc
##     <dbl>    <dbl>  <dbl>   <dbl> <dbl> <chr>      <dbl> <dbl> <dbl> <dbl>
##  1   7.44     28     36.9    37.1   3   80 microm…   135   4.2  90    9.6
##  2   7.3      20.2   36.6    36.4   5.8 93 microm…   140   4.4 118   11.5
##  3   7.48     24.6   37.2    36.2   4.5 75 microm…   140   3.8 113   12.4
##  4   7.33     22.3   35.4    35.8  13.7 173 micro…   136   5   117   13.2
##  5   7.21     20     35.9    37.1   7.9 186 micro…   140   4.9 171   20
##  6   7.5      25.3   36.1    35.6   8.9 81 microm…   138   4.2 110   16.9
##  7   7.29     19.5   35.6    38.3  12.5 55 microm…   138   5   132   13.2
##  8   7.4      23.1   35.9    36.1   9.4 251 micro…   138   5.8  99   15.7
##  9   7.36     24     35.9    36.6  12.9 390 micro…   134   6.5   9.7  8.85
## 10   7.43     22     36.6    37.3   5.1 56 microm…   139   4.6 141   13.5
## # … with 4,990 more rows, and 23 more variables: neutrophil <dbl>,
## #   platelets <dbl>, crp <dbl>, chemo <dbl>, chronic_rrt <dbl>,
## #   metastatic <dbl>, radiotx <dbl>, apache <dbl>, medical <dbl>,
## #   system <dbl>, height <dbl>, weight <dbl>, elective_surgical <dbl>,
## #   arrival_dttm <dttm>, discharge_dttm <dttm>, dob <date>,
## #   vital_status <chr>, sex <chr>, id <chr>, lactate_1hr <dbl>,
## #   lactate_6hr <dbl>, lactate_12hr <dbl>, bmi <dbl>
``````

We now create a new column called `bmi`.

``````include_graphics("./img/mutate.png")
``````

Let’s combine a few of these functions to return only the data we want to see.

``````df %>%
mutate(bmi = weight / (height)^2) %>%
select(weight, height, bmi) %>%
arrange(desc(bmi))
``````
``````## # A tibble: 5,000 x 3
##    weight height   bmi
##     <dbl>  <dbl> <dbl>
##  1    150   1.55  62.4
##  2    150   1.6   58.6
##  3    150   1.6   58.6
##  4    150   1.6   58.6
##  5    135   1.55  56.2
##  6    150   1.65  55.1
##  7    150   1.65  55.1
##  8    150   1.65  55.1
##  9    150   1.65  55.1
## 10    150   1.65  55.1
## # … with 4,990 more rows
``````

There are many other functions in dplyr that can help with easy data maniputation. Have a look at the cheatsheet for more details.

## Data Cleaning

We will now go through some common data cleaning tasks. It is better to try and do these in R, rather than adjust the underlying data in excel if possible. Some tasks are going to have to be done in excel, but the more you can do this with written instructions (and hence have written documentation) the better.

``````names(df)
``````
``````##  [1] "ph_abg"            "hco3_abg"          "temp_c"
##  [4] "temp_nc"           "urea"              "creatinine"
##  [7] "na"                "k"                 "hb"
## [10] "wbc"               "neutrophil"        "platelets"
## [13] "crp"               "chemo"             "chronic_rrt"
## [19] "medical"           "system"            "height"
## [22] "weight"            "elective_surgical" "arrival_dttm"
## [25] "discharge_dttm"    "dob"               "vital_status"
## [28] "sex"               "id"                "lactate_1hr"
## [31] "lactate_6hr"       "lactate_12hr"
``````

Some of these variable names are too vague for my liking.

• chemo: active or historical?
• na: too similar to the R concept of “missing” i.e. NA
• system: a little ambiguous

We’re using the `rename` function from the dplyr package, and then ‘overwriting’ our existing data with the renamed data. We use the format of `rename(new_name = old_name)`. Note that we are using single `=` and not double `==` because we are assigning something, not checking for equality.

``````df <- df %>%
rename(new_name = old_name)
``````
``````df <- df %>%
rename(chemo_6_months = chemo,
sodium = na,
organ_system = system)
``````

### Exercise

Use the functions you have just learnt to:

• Find the mean apache score for survivors and non-survivors
• Create a new column with the highest temp from temp_c and temp_nc

## Wrangling strings

### Extract numbers

The `parse_number()` function is useful when numbers have been embedded inside a character vector. We can see this in our data in the `lactate_abg` column, where the units of measure (mmol/L) have crept into the column.

## Parsing Dates

Let’s create two new columns that show us the length of stay for a patient, and their age on admission.

Dates can be extremely fiddly, but fortunately, tidyverse comes to our rescue. We will need only the `arrival_dttm` and `discharge_dttm` columns to do what we want.

``````df <- df %>%
mutate(los = interval(arrival_dttm, discharge_dttm) / days(1))
``````
``````## Error in interval(arrival_dttm, discharge_dttm): could not find function "interval"
``````
``````df %>%
select(los)
``````
``````## Error in .f(.x[[i]], ...): object 'los' not found
``````

Working with dates can be extremely difficult. Let’s practice a few simple and common tasks.

### Exercise

Use the functions you have just learnt to:

• Find the mean los for survivors and non-survivors
• Calculate the age on arrival for these patients

## Pivoting Data

Right now we have data arranged in 1 row per admission (1 row per patient since there are no repeat admissions). This is normally called the “wide” format. We might want this in a slightly different format (the “long” or “tall” format) to answer a different type of question. This process is known as pivoting.

In our data, the lactate was recorded at hours 1, 6 and 12. Really, this data needs to be stacked for easy analysis. We can achieve this with the `pivot_longer()` function

``````df_long <- df %>%
pivot_longer(
cols = c("lactate_1hr", "lactate_6hr", "lactate_12hr"),
names_to = "lactate_time",
values_to = "lactate_value") %>%
select(id, lactate_time, lactate_value)
``````

To showcase why this was useful, we can draw a spagetti plot and view how the lactate changes over time. Don’t worry about the plotting syntax for now, this will be covered in another talk.

## Key Points

• The process of transforming data from a raw to more usable format

• Most easily using packages from the tidyverse