Data manipulation with dplyr

Last updated on 2024-07-02 | Edit this page

Overview

Questions

  • How can I add variables to my data?
  • How can I alter the variables already in my data?

Objectives

  • Use mutate() to add and alter variables
  • Use if_else() where appropriate
  • Use case_when() where appropriate
  • Understand basic consents of different data types

Motivation


Often, the data we have do not contain exactly what we need. We might need to change the order of factors, create new variables based on other columns in the data, or even variables conditional on specific values in other columns.

Adding new variables,

In {tidyverse}, when we add new variables, we use the mutate() function. Just like the other {tidyverse} functions, mutate work specifically with data sets, and provides a nice shorthand for working directly with the columns in the data set.

R

penguins |> 
  mutate(new_var = 1)

OUTPUT

# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, new_var <dbl>

The output of this can be hard to spot, depending on the size of the screen. Let us for convenience create a subsetted data set to work on so we can easily see what we are doing.

R

penguins_s <- penguins |> 
  select(1:3, starts_with("bill"))

Lets try our command again on this new data.

R

penguins_s |> 
  mutate(new_var = 1)

OUTPUT

# A tibble: 344 × 5
   species island    bill_length_mm bill_depth_mm new_var
   <fct>   <fct>              <dbl>         <dbl>   <dbl>
 1 Adelie  Torgersen           39.1          18.7       1
 2 Adelie  Torgersen           39.5          17.4       1
 3 Adelie  Torgersen           40.3          18         1
 4 Adelie  Torgersen           NA            NA         1
 5 Adelie  Torgersen           36.7          19.3       1
 6 Adelie  Torgersen           39.3          20.6       1
 7 Adelie  Torgersen           38.9          17.8       1
 8 Adelie  Torgersen           39.2          19.6       1
 9 Adelie  Torgersen           34.1          18.1       1
10 Adelie  Torgersen           42            20.2       1
# ℹ 334 more rows

There is now a new column in the data set called “new_var”, and it has the value 1 for all rows! This is what we told mutate() to do! We specified a new column by name, and gave it a specific value, 1.

This works because its easy to assigning a single value to all rows. What if we try to give it three values? What would we expect?

R

penguins_s |> 
  mutate(var = 1:3)

ERROR

Error in `mutate()`:
ℹ In argument: `var = 1:3`.
Caused by error:
! `var` must be size 344 or 1, not 3.

Here, it’s failing with a mysterious message. The error is telling us that input must be of size 344 or 1. 344 are the number of rows in the data set, so its telling us the input we gave it is not suitable because its neither of length 344 nor of length 1.

So now we know the premises for mutate, it takes inputs that are either of the same length as there are rows in the data set or length 1.

R

penguins_s |> 
  mutate(var = 1:344)

OUTPUT

# A tibble: 344 × 5
   species island    bill_length_mm bill_depth_mm   var
   <fct>   <fct>              <dbl>         <dbl> <int>
 1 Adelie  Torgersen           39.1          18.7     1
 2 Adelie  Torgersen           39.5          17.4     2
 3 Adelie  Torgersen           40.3          18       3
 4 Adelie  Torgersen           NA            NA       4
 5 Adelie  Torgersen           36.7          19.3     5
 6 Adelie  Torgersen           39.3          20.6     6
 7 Adelie  Torgersen           38.9          17.8     7
 8 Adelie  Torgersen           39.2          19.6     8
 9 Adelie  Torgersen           34.1          18.1     9
10 Adelie  Torgersen           42            20.2    10
# ℹ 334 more rows

But generally, we create new columns based on other data in the data set. So let’s do a more useful example. For instance, perhaps we want to use the ratio between the bill length and depth as a measurement for a model.

R

penguins_s |> 
  mutate(bill_ratio = bill_length_mm / bill_depth_mm)

OUTPUT

# A tibble: 344 × 5
   species island    bill_length_mm bill_depth_mm bill_ratio
   <fct>   <fct>              <dbl>         <dbl>      <dbl>
 1 Adelie  Torgersen           39.1          18.7       2.09
 2 Adelie  Torgersen           39.5          17.4       2.27
 3 Adelie  Torgersen           40.3          18         2.24
 4 Adelie  Torgersen           NA            NA        NA
 5 Adelie  Torgersen           36.7          19.3       1.90
 6 Adelie  Torgersen           39.3          20.6       1.91
 7 Adelie  Torgersen           38.9          17.8       2.19
 8 Adelie  Torgersen           39.2          19.6       2
 9 Adelie  Torgersen           34.1          18.1       1.88
10 Adelie  Torgersen           42            20.2       2.08
# ℹ 334 more rows

So, here we have asked for the ratio between bill length and depth to be calculated and stored in a column named bill_ratio. Then we selected just the bill columns to have a peak at the output more directly.

We can do almost anything within a mutate() to get the values as we want them, also use functions that exist in R to transform the data. For instance, perhaps we want to scale the variables of interest to have a mean of 0 and standard deviation of 1, which is quite common to improve statistical modelling. We can do that with the scale() function.

R

penguins_s |> 
  mutate(bill_ratio = bill_length_mm / bill_depth_mm,
         bill_length_mm_z = scale(bill_length_mm))

OUTPUT

# A tibble: 344 × 6
   species island   bill_length_mm bill_depth_mm bill_ratio bill_length_mm_z[,1]
   <fct>   <fct>             <dbl>         <dbl>      <dbl>                <dbl>
 1 Adelie  Torgers…           39.1          18.7       2.09               -0.883
 2 Adelie  Torgers…           39.5          17.4       2.27               -0.810
 3 Adelie  Torgers…           40.3          18         2.24               -0.663
 4 Adelie  Torgers…           NA            NA        NA                  NA
 5 Adelie  Torgers…           36.7          19.3       1.90               -1.32
 6 Adelie  Torgers…           39.3          20.6       1.91               -0.847
 7 Adelie  Torgers…           38.9          17.8       2.19               -0.920
 8 Adelie  Torgers…           39.2          19.6       2                  -0.865
 9 Adelie  Torgers…           34.1          18.1       1.88               -1.80
10 Adelie  Torgers…           42            20.2       2.08               -0.352
# ℹ 334 more rows

Challenge 1

Create a column where bill length is transformed to cm. To transform mm to cm, you must divide the mm value by 10. Name the column bill_length_cm.

R

penguins_s |>
  mutate(bill_length_cm = bill_length_mm / 10)

OUTPUT

# A tibble: 344 × 5
   species island    bill_length_mm bill_depth_mm bill_length_cm
   <fct>   <fct>              <dbl>         <dbl>          <dbl>
 1 Adelie  Torgersen           39.1          18.7           3.91
 2 Adelie  Torgersen           39.5          17.4           3.95
 3 Adelie  Torgersen           40.3          18             4.03
 4 Adelie  Torgersen           NA            NA            NA
 5 Adelie  Torgersen           36.7          19.3           3.67
 6 Adelie  Torgersen           39.3          20.6           3.93
 7 Adelie  Torgersen           38.9          17.8           3.89
 8 Adelie  Torgersen           39.2          19.6           3.92
 9 Adelie  Torgersen           34.1          18.1           3.41
10 Adelie  Torgersen           42            20.2           4.2
# ℹ 334 more rows

Challenge 2

Create a column for body mass in kilos, rather than grams, in the main penguins data set. Name the column body_mass_kg. To transform grams to kilograms, divide the grams by 1000.

R

penguins |>
  mutate(body_mass_kg = body_mass_g / 1000)

OUTPUT

# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, body_mass_kg <dbl>

Adding conditional variables

Sometimes, we want to assign certain data values based on other variables in the data set. For instance, maybe we want to classify all penguins with body mass above 4.5 kg as “large” while the rest are “normal”?

The if_else() function takes expressions, much like filter(). The first value after the expression is the value assigned if the expression is TRUE, while the second is if the expression is FALSE

R

penguin_weight <- penguins |> 
  select(year, body_mass_g)

penguin_weight |> 
  mutate(size = if_else(condition = body_mass_g > 4500, 
                        true = "large", 
                        false = "normal"))

OUTPUT

# A tibble: 344 × 3
    year body_mass_g size
   <int>       <int> <chr>
 1  2007        3750 normal
 2  2007        3800 normal
 3  2007        3250 normal
 4  2007          NA <NA>
 5  2007        3450 normal
 6  2007        3650 normal
 7  2007        3625 normal
 8  2007        4675 large
 9  2007        3475 normal
10  2007        4250 normal
# ℹ 334 more rows

Now we have a column with two values, large and normal based on whether the penguins are above or below 4.5 kilos.

We can for instance use that in a plot.

R

penguin_weight |> 
  mutate(size = if_else(condition = body_mass_g > 4500, 
                        true = "large", 
                        false = "normal")) |> 
  ggplot() +
  geom_jitter(mapping = aes(x = year, y = body_mass_g, colour = size))

WARNING

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

That shows us clearly that we have grouped the penguins based on their size. But there is this strange NA in the plot legend. what is that?

In R, missing values are usually given the value NA which stands for Not applicable, i.e., missing data. This is a very special name in R. Like TRUE and FALSE are capitalized, RStudio immediately recognizes the combination of capital letters and gives it another colour than all other values. In this case it means, there are some penguins we do not have the body mass of.

Now we know how to create new variables, and even how to make them if there are conditions on how to add the data.

But, we often want to add several columns of different types, and maybe even add new variables based on other new columns! Oh, it’s starting to sound complicated, but it does not have to be!

mutate() is so-called lazy-evaluated. This sounds weird, but it means that each new column you make is made in the sequence you make them. So as long as you think about the order of your mutate() creations, you can do that in a single mutate call.

R

penguins_s |> 
  mutate(
    bill_ratio = bill_depth_mm / bill_length_mm,
    bill_type = if_else(condition = bill_ratio < 0.5, 
                        true = "elongated", 
                        false = "stumped")
  )

OUTPUT

# A tibble: 344 × 6
   species island    bill_length_mm bill_depth_mm bill_ratio bill_type
   <fct>   <fct>              <dbl>         <dbl>      <dbl> <chr>
 1 Adelie  Torgersen           39.1          18.7      0.478 elongated
 2 Adelie  Torgersen           39.5          17.4      0.441 elongated
 3 Adelie  Torgersen           40.3          18        0.447 elongated
 4 Adelie  Torgersen           NA            NA       NA     <NA>
 5 Adelie  Torgersen           36.7          19.3      0.526 stumped
 6 Adelie  Torgersen           39.3          20.6      0.524 stumped
 7 Adelie  Torgersen           38.9          17.8      0.458 elongated
 8 Adelie  Torgersen           39.2          19.6      0.5   stumped
 9 Adelie  Torgersen           34.1          18.1      0.531 stumped
10 Adelie  Torgersen           42            20.2      0.481 elongated
# ℹ 334 more rows

Now you’ve created two variables. One for bill_ratio, and then another one conditional on the values of the bill_ratio.

If you switched the order of these two, R would produce an error, because there would be no bill ratio to create the other column.

R

penguins_s |> 
  mutate(
    bill_ratio = bill_depth_mm / bill_length_mm,
    bill_type = if_else(condition = bill_ratio < 0.5, 
                        true = "elongated", 
                        false = "stumped"),
    bill_ratio = bill_depth_mm / bill_length_mm
  )

OUTPUT

# A tibble: 344 × 6
   species island    bill_length_mm bill_depth_mm bill_ratio bill_type
   <fct>   <fct>              <dbl>         <dbl>      <dbl> <chr>
 1 Adelie  Torgersen           39.1          18.7      0.478 elongated
 2 Adelie  Torgersen           39.5          17.4      0.441 elongated
 3 Adelie  Torgersen           40.3          18        0.447 elongated
 4 Adelie  Torgersen           NA            NA       NA     <NA>
 5 Adelie  Torgersen           36.7          19.3      0.526 stumped
 6 Adelie  Torgersen           39.3          20.6      0.524 stumped
 7 Adelie  Torgersen           38.9          17.8      0.458 elongated
 8 Adelie  Torgersen           39.2          19.6      0.5   stumped
 9 Adelie  Torgersen           34.1          18.1      0.531 stumped
10 Adelie  Torgersen           42            20.2      0.481 elongated
# ℹ 334 more rows

But what if we want to categorize based on more than one condition? Nested if_else()?

R

penguins_s |> 
  mutate(
    bill_ratio = bill_depth_mm / bill_length_mm,
    bill_type = if_else(condition = bill_ratio < 0.35,
                        true =  "elongated", 
                        false = if_else(condition = bill_ratio < 0.45,
                                        true = "normal",
                                        false = "stumped")))

OUTPUT

# A tibble: 344 × 6
   species island    bill_length_mm bill_depth_mm bill_ratio bill_type
   <fct>   <fct>              <dbl>         <dbl>      <dbl> <chr>
 1 Adelie  Torgersen           39.1          18.7      0.478 stumped
 2 Adelie  Torgersen           39.5          17.4      0.441 normal
 3 Adelie  Torgersen           40.3          18        0.447 normal
 4 Adelie  Torgersen           NA            NA       NA     <NA>
 5 Adelie  Torgersen           36.7          19.3      0.526 stumped
 6 Adelie  Torgersen           39.3          20.6      0.524 stumped
 7 Adelie  Torgersen           38.9          17.8      0.458 stumped
 8 Adelie  Torgersen           39.2          19.6      0.5   stumped
 9 Adelie  Torgersen           34.1          18.1      0.531 stumped
10 Adelie  Torgersen           42            20.2      0.481 stumped
# ℹ 334 more rows

what if you have even more conditionals? It can get pretty messy pretty fast. Thankfully, {dplyr} has a smarter way of doing this, called case_when(). This function is similar to if_else(), but where you specify what each condition should be assigned. On the left you have the logical expression, and the on the right of the tilde (~) is the value to be assigned if that expression is TRUE

R

penguins_s |> 
  mutate(
    bill_ratio = bill_depth_mm / bill_length_mm,
    bill_type = case_when(
      bill_ratio < 0.35 ~ "elongated",
      bill_ratio < 0.45 ~ "normal",
      TRUE              ~ "stumped")
  ) |> 
  ggplot(mapping = aes(x = bill_length_mm,
                       y = bill_depth_mm,
                       colour = bill_type)) +
  geom_point()

WARNING

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

That looks almost the same. The NA’s are gone! That’s not right. We cannot categorize values that are missing. It’s our last statement that does this, which just says “make the remainder this value”. Which is not what we want. We need the NAs to stay NA’s.

case_when(), like the mutate(), evaluates the expressions in sequence. Which is why we can have two statements evaluating the same column with similar expressions (below 0.35 and then below 0.45). All values that are below 0.45 are also below 0.35. Since we first assign everything below 0.35, and then below 0.45, they do not collide. We can do the same for our last statement, saying that all values that are not NA should be given this category.

R

penguins |> 
  mutate(
    bill_ratio = bill_depth_mm / bill_length_mm,
    bill_type = case_when(
      bill_ratio < 0.35  ~ "elongated",
      bill_ratio < 0.45  ~ "normal",
      !is.na(bill_ratio) ~ "stumped")
  ) |> 
  ggplot(mapping = aes(x = bill_length_mm,
                       y = bill_depth_mm,
                       colour = bill_type)) +
  geom_point()

WARNING

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Here, we use the is.na(), which is a special function in R to detect NA values. But it also has an ! in front, what does that mean? In R’s logical expressions, the ! is a negation specifier. It means it flips the logical so the TRUE becomes FALSE, and vice versa. So here, it means the bill_ratio is not NA.

Challenge 3

Create a column named bill_ld_ratio_log that is the natural logarithm (using the log() function) of bill_length_mm divided by bill_depth_mm

R

penguins |>
  mutate(bill_ld_ratio_log = log(bill_length_mm / bill_depth_mm))

OUTPUT

# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, bill_ld_ratio_log <dbl>

Challenge 4

Create a new column called body_type, where animals below 3 kg are small, animals between 3 and 4.5 kg are normal, and animals larger than 4.5 kg are large. In the same command, create a new column named biscoe and its content should be TRUE if the island is Biscoe and FALSE for everything else.

R

penguins |>
  mutate(
    body_type = case_when(
       body_mass_g < 3000 ~ "small",
       body_mass_g >= 3000 & body_mass_g < 4500 ~ "normal",
       body_mass_g >= 4500 ~ "large"),
    biscoe = if_else(island == "Biscoe", 
                     true = TRUE,
                     false = FALSE)
  )

OUTPUT

# A tibble: 344 × 10
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 4 more variables: sex <fct>, year <int>, body_type <chr>, biscoe <lgl>

Wrap up

Now we’ve learned a little about adding and altering variables in data sets using {dplyr}’s mutate() function. You should be able to play around with the examples provided and learn more about how things work through trial and error.