Data manipulation across columns
Last updated on 2024-11-19 | Edit this page
Overview
Questions
- How can I calculate the mean of several columns for every row of data?
- How can I apply the same function across several related columns?
Objectives
Motivation
We have covered many topics so far, and changing (or mutating) variables has been a key concept. The need to create new columns of data, often based on information in other columns of data, is a type of operation that we need very often. But some times, you also need to calculate something per row for several solumns. For instance, you want the sum of all columns in a certain collection, or the mean of them, how can we do that?
One way, is to write it our entirely. Let just pretend there is a good reason to get the sum of bill length and bill depth. Let us also make a subsetted sample with just the bill measurements so we cab easily see what we are doing. We can do that in the following way.
R
penguins_s <- penguins |>
select(species, starts_with("bill"))
penguins_s |>
mutate(
bill_sum = bill_depth_mm + bill_length_mm
)
OUTPUT
# A tibble: 344 × 4
species bill_length_mm bill_depth_mm bill_sum
<fct> <dbl> <dbl> <dbl>
1 Adelie 39.1 18.7 57.8
2 Adelie 39.5 17.4 56.9
3 Adelie 40.3 18 58.3
4 Adelie NA NA NA
5 Adelie 36.7 19.3 56
6 Adelie 39.3 20.6 59.9
7 Adelie 38.9 17.8 56.7
8 Adelie 39.2 19.6 58.8
9 Adelie 34.1 18.1 52.2
10 Adelie 42 20.2 62.2
# ℹ 334 more rows
We’ve seen similar types of operations before. But what if you want
to sum 20 columns, you would need to type our all 20 column names!
Again, tedious. We have a special type of operations we can do to get
that easily. We will use the function sum
to calculate the
sum of several variables when using this pipeline.
R
penguins_s |>
mutate(bill_sum = sum(c_across(starts_with("bill"))))
OUTPUT
# A tibble: 344 × 4
species bill_length_mm bill_depth_mm bill_sum
<fct> <dbl> <dbl> <dbl>
1 Adelie 39.1 18.7 NA
2 Adelie 39.5 17.4 NA
3 Adelie 40.3 18 NA
4 Adelie NA NA NA
5 Adelie 36.7 19.3 NA
6 Adelie 39.3 20.6 NA
7 Adelie 38.9 17.8 NA
8 Adelie 39.2 19.6 NA
9 Adelie 34.1 18.1 NA
10 Adelie 42 20.2 NA
# ℹ 334 more rows
hm, that is not what we expected. I know why, but the reason is not
always easy to understand. By default, c_across
will
summarise all the rows for all the bill columns, and give a
single value for the entire data set. There are some
NA
s the entire data set, so it returns NA
. So
how can we force it to work in a row-wise fashion? We can apply a
function called rowwise()
which is a special type of
group_by
that groups your data by each row, so each row is
its own group. Then, c_across()
will calculate the mean of
the columns just for that group (i.e. row in this case).
R
penguins_s |>
rowwise() |>
mutate(bill_sum = sum(c_across(starts_with("bill"))))
OUTPUT
# A tibble: 344 × 4
# Rowwise:
species bill_length_mm bill_depth_mm bill_sum
<fct> <dbl> <dbl> <dbl>
1 Adelie 39.1 18.7 57.8
2 Adelie 39.5 17.4 56.9
3 Adelie 40.3 18 58.3
4 Adelie NA NA NA
5 Adelie 36.7 19.3 56
6 Adelie 39.3 20.6 59.9
7 Adelie 38.9 17.8 56.7
8 Adelie 39.2 19.6 58.8
9 Adelie 34.1 18.1 52.2
10 Adelie 42 20.2 62.2
# ℹ 334 more rows
Now we can see that we get the row sum of all the bill columns for
each row, and the tibble tells us it is “Rowwise”. To stop the data set
being rowwise, we can use the ungroup()
function we learned
before.
R
penguins_s |>
rowwise() |>
mutate(bill_sum = sum(c_across(starts_with("bill")))) |>
ungroup()
OUTPUT
# A tibble: 344 × 4
species bill_length_mm bill_depth_mm bill_sum
<fct> <dbl> <dbl> <dbl>
1 Adelie 39.1 18.7 57.8
2 Adelie 39.5 17.4 56.9
3 Adelie 40.3 18 58.3
4 Adelie NA NA NA
5 Adelie 36.7 19.3 56
6 Adelie 39.3 20.6 59.9
7 Adelie 38.9 17.8 56.7
8 Adelie 39.2 19.6 58.8
9 Adelie 34.1 18.1 52.2
10 Adelie 42 20.2 62.2
# ℹ 334 more rows
Challenge 1
Calculate the mean of all the columns with millimeter measurements,
an call it mm_mean
, for each row of data.
R
penguins |>
rowwise() |>
mutate(
mm_mean = mean(c_across(ends_with("mm")))
)
OUTPUT
# A tibble: 344 × 9
# Rowwise:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, mm_mean <dbl>
Challenge 2
Calculate the mean of all the columns with millimeter measurements,
an call it mm_mean
, for each row of data. Then, group the
data by species, and calculate the mean of the mm_mean
within each species and add it as a column named
mm_mean_species
. Ignore NA
s in the last
calculation
R
penguins |>
rowwise() |>
mutate(
mm_mean = mean(c_across(ends_with("mm"))),
) |>
group_by(species) |>
mutate(mm_mean_species = mean(mm_mean, na.rm = TRUE))
OUTPUT
# A tibble: 344 × 10
# Groups: species [3]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 4 more variables: sex <fct>, year <int>, mm_mean <dbl>,
# mm_mean_species <dbl>
Mutating several columns in one go
So far, we’ve been looking at adding or summarising variables one by one, or in a pivoted fashion. This is of course something we do all the time, but some times we need to do the same change to multiple columns at once. Imagine you have a data set with 20 column and you want to scale them all to the same scale. Writing the same command with different columns 20 times is very tedious.
In our case, let us say we want to scale the three columns with
millimeter measurements so that they have a mean of 0 and standard
deviation of 1. We’ve already used the scale()
function
once before, so we will do it again.
In this simple example we might have done so:
R
penguins |>
mutate(
bill_depth_sc = scale(bill_depth_mm),
bill_length_sc = scale(bill_length_mm),
flipper_length_sc = scale(flipper_length_mm)
)
OUTPUT
# A tibble: 344 × 11
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 5 more variables: sex <fct>, year <int>, bill_depth_sc <dbl[,1]>,
# bill_length_sc <dbl[,1]>, flipper_length_sc <dbl[,1]>
Its just three columns, we can do that. But let us imagine we have 20
of these, typing all that out is tedious and error prone. You might
forget to alter the name or keep the same type of naming convention. We
are only human, we easily make mistakes. With {dplyr}’s
across()
we can combine our knowledge of tidy-selectors and
mutate to create the entire transformation for these columns at
once.
R
penguins |>
mutate(across(.cols = ends_with("mm"),
.fns = scale))
OUTPUT
# A tibble: 344 × 8
species island bill_length_mm[,1] bill_depth_mm[,1] flipper_length_mm[,1]
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen -0.883 0.784 -1.42
2 Adelie Torgersen -0.810 0.126 -1.06
3 Adelie Torgersen -0.663 0.430 -0.421
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen -1.32 1.09 -0.563
6 Adelie Torgersen -0.847 1.75 -0.776
7 Adelie Torgersen -0.920 0.329 -1.42
8 Adelie Torgersen -0.865 1.24 -0.421
9 Adelie Torgersen -1.80 0.480 -0.563
10 Adelie Torgersen -0.352 1.54 -0.776
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>
Whoa! So fast! Now the three columns are scaled. .col
argument takes a tidy-selection of columns, and .fns
it
where you let it know which function to apply.
But oh no! The columns have been overwritten. Rather than creating new ones, we replaced the old ones. This might be your intention in some instances, or maybe you will just create a new data set with the scaled variables.
R
penguins_mm_sc <- penguins |>
mutate(across(.cols = ends_with("mm"),
.fns = scale))
but often, we’d like to keep the original but add the new variants. We can do that to within the across!
R
penguins |>
mutate(across(.cols = ends_with("mm"),
.fns = scale,
.names = "{.col}_sc")) |>
select(contains("mm"))
OUTPUT
# A tibble: 344 × 6
bill_length_mm bill_depth_mm flipper_length_mm bill_length_mm_sc[,1]
<dbl> <dbl> <int> <dbl>
1 39.1 18.7 181 -0.883
2 39.5 17.4 186 -0.810
3 40.3 18 195 -0.663
4 NA NA NA NA
5 36.7 19.3 193 -1.32
6 39.3 20.6 190 -0.847
7 38.9 17.8 181 -0.920
8 39.2 19.6 195 -0.865
9 34.1 18.1 193 -1.80
10 42 20.2 190 -0.352
# ℹ 334 more rows
# ℹ 2 more variables: bill_depth_mm_sc <dbl[,1]>,
# flipper_length_mm_sc <dbl[,1]>
Now they are all there! neat! But that .names
argument
is a little weird. What does it really mean?
Internally, across()
stores the column names in a vector
it calls .col
. We can use this knowledge to tell the across
function what to name our new columns. In this case, we want to append
the column name with _sc
.
Challenge 3
Transform all the colmns with an underscore in their name so they are
scaled, and add the prefix sc_
to the columns
names.
R
penguins |>
mutate(across(.cols = contains("_"),
.fns = scale,
.names = "sc_{.col}"))
OUTPUT
# A tibble: 344 × 12
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 6 more variables: sex <fct>, year <int>, sc_bill_length_mm <dbl[,1]>,
# sc_bill_depth_mm <dbl[,1]>, sc_flipper_length_mm <dbl[,1]>,
# sc_body_mass_g <dbl[,1]>
Challenge 4
Transform all the colmns with an underscore in their name so they are
scaled, and add the prefix sc_
to the columns
names. Add another standard change of the body mass column to
kilograms
You can add a standard mutate within the same mutate as across
R
penguins |>
mutate(
across(.cols = contains("_"),
.fns = scale,
.names = "sc_{.col}"),
body_mass_kg = body_mass_g / 1000
)
OUTPUT
# A tibble: 344 × 13
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 7 more variables: sex <fct>, year <int>, sc_bill_length_mm <dbl[,1]>,
# sc_bill_depth_mm <dbl[,1]>, sc_flipper_length_mm <dbl[,1]>,
# sc_body_mass_g <dbl[,1]>, body_mass_kg <dbl>
Wrap-up
We hope these sessions have given your a leg-up in starting you R and tidyverse journey. Remember that learning to code is like learning a new language, the best way to learn is to keep trying. We promise, your efforts will not be in vain as you uncover the power of R and the tidyverse.
Learning more
As and end to this workshop, we would like to provide you with some learning materials that might aid you in further pursuits of learning R.
The tidyverse webpage offers lots of resources on learning the tidyverse way of working, and information about what great things you can do with this collection of packages. There is an R for Datascience learning community that is an excellent and welcoming community of other learners navigating the tidyverse. We wholeheartedly recommend joining this community! The Rstudio community is also a great place to ask questions or look for solutions for questions you may have, and so is stackoverflow.