Data summaries with dplyr
Last updated on 2024-10-08 | Edit this page
Estimated time: 68 minutes
Overview
Questions
- How can I create summary tables of my data?
- How can I create different types of summaries based on groups in my data?
Objectives
- Use
summarise()
to create data summaries - Use
group_by()
to create summaries of groups - Use
tally()
/count()
to create a quick frequency table
Motivation
Next to visualizing data, creating summaries of the data in tables is a quick way to get an idea of what type of data you have at hand. It might help you spot incorrect data or extreme values, or whether specific analysis approaches are needed. To summarize data with the {tidyverse} efficiently, we need to utilize the tools we have learned the previous days, like adding new variables, tidy-selections, pivots and grouping data. All these tools combine amazingly when we start making summaries.
Let us start from the beginning with summaries, and work our way up to the more complex variations as we go.
First, we must again prepare our workspace with our packages and data.
R
library(tidyverse)
penguins <- palmerpenguins::penguins
We should start to feel quite familiar with our penguins by now. Let us start by finding the mean of the bill length
R
penguins |>
summarise(bill_length_mean = mean(bill_length_mm))
OUTPUT
# A tibble: 1 × 1
bill_length_mean
<dbl>
1 NA
NA
. as we remember, there are some NA
values in our data. R is very clear about trying to do calculations when
there is an NA
. If there is an NA
, i.e. a
value we do not know, it cannot create a correct calulcation, so it will
return NA
again. This is a nice way of quickly seeing that
you have missing values in your data. Right now, we will ignore those.
We can omit these by adding the na.rm = TRUE
argument,
which will remove all NA
’s before calculating the mean.
R
penguins |>
summarise(bill_length_mean = mean(bill_length_mm, na.rm = TRUE))
OUTPUT
# A tibble: 1 × 1
bill_length_mean
<dbl>
1 43.9
An alternative way to remove missing values from a column is to pass
the column to {tidyr}’s drop_na()
function.
R
penguins |>
drop_na(bill_length_mm) |>
summarise(bill_length_mean = mean(bill_length_mm))
OUTPUT
# A tibble: 1 × 1
bill_length_mean
<dbl>
1 43.9
R
penguins |>
drop_na(bill_length_mm) |>
summarise(bill_length_mean = mean(bill_length_mm),
bill_length_min = min(bill_length_mm),
bill_length_max = max(bill_length_mm))
OUTPUT
# A tibble: 1 × 3
bill_length_mean bill_length_min bill_length_max
<dbl> <dbl> <dbl>
1 43.9 32.1 59.6
Challenge 1
First start by trying to summarise a single column,
body_mass_g
, by calculating its mean in
kilograms.
:::::::::::::::::::::::::::::::::::::::: solution ## Solution
R
penguins |>
drop_na(body_mass_g) |>
summarise(body_mass_kg_mean = mean(body_mass_g / 1000))
OUTPUT
# A tibble: 1 × 1
body_mass_kg_mean
<dbl>
1 4.20
:::::::::::::::::::::::::::::::::::::
Challenge 2
Add a column with the standard deviation of body_mass_g
on kilogram scale.
R
penguins |>
drop_na(body_mass_g) |>
summarise(
body_mass_kg_mean = mean(body_mass_g / 1000),
body_mass_kg_sd = sd(body_mass_g / 1000)
)
OUTPUT
# A tibble: 1 × 2
body_mass_kg_mean body_mass_kg_sd
<dbl> <dbl>
1 4.20 0.802
Challenge 3
Now add the same two metrics for flipper_length_mm
on
centimeter scale and give the columns clear names. Why could
the drop_na()
step give us wrong results?
R
penguins |>
drop_na(body_mass_g, flipper_length_mm) |>
summarise(
body_mass_kg_mean = mean(body_mass_g / 1000),
body_mass_kg_sd = sd(body_mass_g / 1000),
flipper_length_cm_mean = mean(flipper_length_mm / 10),
flipper_length_cm_sd = sd(flipper_length_mm / 10)
)
OUTPUT
# A tibble: 1 × 4
body_mass_kg_mean body_mass_kg_sd flipper_length_cm_mean flipper_length_cm_sd
<dbl> <dbl> <dbl> <dbl>
1 4.20 0.802 20.1 1.41
When we use drop_na on multiple columns, it will drop the entire
row of data where there is NA
in any of the columns we
specify. This means that we might be dropping valid data from body mass
because flipper length is missing, and vice versa.
Summarising grouped data
All the examples we have gone through so far with summarizing data, we have summarized the entire data set. But most times, we want to have a look at groups in our data, and summarize based on these groups. How can we manage to summarize while preserving grouping information?
We’ve already worked a little with the group_by()
function, and we will use it again! Because, once we know how to
summarize data, summarizing data by groups is as simple as adding one
more line to our code.
Let us start with our first example of getting the mean of a single column.
R
penguins |>
drop_na(body_mass_g) |>
summarise(body_mass_g_mean = mean(body_mass_g))
OUTPUT
# A tibble: 1 × 1
body_mass_g_mean
<dbl>
1 4202.
Here, we are getting a single mean for the entire data set. In order to get, for instance the means of each of the species, we can group the data set by species before we summarize.
R
penguins |>
drop_na(body_mass_g) |>
group_by(species) |>
summarise(body_mass_kg_mean = mean(body_mass_g / 1000))
OUTPUT
# A tibble: 3 × 2
species body_mass_kg_mean
<fct> <dbl>
1 Adelie 3.70
2 Chinstrap 3.73
3 Gentoo 5.08
And now we suddenly have three means! And they are tidily collected in each their row. To this we can keep adding as we did before.
R
penguins |>
drop_na(body_mass_g) |>
group_by(species) |>
summarise(
body_mass_kg_mean = mean(body_mass_g / 1000),
body_mass_kg_min = min(body_mass_g / 1000),
body_mass_kg_max = max(body_mass_g / 1000)
)
OUTPUT
# A tibble: 3 × 4
species body_mass_kg_mean body_mass_kg_min body_mass_kg_max
<fct> <dbl> <dbl> <dbl>
1 Adelie 3.70 2.85 4.78
2 Chinstrap 3.73 2.7 4.8
3 Gentoo 5.08 3.95 6.3
Now we are suddenly able to easily compare groups within our data, since they are so neatly summarized here.
Simple frequency tables
So far, we have created custom summary tables with means and standard deviations etc. But what if you want a really quick count of all the records in different groups, a frequency table.
One way, would be to use the summarise function together with the
n()
function, which counts the number of rows in each
group.
R
penguins |>
group_by(species) |>
summarise(n = n())
OUTPUT
# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
This is super nice, and n()
is a nice function to
remember when you are making your own custom tables. But if all you want
is the frequency table, we would suggest using the functions
count()
or tally()
. They are synonymous in
what they do, so you can choose the one that feels more appropriate.
R
penguins |>
group_by(species) |>
tally()
OUTPUT
# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
R
penguins |>
group_by(species) |>
count()
OUTPUT
# A tibble: 3 × 2
# Groups: species [3]
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
These are two really nice convenience functions for getting a quick frequency table of your data.
Challenge 4
Create a table that gives the mean and standard deviation of bill length, grouped by island
R
penguins |>
drop_na(bill_length_mm) |>
group_by(island) |>
summarise(
bill_length_mm_mean = mean(bill_length_mm),
bill_length_mm_sd = sd(bill_length_mm )
)
OUTPUT
# A tibble: 3 × 3
island bill_length_mm_mean bill_length_mm_sd
<fct> <dbl> <dbl>
1 Biscoe 45.3 4.77
2 Dream 44.2 5.95
3 Torgersen 39.0 3.03
Challenge 5
Create a table that gives the mean and standard deviation of bill length, grouped by island and sex.
R
penguins |>
drop_na(bill_length_mm) |>
group_by(island, sex) |>
summarise(
bill_length_mm_mean = mean(bill_length_mm),
bill_length_mm_sd = sd(bill_length_mm )
)
OUTPUT
`summarise()` has grouped output by 'island'. You can override using the
`.groups` argument.
OUTPUT
# A tibble: 9 × 4
# Groups: island [3]
island sex bill_length_mm_mean bill_length_mm_sd
<fct> <fct> <dbl> <dbl>
1 Biscoe female 43.3 4.18
2 Biscoe male 47.1 4.69
3 Biscoe <NA> 45.6 1.37
4 Dream female 42.3 5.53
5 Dream male 46.1 5.77
6 Dream <NA> 37.5 NA
7 Torgersen female 37.6 2.21
8 Torgersen male 40.6 3.03
9 Torgersen <NA> 37.9 3.23
Ungrouping for future control
We’ve been grouping a lot and not ungrouping. Which might seem fine now, because we have not really done anything more after the summarize. But in many cases we might continue our merry data handling way and do lots more, and then the preserving of the grouping can give us some unexpected results. Let us explore that a little.
R
penguins |>
group_by(species) |>
count()
OUTPUT
# A tibble: 3 × 2
# Groups: species [3]
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
When we group by a single column and summarize, the output data is no
longer grouped. In a way, the summarize()
uses up one group
while summarizing, as based on species, the data can not be condensed
any further than this. When we group by two columns, it actually has the
same behavior.
R
penguins |>
group_by(species, island) |>
count()
OUTPUT
# A tibble: 5 × 3
# Groups: species, island [5]
species island n
<fct> <fct> <int>
1 Adelie Biscoe 44
2 Adelie Dream 56
3 Adelie Torgersen 52
4 Chinstrap Dream 68
5 Gentoo Biscoe 124
But because we used to have two groups, we now are left with one. In this case “species” is still a grouping variable. Lets say we want a column now, that counts the total number of penguins observations. That would be the sum of the “n” column.
R
penguins |>
group_by(species, island) |>
count() |>
mutate(total = sum(n))
OUTPUT
# A tibble: 5 × 4
# Groups: species, island [5]
species island n total
<fct> <fct> <int> <int>
1 Adelie Biscoe 44 44
2 Adelie Dream 56 56
3 Adelie Torgersen 52 52
4 Chinstrap Dream 68 68
5 Gentoo Biscoe 124 124
But that is not what we are expecting! why? Because the data is still
grouped by species, it is now taking the sum within each species, rather
than the whole. To get the whole we need first to
ungroup()
, and then try again.
R
penguins |>
group_by(species, island) |>
count() |>
ungroup() |>
mutate(total = sum(n))
OUTPUT
# A tibble: 5 × 4
species island n total
<fct> <fct> <int> <int>
1 Adelie Biscoe 44 344
2 Adelie Dream 56 344
3 Adelie Torgersen 52 344
4 Chinstrap Dream 68 344
5 Gentoo Biscoe 124 344
Challenge 6
Create a table that gives the mean and standard deviation of bill length, grouped by island and sex, then add another column that has the mean for all the data
R
penguins |>
drop_na(bill_length_mm) |>
group_by(island, sex) |>
summarise(
bill_length_mm_mean = mean(bill_length_mm),
bill_length_mm_sd = sd(bill_length_mm )
) |>
ungroup() |>
mutate(mean = mean(bill_length_mm_mean))
OUTPUT
`summarise()` has grouped output by 'island'. You can override using the
`.groups` argument.
OUTPUT
# A tibble: 9 × 5
island sex bill_length_mm_mean bill_length_mm_sd mean
<fct> <fct> <dbl> <dbl> <dbl>
1 Biscoe female 43.3 4.18 42.0
2 Biscoe male 47.1 4.69 42.0
3 Biscoe <NA> 45.6 1.37 42.0
4 Dream female 42.3 5.53 42.0
5 Dream male 46.1 5.77 42.0
6 Dream <NA> 37.5 NA 42.0
7 Torgersen female 37.6 2.21 42.0
8 Torgersen male 40.6 3.03 42.0
9 Torgersen <NA> 37.9 3.23 42.0
Grouped data manipulation
You might have noticed that we managed to do some data manipulation
(i.e. mutate
) while the data were still grouped, which in
our example before produced unwanted results. But, often, grouping
before data manipulation can unlock great new possibilities for working
with our data.
Let us use the data we made where we summarised the body mass of penguins in kilograms, and let us group by species and sex.
R
penguins |>
drop_na(body_mass_g) |>
group_by(species, sex) |>
summarise(
body_mass_kg_mean = mean(body_mass_g / 1000),
body_mass_kg_min = min(body_mass_g / 1000),
body_mass_kg_max = max(body_mass_g / 1000)
)
OUTPUT
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
OUTPUT
# A tibble: 8 × 5
# Groups: species [3]
species sex body_mass_kg_mean body_mass_kg_min body_mass_kg_max
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie female 3.37 2.85 3.9
2 Adelie male 4.04 3.32 4.78
3 Adelie <NA> 3.54 2.98 4.25
4 Chinstrap female 3.53 2.7 4.15
5 Chinstrap male 3.94 3.25 4.8
6 Gentoo female 4.68 3.95 5.2
7 Gentoo male 5.48 4.75 6.3
8 Gentoo <NA> 4.59 4.1 4.88
The data we get out after that, is still grouped by species. Let us say that we want to know, the relative size of the penguin sexes body mass to the species mean. We would need the species mean, in addition to the species sex means. We can add this, as the data is already grouped by sex, with a mutate.
R
penguins |>
drop_na(body_mass_g) |>
group_by(species, sex) |>
summarise(
body_mass_kg_mean = mean(body_mass_g / 1000),
body_mass_kg_min = min(body_mass_g / 1000),
body_mass_kg_max = max(body_mass_g / 1000)
) |>
mutate(
species_mean = mean(body_mass_kg_mean)
)
OUTPUT
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
OUTPUT
# A tibble: 8 × 6
# Groups: species [3]
species sex body_mass_kg_mean body_mass_kg_min body_mass_kg_max species_mean
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Adelie fema… 3.37 2.85 3.9 3.65
2 Adelie male 4.04 3.32 4.78 3.65
3 Adelie <NA> 3.54 2.98 4.25 3.65
4 Chinst… fema… 3.53 2.7 4.15 3.73
5 Chinst… male 3.94 3.25 4.8 3.73
6 Gentoo fema… 4.68 3.95 5.2 4.92
7 Gentoo male 5.48 4.75 6.3 4.92
8 Gentoo <NA> 4.59 4.1 4.88 4.92
Notice that now, the same value is in the species_mean column for all the rows of each species. This means our calculation worked! So, in the same data set, we have everything we need to calculate the relative difference between the species mean of body mass and each of the sexes.
R
penguins |>
drop_na(body_mass_g) |>
group_by(species, sex) |>
summarise(
body_mass_kg_mean = mean(body_mass_g / 1000),
body_mass_kg_min = min(body_mass_g / 1000),
body_mass_kg_max = max(body_mass_g / 1000)
) |>
mutate(
species_mean = mean(body_mass_kg_mean),
rel_species = species_mean - body_mass_kg_mean
)
OUTPUT
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
OUTPUT
# A tibble: 8 × 7
# Groups: species [3]
species sex body_mass_kg_mean body_mass_kg_min body_mass_kg_max species_mean
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Adelie fema… 3.37 2.85 3.9 3.65
2 Adelie male 4.04 3.32 4.78 3.65
3 Adelie <NA> 3.54 2.98 4.25 3.65
4 Chinst… fema… 3.53 2.7 4.15 3.73
5 Chinst… male 3.94 3.25 4.8 3.73
6 Gentoo fema… 4.68 3.95 5.2 4.92
7 Gentoo male 5.48 4.75 6.3 4.92
8 Gentoo <NA> 4.59 4.1 4.88 4.92
# ℹ 1 more variable: rel_species <dbl>
Now we can see, with how much the male penguins usually weight compared to the female ones.
Challenge 7
Calculate the difference in flipper length between the different species of penguin
R
penguins |>
drop_na(flipper_length_mm) |>
group_by(species) |>
summarise(
flipper_mean = mean(flipper_length_mm),
) |>
mutate(
species_mean = mean(flipper_mean),
flipper_species_diff = species_mean - flipper_mean
)
OUTPUT
# A tibble: 3 × 4
species flipper_mean species_mean flipper_species_diff
<fct> <dbl> <dbl> <dbl>
1 Adelie 190. 201. 11.0
2 Chinstrap 196. 201. 5.16
3 Gentoo 217. 201. -16.2
Challenge 8
Calculate the difference in flipper length between the different species of penguin, split by the penguins sex.
R
penguins |>
drop_na(flipper_length_mm) |>
group_by(species, sex) |>
summarise(
flipper_mean = mean(flipper_length_mm),
) |>
mutate(
species_mean = mean(flipper_mean),
flipper_species_diff = species_mean - flipper_mean
)
OUTPUT
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
OUTPUT
# A tibble: 8 × 5
# Groups: species [3]
species sex flipper_mean species_mean flipper_species_diff
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie female 188. 189. 0.807
2 Adelie male 192. 189. -3.81
3 Adelie <NA> 186. 189. 3.00
4 Chinstrap female 192. 196. 4.09
5 Chinstrap male 200. 196. -4.09
6 Gentoo female 213. 217. 3.96
7 Gentoo male 222. 217. -4.88
8 Gentoo <NA> 216. 217. 0.916