Working with data
Last updated on 2022-09-13 | Edit this page
Overview
Questions
- How do you manipulate tabular data in R?
Objectives
- Import CSV data into R.
- Understand the difference between base R and
tidyverse
approaches. - Subset rows and columns of data.frames.
- Use pipes to link steps together into pipelines.
- Create new data.frame columns using existing columns.
- Utilize the concept of split-apply-combine data analysis.
- Reshape data between wide and long formats.
- Export data to a CSV file.
R
library(tidyverse)
Importing data
Up until this point, we have been working with the complete_old
dataframe contained in the ratdat
package. However, you typically won’t access data from an R package; it is much more common to access data files stored somewhere on your computer. We are going to download a CSV file containing the surveys data to our computer, which we will then read into R.
Click this link to download the file: https://www.michaelc-m.com/Rewrite-R-ecology-lesson/data/cleaned/surveys_complete_77_89.csv.
You will be prompted to save the file on your computer somewhere. Save it inside the cleaned
data folder, which is in the data
folder in your R-Ecology-Workshop
folder. Once it’s inside our project, we will be able to point R towards it.
File paths
When we reference other files from an R script, we need to give R precise instructions on where those files are. We do that using something called a file path. It looks something like this: "Documents/Manuscripts/Chapter_2.txt"
. This path would tell your computer how to get from whatever folder contains the Documents
folder all the way to the .txt
file.
There are two kinds of paths: absolute and relative. Absolute paths are specific to a particular computer, whereas relative paths are relative to a certain folder. Because we are keeping all of our work in the R-Ecology-Workshop
folder, all of our paths can be relative to this folder.
Now, let’s read our CSV file into R and store it in an object named surveys
. We will use the read_csv
function from the tidyverse
’s readr
package, and the argument we give will be the relative path to the CSV file.
R
surveys <- read_csv("data/cleaned/surveys_complete_77_89.csv")
OUTPUT
Rows: 16878 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): species_id, sex, genus, species, taxa, plot_type
dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Callout
Typing out paths can be error prone, so we can utilize a keyboard shortcut. Inside the parentheses of read_csv()
, type out a pair of quotes and put your cursor between them. Then hit Tab. A small menu showing your folders and files should show up. You can use the ↑ and ↓ keys to move through the options, or start typing to narrow them down. You can hit Enter to select a file or folder, and hit Tab again to continue building the file path. This might take a bit of getting used to, but once you get the hang of it, it will speed up writing file paths and reduce the number of mistakes you make.
You may have noticed a bit of feedback from R when you ran the last line of code. We got some useful information about the CSV file we read in. We can see:
- the number of rows and columns
- the delimiter of the file, which is how values are separated, a comma
","
- a set of columns that were parsed as various vector types
- the file has 6 character columns and 7 numeric columns
- we can see the names of the columns for each type
When working with the output of a new function, it’s often a good idea to check the class()
:
R
class(surveys)
OUTPUT
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Whoa! What is this thing? It has multiple classes? Well, it’s called a tibble
, and it is the tidyverse
version of a data.frame. It is a data.frame, but with some added perks. It prints out a little more nicely, it highlights NA
values and negative values in red, and it will generally communicate with you more (in terms of warnings and errors, which is a good thing).
Callout
tidyverse
vs. base R
As we begin to delve more deeply into the tidyverse
, we should briefly pause to mention some of the reasons for focusing on the tidyverse
set of tools. In R, there are often many ways to get a job done, and there are other approaches that can accomplish tasks similar to the tidyverse
.
The phrase base R is used to refer to approaches that utilize functions contained in R’s default packages. We have already used some base R functions, such as str()
, head()
, and mean()
, and we will be using more scattered throughout this lesson. However, there are some key base R approaches we will not be teaching. These include square bracket subsetting and base plotting. You may come across code written by other people that looks like surveys[1:10, 2]
or plot(surveys$weight, surveys$hindfoot_length)
, which are base R commands. If you’re interested in learning more about these approaches, you can check out other Carpentries lessons like the Software Carpentry Programming with R lesson.
We choose to teach the tidyverse
set of packages because they share a similar syntax and philosophy, making them consistent and producing highly readable code. They are also very flexible and powerful, with a growing number of packages designed according to similar principles and to work well with the rest of the packages. The tidyverse
packages tend to have very clear documentation and wide array of learning materials that tend to be written with novice users in mind. Finally, the tidyverse
has only continued to grow, and has strong support from RStudio, which implies that these approaches will be relevant into the future.
Manipulating data
One of the most important skills for working with data in R is the ability to manipulate, modify, and reshape data. The dplyr
and tidyr
packages in the tidyverse
provide a series of powerful functions for many common data manipulation tasks.
We’ll start off with two of the most commonly used dplyr
functions: select()
, which selects certain columns of a data.frame, and filter()
, which filters out rows according to certain criteria.
select()
To use the select()
function, the first argument is the name of the data.frame, and the rest of the arguments are unquoted names of the columns you want:
R
select(surveys, plot_id, species_id, hindfoot_length)
OUTPUT
# A tibble: 16,878 × 3
plot_id species_id hindfoot_length
<dbl> <chr> <dbl>
1 2 NL 32
2 3 NL 33
3 2 DM 37
4 7 DM 36
5 3 DM 35
6 1 PF 14
7 2 PE NA
8 1 DM 37
9 1 DM 34
10 6 PF 20
# … with 16,868 more rows
The columns are arranged in the order we specified inside select()
.
To select all columns except specific columns, put a -
in front of the column you want to exclude:
R
select(surveys, -record_id, -year)
OUTPUT
# A tibble: 16,878 × 11
month day plot_id specie…¹ sex hindf…² weight genus species taxa plot_…³
<dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 7 16 2 NL M 32 NA Neot… albigu… Rode… Control
2 7 16 3 NL M 33 NA Neot… albigu… Rode… Long-t…
3 7 16 2 DM F 37 NA Dipo… merria… Rode… Control
4 7 16 7 DM M 36 NA Dipo… merria… Rode… Rodent…
5 7 16 3 DM M 35 NA Dipo… merria… Rode… Long-t…
6 7 16 1 PF M 14 NA Pero… flavus Rode… Specta…
7 7 16 2 PE F NA NA Pero… eremic… Rode… Control
8 7 16 1 DM M 37 NA Dipo… merria… Rode… Specta…
9 7 16 1 DM F 34 NA Dipo… merria… Rode… Specta…
10 7 16 6 PF F 20 NA Pero… flavus Rode… Short-…
# … with 16,868 more rows, and abbreviated variable names ¹species_id,
# ²hindfoot_length, ³plot_type
select()
also works with numeric vectors for the order of the columns. To select the 3rd, 4th, 5th, and 10th columns, we could run the following code:
R
select(surveys, c(3:5, 10))
OUTPUT
# A tibble: 16,878 × 4
day year plot_id genus
<dbl> <dbl> <dbl> <chr>
1 16 1977 2 Neotoma
2 16 1977 3 Neotoma
3 16 1977 2 Dipodomys
4 16 1977 7 Dipodomys
5 16 1977 3 Dipodomys
6 16 1977 1 Perognathus
7 16 1977 2 Peromyscus
8 16 1977 1 Dipodomys
9 16 1977 1 Dipodomys
10 16 1977 6 Perognathus
# … with 16,868 more rows
You should be careful when using this method, since you are being less explicit about which columns you want. However, it can be useful if you have a data.frame with many columns and you don’t want to type out too many names.
Finally, you can select columns based on whether they match a certain criteria by using the where()
function. If we want all numeric columns, we can ask to select
all the columns where
the class is numeric
:
R
select(surveys, where(is.numeric))
OUTPUT
# A tibble: 16,878 × 7
record_id month day year plot_id hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 7 16 1977 2 32 NA
2 2 7 16 1977 3 33 NA
3 3 7 16 1977 2 37 NA
4 4 7 16 1977 7 36 NA
5 5 7 16 1977 3 35 NA
6 6 7 16 1977 1 14 NA
7 7 7 16 1977 2 NA NA
8 8 7 16 1977 1 37 NA
9 9 7 16 1977 1 34 NA
10 10 7 16 1977 6 20 NA
# … with 16,868 more rows
Instead of giving names or positions of columns, we instead pass the where()
function with the name of another function inside it, in this case is.numeric()
, and we get all the columns for which that function returns TRUE
.
We can use this to select any columns that have any NA
values in them:
R
select(surveys, where(anyNA))
OUTPUT
# A tibble: 16,878 × 7
species_id sex hindfoot_length weight genus species taxa
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 NL M 32 NA Neotoma albigula Rodent
2 NL M 33 NA Neotoma albigula Rodent
3 DM F 37 NA Dipodomys merriami Rodent
4 DM M 36 NA Dipodomys merriami Rodent
5 DM M 35 NA Dipodomys merriami Rodent
6 PF M 14 NA Perognathus flavus Rodent
7 PE F NA NA Peromyscus eremicus Rodent
8 DM M 37 NA Dipodomys merriami Rodent
9 DM F 34 NA Dipodomys merriami Rodent
10 PF F 20 NA Perognathus flavus Rodent
# … with 16,868 more rows
filter()
The filter()
function is used to select rows that meet certain criteria. To get all the rows where the value of year
is equal to 1985, we would run the following:
R
filter(surveys, year == 1985)
OUTPUT
# A tibble: 1,438 × 13
record…¹ month day year plot_id speci…² sex hindf…³ weight genus species
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 9790 1 19 1985 16 RM F 16 4 Reit… megalo…
2 9791 1 19 1985 17 OT F 20 16 Onyc… torrid…
3 9792 1 19 1985 6 DO M 35 48 Dipo… ordii
4 9793 1 19 1985 12 DO F 35 40 Dipo… ordii
5 9794 1 19 1985 24 RM M 16 4 Reit… megalo…
6 9795 1 19 1985 12 DO M 34 48 Dipo… ordii
7 9796 1 19 1985 6 DM F 37 35 Dipo… merria…
8 9797 1 19 1985 14 DM M 36 45 Dipo… merria…
9 9798 1 19 1985 6 DM F 36 38 Dipo… merria…
10 9799 1 19 1985 19 RM M 16 4 Reit… megalo…
# … with 1,428 more rows, 2 more variables: taxa <chr>, plot_type <chr>, and
# abbreviated variable names ¹record_id, ²species_id, ³hindfoot_length
The ==
sign means “is equal to”. There are several other operators we can use: >, >=, <, <=, and != (not equal to). Another useful operator is %in%
, which asks if the value on the lefthand side is found anywhere in the vector on the righthand side. For example, to get rows with specific species_id
values, we could run:
R
filter(surveys, species_id %in% c("RM", "DO"))
OUTPUT
# A tibble: 2,835 × 13
record…¹ month day year plot_id speci…² sex hindf…³ weight genus species
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 68 8 19 1977 8 DO F 32 52 Dipo… ordii
2 292 10 17 1977 3 DO F 36 33 Dipo… ordii
3 294 10 17 1977 3 DO F 37 50 Dipo… ordii
4 311 10 17 1977 19 RM M 18 13 Reit… megalo…
5 317 10 17 1977 17 DO F 32 48 Dipo… ordii
6 323 10 17 1977 17 DO F 33 31 Dipo… ordii
7 337 10 18 1977 8 DO F 35 41 Dipo… ordii
8 356 11 12 1977 1 DO F 32 44 Dipo… ordii
9 378 11 12 1977 1 DO M 33 48 Dipo… ordii
10 397 11 13 1977 17 RM F 16 7 Reit… megalo…
# … with 2,825 more rows, 2 more variables: taxa <chr>, plot_type <chr>, and
# abbreviated variable names ¹record_id, ²species_id, ³hindfoot_length
We can also use multiple conditions in one filter()
statement. Here we will get rows with a year less than or equal to 1988 and whose hindfoot length values are not NA
. The !
before the is.na()
function means “not”.
R
filter(surveys, year <= 1988 & !is.na(hindfoot_length))
OUTPUT
# A tibble: 12,779 × 13
record…¹ month day year plot_id speci…² sex hindf…³ weight genus species
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1 7 16 1977 2 NL M 32 NA Neot… albigu…
2 2 7 16 1977 3 NL M 33 NA Neot… albigu…
3 3 7 16 1977 2 DM F 37 NA Dipo… merria…
4 4 7 16 1977 7 DM M 36 NA Dipo… merria…
5 5 7 16 1977 3 DM M 35 NA Dipo… merria…
6 6 7 16 1977 1 PF M 14 NA Pero… flavus
7 8 7 16 1977 1 DM M 37 NA Dipo… merria…
8 9 7 16 1977 1 DM F 34 NA Dipo… merria…
9 10 7 16 1977 6 PF F 20 NA Pero… flavus
10 11 7 16 1977 5 DS F 53 NA Dipo… specta…
# … with 12,769 more rows, 2 more variables: taxa <chr>, plot_type <chr>, and
# abbreviated variable names ¹record_id, ²species_id, ³hindfoot_length
R
surveys_filtered <- filter(surveys, year >= 1980 & year <= 1985)
R
surveys_selected <- select(surveys, year, month, species_id, plot_id)
The pipe: %>%
What happens if we want to both select()
and filter()
our data? We have a couple options. First, we could use nested functions:
R
filter(select(surveys, -day), month >= 7)
OUTPUT
# A tibble: 8,244 × 12
record…¹ month year plot_id speci…² sex hindf…³ weight genus species taxa
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 1 7 1977 2 NL M 32 NA Neot… albigu… Rode…
2 2 7 1977 3 NL M 33 NA Neot… albigu… Rode…
3 3 7 1977 2 DM F 37 NA Dipo… merria… Rode…
4 4 7 1977 7 DM M 36 NA Dipo… merria… Rode…
5 5 7 1977 3 DM M 35 NA Dipo… merria… Rode…
6 6 7 1977 1 PF M 14 NA Pero… flavus Rode…
7 7 7 1977 2 PE F NA NA Pero… eremic… Rode…
8 8 7 1977 1 DM M 37 NA Dipo… merria… Rode…
9 9 7 1977 1 DM F 34 NA Dipo… merria… Rode…
10 10 7 1977 6 PF F 20 NA Pero… flavus Rode…
# … with 8,234 more rows, 1 more variable: plot_type <chr>, and abbreviated
# variable names ¹record_id, ²species_id, ³hindfoot_length
R will evaluate statements from the inside out. First, select()
will operate on the surveys
data.frame, removing the column day
. The resulting data.frame is then used as the first argument for filter()
, which selects rows with a month greater than or equal to 7.
Nested functions can be very difficult to read with only a few functions, and nearly impossible when many functions are done at once. An alternative approach is to create intermediate objects:
R
surveys_noday <- select(surveys, -day)
filter(surveys_noday, month >= 7)
OUTPUT
# A tibble: 8,244 × 12
record…¹ month year plot_id speci…² sex hindf…³ weight genus species taxa
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 1 7 1977 2 NL M 32 NA Neot… albigu… Rode…
2 2 7 1977 3 NL M 33 NA Neot… albigu… Rode…
3 3 7 1977 2 DM F 37 NA Dipo… merria… Rode…
4 4 7 1977 7 DM M 36 NA Dipo… merria… Rode…
5 5 7 1977 3 DM M 35 NA Dipo… merria… Rode…
6 6 7 1977 1 PF M 14 NA Pero… flavus Rode…
7 7 7 1977 2 PE F NA NA Pero… eremic… Rode…
8 8 7 1977 1 DM M 37 NA Dipo… merria… Rode…
9 9 7 1977 1 DM F 34 NA Dipo… merria… Rode…
10 10 7 1977 6 PF F 20 NA Pero… flavus Rode…
# … with 8,234 more rows, 1 more variable: plot_type <chr>, and abbreviated
# variable names ¹record_id, ²species_id, ³hindfoot_length
This approach is easier to read, since we can see the steps in order, but after enough steps, we are left with a cluttered mess of intermediate objects, often with confusing names.
An elegant solution to this problem is an operator called the pipe, which looks like %>%
. You can insert it by using the keyboard shortcut Shift+Cmd+M (Mac) or Shift+Ctrl+M (Windows). Here’s how you could use a pipe to select and filter in one step:
R
surveys %>%
select(-day) %>%
filter(month >= 7)
OUTPUT
# A tibble: 8,244 × 12
record…¹ month year plot_id speci…² sex hindf…³ weight genus species taxa
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 1 7 1977 2 NL M 32 NA Neot… albigu… Rode…
2 2 7 1977 3 NL M 33 NA Neot… albigu… Rode…
3 3 7 1977 2 DM F 37 NA Dipo… merria… Rode…
4 4 7 1977 7 DM M 36 NA Dipo… merria… Rode…
5 5 7 1977 3 DM M 35 NA Dipo… merria… Rode…
6 6 7 1977 1 PF M 14 NA Pero… flavus Rode…
7 7 7 1977 2 PE F NA NA Pero… eremic… Rode…
8 8 7 1977 1 DM M 37 NA Dipo… merria… Rode…
9 9 7 1977 1 DM F 34 NA Dipo… merria… Rode…
10 10 7 1977 6 PF F 20 NA Pero… flavus Rode…
# … with 8,234 more rows, 1 more variable: plot_type <chr>, and abbreviated
# variable names ¹record_id, ²species_id, ³hindfoot_length
What it does is take the thing on the lefthand side and insert it as the first argument of the function on the righthand side. By putting each of our functions onto a new line, we can build a nice, readable pipeline. It can be useful to think of this as a little assembly line for our data. It starts at the top and gets piped into a select()
function, and it comes out modified somewhat. It then gets sent into the filter()
function, where it is further modified, and then the final product gets printed out to our console. It can also be helpful to think of %>%
as meaning “and then”. Since many tidyverse
functions have verbs for names, a pipeline can be read like a sentence.
If we want to store this final product as an object, we use an assignment arrow at the start:
R
surveys_sub <- surveys %>%
select(-day) %>%
filter(month >= 7)
A good approach is to build a pipeline step by step prior to assignment. You add functions to the pipeline as you go, with the results printing in the console for you to view. Once you’re satisfied with your final result, go back and add the assignment arrow statement at the start. This approach is very interactive, allowing you to see the results of each step as you build the pipeline, and produces nicely readable code.
R
surveys_1988 <- surveys %>%
filter(year == 1988) %>%
select(record_id, month, species_id)
Make sure to filter()
before you select()
. You need to use the year
column for filtering rows, but it is discarded in the select()
step. You also need to make sure to use ==
instead of =
when you are filtering rows where year
is equal to 1988.
Making new columns with mutate()
Another common task is creating a new column based on values in existing columns. For example, we could add a new column that has the weight in kilograms instead of grams:
R
surveys %>%
mutate(weight_kg = weight / 1000)
OUTPUT
# A tibble: 16,878 × 14
record…¹ month day year plot_id speci…² sex hindf…³ weight genus species
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1 7 16 1977 2 NL M 32 NA Neot… albigu…
2 2 7 16 1977 3 NL M 33 NA Neot… albigu…
3 3 7 16 1977 2 DM F 37 NA Dipo… merria…
4 4 7 16 1977 7 DM M 36 NA Dipo… merria…
5 5 7 16 1977 3 DM M 35 NA Dipo… merria…
6 6 7 16 1977 1 PF M 14 NA Pero… flavus
7 7 7 16 1977 2 PE F NA NA Pero… eremic…
8 8 7 16 1977 1 DM M 37 NA Dipo… merria…
9 9 7 16 1977 1 DM F 34 NA Dipo… merria…
10 10 7 16 1977 6 PF F 20 NA Pero… flavus
# … with 16,868 more rows, 3 more variables: taxa <chr>, plot_type <chr>,
# weight_kg <dbl>, and abbreviated variable names ¹record_id, ²species_id,
# ³hindfoot_length
You can create multiple columns in one mutate()
call, and they will get created in the order you write them. This means you can even reference the first new column in the second new column:
R
surveys %>%
mutate(weight_kg = weight / 1000,
weight_lbs = weight_kg * 2.2)
OUTPUT
# A tibble: 16,878 × 15
record…¹ month day year plot_id speci…² sex hindf…³ weight genus species
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1 7 16 1977 2 NL M 32 NA Neot… albigu…
2 2 7 16 1977 3 NL M 33 NA Neot… albigu…
3 3 7 16 1977 2 DM F 37 NA Dipo… merria…
4 4 7 16 1977 7 DM M 36 NA Dipo… merria…
5 5 7 16 1977 3 DM M 35 NA Dipo… merria…
6 6 7 16 1977 1 PF M 14 NA Pero… flavus
7 7 7 16 1977 2 PE F NA NA Pero… eremic…
8 8 7 16 1977 1 DM M 37 NA Dipo… merria…
9 9 7 16 1977 1 DM F 34 NA Dipo… merria…
10 10 7 16 1977 6 PF F 20 NA Pero… flavus
# … with 16,868 more rows, 4 more variables: taxa <chr>, plot_type <chr>,
# weight_kg <dbl>, weight_lbs <dbl>, and abbreviated variable names
# ¹record_id, ²species_id, ³hindfoot_length
We can also use multiple columns to create a single column. For example, it’s often good practice to keep the components of a date in separate columns until necessary, as we’ve done here. This is because programs like Excel can do automatic things with dates in a way that is not reproducible and sometimes hard to notice. However, now that we are working in R, we can safely put together a date column.
To put together the columns into something that looks like a date, we can use the paste()
function, which takes arguments of the items to paste together, as well as the argument sep
, which is the character used to separate the items.
R
surveys %>%
mutate(date = paste(year, month, day, sep = "-"))
OUTPUT
# A tibble: 16,878 × 14
record…¹ month day year plot_id speci…² sex hindf…³ weight genus species
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1 7 16 1977 2 NL M 32 NA Neot… albigu…
2 2 7 16 1977 3 NL M 33 NA Neot… albigu…
3 3 7 16 1977 2 DM F 37 NA Dipo… merria…
4 4 7 16 1977 7 DM M 36 NA Dipo… merria…
5 5 7 16 1977 3 DM M 35 NA Dipo… merria…
6 6 7 16 1977 1 PF M 14 NA Pero… flavus
7 7 7 16 1977 2 PE F NA NA Pero… eremic…
8 8 7 16 1977 1 DM M 37 NA Dipo… merria…
9 9 7 16 1977 1 DM F 34 NA Dipo… merria…
10 10 7 16 1977 6 PF F 20 NA Pero… flavus
# … with 16,868 more rows, 3 more variables: taxa <chr>, plot_type <chr>,
# date <chr>, and abbreviated variable names ¹record_id, ²species_id,
# ³hindfoot_length
Since our new column gets moved all the way to the end, it doesn’t end up printing out. We can use the relocate()
function to put it after our year
column:
R
surveys %>%
mutate(date = paste(year, month, day, sep = "-")) %>%
relocate(date, .after = year)
OUTPUT
# A tibble: 16,878 × 14
record_id month day year date plot_id speci…¹ sex hindf…² weight genus
<dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr>
1 1 7 16 1977 1977-… 2 NL M 32 NA Neot…
2 2 7 16 1977 1977-… 3 NL M 33 NA Neot…
3 3 7 16 1977 1977-… 2 DM F 37 NA Dipo…
4 4 7 16 1977 1977-… 7 DM M 36 NA Dipo…
5 5 7 16 1977 1977-… 3 DM M 35 NA Dipo…
6 6 7 16 1977 1977-… 1 PF M 14 NA Pero…
7 7 7 16 1977 1977-… 2 PE F NA NA Pero…
8 8 7 16 1977 1977-… 1 DM M 37 NA Dipo…
9 9 7 16 1977 1977-… 1 DM F 34 NA Dipo…
10 10 7 16 1977 1977-… 6 PF F 20 NA Pero…
# … with 16,868 more rows, 3 more variables: species <chr>, taxa <chr>,
# plot_type <chr>, and abbreviated variable names ¹species_id,
# ²hindfoot_length
Now we can see that we have a character column that contains our date string. However, it’s not truly a date column. Dates are a type of numeric variable with a defined, ordered scale. To turn this column into a proper date, we will use a function from the tidyverse
’s lubridate
package, which has lots of useful functions for working with dates. The function ymd()
will parse a date string that has the order year-month-day. Let’s load the package and use ymd()
.
R
library(lubridate)
OUTPUT
Attaching package: 'lubridate'
OUTPUT
The following objects are masked from 'package:base':
date, intersect, setdiff, union
R
surveys %>%
mutate(date = paste(year, month, day, sep = "-"),
date = ymd(date)) %>%
relocate(date, .after = year)
OUTPUT
# A tibble: 16,878 × 14
record_id month day year date plot_id species…¹ sex hindf…² weight
<dbl> <dbl> <dbl> <dbl> <date> <dbl> <chr> <chr> <dbl> <dbl>
1 1 7 16 1977 1977-07-16 2 NL M 32 NA
2 2 7 16 1977 1977-07-16 3 NL M 33 NA
3 3 7 16 1977 1977-07-16 2 DM F 37 NA
4 4 7 16 1977 1977-07-16 7 DM M 36 NA
5 5 7 16 1977 1977-07-16 3 DM M 35 NA
6 6 7 16 1977 1977-07-16 1 PF M 14 NA
7 7 7 16 1977 1977-07-16 2 PE F NA NA
8 8 7 16 1977 1977-07-16 1 DM M 37 NA
9 9 7 16 1977 1977-07-16 1 DM F 34 NA
10 10 7 16 1977 1977-07-16 6 PF F 20 NA
# … with 16,868 more rows, 4 more variables: genus <chr>, species <chr>,
# taxa <chr>, plot_type <chr>, and abbreviated variable names ¹species_id,
# ²hindfoot_length
Now we can see that our date
column has the type date
as well. In this example, we created our column with two separate lines in mutate()
, but we can combine them into one:
R
# using nested functions
surveys %>%
mutate(date = ymd(paste(year, month, day, sep = "-"))) %>%
relocate(date, .after = year)
OUTPUT
# A tibble: 16,878 × 14
record_id month day year date plot_id species…¹ sex hindf…² weight
<dbl> <dbl> <dbl> <dbl> <date> <dbl> <chr> <chr> <dbl> <dbl>
1 1 7 16 1977 1977-07-16 2 NL M 32 NA
2 2 7 16 1977 1977-07-16 3 NL M 33 NA
3 3 7 16 1977 1977-07-16 2 DM F 37 NA
4 4 7 16 1977 1977-07-16 7 DM M 36 NA
5 5 7 16 1977 1977-07-16 3 DM M 35 NA
6 6 7 16 1977 1977-07-16 1 PF M 14 NA
7 7 7 16 1977 1977-07-16 2 PE F NA NA
8 8 7 16 1977 1977-07-16 1 DM M 37 NA
9 9 7 16 1977 1977-07-16 1 DM F 34 NA
10 10 7 16 1977 1977-07-16 6 PF F 20 NA
# … with 16,868 more rows, 4 more variables: genus <chr>, species <chr>,
# taxa <chr>, plot_type <chr>, and abbreviated variable names ¹species_id,
# ²hindfoot_length
R
# using a pipe *inside* mutate()
surveys %>%
mutate(date = paste(year, month, day,
sep = "-") %>% ymd()) %>%
relocate(date, .after = year)
OUTPUT
# A tibble: 16,878 × 14
record_id month day year date plot_id species…¹ sex hindf…² weight
<dbl> <dbl> <dbl> <dbl> <date> <dbl> <chr> <chr> <dbl> <dbl>
1 1 7 16 1977 1977-07-16 2 NL M 32 NA
2 2 7 16 1977 1977-07-16 3 NL M 33 NA
3 3 7 16 1977 1977-07-16 2 DM F 37 NA
4 4 7 16 1977 1977-07-16 7 DM M 36 NA
5 5 7 16 1977 1977-07-16 3 DM M 35 NA
6 6 7 16 1977 1977-07-16 1 PF M 14 NA
7 7 7 16 1977 1977-07-16 2 PE F NA NA
8 8 7 16 1977 1977-07-16 1 DM M 37 NA
9 9 7 16 1977 1977-07-16 1 DM F 34 NA
10 10 7 16 1977 1977-07-16 6 PF F 20 NA
# … with 16,868 more rows, 4 more variables: genus <chr>, species <chr>,
# taxa <chr>, plot_type <chr>, and abbreviated variable names ¹species_id,
# ²hindfoot_length
R
surveys %>%
mutate(date = ymd(paste(year, month, day, sep = "-"))) %>%
ggplot(aes(x = date, y = weight)) +
geom_jitter(alpha = 0.1)
WARNING
Warning: Removed 1692 rows containing missing values (geom_point).
This isn’t necessarily the most useful plot, but we will learn some techniques that will help produce nice time series plots
The split-apply-combine approach
Many data analysis tasks can be achieved using the split-apply-combine approach: you split the data into groups, apply some analysis to each group, and combine the results in some way. dplyr
has a few convenient functions to enable this approach, the main two being group_by()
and summarize()
.
group_by()
takes a data.frame and the name of one or more columns with categorical values that define the groups. summarize()
then collapses each group into a one-row summary of the group, giving you back a data.frame with one row per group. The syntax for summarize()
is similar to mutate()
, where you define new columns based on values of other columns. Let’s try calculating the mean weight of all our animals by sex.
R
surveys %>%
group_by(sex) %>%
summarize(mean_weight = mean(weight, na.rm = T))
OUTPUT
# A tibble: 3 × 2
sex mean_weight
<chr> <dbl>
1 F 53.1
2 M 53.2
3 <NA> 74.0
You can see that the mean weight for males is slightly higher than for females, but that animals whose sex is unknown have much higher weights. This is probably due to small sample size, but we should check to be sure. Like mutate()
, we can define multiple columns in one summarize()
call. The function n()
will count the number of rows in each group.
R
surveys %>%
group_by(sex) %>%
summarize(mean_weight = mean(weight, na.rm = T),
n = n())
OUTPUT
# A tibble: 3 × 3
sex mean_weight n
<chr> <dbl> <int>
1 F 53.1 7318
2 M 53.2 8260
3 <NA> 74.0 1300
You will often want to create groups based on multiple columns. For example, we might be interested in the mean weight of every species + sex combination. All we have to do is add another column to our group_by()
call.
R
surveys %>%
group_by(species_id, sex) %>%
summarize(mean_weight = mean(weight, na.rm = T),
n = n())
OUTPUT
`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.
OUTPUT
# A tibble: 67 × 4
# Groups: species_id [36]
species_id sex mean_weight n
<chr> <chr> <dbl> <int>
1 AB <NA> NaN 223
2 AH <NA> NaN 136
3 BA M 7 3
4 CB <NA> NaN 23
5 CM <NA> NaN 13
6 CQ <NA> NaN 16
7 CS <NA> NaN 1
8 CV <NA> NaN 1
9 DM F 40.7 2522
10 DM M 44.0 3108
# … with 57 more rows
Our resulting data.frame is much larger, since we have a greater number of groups. We also see a strange value showing up in our mean_weight
column: NaN
. This stands for “Not a Number”, and it often results from trying to do an operation a vector with zero entries. How can a vector have zero entries? Well, if a particular group (like the AB species ID + NA
sex group) has only NA
values for weight, then the na.rm = T
argument in mean()
will remove all the values prior to calculating the mean. The result will be a value of NaN
. Since we are not particularly interested in these values, let’s add a step to our pipeline to remove rows where weight is NA
before doing any other steps. This means that any groups with only NA
values will disappear from our data.frame before we formally create the groups with group_by()
.
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(species_id, sex) %>%
summarize(mean_weight = mean(weight),
n = n())
OUTPUT
`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.
OUTPUT
# A tibble: 46 × 4
# Groups: species_id [18]
species_id sex mean_weight n
<chr> <chr> <dbl> <int>
1 BA M 7 3
2 DM F 40.7 2460
3 DM M 44.0 3013
4 DM <NA> 37 8
5 DO F 48.4 679
6 DO M 49.3 748
7 DO <NA> 44 1
8 DS F 118. 1055
9 DS M 123. 1184
10 DS <NA> 121. 16
# … with 36 more rows
That looks better! It’s often useful to take a look at the results in some order, like the lowest mean weight to highest. We can use the arrange()
function for that:
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(species_id, sex) %>%
summarize(mean_weight = mean(weight),
n = n()) %>%
arrange(mean_weight)
OUTPUT
`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.
OUTPUT
# A tibble: 46 × 4
# Groups: species_id [18]
species_id sex mean_weight n
<chr> <chr> <dbl> <int>
1 PF <NA> 6 2
2 BA M 7 3
3 PF F 7.09 215
4 PF M 7.10 296
5 RM M 9.92 678
6 RM <NA> 10.4 7
7 RM F 10.7 629
8 RF M 12.4 16
9 RF F 13.7 46
10 PP <NA> 15 2
# … with 36 more rows
If we want to reverse the order, we can wrap the column name in desc()
:
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(species_id, sex) %>%
summarize(mean_weight = mean(weight),
n = n()) %>%
arrange(desc(mean_weight))
OUTPUT
`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.
OUTPUT
# A tibble: 46 × 4
# Groups: species_id [18]
species_id sex mean_weight n
<chr> <chr> <dbl> <int>
1 NL M 168. 355
2 NL <NA> 164. 9
3 NL F 151. 460
4 SS M 130 1
5 DS M 123. 1184
6 DS <NA> 121. 16
7 DS F 118. 1055
8 SH F 79.2 61
9 SH M 67.6 34
10 SF F 58.3 3
# … with 36 more rows
You may have seen several messages saying summarise() has grouped output by 'species_id'. You can override using the .groups argument.
These are warning you that your resulting data.frame has retained some group structure, which means any subsequent operations on that data.frame will happen at the group level. If you look at the resulting data.frame printed out in your console, you will see these lines:
# A tibble: 46 × 4
# Groups: species_id [18]
They tell us we have a data.frame with 46 rows, 4 columns, and a group variable species_id
, for which there are 18 groups. We will see something similar if we use group_by()
alone:
R
surveys %>%
group_by(species_id, sex)
OUTPUT
# A tibble: 16,878 × 13
# Groups: species_id, sex [67]
record…¹ month day year plot_id speci…² sex hindf…³ weight genus species
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1 7 16 1977 2 NL M 32 NA Neot… albigu…
2 2 7 16 1977 3 NL M 33 NA Neot… albigu…
3 3 7 16 1977 2 DM F 37 NA Dipo… merria…
4 4 7 16 1977 7 DM M 36 NA Dipo… merria…
5 5 7 16 1977 3 DM M 35 NA Dipo… merria…
6 6 7 16 1977 1 PF M 14 NA Pero… flavus
7 7 7 16 1977 2 PE F NA NA Pero… eremic…
8 8 7 16 1977 1 DM M 37 NA Dipo… merria…
9 9 7 16 1977 1 DM F 34 NA Dipo… merria…
10 10 7 16 1977 6 PF F 20 NA Pero… flavus
# … with 16,868 more rows, 2 more variables: taxa <chr>, plot_type <chr>, and
# abbreviated variable names ¹record_id, ²species_id, ³hindfoot_length
What we get back is the entire surveys
data.frame, but with the grouping variables added: 67 groups of species_id
+ sex
combinations. Groups are often maintained throughout a pipeline, and if you assign the resulting data.frame to a new object, it will also have those groups. This can lead to confusing results if you forget about the grouping and want to carry out operations on the whole data.frame, not by group. Therefore, it is a good habit to remove the groups at the end of a pipeline containing group_by()
:
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(species_id, sex) %>%
summarize(mean_weight = mean(weight),
n = n()) %>%
arrange(desc(mean_weight)) %>%
ungroup()
OUTPUT
`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.
OUTPUT
# A tibble: 46 × 4
species_id sex mean_weight n
<chr> <chr> <dbl> <int>
1 NL M 168. 355
2 NL <NA> 164. 9
3 NL F 151. 460
4 SS M 130 1
5 DS M 123. 1184
6 DS <NA> 121. 16
7 DS F 118. 1055
8 SH F 79.2 61
9 SH M 67.6 34
10 SF F 58.3 3
# … with 36 more rows
Now our data.frame just says # A tibble: 46 × 4
at the top, with no groups.
While it is common that you will want to get the one-row-per-group summary that summarise()
provides, there are times where you want to calculate a per-group value but keep all the rows in your data.frame. For example, we might want to know the mean weight for each species ID + sex combination, and then we might want to know how far from that mean value each observation in the group is. For this, we can use group_by()
and mutate()
together:
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(species_id, sex) %>%
mutate(mean_weight = mean(weight),
weight_diff = weight - mean_weight)
OUTPUT
# A tibble: 15,186 × 15
# Groups: species_id, sex [46]
record…¹ month day year plot_id speci…² sex hindf…³ weight genus species
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 63 8 19 1977 3 DM M 35 40 Dipo… merria…
2 64 8 19 1977 7 DM M 37 48 Dipo… merria…
3 65 8 19 1977 4 DM F 34 29 Dipo… merria…
4 66 8 19 1977 4 DM F 35 46 Dipo… merria…
5 67 8 19 1977 7 DM M 35 36 Dipo… merria…
6 68 8 19 1977 8 DO F 32 52 Dipo… ordii
7 69 8 19 1977 2 PF M 15 8 Pero… flavus
8 70 8 19 1977 3 OX F 21 22 Onyc… sp.
9 71 8 19 1977 7 DM F 36 35 Dipo… merria…
10 74 8 19 1977 8 PF M 12 7 Pero… flavus
# … with 15,176 more rows, 4 more variables: taxa <chr>, plot_type <chr>,
# mean_weight <dbl>, weight_diff <dbl>, and abbreviated variable names
# ¹record_id, ²species_id, ³hindfoot_length
Since we get all our columns back, the new columns are at the very end and don’t print out in the console. Let’s use select()
to just look at the columns of interest. Inside select()
we can use the contains()
function to get any column containing the word “weight” in the name:
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(species_id, sex) %>%
mutate(mean_weight = mean(weight),
weight_diff = weight - mean_weight) %>%
select(species_id, sex, contains("weight"))
OUTPUT
# A tibble: 15,186 × 5
# Groups: species_id, sex [46]
species_id sex weight mean_weight weight_diff
<chr> <chr> <dbl> <dbl> <dbl>
1 DM M 40 44.0 -4.00
2 DM M 48 44.0 4.00
3 DM F 29 40.7 -11.7
4 DM F 46 40.7 5.28
5 DM M 36 44.0 -8.00
6 DO F 52 48.4 3.63
7 PF M 8 7.10 0.902
8 OX F 22 21 1
9 DM F 35 40.7 -5.72
10 PF M 7 7.10 -0.0980
# … with 15,176 more rows
What happens with the group_by()
+ mutate()
combination is similar to using summarize()
: for each group, the mean weight is calculated. However, instead of reporting only one row per group, the mean weight for each group is added to each row in that group. For each row in a group (like DM species ID + M sex), you will see the same value in mean_weight
.
R
surveys_daily_counts <- surveys %>%
mutate(date = ymd(paste(year, month, day, sep = "-"))) %>%
group_by(date, sex) %>%
summarize(n = n())
OUTPUT
`summarise()` has grouped output by 'date'. You can override using the
`.groups` argument.
R
# shorter approach using count()
surveys_daily_counts <- surveys %>%
mutate(date = ymd(paste(year, month, day, sep = "-"))) %>%
count(date, sex)
Challenge 4: Making a time series (continued)
- Now use the data.frame you just made to plot the daily number of animals of each sex caught over time. It’s up to you what
geom
to use, but aline
plot might be a good choice. You should also think about how to differentiate which data corresponds to which sex.
R
surveys_daily_counts %>%
ggplot(aes(x = date, y = n, color = sex)) +
geom_line()
Reshaping data with tidyr
Let’s say we are interested in comparing the mean weights of each species across our different plots. We can begin this process using the group_by()
+ summarize()
approach:
R
sp_by_plot <- surveys %>%
filter(!is.na(weight)) %>%
group_by(species_id, plot_id) %>%
summarise(mean_weight = mean(weight)) %>%
arrange(species_id, plot_id)
OUTPUT
`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.
R
sp_by_plot
OUTPUT
# A tibble: 300 × 3
# Groups: species_id [18]
species_id plot_id mean_weight
<chr> <dbl> <dbl>
1 BA 3 8
2 BA 21 6.5
3 DM 1 42.7
4 DM 2 42.6
5 DM 3 41.2
6 DM 4 41.9
7 DM 5 42.6
8 DM 6 42.1
9 DM 7 43.2
10 DM 8 43.4
# … with 290 more rows
That looks great, but it is a bit difficult to compare values across plots. It would be nice if we could reshape this data.frame to make those comparisons easier. Well, the tidyr
package from the tidyverse
has a pair of functions that allow you to reshape data by pivoting it: pivot_wider()
and pivot_longer()
. pivot_wider()
will make the data wider, which means increasing the number of columns and reducing the number of rows. pivot_longer()
will do the opposite, reducing the number of columns and increasing the number of rows.
In this case, it might be nice to create a data.frame where each species has its own row, and each plot has its own column containing the mean weight for a given species. We will use pivot_wider()
to reshape our data in this way. It takes 3 arguments:
- the name of the data.frame
-
names_from
: which column should be used to generate the names of the new columns? -
values_from
: which column should be used to fill in the values of the new columns?
Any columns not used for names_from
or values_from
will not be pivoted.
In our case, we want the new columns to be named from our plot_id
column, with the values coming from the mean_weight
column. We can pipe our data.frame right into pivot_wider()
and add those two arguments:
R
sp_by_plot_wide <- sp_by_plot %>%
pivot_wider(names_from = plot_id,
values_from = mean_weight)
sp_by_plot_wide
OUTPUT
# A tibble: 18 × 25
# Groups: species_id [18]
species…¹ `3` `21` `1` `2` `4` `5` `6` `7` `8` `9`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BA 8 6.5 NA NA NA NA NA NA NA NA
2 DM 41.2 41.5 42.7 42.6 41.9 42.6 42.1 43.2 43.4 43.1
3 DO 42.7 NA 50.1 50.3 46.8 50.4 49.0 52 49.2 49.0
4 DS 128. NA 129. 125. 118. 111. 114. 126. 128. 115.
5 NL 171. 136. 154. 171. 164. 192. 176. 170. 134. 165
6 OL 32.1 28.6 35.5 34 33.0 32.6 31.8 NA 30.3 32.0
7 OT 24.1 24.1 23.7 24.9 26.5 23.6 23.5 22 24.1 23.1
8 OX 22 NA NA 22 NA 20 NA NA NA NA
9 PE 22.7 19.6 21.6 22.0 NA 21 21.6 22.8 19.4 22.1
10 PF 7.12 7.23 6.57 6.89 6.75 7.5 7.54 7 6.78 7.29
11 PH 28 31 NA NA NA 29 NA NA NA NA
12 PM 20.1 23.6 23.7 23.9 NA 23.7 22.3 23.4 23 21
13 PP 17.1 13.6 14.3 16.4 14.8 19.8 16.8 NA 13.9 14.8
14 RF 14.8 17 NA 16 NA 14 12.1 13 NA NA
15 RM 10.3 9.89 10.9 10.6 10.4 10.8 10.6 10.7 9 9.6
16 SF NA 49 NA NA NA NA NA NA NA NA
17 SH 76.0 79.9 NA 88 NA 82.7 NA NA NA NA
18 SS NA NA NA NA NA NA NA NA NA NA
# … with 14 more variables: `10` <dbl>, `11` <dbl>, `12` <dbl>, `13` <dbl>,
# `14` <dbl>, `15` <dbl>, `16` <dbl>, `17` <dbl>, `18` <dbl>, `19` <dbl>,
# `20` <dbl>, `22` <dbl>, `23` <dbl>, `24` <dbl>, and abbreviated variable
# name ¹species_id
Now we’ve got our reshaped data.frame. There are a few things to notice. First, we have a new column for each plot_id
value. There is one old column left in the data.frame: species_id
. It wasn’t used in pivot_wider()
, so it stays, and now contains a single entry for each unique species_id
value.
Finally, a lot of NA
s have appeared. Some species aren’t found in every plot, but because a data.frame has to have a value in every row and every column, an NA
is inserted. We can double-check this to verify what is going on.
Looking in our new pivoted data.frame, we can see that there is an NA
value for the species BA
in plot 1
. Let’s take our sp_by_plot
data.frame and look for the mean_weight
of that species + plot combination.
R
sp_by_plot %>%
filter(species_id == "BA" & plot_id == 1)
OUTPUT
# A tibble: 0 × 3
# Groups: species_id [0]
# … with 3 variables: species_id <chr>, plot_id <dbl>, mean_weight <dbl>
We get back 0 rows. There is no mean_weight
for the species BA
in plot 1
. This either happened because no BA
were ever caught in plot 1
, or because every BA
caught in plot 1
had an NA
weight value and all the rows got removed when we used filter(!is.na(weight))
in the process of making sp_by_plot
. Because there are no rows with that species + plot combination, in our pivoted data.frame, the value gets filled with NA
.
There is another pivot_
function that does the opposite, moving data from a wide to long format, called pivot_longer()
. It takes 3 arguments: cols
for the columns you want to pivot, names_to
for the name of the new column which will contain the old column names, and values_to
for the name of the new column which will contain the old values.
We can pivot our new wide data.frame to a long format using pivot_longer()
. We want to pivot all the columns except species_id
, and we will use PLOT
for the new column of plot IDs, and MEAN_WT
for the new column of mean weight values.
R
sp_by_plot_wide %>%
pivot_longer(cols = -species_id, names_to = "PLOT", values_to = "MEAN_WT")
OUTPUT
# A tibble: 432 × 3
# Groups: species_id [18]
species_id PLOT MEAN_WT
<chr> <chr> <dbl>
1 BA 3 8
2 BA 21 6.5
3 BA 1 NA
4 BA 2 NA
5 BA 4 NA
6 BA 5 NA
7 BA 6 NA
8 BA 7 NA
9 BA 8 NA
10 BA 9 NA
# … with 422 more rows
One thing you will notice is that all those NA
values that got generated when we pivoted wider. However, we can filter those out, which gets us back to the same data as sp_by_plot
, before we pivoted it wider.
R
sp_by_plot_wide %>%
pivot_longer(cols = -species_id, names_to = "PLOT", values_to = "MEAN_WT") %>%
filter(!is.na(MEAN_WT))
OUTPUT
# A tibble: 300 × 3
# Groups: species_id [18]
species_id PLOT MEAN_WT
<chr> <chr> <dbl>
1 BA 3 8
2 BA 21 6.5
3 DM 3 41.2
4 DM 21 41.5
5 DM 1 42.7
6 DM 2 42.6
7 DM 4 41.9
8 DM 5 42.6
9 DM 6 42.1
10 DM 7 43.2
# … with 290 more rows
Data are often recorded in spreadsheets in a wider format, but lots of tidyverse
tools, especially ggplot2
, like data in a longer format, so pivot_longer()
is often very useful.
Exporting data
Let’s say we want to send the wide version of our sb_by_plot
data.frame to a colleague who doesn’t use R. In this case, we might want to save it as a CSV file.
First, we might want to modify the names of the columns, since right now they are bare numbers, which aren’t very informative. Luckily, pivot_wider()
has an argument names_prefix
which will allow us to add “plot_” to the start of each column.
R
sp_by_plot %>%
pivot_wider(names_from = plot_id, values_from = mean_weight,
names_prefix = "plot_")
OUTPUT
# A tibble: 18 × 25
# Groups: species_id [18]
species_id plot_3 plot_21 plot_1 plot_2 plot_4 plot_5 plot_6 plot_7 plot_8
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BA 8 6.5 NA NA NA NA NA NA NA
2 DM 41.2 41.5 42.7 42.6 41.9 42.6 42.1 43.2 43.4
3 DO 42.7 NA 50.1 50.3 46.8 50.4 49.0 52 49.2
4 DS 128. NA 129. 125. 118. 111. 114. 126. 128.
5 NL 171. 136. 154. 171. 164. 192. 176. 170. 134.
6 OL 32.1 28.6 35.5 34 33.0 32.6 31.8 NA 30.3
7 OT 24.1 24.1 23.7 24.9 26.5 23.6 23.5 22 24.1
8 OX 22 NA NA 22 NA 20 NA NA NA
9 PE 22.7 19.6 21.6 22.0 NA 21 21.6 22.8 19.4
10 PF 7.12 7.23 6.57 6.89 6.75 7.5 7.54 7 6.78
11 PH 28 31 NA NA NA 29 NA NA NA
12 PM 20.1 23.6 23.7 23.9 NA 23.7 22.3 23.4 23
13 PP 17.1 13.6 14.3 16.4 14.8 19.8 16.8 NA 13.9
14 RF 14.8 17 NA 16 NA 14 12.1 13 NA
15 RM 10.3 9.89 10.9 10.6 10.4 10.8 10.6 10.7 9
16 SF NA 49 NA NA NA NA NA NA NA
17 SH 76.0 79.9 NA 88 NA 82.7 NA NA NA
18 SS NA NA NA NA NA NA NA NA NA
# … with 15 more variables: plot_9 <dbl>, plot_10 <dbl>, plot_11 <dbl>,
# plot_12 <dbl>, plot_13 <dbl>, plot_14 <dbl>, plot_15 <dbl>, plot_16 <dbl>,
# plot_17 <dbl>, plot_18 <dbl>, plot_19 <dbl>, plot_20 <dbl>, plot_22 <dbl>,
# plot_23 <dbl>, plot_24 <dbl>
That looks better! Let’s save this data.frame as a new object.
R
surveys_sp <- sp_by_plot %>%
pivot_wider(names_from = plot_id, values_from = mean_weight,
names_prefix = "plot_")
surveys_sp
OUTPUT
# A tibble: 18 × 25
# Groups: species_id [18]
species_id plot_3 plot_21 plot_1 plot_2 plot_4 plot_5 plot_6 plot_7 plot_8
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BA 8 6.5 NA NA NA NA NA NA NA
2 DM 41.2 41.5 42.7 42.6 41.9 42.6 42.1 43.2 43.4
3 DO 42.7 NA 50.1 50.3 46.8 50.4 49.0 52 49.2
4 DS 128. NA 129. 125. 118. 111. 114. 126. 128.
5 NL 171. 136. 154. 171. 164. 192. 176. 170. 134.
6 OL 32.1 28.6 35.5 34 33.0 32.6 31.8 NA 30.3
7 OT 24.1 24.1 23.7 24.9 26.5 23.6 23.5 22 24.1
8 OX 22 NA NA 22 NA 20 NA NA NA
9 PE 22.7 19.6 21.6 22.0 NA 21 21.6 22.8 19.4
10 PF 7.12 7.23 6.57 6.89 6.75 7.5 7.54 7 6.78
11 PH 28 31 NA NA NA 29 NA NA NA
12 PM 20.1 23.6 23.7 23.9 NA 23.7 22.3 23.4 23
13 PP 17.1 13.6 14.3 16.4 14.8 19.8 16.8 NA 13.9
14 RF 14.8 17 NA 16 NA 14 12.1 13 NA
15 RM 10.3 9.89 10.9 10.6 10.4 10.8 10.6 10.7 9
16 SF NA 49 NA NA NA NA NA NA NA
17 SH 76.0 79.9 NA 88 NA 82.7 NA NA NA
18 SS NA NA NA NA NA NA NA NA NA
# … with 15 more variables: plot_9 <dbl>, plot_10 <dbl>, plot_11 <dbl>,
# plot_12 <dbl>, plot_13 <dbl>, plot_14 <dbl>, plot_15 <dbl>, plot_16 <dbl>,
# plot_17 <dbl>, plot_18 <dbl>, plot_19 <dbl>, plot_20 <dbl>, plot_22 <dbl>,
# plot_23 <dbl>, plot_24 <dbl>
Now we can save this data.frame to a CSV using the write_csv()
function from the readr
package. The first argument is the name of the data.frame, and the second is the path to the new file we want to create, including the file extension .csv
.
R
write_csv(surveys_sp, "data/cleaned/surveys_meanweight_species_plot.csv")
If we go look into our data/cleaned_data
folder, we will see this new CSV file.
Keypoints
- use
filter()
to subset rows andselect()
to subset columns - build up pipelines one step at a time before assigning the result
- it is often best to keep components of dates separate until needed, then use
mutate()
to make a date column -
group_by()
can be used withsummarize()
to collapse rows ormutate()
to keep the same number of rows -
pivot_wider()
andpivot_longer()
are powerful for reshaping data, but you should plan out how to use them thoughtfully