Content from Lessson introduction
Last updated on 2022-10-04 | Edit this page
Overview
Questions
- How should I use this lesson?
Objectives
- Understand how to use this lesson as a follow-up to the Data Carpentry R for Ecology lesson
- Understand how to use this lesson as a standalone lesson for intermediate R users
About this lesson
This lesson was originally the final episode of the updated R for Ecology lesson, and was designed as an optional follow-up to the prior episodes. The R for Ecology lesson uses a cleaned dataset from the earlier years of the Portal rodent data, which is suitable for new R learners. This data for this lesson comes from more recent years of the Portal data, spread across 3 data files, and in an “uncleaned” form. It introduces only a handful of new tidyverse
functions and focuses primarily on putting previously learned skills to use in real-world scenarios, combining them to build pipelines, and extending them to new use cases.
Use as a follow-up lesson
- recommended for intermediate skill level
- may cause burnout if you try to cram it in
- best used if learners are speeding through the R Ecology lesson
Used standalone
- probably need to do a little bit of introduction of the dataset, like first part of R Ecology ggplot lesson
R
library(ratdat)
?complete_old
str(complete_old)
OUTPUT
'data.frame': 16878 obs. of 13 variables:
$ record_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 16 16 16 16 16 16 16 16 16 16 ...
$ year : int 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
$ plot_id : int 2 3 2 7 3 1 2 1 1 6 ...
$ species_id : chr "NL" "NL" "DM" "DM" ...
$ sex : chr "M" "M" "F" "M" ...
$ hindfoot_length: int 32 33 37 36 35 14 NA 37 34 20 ...
$ weight : int NA NA NA NA NA NA NA NA NA NA ...
$ genus : chr "Neotoma" "Neotoma" "Dipodomys" "Dipodomys" ...
$ species : chr "albigula" "albigula" "merriami" "merriami" ...
$ taxa : chr "Rodent" "Rodent" "Rodent" "Rodent" ...
$ plot_type : chr "Control" "Long-term Krat Exclosure" "Control" "Rodent Exclosure" ...
- everything else in the lesson is standalone, but it assumes learners are familiar with working with dataframes, the basics of
ggplot2
,filter()
,select()
,group_by()
,mutate()
, andsummarize()
Content from Putting it together
Last updated on 2022-10-04 | Edit this page
Overview
Questions
- How do you apply data manipulation skills to multiple new files?
Objectives
- Read in messy data and find issues.
- Replace incorrect values.
- Read data from multiple file formats.
- Utilize
pivot_
functions to reshape untidy data. - Combine multiple datasets.
- Understand the process of formatting new data similarly to existing data.
R
library(tidyverse)
So far we have been working with surveys data from 1977 to 1989, and our data have been pretty neat and tidy. There are some NA
values, but for the most part, the data have been formatted nicely. However, as many of us know, we do not always receive data in such nice shape. It’s pretty common to get data with all sorts of formatting issues, maybe some strange file formats, and possibly spread across several different sources.
Well, it turns out we have just that situation! We have received a newer batch of surveys data, from 1990 to 2002, and we want to add it to our older dataset so we can work with them together. Unfortunately, the data are not formatted quite as nicely as our old data. Our collaborators have told us to “look them over” for any errors, but have not given us very much specific information. We will have to explore the new data to make sure we understand it and verify that there aren’t any errors.
You can download a .zip
file containing three new data files here: https://www.michaelc-m.com/Rewrite-R-ecology-lesson/data/new_data.zip. When prompted, save the file to your data/raw/
folder. A .zip
file is a type of compressed file that contains one or more files or directories. We will use the unzip()
command to extract the data files from the .zip
file. The first argument is the path to the .zip
file, the next argument is the directory we want to put the extracted files into, and the last argument tells unzip()
to not create an additional directory for the new files. Since this is an action we only want to perform once, we will run it directly in the Console instead of putting it into a script.
R
unzip("data/raw/new_data.zip", exdir = "data/raw/", junkpaths = TRUE)
Use the Files pane in the lower right to navigate to the data/raw/
folder and you should find 3 new files: plots_new.csv
, species_new.txt
, and surveys_new.csv
.
Reading the new surveys data
Let’s start off with the new surveys data. First we will read it into R:
R
surveys_new <- read_csv("data/raw/surveys_new.csv")
WARNING
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
OUTPUT
Rows: 18676 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): date (mm/dd/yyyy), species_id, sex
dbl (4): record_id, plot_id, hindfoot_length, weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
You will notice it contains a lot of columns from our previous surveys
data, but not all of the columns. Some of them are only found in our other plots_new.csv
and species_new.txt
files.
First thing we want to do with surveys_new
is fix that date
column name with spaces in it. R can handle them, but they are often very annoying. We can use the rename()
function to change the column name.
R
surveys_new <- surveys_new %>%
rename(date = `date (mm/dd/yyyy)`)
Let’s take a look at a summary of our data using summary()
.
R
summary(surveys_new)
OUTPUT
record_id date plot_id species_id
Min. :16879 Length:18676 Min. : 1.00 Length:18676
1st Qu.:21545 Class :character 1st Qu.: 5.00 Class :character
Median :26214 Mode :character Median :12.00 Mode :character
Mean :26214 Mean :11.33
3rd Qu.:30881 3rd Qu.:17.00
Max. :35549 Max. :24.00
sex hindfoot_length weight
Length:18676 Min. : 2.00 Min. : 4.0
Class :character 1st Qu.:21.00 1st Qu.: 19.0
Mode :character Median :26.00 Median : 32.0
Mean :27.08 Mean : 873.7
3rd Qu.:36.00 3rd Qu.: 47.0
Max. :64.00 Max. :9999.0
NA's :1380
The summary()
function is often useful for detecting outliers or clearly incorrect values, since we get a Min.
and Max.
value for each numeric column. For example, we see that month
goes from 1 to 12 and day
goes from 1 to 31, so no issues there. However, we do notice that weight
has a max value of 9999. Sometimes people will use extreme and impossible values to denote a missing value. It is worth checking with our collaborators to make sure this is the case, but we will assume that’s what happened.
Finally, we actually got a warning message about a parsing issue. This message actually comes from read_csv()
, even though it only showed up now. Parsing is what read_csv()
does when it tries to guess what type of vector each CSV column should be. Sometimes it will warn us about issues that occurred, which we can then investigate with the problems()
function.
R
problems(surveys_new)
OUTPUT
# A tibble: 1 × 5
row col expected actual file
<int> <int> <chr> <chr> <chr>
1 19 6 a double 19' /home/runner/work/R-ecology-lesson-intermediate/R…
The output shows that in the 19th row and 8th column of the CSV, read_csv()
expected a double, or numeric, value. Instead, what it got was 19'
. That stray quotation mark was unexpected, so read_csv()
notified us. Let’s go see what value is actually there for surveys_new
. It was in the 19th row of the CSV, which includes the header row containing column names, so we should look at the 18th row of our data.frame. The 8th column is hindfoot_length
. We can use the head()
function to look at the first 20 rows.
R
surveys_new %>%
head(n=20)
OUTPUT
# A tibble: 20 × 7
record_id date plot_id species_id sex hindfoot_length weight
<dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 16879 1/6/1990 1 DM F 37 35
2 16880 1/6/1990 1 OL M 21 28
3 16881 1/6/1990 6 PF M 16 7
4 16882 1/6/1990 23 RM F 17 9
5 16883 1/6/1990 12 RM M 17 10
6 16884 1/6/1990 24 RM M 17 9
7 16885 1/6/1990 12 SF M 25 35
8 16886 1/6/1990 24 SH F 30 73
9 16887 1/6/1990 12 SF M 28 44
10 16888 1/6/1990 17 DO M 36 55
11 16889 1/6/1990 21 SF M 29 55
12 16890 1/6/1990 12 OT M 22 23
13 16891 1/6/1990 12 DO F 36 53
14 16892 1/6/1990 21 AB <NA> NA 9999
15 16893 1/6/1990 12 OT F 21 24
16 16894 1/6/1990 1 OT F 21 20
17 16895 1/6/1990 12 SF F 27 75
18 16896 1/6/1990 12 RM M NA 11
19 16897 1/6/1990 21 SF F 29 46
20 16898 1/6/1990 23 RM M 18 11
Because read_csv()
didn’t know what to do with the value 19'
, there is an NA
for hindfoot_length
in row 18. It is likely that the true value was 19
and the stray quotation mark was simply a typo. If we want to change that value, we can do it by referring to the record_id
, since it is a unique identifier for each row. We will use the function if_else()
to actually replace the value. This function takes a logical statement as its first argument, then a value to return if that statement is TRUE
, and a value to return if it is FALSE
. Take a look at this example:
R
x <- 1:10
ifelse(x > 6, "bigger than 6", "not bigger than 6")
OUTPUT
[1] "not bigger than 6" "not bigger than 6" "not bigger than 6"
[4] "not bigger than 6" "not bigger than 6" "not bigger than 6"
[7] "bigger than 6" "bigger than 6" "bigger than 6"
[10] "bigger than 6"
What we will do is take surveys_new
and mutate the hindfoot_length
column. It will be equal to the result of an ifelse()
statement. If the record_id
is 16896
, the row we are trying to change, then hindfoot_length
will be set to 19. If the record_id
is not 16896
, then it will stay as the current hindfoot_length
value.
R
surveys_new <- surveys_new %>%
mutate(hindfoot_length = ifelse(record_id == 16896, 19, hindfoot_length))
surveys_new %>%
head(n=20)
OUTPUT
# A tibble: 20 × 7
record_id date plot_id species_id sex hindfoot_length weight
<dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 16879 1/6/1990 1 DM F 37 35
2 16880 1/6/1990 1 OL M 21 28
3 16881 1/6/1990 6 PF M 16 7
4 16882 1/6/1990 23 RM F 17 9
5 16883 1/6/1990 12 RM M 17 10
6 16884 1/6/1990 24 RM M 17 9
7 16885 1/6/1990 12 SF M 25 35
8 16886 1/6/1990 24 SH F 30 73
9 16887 1/6/1990 12 SF M 28 44
10 16888 1/6/1990 17 DO M 36 55
11 16889 1/6/1990 21 SF M 29 55
12 16890 1/6/1990 12 OT M 22 23
13 16891 1/6/1990 12 DO F 36 53
14 16892 1/6/1990 21 AB <NA> NA 9999
15 16893 1/6/1990 12 OT F 21 24
16 16894 1/6/1990 1 OT F 21 20
17 16895 1/6/1990 12 SF F 27 75
18 16896 1/6/1990 12 RM M 19 11
19 16897 1/6/1990 21 SF F 29 46
20 16898 1/6/1990 23 RM M 18 11
We can actually use ifelse()
to fix the values of 9999
in the weight
column as well.
R
surveys_new <- surveys_new %>%
mutate(weight = ifelse(weight == 9999, NA, weight))
surveys_new %>%
head(n=20)
OUTPUT
# A tibble: 20 × 7
record_id date plot_id species_id sex hindfoot_length weight
<dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 16879 1/6/1990 1 DM F 37 35
2 16880 1/6/1990 1 OL M 21 28
3 16881 1/6/1990 6 PF M 16 7
4 16882 1/6/1990 23 RM F 17 9
5 16883 1/6/1990 12 RM M 17 10
6 16884 1/6/1990 24 RM M 17 9
7 16885 1/6/1990 12 SF M 25 35
8 16886 1/6/1990 24 SH F 30 73
9 16887 1/6/1990 12 SF M 28 44
10 16888 1/6/1990 17 DO M 36 55
11 16889 1/6/1990 21 SF M 29 55
12 16890 1/6/1990 12 OT M 22 23
13 16891 1/6/1990 12 DO F 36 53
14 16892 1/6/1990 21 AB <NA> NA NA
15 16893 1/6/1990 12 OT F 21 24
16 16894 1/6/1990 1 OT F 21 20
17 16895 1/6/1990 12 SF F 27 75
18 16896 1/6/1990 12 RM M 19 11
19 16897 1/6/1990 21 SF F 29 46
20 16898 1/6/1990 23 RM M 18 11
Challenge 1: Find a specialized function
The tidyverse
often has specialized functions for common data manipulation tasks, such as replacing a certain values with NA
. There is a tidyverse
function to replace a value in a vector with NA
. Put your Googling skills to work and see if you can find the correct function.
For an extra challenge, write out code that could use this function to replace weight
values of 9999 with NA
.
The dplyr
function na_if()
will replace specific values in a vector to NA
. To find this function, you can Google “tidyverse replace value with NA”. One of the first results is the dplyr
documentation page for the na_if()
function.
If you scroll down to the bottom section of the documentation, you will find several examples, including how to use the function inside mutate()
.
R
surveys_new %>%
mutate(weight = na_if(weight, 9999))
The last thing we have to do is deal with our date column. It’s currently a character column, but our old surveys
data had separate columns for year
, month
, and day
. Another thing we should do is check for any errors in our dates, since they are an error-prone data type.
There are a few ways we could approach this problem, which is a common theme in R: there are often many ways to accomplish the same task. It is often useful to plan your approach ahead of time, so we will describe two possible methods:
Turn the current column into a date column, validate the dates, then use
lubridate
functions to extract the year, month, and day into their own columns.Use the
separate()
function to split our current date column into 3 new character columns, containing the month, day and year. Then turn those columns into numeric columns. Then it will match our oldsurveys
data, and we can later make a date column to validate our dates.
It is often useful to plan out your approach, or several approaches, before you start writing code. It can be in the form of plain English like above, or in “pseudo-code”, which is laid out like code, but doesn’t have explicit, functioning code.
We will go ahead and use the first approach. First we will load lubridate
and use the mdy()
function to turn our date
column into a date instead of character column.
R
library(lubridate)
OUTPUT
Attaching package: 'lubridate'
OUTPUT
The following objects are masked from 'package:base':
date, intersect, setdiff, union
R
surveys_new <- surveys_new %>%
mutate(date = mdy(date))
WARNING
Warning: 6 failed to parse.
We got a warning message about 6 dates failing to parse. This means that the mdy()
function encountered 6 dates that it wasn’t able to identify correctly. When lubridate
functions fail to parse dates, they will return an NA
value instead. To find the rows where this happened, we can use filter()
:
R
surveys_new %>%
filter(is.na(date))
OUTPUT
# A tibble: 6 × 7
record_id date plot_id species_id sex hindfoot_length weight
<dbl> <date> <dbl> <chr> <chr> <dbl> <dbl>
1 22258 NA 8 AH <NA> NA NA
2 22261 NA 9 DM F 37 45
3 30595 NA 18 PB F 25 34
4 30610 NA 2 PB F 25 31
5 30638 NA 20 PP F 22 20
6 31394 NA 12 OT F 20 29
Challenge 2: Find the bad dates
We have now located the rows with NA
dates, but we probably want to know what the original date character strings looked like. Figure out what those dates were and why they might have been wrong.
Hint: you will have to look at a previous version of the data, before we modified the date
column.
There are two basic approaches you could take. First, you could look directly at the old CSV and find the rows with bad dates based on their record_id
.
You could also read the data back into R and use filter()
to pick out those specific rows via record_id
:
R
read_csv("data/raw/surveys_new.csv") %>%
filter(record_id %in% c(22258, 22261, 30595, 30610, 30638, 31394))
WARNING
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
OUTPUT
# A tibble: 6 × 7
record_id `date (mm/dd/yyyy)` plot_id species_id sex hindfoot_length weight
<dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 22258 4/31/1995 8 AH <NA> NA 9999
2 22261 4/31/1995 9 DM F 37 45
3 30595 4/31/2000 18 PB F 25 34
4 30610 4/31/2000 2 PB F 25 31
5 30638 4/31/2000 20 PP F 22 20
6 31394 9/31/2000 12 OT F 20 29
The dates are wrong because they are the 31st day in a month that only has 30 days, like April or September. lubridate
doesn’t recognize these as valid dates. The same thing can happen with things like dates in February during non-leap years.
The last thing to do is extract the year, month, and day values from our date
column. lubridate
has functions to extract each component of a date. We will then get rid of the date
column, since it doesn’t appear in our original surveys
data, and we can always remake it from the component columns.
R
surveys_new <- surveys_new %>%
mutate(year = year(date),
month = month(date),
day = day(date)) %>%
select(-date)
surveys_new
OUTPUT
# A tibble: 18,676 × 9
record_id plot_id species_id sex hindfoot_length weight year month day
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
1 16879 1 DM F 37 35 1990 1 6
2 16880 1 OL M 21 28 1990 1 6
3 16881 6 PF M 16 7 1990 1 6
4 16882 23 RM F 17 9 1990 1 6
5 16883 12 RM M 17 10 1990 1 6
6 16884 24 RM M 17 9 1990 1 6
7 16885 12 SF M 25 35 1990 1 6
8 16886 24 SH F 30 73 1990 1 6
9 16887 12 SF M 28 44 1990 1 6
10 16888 17 DO M 36 55 1990 1 6
# … with 18,666 more rows
Reading the new species data
Our surveys_new
data look good at this point, so let’s move on to the species data. You may have noticed that our species data came in a different file format, species_new.txt
. So far we have been working with CSV files, in which values are separated by commas. However, R is capable of reading many different file types. The .txt
extension means it is a plain-text file, which means the data could be formatted in quite a few different ways. Let’s take a look at the file directly to see how it is structured.
Click on the species_new.txt
file in the Files pane to open it in RStudio. We see that our data are still structured in columns and rows, with column names in the header row. Each value is wrapped in quotes, values are separated by spaces, and each row ends with a new line.
This is a generic data structure called “delimited” data. A CSV is a form of delimited data, where values are “delimited”, or separated, by commas. Luckily, the readr
package has a function for dealing with more generic delimited data, called read_delim()
.
We have to give read_delim()
three arguments. First is the file path, just like read_csv()
. The second argument is what character string is used to delimit each item in the file. In our case, it is a space, so we make a character string that is just a space. Finally, we need to identify what is used to quote each entry in our file. Our values are wrapped in double-quotes, so we need to type a double quote. However, we can’t just type 3 double-quotes, or R will get upset with us (give it a try if you want). Luckily, R recognizes both single- and double-quotes for creating character strings. So we can use single-quotes to make our character string, and put one double-quote character inside it.
R
species_new <- read_delim("data/raw/species_new.txt", delim = " ", quote = '"')
OUTPUT
Rows: 54 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (3): species_id, species_name, taxa
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
R
species_new
OUTPUT
# A tibble: 54 × 3
species_id species_name taxa
<chr> <chr> <chr>
1 AB Amphispiza bilineata Bird
2 AH Ammospermophilus harrisi Rodent
3 AS Ammodramus savannarum Bird
4 BA Baiomys taylori Rodent
5 CB Campylorhynchus brunneicapillus Bird
6 CM Calamospiza melanocorys Bird
7 CQ Callipepla squamata Bird
8 CS Crotalus scutalatus Reptile
9 CT Cnemidophorus tigris Reptile
10 CU Cnemidophorus uniparens Reptile
# … with 44 more rows
What we get back is a tibble, formatted just like it would have been if our data were in a CSV.
One thing we might notice is that our species and genus are combined into one column called species_name
, whereas in our old data, we had separate columns for genus
and species
. It is fairly common to have data in one column that could be separated into two or more columns. Luckily, tidyr
has a convenient function for solving this problem, called separate()
.
We pipe species_new
into the separate()
function, then give it several other arguments. First, the name of the column to be separated, species_name
. Next, we give the argument into
a character vector of the new columns we want. Finally, we give a string for what is currently separating each of the new values in the current column. In species_name
, the genus and species are separated by a space.
R
species_new <- species_new %>%
separate(species_name, into = c("genus", "species"), sep = " ")
species_new
OUTPUT
# A tibble: 54 × 4
species_id genus species taxa
<chr> <chr> <chr> <chr>
1 AB Amphispiza bilineata Bird
2 AH Ammospermophilus harrisi Rodent
3 AS Ammodramus savannarum Bird
4 BA Baiomys taylori Rodent
5 CB Campylorhynchus brunneicapillus Bird
6 CM Calamospiza melanocorys Bird
7 CQ Callipepla squamata Bird
8 CS Crotalus scutalatus Reptile
9 CT Cnemidophorus tigris Reptile
10 CU Cnemidophorus uniparens Reptile
# … with 44 more rows
There we go, now species_new
is formatted like the similar columns in the older surveys
data.
The separate()
function also has an argument called convert
, which will automatically convert the types of your new columns. For example, if you had a column called range
that had character strings like "1990-1995"
, and you wanted to separate it into start
and end
columns, you would end up with character columns if you used separate()
like we did above. However, if you use convert = T
, the new columns will be converted to integers. Check out this short example below:
R
d <- tibble(years = c("1990-1995", "2000-2002"))
d %>%
separate(years, into = c("start", "end"), sep = "-")
OUTPUT
# A tibble: 2 × 2
start end
<chr> <chr>
1 1990 1995
2 2000 2002
R
d %>%
separate(years, into = c("start", "end"), sep = "-", convert = T)
OUTPUT
# A tibble: 2 × 2
start end
<int> <int>
1 1990 1995
2 2000 2002
Reading the new plots data
Finally, we can move on to the new plots
data, in the plots_new.csv
file. We can go back to read_csv()
to get it into R.
R
plots_new <- read_csv("data/raw/plots_new.csv")
OUTPUT
Rows: 1 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (24): Plot 1, Plot 2, Plot 3, Plot 4, Plot 5, Plot 6, Plot 7, Plot 8, Pl...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
R
plots_new
OUTPUT
# A tibble: 1 × 24
`Plot 1` Plot …¹ Plot …² Plot …³ Plot …⁴ Plot …⁵ Plot …⁶ Plot …⁷ Plot …⁸
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Spectab exclo… Control Long-t… Control Rodent… Short-… Rodent… Control Specta…
# … with 15 more variables: `Plot 10` <chr>, `Plot 11` <chr>, `Plot 12` <chr>,
# `Plot 13` <chr>, `Plot 14` <chr>, `Plot 15` <chr>, `Plot 16` <chr>,
# `Plot 17` <chr>, `Plot 18` <chr>, `Plot 19` <chr>, `Plot 20` <chr>,
# `Plot 21` <chr>, `Plot 22` <chr>, `Plot 23` <chr>, `Plot 24` <chr>, and
# abbreviated variable names ¹`Plot 2`, ²`Plot 3`, ³`Plot 4`, ⁴`Plot 5`,
# ⁵`Plot 6`, ⁶`Plot 7`, ⁷`Plot 8`, ⁸`Plot 9`
It looks like our data are in a bit of a strange format. We have a column for each plot, and then a single row of data containing the plot type. If you look at our old surveys
data, we had a single row for plot_id
and a single row for plot_type
. surveys
contained this data in a long format, whereas plots_new
has a wide format.
R
plots_new <- plots_new %>%
pivot_longer(cols = everything(), names_to = "plot_id", values_to = "plot_type")
plots_new
OUTPUT
# A tibble: 24 × 2
plot_id plot_type
<chr> <chr>
1 Plot 1 Spectab exclosure
2 Plot 2 Control
3 Plot 3 Long-term Krat Exclosure
4 Plot 4 Control
5 Plot 5 Rodent Exclosure
6 Plot 6 Short-term Krat Exclosure
7 Plot 7 Rodent Exclosure
8 Plot 8 Control
9 Plot 9 Spectab exclosure
10 Plot 10 Rodent Exclosure
# … with 14 more rows
Our old surveys
data had plot_id
as a numeric variable, but ours is a character string with "Plot "
in front of the number. This is a pretty common issue, but we can use a function from the stringr
package to fix it.
We will use mutate()
to modify the plot_id
column, and we will replace it with the results of the str_replace()
function. The first argument to str_replace()
is the character vector we want to modify, which is the current plot_id
column. Next is the string of characters that we want to replace, which is "Plot "
, including the space at the end. Finally, we have the replacement string. Since we want to remove "Plot "
, we replace it with a blank string ""
.
R
plots_new <- plots_new %>%
mutate(plot_id = str_replace(plot_id, "Plot ", ""))
plots_new
OUTPUT
# A tibble: 24 × 2
plot_id plot_type
<chr> <chr>
1 1 Spectab exclosure
2 2 Control
3 3 Long-term Krat Exclosure
4 4 Control
5 5 Rodent Exclosure
6 6 Short-term Krat Exclosure
7 7 Rodent Exclosure
8 8 Control
9 9 Spectab exclosure
10 10 Rodent Exclosure
# … with 14 more rows
We successfully removed "Plot "
from our plot_id
column entries, so we are left with just the numbers. However, it is still a character
column. The last step is to convert it to a numeric column.
R
plots_new <- plots_new %>%
mutate(plot_id = as.numeric(plot_id))
plots_new
OUTPUT
# A tibble: 24 × 2
plot_id plot_type
<dbl> <chr>
1 1 Spectab exclosure
2 2 Control
3 3 Long-term Krat Exclosure
4 4 Control
5 5 Rodent Exclosure
6 6 Short-term Krat Exclosure
7 7 Rodent Exclosure
8 8 Control
9 9 Spectab exclosure
10 10 Rodent Exclosure
# … with 14 more rows
Joining the new data
Now that we have each individual data.frame formatted nicely, we would like to be able to combine them. Our surveys
data has all of the data combined into one data.frame. However, our data.frames are different sizes. surveys_new
has 18676 rows, and it contains the individual data for each animal. This is the same basic size of the old surveys
data. However, our plots_new
and species_new
data are much smaller. They only contain data on specific plots and species.
If we look at the column names for surveys_new
and plots_new
, we see that they share a plot_id
column. What we want to do now is take the data of our actual observations, surveys_new
, and add the data for each associated plot. If a row in surveys_new
has a plot_id
of 2, we want to associate the plot_type
of that plot with that row. We can accomplish this using a join.
There are several types of joins in the dplyr
package, which you can read more about here. We will use a function called left_join()
, which takes two dataframes and adds the columns from the second dataframe to the first dataframe, matching rows based on the column name supplied to the by
argument.
R
left_join(surveys_new, plots_new, by = "plot_id")
OUTPUT
# A tibble: 18,676 × 10
record_id plot_id species_id sex hindfoo…¹ weight year month day plot_…²
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int> <chr>
1 16879 1 DM F 37 35 1990 1 6 Specta…
2 16880 1 OL M 21 28 1990 1 6 Specta…
3 16881 6 PF M 16 7 1990 1 6 Short-…
4 16882 23 RM F 17 9 1990 1 6 Rodent…
5 16883 12 RM M 17 10 1990 1 6 Control
6 16884 24 RM M 17 9 1990 1 6 Rodent…
7 16885 12 SF M 25 35 1990 1 6 Control
8 16886 24 SH F 30 73 1990 1 6 Rodent…
9 16887 12 SF M 28 44 1990 1 6 Control
10 16888 17 DO M 36 55 1990 1 6 Control
# … with 18,666 more rows, and abbreviated variable names ¹hindfoot_length,
# ²plot_type
Now we have our surveys_new
dataframe, still with 18676 rows, but now each row has a value for plot_type
, corresponding to its entry in plot_id
. We can assign this back to surveys_new
, so that it now contains the information from both dataframes.
R
surveys_new <- left_join(surveys_new, plots_new, by = "plot_id")
We can repeat this process to get the information from species_new
. surveys_new
and species_new
both have a species_id
column, but we would like to add the genus
, species
, and taxa
information to surveys_new
.
R
surveys_new <- left_join(surveys_new, species_new, by = "species_id")
surveys_new
OUTPUT
# A tibble: 18,676 × 13
record…¹ plot_id speci…² sex hindf…³ weight year month day plot_…⁴ genus
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int> <chr> <chr>
1 16879 1 DM F 37 35 1990 1 6 Specta… Dipo…
2 16880 1 OL M 21 28 1990 1 6 Specta… Onyc…
3 16881 6 PF M 16 7 1990 1 6 Short-… Pero…
4 16882 23 RM F 17 9 1990 1 6 Rodent… Reit…
5 16883 12 RM M 17 10 1990 1 6 Control Reit…
6 16884 24 RM M 17 9 1990 1 6 Rodent… Reit…
7 16885 12 SF M 25 35 1990 1 6 Control Sigm…
8 16886 24 SH F 30 73 1990 1 6 Rodent… Sigm…
9 16887 12 SF M 28 44 1990 1 6 Control Sigm…
10 16888 17 DO M 36 55 1990 1 6 Control Dipo…
# … with 18,666 more rows, 2 more variables: species <chr>, taxa <chr>, and
# abbreviated variable names ¹record_id, ²species_id, ³hindfoot_length,
# ⁴plot_type
Now our surveys_new
dataframe has all the information from our 3 files, and the same number of columns as our original surveys
data.
Adding to the old data
Now that our old surveys
data and surveys_new
data are formatted in the same way, we can bind them together so we have data from all years in one data.frame. First let’s read our `surveys’ data back in.
R
surveys <- read_csv("data/cleaned/surveys_complete_77_89.csv")
ERROR
Error: 'data/cleaned/surveys_complete_77_89.csv' does not exist in current working directory ('/home/runner/work/R-ecology-lesson-intermediate/R-ecology-lesson-intermediate/site/built').
Now we can use the bind_rows()
function to bind the rows of our two data.frames together. The fact that our columns are not in the same order doesn’t matter, bind_rows()
will detect thatt the column names are the same, and will rearrange them to match the first data.frame.
R
surveys_complete <- bind_rows(surveys, surveys_new)
ERROR
Error in list2(...): object 'surveys' not found
R
surveys_complete
ERROR
Error in eval(expr, envir, enclos): object 'surveys_complete' not found
We might be interested in indicating which rows of our data came from which source: the old data or the new. We can name the data.frames inside bind_rows()
, and then give a new argument .id
. This will give us a new column called source
that contains a value of "old"
for rows that came from surveys
, and a value of "new"
for rows that came from surveys_new
.
R
surveys_complete <- bind_rows(old = surveys, new = surveys_new, .id = "source")
ERROR
Error in list2(...): object 'surveys' not found
R
surveys_complete
ERROR
Error in eval(expr, envir, enclos): object 'surveys_complete' not found
We have now successfully cleaned our new data and reshaped it to match our old data so they could be arranged into one data.frame covering all the years.
Back to ggplot2
position_dodge()
-
coord_
? patchwork
-
label_wrap_gen()
? theme_set()
R
surveys_complete %>%
count(year) %>%
ggplot(aes(x = year, y = n)) +
geom_line()
ERROR
Error in count(., year): object 'surveys_complete' not found
R
surveys_complete %>%
count(plot_type, sex) %>%
ggplot(aes(x = plot_type, y = n, fill = sex)) +
geom_col(position = position_dodge()) +
scale_x_discrete(labels = label_wrap_gen(10))
ERROR
Error in count(., plot_type, sex): object 'surveys_complete' not found
R
surveys_complete %>%
filter(!is.na(weight), !is.na(sex)) %>%
group_by(genus, year, sex) %>%
summarise(mean_weight = mean(weight)) %>%
ggplot(aes(x = year, y = mean_weight, color = genus)) +
geom_line() +
facet_wrap(vars(sex))
ERROR
Error in filter(., !is.na(weight), !is.na(sex)): object 'surveys_complete' not found
Setting limits with scale_
or xlim()
/ylim()
will remove data, so the slope of the line changes:
R
surveys_complete %>%
ggplot(aes(x = weight, y = hindfoot_length)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(limits = c(0,100))
ERROR
Error in ggplot(., aes(x = weight, y = hindfoot_length)): object 'surveys_complete' not found
If you want to zoom in on the plot without removing data outside the limits, set the limits inside coord_cartestian()
:
R
surveys_complete %>%
ggplot(aes(x = weight, y = hindfoot_length)) +
geom_point() +
geom_smooth(method = "lm") +
coord_cartesian(xlim = c(0,100))
ERROR
Error in ggplot(., aes(x = weight, y = hindfoot_length)): object 'surveys_complete' not found
There are other coord_
functions if you need to plot using polar coordinates, map coordinates, or fix the aspect ratio of a plot.
Final outputs
Let’s go ahead and write our data to a CSV file so we can share it with others.
R
surveys_complete %>%
write_csv("data/cleaned/surveys_complete.csv")
Now we might be interested in looking at all of our data together. Try making some plots of your own to look at the whole dataset!
R
surveys_complete %>%
ggplot(aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.05) +
facet_wrap(vars(source))
ERROR
Error in ggplot(., aes(x = weight, y = hindfoot_length)): object 'surveys_complete' not found
Keypoints
- it is always good to do preliminary investigations of new data
- there are often many ways to achieve the same goal, describing them with plain English or pseudocode can help you choose an approach
- the
read_delimited()
function can read tabular data from multiple file formats - joins are powerful ways to combine multiple datasets
- it is a good idea to plan out the steps of your data cleaning and combining