Organizing Whiskerbook
Overview
Teaching: 60 min
Exercises: 30 minQuestions
How can we properly organize the data for batch import into Whiskerbook?
Objectives
Create a xlsx file in the format suitable for upload to the Whikerbook interface
First, set the working directory to the tutorial directory.
Then, read in the camera trap data compiled in the previous lesson into this session.
SnowLeopardBook<-read.csv("SnowLeopard_CameraTrap.csv")
Load the Whiskerbook template downloaded from the website to assist in preparing batch import files. This file contains the header column names that are necessary for the program to import the data and fields successfully.
Whiskerbook_template<-read.csv("WildbookStandardFormat.csv")
We will expand the empty dataframe with NA values to populate the dataframe with as many rows as there are data from our camera trap records.
#first make sure there are enough rows in the template
Whiskerbook_template[1:nrow(SnowLeopardBook),]<-NA
Challenge: Whiskerbook
Answer the following questions:
- What are some of the original fields in this Whiskerbook template that may be useful?
Using our camera trap data, subset the data by columns that will be used for the Whiskerbook upload. There are some obvious things in the template and some not so obvious. We will go through which ones are needed below, first lets make a subset of the camera trap data and metadata to use.
SnowLeopardBook<-SnowLeopardBook[,c(2,3,4,5,6,12,15,16,17,27,28)]
Now, with the template formatted correctly, we can simply add the data from the camera trap dataframe to the template.
#then add data from the camera trap dataframe to the template
Whiskerbook_template$Encounter.locationID<-SnowLeopardBook$Station
Whiskerbook_template$Encounter.mediaAssetX<-SnowLeopardBook$FileName
Whiskerbook_template$Encounter.decimalLatitude<-SnowLeopardBook$Lat
Whiskerbook_template$Encounter.decimalLongitude<-SnowLeopardBook$Long
Whiskerbook_template$Encounter.year<-SnowLeopardBook$Year
Whiskerbook_template$Encounter.genus<-SnowLeopardBook$Genus
Whiskerbook_template$Encounter.specificEpithet<-SnowLeopardBook$Species
Whiskerbook_template$Encounter.submitterID<-"YOUR_WHISKERBOOK_USERNAME"
Whiskerbook_template$Encounter.country<-"Afghanistan"
Whiskerbook_template$Encounter.submitterOrganization<-"WCS Afghanistan"
Since the dates in our camera trap dataset are not formatted properly for the Whiskerbook template, then we need to fix it a bit.
We will pull out the information for month and day from the date objects and fill in new columns withg the name Encounter.month and Encounter.day.
#fix dates
SnowLeopardBook$Date<-as.Date(SnowLeopardBook$Date)
Whiskerbook_template$Encounter.month<- format(SnowLeopardBook$Date, "%m")
Whiskerbook_template$Encounter.day<- format(SnowLeopardBook$Date, "%d")
#we can simply use the substr function we learned about earlier to pull out the first two characters in the time string to get the hours only.
Whiskerbook_template$Encounter.hour<-substr(SnowLeopardBook$Time, 1,2)
The Whiskerbook template requires that the data are put into a format with the image names of each encounter in one row.
- An occurrence is a set of images that are normally associated to one time span, like an hour where the same animals were passing in front of the camera.
- An encounter is a sighting of one animal within that time span. It is possible to encounter more than one individual over the course of an hour, although that hour is still called an occurrence and the single animals within are called the encounter.
For animal encounter data, it depends on the length of time you would like to subset the data. For the purposes of this lesson, we will group the data into occurences by hour. Each row of our whiskerbook dataframe will represent an encounter. Although at this early stage, we likely do not know if there are multiple individuals within the hour. To start, we will break up our data into hourly subsets and work with that. If possible, going through the data carefully to understand where there are multiple individuals and sorting it out early can be an advantage. It will keep the data clean and avoid problems later. Although, in the Whiskerbook program it is fairly easy to delete images from one encounter and create a new encounter with images from the second individual. It is up to you how you will deal with this issue. However time consuming it may be to sort the data by individual early on may save time later.
Here we use the dplyr functions for group_by to group the camera trap photos by location ID of the camera station, and the year, month, day, and hour. Then, after they are grouped, we simply sequentially number each individual photo and assign it to which group it is in.
The mutate function within dplyr allows us to create a new row of data based on some function or command, in this case, we will use the cur_group_id() to ID each of the rows according to which group they are in. The group will then denote the hourly intervals, which are our occurrences.
Whiskerbook_template<-Whiskerbook_template%>%
group_by(Encounter.locationID, Encounter.year, Encounter.month, Encounter.day,Encounter.hour)%>%
mutate(Encounter.occurrenceID = cur_group_id())
Challenge: group_by function
Answer the following questions:
- What happens when you group by different columns? Experiment with grouping by fewer or more columns to see the result. What happens to the group_ids?
Next, we have to sequentially number each of the images within that group or “occurrence” to create the Encounter.mediaAsset information that Whiskerbook needs to have. Now, within each group we are numbering each photo within that group. There may be 10 photos in an occurrence so they would be numbered 1-10, or there may be 40 photos within the occurrence, so we name those 1-40. Thankfully, dplyr has all of the necessary functions to allow us to name the photos within each group.
Whiskerbook_template<-Whiskerbook_template%>%
group_by(Encounter.occurrenceID)%>%
mutate(Encounter.mediaAsset = 1:n())
The image numbers we just created are actually going to become column names, and so we need to add the characters “Encounter.mediaAsset” before these numbers. To do this, we can use the paste function to paste together our character string and the number we generated.
Whiskerbook_template$Encounter.mediaAsset<-paste("Encounter.mediaAsset", Whiskerbook_template$Encounter.mediaAsset, sep="_")
Now, we will cast the Encounter.mediaAsset column out. Which means we will take one column of data, and generate numerous columns. Check to see the result of this if you are unsure what just happened.
We are calling this new template Whiskerbook_template2
library(reshape2)
Whiskerbook_template2<-dcast(Whiskerbook_template,Encounter.occurrenceID~Encounter.mediaAsset, value.var ="Encounter.mediaAssetX")
As you can see, the columns are not sorted sequentially, so we need to sort them by ascending order. The str_sort function in the stringr can sort the columns.
library(stringr)
Whiskerbook_template2_cols<-str_sort(colnames(Whiskerbook_template2), numeric = TRUE)
Whiskerbook_template2<-Whiskerbook_template2[,Whiskerbook_template2_cols]
Next, we have to actually rename all of the Encounter.mediaAsset columns starting with 0. They start with 1 now because the dply package requires we number starting with 1 not 0. It’s a bit of a glitch for us, but we can fix this.
First, we can create a vector of numbers for the number of occurrences that we have. Then, we add the characters “Encounter.mediaAsset” to these numbers. Then we add one more column name for the “Encounter.occurrenceID” that is already in our dataframe. Now we have a vector of character strings that will be our new column names.
Now, we can simply rename our dataframe columns using the new names we have created.
#the columns have to be renamed from 0 so we subtract one from the length
#the final column is the Encounter.occurrenceID column so we subtract one
col_vec<-0:(length(Whiskerbook_template2)-2)
col_vec<-paste("Encounter.mediaAsset",col_vec, sep="")
Media_assets<-c(col_vec, "Encounter.occurrenceID")
names(Whiskerbook_template2)<-Media_assets
The next thing we need to do is clean up our original Whiskerbook template so that we can merge these new cast Encounter.mediaAsset data.
In the original template, now, we have all the filenames in an Encounter.mediaAssetX column, and we need to remove that. That was there originally before we went through the trouble of reorganizing it.
Finally, we can remove the Encounter.mediaAsset column, which contained the numbers assigned to the individual images, we will also remove the final column for the original numbered images within the hourly subsets.
Whiskerbook_template<-Whiskerbook_template[,-1]
Whiskerbook_template <-Whiskerbook_template[,-ncol(Whiskerbook_template)]
Then, we will take only the unique records within this template.
Whiskerbook_template<-unique(Whiskerbook_template)
Then, remove all of the columns which are filled with only NA values. We do not need these columns if they have no information and the file can still be uploaded.
Whiskerbook_template<-Whiskerbook_template[,colSums(is.na(Whiskerbook_template))<nrow(Whiskerbook_template)]
Now our original Whiskerbook template is formatted so we can merge the Whiskerbook_template2 with our cast Encounter.mediaAsset filenames to it. To do this, we can merge the templates together using the merge function, and then select only the unique rows.
Whiskerbook<-merge(Whiskerbook_template,Whiskerbook_template2, by="Encounter.occurrenceID", all.x=FALSE, all.y=TRUE)
Whiskerbook<-unique(Whiskerbook)
Finally, we are left with a template with our occurrences with the filenames cast into rows. We will write this to file for batch import into Whiskerbook.
write.csv(Whiskerbook, "Whiskerbook.csv")
Key Points
Whiskerbook takes a specific format for data upload
Casting the file names into long format for Encounter.mediaAsset