This lesson is being piloted (Beta version)

Geospatial data and boundaries

Overview

Teaching: min
Exercises: min
Questions
  • How do I read shape files into R?

  • What is a coordinate reference system (CRS)?

  • What is my shape file’s coordinate reference system?

Objectives
  • Import geospatial files into your R environment

  • Identify the coordinate system for a dataset

  • Talk about when data don’t have a projection defined (missing .prj file)

  • Determine UTM zone of a dataset

  • Reproject the dataset into UTM

  • Visualize geospatial data with R

Introducing Spatial Data

Spatial data can be stored in many different ways, and an important part of using your farm’s data will involve understanding what format your data is already in and what format another program needs it to be in. During the course of this lesson, we’ll learn:

What is a CRS?

Geospatial data has a coordinate reference system (CRS) that reports how the map is projected and what point is used as a reference. A projection is a way of making the earth’s curved surface fit into something you can represent on a flat computer screen. The point used for reference during projection is called a datum.

Importance of Projections

When you cut an orange, it’s impossible to get the peel to lay flat (without really smashing it). The Earth, of course, behaves similarly, leading to distortion in the map. The kind of distortion can be chosen by using a different projection, or mathematical scheme for converting the round Earth to a flat map.

To understand how much projection matters, take a look at the difference between the Mercator projection of the world and the Boggs eumorphic projection.

Mercator Boggs eumorphic projection

In the Mercator projection, space that doesn’t exist is created to make a “flat” map. Greenland and Antarctica wind up disproportionately huge. In the Boggs projection, strategic slices are cut out of the ocean so that the sizes appear a bit closer to true, but Canada and Russia get pinched and Greenland gets bisected. There will always be some compromises made in a projection system that converts curved surfaces to flat ones for the same reason that it’s difficult to make an orange peel lie flat. So the method you select will have an effect on your outcome.

Understanding file types and CRS types

Geospatial data files have several potential variations, including:

Any of this information can come in any combination – you can have vector or raster data in any CRS in either a geopackage file or a collection of shape files.

The image on the left below is a vector-based representation of GIS data, with lines and curves; the image on the right is a raster-based representation, with individual values assigned to particular locations in a grid.

In order to make data from different sources work together, you’ll need to be able to identify what type of data in what CRS you have and how to convert them to the same type of thing—making sure you’re comparing apples to apples.

For example:

Loading Libraries

In the beginning of this workshop, we had you all install a bunch of packages that have a collection of functions we are going to use to plot our data. We want to tell R to load in all of these functions, and we can do that by using the library function:

library("rgdal")
library("plyr")
library("dplyr")
library("sp")
library("sf")
library("gstat")
library("tmap")
library("measurements")
library("daymetr")
library("FedData")
library("lubridate")
library("raster")
library("data.table")
library("broom")
library("ggplot2")

# load functions for this workshop
source('https://raw.githubusercontent.com/data-carpentry-for-agriculture/trial-lesson/gh-pages/_episodes_rmd/functions.R')

# you need your specific working directory here with something like:
#setwd(YOUR DIR)

This should generate a lot of output, and some of it will be warnings (in red), but that is to be expected.

Reading in the Boundary File

Before we can look at a CRS in R, we need to have a geospatial file in the R environment. We will bring in the field boundary. Use the function read_sf() to bring the dataset into your R environment. Because we have already set the working directory for this file, we don’t need to give the whole path, just the folder data that the gpkg file is stored within.

boundary <- read_sf("data/boundary.gpkg")

There are many functions for reading files into the environment, but read_sf() creates an object of class sf (simple feature). This class makes accessing spatial data much easier. Much like a data frame, you can access variables within an sf object using the $ operator. For this and other reasons like the number of spatial calculations available for sf objects, this class is perferred in most situations.

Check the coordinate reference system

The function for retreiving the CRS of a simple feature is st_crs(). Generally it is good practice to know the CRS of your files, but it becomes critical when combining files and performing operations on geospatial data. Some commands will not work if the data is in the wrong CRS or if two dataframes are in different CRSs.

st_crs(boundary)
Coordinate Reference System:
  EPSG: 4326 
  proj4string: "+proj=longlat +datum=WGS84 +no_defs"

The boundary file is projected in longitude and latitude using the WGS84 datum. This will be the CRS of most of the data you see.

Lost .prj files

Sometimes when looking at a shapefile, the .prj file may be lost. Then st_crs() will return empty, but sf objects still contain a geometry column geom. We can see the geometric points for the vertices of each polygon or the points in the data.

head(boundary$geom)
Geometry set for 1 feature 
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: -82.87853 ymin: 40.83945 xmax: -82.87306 ymax: 40.8466
epsg (SRID):    4326
proj4string:    +proj=longlat +datum=WGS84 +no_defs
POLYGON ((-82.87319 40.84574, -82.87306 40.8398...

UTM Zones

Some coordinate reference systems, such as UTM zones, are measured in meters. Latitude and longitude represent a different type of CRS, defined in terms of angles across a sphere. If we want to create measures of distance, we need the trial design in UTM. But there are many UTM zones, so we must determine the zone of the trial area.

The UTM system divides the surface of Earth between 80°S and 84°N latitude into 60 zones, each 6° of longitude in width. Zone 1 covers longitude 180° to 174° W; zone numbering increases eastward to zone 60 that covers longitude 174 to 180 East.

st_transform and ESPG Codes

For reprojecting spatial data, the function st_transform() uses an ESPG code to transform a simple feature to the new CRS. The ESPG Geodetic Parameter Dataset is a public registry of spatial reference systems, Earth ellipsoids, coordinate transformations and related units of measurement. The ESPG is one way to assign or transform the CRS in R.

The ESPG for UTM always begins with “326” and the last numbers are the number of the zone. The ESPG for WGS84 is 4326. This is the projection your equipment reads, so any trial design files will need to be transformed back into WGS84 before you implement the trial. Also, all files from your machinery, such as yield, as-applied, and as-planted, will be reported in latitude and longitude with WGS84.

Transforming

The function st_transform_utm() transforms a simple feature that is in lat/long into UTM. This function is in the functions.R script, and is described there in more detail. Make sure that you have run source("functions.R") or you will not have the function in your global environment.

boundaryutm <- st_transform_utm(boundary)
st_crs(boundaryutm)

Joint Exercise: Exploring Geospatial Files

Instructors: ask the students how to do this first, then type along with explanation.

  1. Bring the file called "asplanted.gpkg" (from the data subdirectory of your WorkingDir) in your environment. Name the object planting. This file contains the planting information for 2017.
  2. Identify the CRS of the object.
  3. Look at the geometry features. What kind of geometric features are in this dataset?
  4. Transform the file to UTM or Lat/Long, depending on the current CRS.

Solution

planting <- read_sf("data/asplanted.gpkg")
st_crs(planting)
Coordinate Reference System:
  EPSG: 4326 
  proj4string: "+proj=longlat +datum=WGS84 +no_defs"
planting$geom
Geometry set for 6382 features 
geometry type:  POINT
dimension:      XY
bbox:           xmin: -82.87843 ymin: 40.83952 xmax: -82.87315 ymax: 40.84653
epsg (SRID):    4326
proj4string:    +proj=longlat +datum=WGS84 +no_defs
First 5 geometries:
POINT (-82.87829 40.83953)
POINT (-82.87828 40.83953)
POINT (-82.87828 40.83953)
POINT (-82.87827 40.83953)
POINT (-82.87825 40.83953)
plantingutm <- st_transform_utm(planting)
st_crs(plantingutm)
Coordinate Reference System:
  EPSG: 32617 
  proj4string: "+proj=utm +zone=17 +datum=WGS84 +units=m +no_defs"

Exercise Discussion

The cleaned planting file was in WGS84 initially. When we look at the geometry features, they are 6382 points defined in xand y coordinates. Using st_transform_utm() we create a new file called plantingutm with the CRS of UTM zone 17.

Save the file

Use st_write() to save an sf object. If you do not specify a directory, the working directory will be used. We include the object we are saving boundaryutm and the name we would like to give the saved file "boundary_utm.gpkg". Additionally, we specify the layer_options and update values to enable overwriting an existing file with the same name.

st_write(boundaryutm, "boundary_utm.gpkg", layer_options = 'OVERWRITE=YES')
Error in st_write(boundaryutm, "boundary_utm.gpkg", layer_options = "OVERWRITE=YES"): object 'boundaryutm' not found

The new .gpkg file will be visible in your working directory. (Check it out: Browse to your working directory in Windows File Explorer or Mac Finder and see the date and time on your new file.)

.gpkg vs. .shp files

You can save the file as a .gpkg or .shp file.

The advantage of a .gpkg file is that you only save one file rather than four files in a shapefile. Because shapefiles contain multiple files, they can be corrupted if one piece is missing. One example is a .prj file. If the .prj file is missing, the shapefile will have no CRS, and you will need to determine the CRS of the object. You will often need to transform a file from UTM to lat/long and save the new file during trial design, so this is an important step. One common problem with these files is that when you try to open a .gpkg file for the first time in R, it might not work if you haven’t opened it in QGIS before.

Visualizing the data

There are several ways to visualize spatial data. One way is to use plot() to look at the basic shape of the data:

plot(boundary$geom)

plot of chunk plotGeomFirst

Data Types

There are many different geospatial data file types in use, which can make processing a bit complicated. We have stuck to using gpkg in our work here because we feel there are fewer ways for the data to get corrupted (as by a lost shp file, for instance). Some other types you may see include:

Key Points

  • It is preferred to use sf for data analysis, making it easier to access the dataframe.

  • Projecting your data in utm is necessary for many of the geometric operations you perform (e.g. making trial grids and splitting plots into subplot data)

  • Different data formats that you are likely to encounter include gpkg, shp (cpg, dbf, prj, sbn, sbx), geojson, and tif.