This lesson is in the early stages of development (Alpha version)

# Introduction to Raster Data

## Overview

Teaching: 15 min
Exercises: 10 min
Questions
• What format should I use to represent my data?

• What are the main data types used for representing geospatial data?

• What are the main attributes of raster data?

Objectives
• Describe the difference between raster and vector data.

• Describe the strengths and weaknesses of storing data in raster format.

• Distinguish between continuous and categorical raster data and identify types of datasets that would be stored in each format.

This episode introduces the two primary types of geospatial data: rasters and vectors. After briefly introducing these data types, this episode focuses on raster data, describing some major features and types of raster data.

# Data Structures: Raster and Vector

The two primary types of geospatial data are raster and vector data. Raster data is stored as a grid of values which are rendered on a map as pixels. Each pixel value represents an area on the Earth’s surface. Vector data structures represent specific features on the Earth’s surface, and assign attributes to those features. Vector data structures will be discussed in more detail in the next episode.

The Python for Raster and Vector Data lesson will focus on how to work with both raster and vector data sets, therefore it is essential that we understand the basic structures of these types of data and the types of data that they can be used to represent.

## About Raster Data

Raster data is any pixelated (or gridded) data where each pixel is associated with a specific geographic location. The value of a pixel can be continuous (e.g. elevation) or categorical (e.g. land use). If this sounds familiar, it is because this data structure is very common: it’s how we represent any digital image. A geospatial raster is only different from a digital photo in that it is accompanied by spatial information that connects the data to a particular location. This includes the raster’s extent and cell size, the number of rows and columns, and its coordinate reference system (or CRS). Source: National Ecological Observatory Network (NEON)

Some examples of continuous rasters include:

1. Precipitation maps.
2. Maps of tree height derived from LiDAR data.
3. Elevation values for a region.

A map of elevation for Harvard Forest derived from the NEON AOP LiDAR sensor is below. Elevation is represented as a continuous numeric variable in this map. The legend shows the continuous range of values in the data from around 300 to 420 meters. Some rasters contain categorical data where each pixel represents a discrete class such as a landcover type (e.g., “forest” or “grassland”) rather than a continuous value such as elevation or temperature. Some examples of classified maps include:

1. Landcover / land-use maps.
2. Tree height maps classified as short, medium, and tall trees.
3. Elevation maps classified as low, medium, and high elevation. The map above shows the contiguous United States with landcover as categorical data. Each color is a different landcover category. (Source: Homer, C.G., et al., 2015, Completion of the 2011 National Land Cover Database for the conterminous United States-Representing a decade of land cover change information. Photogrammetric Engineering and Remote Sensing, v. 81, no. 5, p. 345-354) The map above shows elevation data for the NEON Harvard Forest field site. We will be working with data from this site later in the workshop. In this map, the elevation data (a continuous variable) has been divided into categories to yield a categorical raster.

## Advantages and Disadvantages

With your neighbor, brainstorm potential advantages and disadvantages of storing data in raster format. Add your ideas to the Etherpad. The Instructor will discuss and add any points that weren’t brought up in the small group discussions.

## Solution

Raster data has some important advantages:

• representation of continuous surfaces
• potentially very high levels of detail
• data is ‘unweighted’ across its extent - the geometry doesn’t implicitly highlight features
• cell-by-cell calculations can be very fast and efficient

The downsides of raster data are:

• very large file sizes as cell size gets smaller
• currently popular formats don’t embed metadata well (more on this later!)
• can be difficult to represent complex information

## Important Attributes of Raster Data

### Extent

The spatial extent is the geographic area that the raster data covers. The spatial extent of an object represents the geographic edge or location that is the furthest north, south, east and west. In other words, extent represents the overall geographic coverage of the spatial object. (Image Source: National Ecological Observatory Network (NEON))

## Extent Challenge

In the image above, the dashed boxes around each set of objects seems to imply that the three objects have the same extent. Is this accurate? If not, which object(s) have a different extent?

## Solution

The lines and polygon objects have the same extent. The extent for the points object is smaller in the vertical direction than the other two because there are no points on the line at y = 8.

### Resolution

A resolution of a raster represents the area on the ground that each pixel of the raster covers. The image below illustrates the effect of changes in resolution. (Source: National Ecological Observatory Network (NEON))

## Raster Data Format for this Workshop

Raster data can come in many different formats. For this workshop, we will use the GeoTIFF format which has the extension `.tif`. A `.tif` file stores metadata or attributes about the file as embedded `tif tags`. For instance, your camera might store a tag that describes the make and model of the camera or the date the photo was taken when it saves a `.tif`. A GeoTIFF is a standard `.tif` image format with additional spatial (georeferencing) information embedded in the file as tags. These tags should include the following raster metadata:

1. Extent
2. Resolution
3. Coordinate Reference System (CRS) - we will introduce this concept in a later episode
4. Values that represent missing data (`NoDataValue`) - we will introduce this concept in a later episode.

We will discuss these attributes in more detail in a later episode. In that episode, we will also learn how to use Python to extract raster attributes from a GeoTIFF file.

## Multi-band Raster Data

A raster can contain one or more bands. One type of multi-band raster dataset that is familiar to many of us is a color image. A basic color image consists of three bands: red, green, and blue. Each band represents light reflected from the red, green or blue portions of the electromagnetic spectrum. The pixel brightness for each band, when composited creates the colors that we see in an image. (Source: National Ecological Observatory Network (NEON).)

We can plot each band of a multi-band image individually.

Or we can composite all three bands together to make a color image.

In a multi-band dataset, the rasters will always have the same extent, resolution, and CRS.

## Other Types of Multi-band Raster Data

Multi-band raster data might also contain:

1. Time series: the same variable, over the same area, over time. We will be working with time series data in the Plot Raster Data in Python episode.
2. Multi or hyperspectral imagery: image rasters that have 4 or more (multi-spectral) or more than 10-15 (hyperspectral) bands. We won’t be working with this type of data in this workshop, but you can check out the NEON Data Skills Imaging Spectroscopy HDF5 in R tutorial if you’re interested in working with hyperspectral data cubes.

## Key Points

• Raster data is pixelated data where each pixel is associated with a specific location.

• Raster data always has an extent and a resolution.

• The extent is the geographical area covered by a raster.

• The resolution is the area covered by each pixel of a raster.

# Introduction to Vector Data

## Overview

Teaching: 10 min
Exercises: 5 min
Questions
• What are the main attributes of vector data?

Objectives
• Describe the strengths and weaknesses of storing data in vector format.

• Describe the three types of vectors and identify types of data that would be stored in each.

## About Vector Data

Vector data structures represent specific features on the Earth’s surface, and assign attributes to those features. Vectors are composed of discrete geometric locations (x, y values) known as vertices that define the shape of the spatial object. The organization of the vertices determines the type of vector that we are working with: point, line or polygon. Image Source: National Ecological Observatory Network (NEON)

• Points: Each point is defined by a single x, y coordinate. There can be many points in a vector point file. Examples of point data include: sampling locations, the location of individual trees, or the location of survey plots.

• Lines: Lines are composed of many (at least 2) points that are connected. For instance, a road or a stream may be represented by a line. This line is composed of a series of segments, each “bend” in the road or stream represents a vertex that has a defined x, y location.

• Polygons: A polygon consists of 3 or more vertices that are connected and closed. The outlines of survey plot boundaries, lakes, oceans, and states or countries are often represented by polygons.

## Data Tip

Sometimes, boundary layers such as states and countries, are stored as lines rather than polygons. However, these boundaries, when represented as a line, will not create a closed object with a defined area that can be filled.

## Identify Vector Types

The plot below includes examples of two of the three types of vector objects. Use the definitions above to identify which features are represented by which vector type. ## Solution

State boundaries are polygons. The Fisher Tower location is a point. There are no line features shown.

Vector data has some important advantages:

• The geometry itself contains information about what the dataset creator thought was important
• The geometry structures hold information in themselves - why choose point over polygon, for instance?
• Each geometry feature can carry multiple attributes instead of just one, e.g. a database of cities can have attributes for name, country, population, etc
• Data storage can be very efficient compared to rasters

The downsides of vector data include:

• Potential loss of detail compared to raster
• Potential bias in datasets - what didn’t get recorded?
• Calculations involving multiple vector layers need to do math on the geometry as well as the attributes, so can be slow compared to raster math.

Vector datasets are in use in many industries besides geospatial fields. For instance, computer graphics are largely vector-based, although the data structures in use tend to join points using arcs and complex curves rather than straight lines. Computer-aided design (CAD) is also vector- based. The difference is that geospatial datasets are accompanied by information tying their features to real-world locations.

## Vector Data Format for this Workshop

Like raster data, vector data can also come in many different formats. For this workshop, we will use the Shapefile format. A Shapefile format consists of multiple files in the same directory, of which `.shp`, `.shx`, and `.dbf` files are mandatory. Other non-mandatory but very important files are `.prj` and `shp.xml` files.

• The `.shp` file stores the feature geometry itself
• `.shx` is a positional index of the feature geometry to allow quickly searching forwards and backwards the geographic coordinates of each vertex in the vector
• `.dbf` contains the tabular attributes for each shape.
• `.prj` file indicates the Coordinate reference system (CRS)
• `.shp.xml` contains the Shapefile metadata.

Together, the Shapefile includes the following information:

• Extent - the spatial extent of the shapefile (i.e. geographic area that the shapefile covers). The spatial extent for a shapefile represents the combined extent for all spatial objects in the shapefile.
• Object type - whether the shapefile includes points, lines, or polygons.
• Coordinate reference system (CRS)
• Other attributes - for example, a line shapefile that contains the locations of streams, might contain the name of each stream.

Because the structure of points, lines, and polygons are different, each individual shapefile can only contain one vector type (all points, all lines or all polygons). You will not find a mixture of point, line and polygon objects in a single shapefile.

## More Resources on Shapefiles

More about shapefiles can be found on Wikipedia. Shapefiles are often publicly available from government services, such as this page from the US Census Bureau or this one from Australia’s Data.gov.au website.

## Why not both?

Very few formats can contain both raster and vector data - in fact, most are even more restrictive than that. Vector datasets are usually locked to one geometry type, e.g. points only. Raster datasets can usually only encode one data type, for example you can’t have a multiband GeoTIFF where one layer is integer data and another is floating-point. There are sound reasons for this - format standards are easier to define and maintain, and so is metadata. The effects of particular data manipulations are more predictable if you are confident that all of your input data has the same characteristics.

## Key Points

• Vector data structures represent specific features on the Earth’s surface along with attributes of those features.

• Vector objects are either points, lines, or polygons.

# Coordinate Reference Systems

## Overview

Teaching: 15 min
Exercises: 10 min
Questions
• What is a coordinate reference system and how do I interpret one?

Objectives
• Name some common schemes for describing coordinate reference systems.

• Interpret a PROJ4 coordinate reference system description.

## Coordinate Reference Systems

A data structure cannot be considered geospatial unless it is accompanied by coordinate reference system (CRS) information, in a format that geospatial applications can use to display and manipulate the data correctly. CRS information connects data to the Earth’s surface using a mathematical model.

## CRS vs SRS

CRS (coordinate reference system) and SRS (spatial reference system) are synonyms and are commonly interchanged. We will use only CRS throughout this workshop.

The CRS associated with a dataset tells your mapping software (for example Python) where the raster is located in geographic space. It also tells the mapping software what method should be used to flatten or project the raster in geographic space. The above image shows maps of the United States in different projections. Notice the differences in shape associated with each projection. These differences are a direct result of the calculations used to flatten the data onto a 2-dimensional map. (Source: opennews.org)

There are lots of great resources that describe coordinate reference systems and projections in greater detail. For the purposes of this workshop, what is important to understand is that data from the same location but saved in different projections will not line up in any GIS or other program. Thus, it’s important when working with spatial data to identify the coordinate reference system applied to the data and retain it throughout data processing and analysis.

## Components of a CRS

CRS information has three components:

• Datum: A model of the shape of the earth. It has angular units (i.e. degrees) and defines the starting point (i.e. where is [0,0]?) so the angles reference a meaningful spot on the earth. Common global datums are WGS84 and NAD83. Datums can also be local - fit to a particular area of the globe, but ill-fitting outside the area of intended use. In this workshop, we will use the WGS84 datum.

• Projection: A mathematical transformation of the angular measurements on a round earth to a flat surface (i.e. paper or a computer screen). The units associated with a given projection are usually linear (feet, meters, etc.). In this workshop, we will see data in two different projections.

• Additional Parameters: Additional parameters are often necessary to create the full coordinate reference system. One common additional parameter is a definition of the center of the map. The number of required additional parameters depends on what is needed by each specific projection.

## Orange Peel Analogy

A common analogy employed to teach projections is the orange peel analogy. If you imagine that the Earth is an orange, how you peel it and then flatten the peel is similar to how projections get made.

• A datum is the choice of fruit to use. Is the Earth an orange, a lemon, a lime, a grapefruit? Image source

A projection is how you peel your orange and then flatten the peel. Image source

• An additional parameter could include a definition of the location of the stem of the fruit. What other parameters could be included in this analogy?

## Which projection should I use?

To decide if a projection is right for your data, answer these questions:

• What is the area of minimal distortion?
• What aspect of the data does it preserve?

Peter Dana from the University of Colorado at Boulder and the Department of Geo-Information Processing has a good discussion of these aspects of projections. Online tools like Projection Wizard can also help you discover projections that might be a good fit for your data.

## Data Tip

Take the time to identify a projection that is suited for your project. You don’t have to stick to the ones that are popular.

## Describing Coordinate Reference Systems

There are several common systems in use for storing and transmitting CRS information, as well as translating among different CRSs. These systems generally comply with ISO 19111. Common systems for describing CRSs include EPSG, OGC WKT, and PROJ strings.

# EPSG

The EPSG system is a database of CRS information maintained by the International Association of Oil and Gas Producers. The dataset contains both CRS definitions and information on how to safely convert data from one CRS to another. Using EPSG is easy as every CRS has an integer identifier, e.g. WGS84 is EPSG:4326. The downside is that you can only use the CRSs defined by EPSG and cannot customise them (some datasets do not have EPSG codes). epsg.io is an excellent website for finding suitable projections by location or for finding information about a particular EPSG code.

# Well-Known Text

The Open Geospatial Consortium WKT standard is used by a number of important geospatial apps and software libraries. WKT is a nested list of geodetic parameters. The structure of the information is defined on their website. WKT is valuable in that the CRS information is more transparent than in EPSG, but can be more difficult to read and compare than PROJ since it is meant to necessarily represent more complex CRS information. Additionally, the WKT standard is implemented inconsistently across various software platforms, and the spec itself has some known issues.

# PROJ

PROJ is an open-source library for storing, representing and transforming CRS information. PROJ strings continue to be used, but the format is deprecated by the PROJ C maintainers due to inaccuracies when converting to the WKT format. The data and python libraries we will be working with in this workshop use different underlying representations of CRSs under the hood for reprojecting. CRS information can still be represented with EPSG, WKT, or PROJ strings without consequence, but it is best to only use PROJ strings as a format for viewing CRS information, not for reprojecting data.

PROJ represents CRS information as a text string of key-value pairs, which makes it easy to read and interpret.

A PROJ4 string includes the following information:

• proj: the projection of the data
• zone: the zone of the data (this is specific to the UTM projection)
• datum: the datum used
• units: the units for the coordinates of the data
• ellps: the ellipsoid (how the earth’s roundness is calculated) for the data

Note that the zone is unique to the UTM projection. Not all CRSs will have a zone. Image source: Chrismurf at English Wikipedia, via Wikimedia Commons (CC-BY).

## Reading a PROJ4 String

Here is a PROJ4 string for one of the datasets we will use in this workshop:

`+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0`

• What projection, zone, datum, and ellipsoid are used for this data?
• What are the units of the data?
• Using the map above, what part of the United States was this data collected from?

## Solution

• Projection is UTM, zone 18, datum is WGS84, ellipsoid is WGS84.
• The data is in meters.
• The data comes from the eastern US seaboard.

## Format interoperability

Many existing file formats were invented by GIS software developers, often in a closed-source environment. This led to the large number of formats on offer today, and considerable problems transferring data between software environments. The Geospatial Data Abstraction Library (GDAL) is an open-source answer to this issue.

GDAL is a set of software tools that translate between almost any geospatial format in common use today (and some not so common ones). GDAL also contains tools for editing and manipulating both raster and vector files, including reprojecting data to different CRSs. GDAL can be used as a standalone command-line tool, or built in to other GIS software. Several open-source GIS programs use GDAL for all file import/export operations.

## Metadata

Spatial data is useless without metadata. Essential metadata includes the CRS information, but proper spatial metadata encompasses more than that. History and provenance of a dataset (how it was made), who is in charge of maintaining it, and appropriate (and inappropriate!) use cases should also be documented in metadata. This information should accompany a spatial dataset wherever it goes. In practice this can be difficult, as many spatial data formats don’t have a built-in place to hold this kind of information. Metadata often has to be stored in a companion file, and generated and maintained manually.

## Key Points

• All geospatial datasets (raster and vector) are associated with a specific coordinate reference system.

• A coordinate reference system includes datum, projection, and additional parameters specific to the dataset.

# The Geospatial Landscape

## Overview

Teaching: 10 min
Exercises: 0 min
Questions
• What programs and applications are available for working with geospatial data?

Objectives
• Describe the difference between various approaches to geospatial computing, and their relative strengths and weaknesses.

• Name some commonly used GIS applications.

• Name some commonly used Python packages that can access and process spatial data.

• Describe pros and cons for working with geospatial data using a command-line versus a graphical user interface.

## Standalone Software Packages

Most traditional GIS work is carried out in standalone applications that aim to provide end-to-end geospatial solutions. These applications are available under a wide range of licenses and price points. Some of the most common are listed below.

### Open-source software

The Open Source Geospatial Foundation (OSGEO) supports several actively managed GIS platforms:

• QGIS is a professional GIS application that is built on top of and proud to be itself Free and Open Source Software (FOSS). QGIS is written in Python, has a python console interface, and has several interfaces written in R including RQGIS.
• GRASS GIS, commonly referred to as GRASS (Geographic Resources Analysis Support System), is a FOSS-GIS software suite used for geospatial data management and analysis, image processing, graphics and maps production, spatial modeling, and visualization. GRASS GIS is currently used in academic and commercial settings around the world, as well as by many governmental agencies and environmental consulting companies. It is a founding member of the Open Source Geospatial Foundation (OSGeo). GRASS GIS can be installed along with and made accessible within QGIS 3.
• GDAL is a multiplatform set of tools for translating between geospatial data formats. It can also handle reprojection and a variety of geoprocessing tasks. GDAL is built in to many applications both FOSS and commercial, including GRASS and QGIS.
• SAGA-GIS, or System for Automated Geoscientific Analyses, is a FOSS-GIS application developed by a small team of researchers from the Dept. of Physical Geography, Göttingen, and the Dept. of Physical Geography, Hamburg. SAGA has been designed for an easy and effective implementation of spatial algorithms, offers a comprehensive, growing set of geoscientific methods, provides an easily approachable user interface with many visualisation options, and runs under Windows and Linux operating systems. Like GRASS GIS, it can also be installed and made accessible in QGIS3.
• PostGIS is a geospatial extension to the PostGreSQL relational database.

### Online + Cloud computing

• PANGEO is a community organization dedicated to open and reproducible data science with python. They focus on the Pangeo software ecosystem for working with big data in the geosciences.
• Google has created Google Earth Engine which combines a multi-petabyte catalog of satellite imagery and geospatial datasets with planetary-scale analysis capabilities and makes it available for scientists, researchers, and developers to detect changes, map trends, and quantify differences on the Earth’s surface. Earth Engine API runs in both Python and JavaScript.
• ArcGIS Online provides access to thousands of maps and base layers.

Private companies have released SDK platforms for large scale GIS analysis:

• Kepler.gl is Uber’s toolkit for handling large datasets (i.e. Uber’s data archive).
• Boundless Geospatial is built upon OSGEO software for enterprise solutions.

Publicly funded open-source platforms for large scale GIS analysis:

• PANGEO for the Earth Sciences. This community organization also supports python libraries like xarray, iris, dask, jupyter, and many other packages.
• Sepal.io by FAO Open Foris utilizing EOS satellite imagery and cloud resources for global forest monitoring.

## GUI vs CLI

The earliest computer systems operated without a graphical user interface (GUI), relying only on the command-line interface (CLI). Since mapping and spatial analysis are strongly visual tasks, GIS applications benefited greatly from the emergence of GUIs and quickly came to rely heavily on them. Most modern GIS applications have very complex GUIs, with all common tools and procedures accessed via buttons and menus.

Benefits of using a GUI include:

• Tools are all laid out in front of you
• Complex commands are easy to build
• Don’t need to learn a coding language
• Cartography and visualisation is more intuitive and flexible

Downsides of using a GUI include:

• Low reproducibility - you can’t record your actions and replay
• Most are not designed for batch-processing files
• Limited ability to customise functions or write your own
• Intimidating interface for new users - so many buttons!

In scientific computing, the lack of reproducibility in point-and-click software has come to be viewed as a critical weakness. As such, scripted CLI-style workflows are again becoming popular, which leads us to another approach to doing GIS — via a programming language. This is the approach we will be using throughout this workshop.

## GIS in programming languages

A number of powerful geospatial processing libraries exist for general-purpose programming languages like Java and C++. However, the learning curve for these languages is steep and the effort required is excessive for users who only need a subset of their functionality.

Higher-level scripting languages like Python and R are easier to learn and use. Both now have their own packages that wrap up those geospatial processing libraries and make them easy to access and use safely. A key example is the Java Topology Suite (JTS), which is implemented in C++ as GEOS. GEOS is accessible in Python via the `shapely` package (and `geopandas`, which makes use of `shapely`) and in R via `sf`. R and Python also have interface packages for GDAL, and for specific GIS apps.

This last point is a huge advantage for GIS-by-programming; these interface packages give you the ability to access functions unique to particular programs, but have your entire workflow recorded in a central document - a document that can be re-run at will. Below are lists of some of the key spatial packages for Python, which we will be using in the remainder of this workshop.

• `geopandas` and `geocube` for working with vector data
• `rasterio` and `rioxarray` for working with raster data

These packages along with the `matplotlib` package are all we need for spatial data visualisation. Python also has many fundamental scientific packages that are relevant in the geospatial domain. Below is a list of particularly fundamental packages. `numpy`, `scipy`, and `scikit-image` are all excellent options for working with rasters, as arrays.

An overview of these and other Python spatial packages can be accessed here.

As a programming language, Python can be a CLI tool. However, using Python together with an Integrated Development Environment (IDE) application allows some GUI features to become part of your workflow. IDEs allow the best of both worlds. They provide a place to visually examine data and other software objects, interact with your file system, and draw plots and maps, but your activities are still command-driven: recordable and reproducible. There are several IDEs available for Python. JupyterLab is well-developed and the most widely used option for data science in Python. VSCode and Spyder are other popular options for data science.

Traditional GIS apps are also moving back towards providing a scripting environment for users, further blurring the CLI/GUI divide. ESRI have adopted Python into their software, and QGIS is both Python and R-friendly.

## GIS File Types

There are a variety of file types that are used in GIS analysis. Depending on the program you choose to use some file types can be used while others are not readable. Below is a brief table describing some of the most common vector and raster file types.

### Vector

 File Type Extensions Description Esri ShapeFile .SHP .DBF .SHX The most common geospatial file type. This has become the industry standard. The three required files are: SHP is the feature geometry. SHX is the shape index position. DBF is the attribute data. Geographic JavaScript Object Notation .GEOJSON .JSON Used for web-based mapping and uses JavaScript Object Notation to store the coordinates as text. Google Keyhole Markup Language .KML .KMZ KML stands for Keyhole Markup Language. This GIS format is XML-based and is primarily used for Google Earth. OpenStreetMap .OSM OSM files are the native file for OpenStreetMap which had become the largest crowdsourcing GIS data project in the world. These files are a collection of vector features from crowd-sourced contributions from the open community.

### Raster

 File Type Extensions Description ERDAS Imagine .IMG ERDAS Imagine IMG files is a proprietary file format developed by Hexagon Geospatial. IMG files are commonly used for raster data to store single and multiple bands of satellite data.Each raster layer as part of an IMG file contains information about its data values. For example, this includes projection, statistics, attributes, pyramids and whether or not it’s a continuous or discrete type of raster. GeoTIFF .TIF .TIFF .OVR The GeoTIFF has become an industry image standard file for GIS and satellite remote sensing applications. GeoTIFFs may be accompanied by other files:TFW is the world file that is required to give your raster geolocation.XML optionally accompany GeoTIFFs and are your metadata.AUX auxiliary files store projections and other information.OVR pyramid files improves performance for raster display.

## Key Points

• Many software packages exist for working with geospatial data.

• Command-line programs allow you to automate and reproduce your work.

• JupyterLab provides a user-friendly interface for working with Python.

# Intro to Raster Data in Python

## Overview

Teaching: 40 min
Exercises: 20 min
Questions
• What is a raster dataset?

• How do I work with and plot raster data in Python?

• How can I handle missing or bad data values for a raster?

Objectives
• Describe the fundamental attributes of a raster dataset.

• Explore raster attributes and metadata using Python.

• Read rasters into Python using the `rioxarray` package.

## Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

In this episode, we will introduce the fundamental principles, packages and metadata/raster attributes that are needed to work with raster data in Python. We will discuss some of the core metadata elements that we need to understand to work with rasters, including Coordinate Reference Systems, no data values, and resolution. We will also explore missing and bad data values as stored in a raster and how Python handles these elements.

We will use 1 package in this episode to work with raster data - `rioxarray`, which is based on the popular `rasterio` package for working with rasters and `xarray` for working with multi-dimensional arrays. Make sure that you have `rioxarray` installed and imported.

``````import rioxarray
``````

## Introduce the Data

A brief introduction to the datasets can be found on the Geospatial workshop setup page.

For more detailed information about the datasets, check out the Geospatial workshop data page.

## Open a Raster and View Raster File Attributes

We will be working with a series of GeoTIFF files in this lesson. The GeoTIFF format contains a set of embedded tags with metadata about the raster data. We can use the function `rioxarray.open_rasterio()` to read the geotiff file and then inspect this metadata. By calling the variable name in the jupyter notebook we can get a quick look at the shape and attributes of the data.

``````surface_HARV = rioxarray.open_rasterio("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif")
surface_HARV
``````
``````<xarray.DataArray (band: 1, y: 1367, x: 1697)>
[2319799 values with dtype=float64]
Coordinates:
* band         (band) int64 1
* y            (y) float64 4.714e+06 4.714e+06 ... 4.712e+06 4.712e+06
* x            (x) float64 7.315e+05 7.315e+05 ... 7.331e+05 7.331e+05
spatial_ref  int64 0
Attributes:
transform:     (1.0, 0.0, 731453.0, 0.0, -1.0, 4713838.0)
_FillValue:    -9999.0
scales:        (1.0,)
offsets:       (0.0,)
grid_mapping:  spatial_ref
``````

The first call to `rioxarray.open_rasterio()` opens the file and returns a `xarray.DataArray` that we store in a variable, `surface_HARV`. Reading in the data with `xarray` instead of `rioxarray` also returns a `xarray.DataArray`, but the output will not contain the geospatial metadata (such as projection information). You can use a `xarray.DataArray` in calculations just like a numpy array. Calling the variable name of the `DataArray` also prints out all of its metadata information.

The output tells us that we are looking at an `xarray.DataArray`, with `1` band, `1367` rows, and `1697` columns. We can also see the number of pixel values in the `DataArray`, and the type of those pixel values, which is floating point, or (`float64`). The `DataArray` also stores different values for the coordinates of the `DataArray`. When using `rioxarray`, the term coordinates refers to spatial coordinates like `x` and `y` but also the `band` coordinate. Each of these sequences of values has its own data type, like `float64` for the spatial coordinates and `int64` for the `band` coordinate. The `transform` represents the conversion between array coordinates (non-spatial) and spatial coordinates.

This `DataArray` object also has a couple attributes that are accessed like `.rio.crs`, `.rio.nodata`, and `.rio.bounds()`, which contain the metadata for the file we opened. Note that many of the metadata are accessed as attributes without `()`, but `bounds()` is a function and needs parentheses.

``````print(surface_HARV.rio.crs)
print(surface_HARV.rio.nodata)
print(surface_HARV.rio.bounds())
print(surface_HARV.rio.width)
print(surface_HARV.rio.height)
``````
``````EPSG:32618
-9999.0
(731453.0, 4712471.0, 733150.0, 4713838.0)
1697
1367
``````

The Coordinate Reference System, or `surface_HARV.rio.crs`, is reported as the string `EPSG:32618`. The `nodata` value is encoded as -9999.0 and the bounding box corners of our raster are represented by the output of `.bounds()` as a `tuple` (like a list but you can’t edit it). The height and width match what we saw when we printed the `DataArray`, but by using `.rio.width` and `.rio.height` we can access these values if we need them in calculations.

We will be exploring this data throughout this episode. By the end of this episode, you will be able to understand and explain the metadata output.

## Data Tip - Object names

To improve code readability, file and object names should be used that make it clear what is in the file. The data for this episode were collected from Harvard Forest so we’ll use a naming convention of `datatype_HARV`.

After viewing the attributes of our raster, we can examine the raw values of the array with `.values`:

``````surface_HARV.values
``````
``````array([[[408.76998901, 408.22998047, 406.52999878, ..., 345.05999756,
345.13998413, 344.97000122],
[407.04998779, 406.61999512, 404.97998047, ..., 345.20999146,
344.97000122, 345.13998413],
[407.05999756, 406.02999878, 403.54998779, ..., 345.07000732,
345.08999634, 345.17999268],
...,
[367.91000366, 370.19000244, 370.58999634, ..., 311.38998413,
310.44998169, 309.38998413],
[370.75997925, 371.50997925, 363.41000366, ..., 314.70999146,
309.25      , 312.01998901],
[369.95999146, 372.6000061 , 372.42999268, ..., 316.38998413,
309.86999512, 311.20999146]]])
``````

This can give us a quick view of the values of our array, but only at the corners. Since our raster is loaded in python as a `DataArray` type, we can plot this in one line similar to a pandas `DataFrame` with `DataArray.plot()`.

``````surface_HARV.plot()
`````` Nice plot! Notice that `rioxarray` helpfully allows us to plot this raster with spatial coordinates on the x and y axis (this is not the default in many cases with other functions or libraries).

This map shows the elevation of our study site in Harvard Forest. From the legend, we can see that the maximum elevation is ~400, but we can’t tell whether this is 400 feet or 400 meters because the legend doesn’t show us the units. We can look at the metadata of our object to see what the units are. Much of the metadata that we’re interested in is part of the CRS, and it can be accessed with `.rio.crs`. We introduced the concept of a CRS in an earlier episode.

Now we will see how features of the CRS appear in our data file and what meanings they have.

### View Raster Coordinate Reference System (CRS) in Python

We can view the CRS string associated with our Python object using the`crs` attribute.

``````print(surface_HARV.rio.crs)
``````
``````EPSG:32618
``````

To just print the EPSG code number as an `int`, we use the `.to_epsg()` method:

``````print(surface_HARV.rio.crs.to_epsg())
``````
``````32618
``````

EPSG codes are great for succinctly representing a particular coordinate reference system. But what if we want to see information about the CRS? For that, we can use `pyproj`, a library for representing and working with coordinate reference systems.

``````from pyproj import CRS
epsg = surface_HARV.rio.crs.to_epsg()
crs = CRS(epsg)
crs
``````
``````<Projected CRS: EPSG:32618>
Name: WGS 84 / UTM zone 18N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: World - N hemisphere - 78°W to 72°W - by country
- bounds: (-78.0, 0.0, -72.0, 84.0)
Coordinate Operation:
- name: UTM zone 18N
- method: Transverse Mercator
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
``````

The `CRS` class from the `pyproj` library allows us to create a `CRS` object with methods and attributes for accessing specific information about a CRS, or the detailed summary shown above.

A particularly useful attribute is `area_of_use`, which shows the geographic bounds that the CRS is intended to be used.

``````crs.area_of_use
``````
``````AreaOfUse(name=World - N hemisphere - 78°W to 72°W - by country, west=-78.0, south=0.0, east=-72.0, north=84.0)
``````

## Challenge

What units are our data in? See if you can find a method to examine this information using `help(crs)` or `dir(crs)`

## Answers

`crs.axis_info` tells us that our CRS for our raster has two axis and both are in meters. We could also get this information from the attribute `surface_HARV.rio.crs.linear_units`.

## Understanding pyproj CRS Summary

Let’s break down the pieces of the `pyproj` CRS summary. The string contains all of the individual CRS elements that Python or another GIS might need, separated into distinct sections, and datum (`datum=`).

### UTM pyproj summary

Our UTM projection is summarized as follows:

``````<Projected CRS: EPSG:32618>
Name: WGS 84 / UTM zone 18N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: World - N hemisphere - 78°W to 72°W - by country
- bounds: (-78.0, 0.0, -72.0, 84.0)
Coordinate Operation:
- name: UTM zone 18N
- method: Transverse Mercator
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
``````
• Name of the projection is UTM zone 18N (UTM has 60 zones, each 6-degrees of longitude in width). The underlying datum is WGS84.
• Axis Info: the CRS shows a Cartesian system with two axis, an easting and northing, in meter units.
• Area of Use: the projection is used for a particular range of longitudes `-78°W to 72°W` in the northern hemisphere (`0.0°N to 84.0°N`)
• Coordinate Operation: the operation to project the coordinates (if it is projected) on to a cartesian (x, y) plane. Transverse mercator is accurate for areas with longitudinal widths of a few degrees, hence the distinct UTM zones.
• Datum: Details about the datum, or the reference point for coordinates. `WGS 84` and `NAD 1983` are common datums. `NAD 1983` is set to be replaced in 2022.

Note that the zone is unique to the UTM projection. Not all CRSs will have a zone. Image source: Chrismurf at English Wikipedia, via Wikimedia Commons (CC-BY). ## Calculate Raster Min and Max Values

It is useful to know the minimum or maximum values of a raster dataset. In this case, given we are working with elevation data, these values represent the min/max elevation range at our site.

We can compute these and other descriptive statistics with `min`, `max`, `mean`, and `std`.

``````print(surface_HARV.min())
print(surface_HARV.max())
print(surface_HARV.mean())
print(surface_HARV.std())
``````
``````<xarray.DataArray ()>
array(305.07000732)
Coordinates:
spatial_ref  int64 0
<xarray.DataArray ()>
array(416.06997681)
Coordinates:
spatial_ref  int64 0
<xarray.DataArray ()>
array(359.85311803)
Coordinates:
spatial_ref  int64 0
<xarray.DataArray ()>
array(17.83168952)
Coordinates:
spatial_ref  int64 0
``````

The information above includes a report of the min, max, mean, and standard deviation values, along with the data type.

If we want to see specific quantiles, we can use xarray’s `.quantile()` method. For example for the 25% and 75% quartiles:

``````print(surface_HARV.quantile([0.25, 0.75]))
``````
``````<xarray.DataArray (quantile: 2)>
array([345.58999634, 374.27999878])
Coordinates:
* quantile  (quantile) float64 0.25 0.75
``````

## Data Tip - NumPy methods

You could also get each of these values one by one using `numpy`.

``````import numpy
print(numpy.percentile(surface_HARV, 25))
print(numpy.percentile(surface_HARV, 75))
``````
``````345.5899963378906
374.2799987792969
``````

You may notice that `surface_HARV.quantile` and `numpy.percentile` didn’t require an argument specifying the axis or dimension along which to compute the quantile. This is because `axis=None` is the default for most numpy functions, and therefore `dim=None` is the default for most xarray methods. It’s always good to check out the docs on a function to see what the default arguments are, particularly when working with multi-dimensional image data. To do so, we can use`help(surface_HARV.quantile)` or `?surface_HARV.percentile` if you are using jupyter notebook or jupyter lab.

We can see that the elevation at our site ranges from 305.0700073m to 416.0699768m.

## Raster Bands

The Digital Surface Model that we’ve been working with is a single band raster. This means that there is only one dataset stored in the raster–surface elevation in meters for one time period. However, a raster dataset can contain one or more bands. We can view the number of bands in a raster by looking at the `.shape` attribute of the `DataArray`. The band number comes first when GeoTiffs are read with the `.open_rasterio()` function.

``````rgb_HARV = rioxarray.open_rasterio("data/NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/HARV_RGB_Ortho.tif")
rgb_HARV.shape
``````
``````(3, 2317, 3073)
``````

It’s always a good idea to examine the shape of the raster array you are working with and make sure it’s what you expect. Many functions, especially ones that plot images, expect a raster array to have a particular shape.

Jump to a later episode in this series for information on working with multi-band rasters: Work with Multi-Band Rasters in Python.

## Dealing with Missing Data

Raster data often has a “no data value” associated with it and for raster datasets read in by `rioxarray` this value is referred to as `nodata`. This is a value assigned to pixels where data is missing or no data were collected. However, there can be different cases that cause missing data, and it’s common for other values in a raster to represent different cases. The most common example is missing data at the edges of rasters.

By default the shape of a raster is always rectangular. So if we have a dataset that has a shape that isn’t rectangular, some pixels at the edge of the raster will have no data values. This often happens when the data were collected by a sensor which only flew over some part of a defined region.

In the RGB image below, the pixels that are black have no data values. The sensor did not collect data in these areas. `rioxarray` assigns a specific number as missing data to the `.rio.nodata` attribute when the dataset is read, based on the file’s own metadata. The GeoTiff’s `nodata` attribute is assigned to the value `-1.7e+308` and in order to run calculations on this image that ignore these edge values or plot the image without the nodata values being displayed on the color scale, `rioxarray` masks them out.

``````# The imshow() function in the pyplot module of the matplotlib library is used to display data as an image.
rgb_HARV.plot.imshow()
`````` From this plot we see something interesting, while our no data values were masked along the edges, the color channel’s no data values don’t all line up. The colored pixels at the edges between white and black result from there being no data in one or two bands at a given pixel. `0` could conceivably represent a valid value for reflectance (the units of our pixel values) so it’s good to make sure we are masking values at the edges and not valid data values within the image.

While this plot tells us where we have no data values, the color scale looks strange, because our plotting function expects image values to be normalized between a certain range (0-1 or 0-255). By using `surface_HARV.plot.imshow` with the `robust=True` argument, we can display values between the 2nd and 98th percentile, providing better color contrast.

``````rgb_HARV.plot.imshow(robust=True)
`````` The value that is conventionally used to take note of missing data (the no data value) varies by the raster data type. For floating-point rasters, the value `-3.4e+38` is a common default, and for integers, `-9999` is common. Some disciplines have specific conventions that vary from these common values.

In some cases, other `nodata` values may be more appropriate. A `nodata` value should be, a) outside the range of valid values, and b) a value that fits the data type in use. For instance, if your data ranges continuously from -20 to 100, 0 is not an acceptable `nodata` value! Or, for categories that number 1-15, 0 might be fine for `nodata`, but using -.000003 will force you to save the GeoTIFF on disk as a floating point raster, resulting in a bigger file.

## Key Points

• The GeoTIFF file format includes metadata about the raster data.

• `rioxarray` stores CRS information as a CRS object that can be converted to an EPSG code or PROJ4 string.

• The GeoTIFF file may or may not store the correct no data value(s).

• We can find the correct value(s) in the raster’s external metadata or by plotting the raster.

• `rioxarray` and `xarray` are for working with multidimensional arrays like pandas is for working with tabular data with many columns

# Reproject Raster Data with Rioxarray

## Overview

Teaching: 60 min
Exercises: 20 min
Questions
• How do I call a DataArray to print out its metadata information?

• How do I work with raster data sets that are in different projections?

Objectives
• Reproject a raster in Python using rasterio.

• Accomplish the same task with rioxarray and xarray.

## Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

Sometimes we encounter raster datasets that do not “line up” when plotted or analyzed. Rasters that don’t line up are most often in different Coordinate Reference Systems (CRS), otherwise known as “projections”. This episode explains how to line up rasters in different, known CRSs.

## Raster Projection in Python

If you loaded two rasters with different projections in QGIS or ArcGIS, you’d see that they would align since these software reproject “on-the-fly”. But with R or Python, you’ll need to reproject your data yourself in order to plot or use these rasters together in calculations.

For this episode, we will be working with the Harvard Forest Digital Terrain Model (DTM). This differs from the surface model data we’ve been working with so far in that the digital surface model (DSM) includes the tops of trees, while the digital terrain model (DTM) shows the ground level beneath the tree canopy.

Our goal is to get these data into the same projection with the `rioxarray.reproject_match()` function so that we can use both rasters to calculate tree canopy height, also called a Canopy Height Model (CHM).

First, we need to read in the DSM and DTM rasters.

``````import rioxarray

surface_HARV = rioxarray.open_rasterio("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif")
terrain_HARV = rioxarray.open_rasterio("data/NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_dtmCrop_WGS84.tif")

surface_HARV
``````
``````xarray.DataArray band: 1, y: 1367, x: 1697
[2319799 values with dtype=float64]
Coordinates:
band       (band)   int64    1
y          (y)    float64   4.714e+06 4.714e+06 ... 4.712e+06
x          (x)    float64   7.315e+05 7.315e+05 ... 7.331e+05
spatial_ref ()      int64    0
Attributes:
STATISTICS_MAXIMUM: 416.06997680664
STATISTICS_MEAN: 359.85311802914
STATISTICS_MINIMUM: 305.07000732422
STATISTICS_STDDEV: 17.83169335933
_FillValue: -9999.0
scale_factor: 1.0
add_offset: 0.0
grid_mapping: spatial_ref
``````

To read the spatial reference in the output you click on the icon “Show/Hide attributes” on the right side of the `spatial_ref` row. You can also print the Well-known Text projection string.

``````surface_HARV.rio.crs.wkt
``````
``````'PROJCS["WGS 84 / UTM zone 18N",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-75],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","32618"]]'
``````

We can see the datum and projection are UTM zone 18N and WGS 84 respectively. UTM zone 18N is a regional projection with an associated coordinate system to more accurately capture distance, shape and/or area around the Harvard Forest.

``````terrain_HARV.rio.crs.wkt
``````
``````'GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]'
``````

We see the DTM is in an unprojected geographic coordinate system, using WGS84 as the datum and a coordinate system that spans the entire planet (i.e. latitude and longitude). This means that every location on the planet is defined using the SAME coordinate system and the same units. Geographic coordinate reference systems are best for global analysis but not for capturing distance, shape and/or area on a local scale.

We can use the CRS attribute from one of our datasets to reproject the other dataset so that they are both in the same projection. The only argument that is required is the `reproject_match` argument, which takes the CRS of the result of the reprojection.

``````terrain_HARV_UTM18 = terrain_HARV.rio.reproject_match(surface_HARV)

terrain_HARV_UTM18
``````
``````xarray.DataArray band: 1, y: 1492, x: 1801
array([[[-9999., -9999., -9999., ..., -9999., -9999., -9999.],
[-9999., -9999., -9999., ..., -9999., -9999., -9999.],
[-9999., -9999., -9999., ..., -9999., -9999., -9999.],
...,
[-9999., -9999., -9999., ..., -9999., -9999., -9999.],
[-9999., -9999., -9999., ..., -9999., -9999., -9999.],
[-9999., -9999., -9999., ..., -9999., -9999., -9999.]]])
Coordinates:
x            (x)       float64       7.314e+05 7.314e+05 ... 7.332e+05
y            (y)       float64       4.714e+06 4.714e+06 ... 4.712e+06
band       (band)       int64        1
spatial_ref  ()         int64        0
Attributes:
scale_factor: 1.0
add_offset: 0.0
grid_mapping: spatial_ref
_FillValue: -9999.0
``````

In one line `reproject_match` does a lot of helpful things:

1. It reprojects `terrain_HARV` from WGS 84 to UTM Zone 18.
2. Where `terrain_HARV` has data values and `surface_HARV` does not, the result `terrain_HARV_UTM18` is clipped. Where surface_HARV has data values and `terrain_HARV` does not, the result `terrain_HARV_UTM18` is padded with no data values to match the extent.
3. It sets the no data value of `terrain_HARV` to the no data value for `surface_HARV`

## Code Tip

There also exists a method called `reproject()`, which only reprojects one raster to another projection. If you want more control over how rasters are resampled, clipped, and/or reprojected, you can use the `reproject()` method and other `rioxarray` methods individually.

We can also save our DataArray that we created with `rioxarray` to a file.

``````reprojected_path = "data/NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_dtmCrop_UTM18.tif"
terrain_HARV_UTM18.rio.to_raster(reprojected_path)
``````

## Exercise

Inspect the metadata for `terrain_HARV_UTM18 ` and `surface_HARV`. Are the projections the same? Are the extents the same? Are the no data values the same? How might projections, extents, and no data values effect calculations we make between arrays?

## Solution

``````# view crs for DTM
print(terrain_HARV_UTM18.rio.crs)

# view crs for DSM
print(surface_HARV.rio.crs)
``````
``````EPSG:32618
EPSG:32618
``````

Good, the CRSs are the same. But, before we plot or calculate both of these DataArrays together, we should make sure they have the same nodata value.

``````# view noddata value for DTM
print(terrain_HARV_UTM18.rio.nodata)

# view nodata value for DSM
print(surface_HARV.rio.nodata)
``````
``````-9999.0
-9999.0
``````

Furthermore, let’s make sure both of these DataArrays have the same shape (i.e. extent).

``````# view shape for DTM
print(terrain_HARV_UTM18.shape)

# view shape for DSM
print(surface_HARV.shape)
``````
``````(1, 1367, 1697)
(1, 1367, 1697)
``````

The shapes and projections are the same which means these data cover the same locations. The no data values are also the same. This means we can run calculations on these two DataArrays.

Let’s plot our handiwork so far! We can use the `xarray.DataArray.plot` function to show the DTM. But if we run the following code, something doesn’t look right …

``````terrain_HARV_UTM18.plot(cmap="viridis")
`````` ## Challenge

Whoops! What did we forget to do to the DTM DataArray before plotting?

## Answers

Our array has a `nodata` value, `-9999.0`, which causes the color of our plot to be stretched over too wide a range. We’d like to only display valid values, so before plotting we can filter out the nodata values using the `where()` function and the `.rio.nodata` attribute of our DataArray.

``````terrain_HARV_UTM18_valid = terrain_HARV_UTM18.where(
terrain_HARV_UTM18  != terrain_HARV_UTM18.rio.nodata)
terrain_HARV_UTM18_valid.plot(cmap="viridis")
`````` If we had saved `terrain_HARV_UTM18` to a file and then read it in with `open_rasterio`’s `masked=True` argument, the raster’s `nodata` value would be masked and we would not need to use the `where()` function to do the masking before plotting.

## Plotting Tip

There are many ways to improve this plot. Matplotlib offers lots of different functions to change the position and appearance of plot elements. To plot with Matplotlib, you need to import the `pyplot` module. Something that would really improve our figure is adding a title. This can be done with the `plt.title()` function.

Try importing Matplotlib and adding a title to the figure.

## Importing `pyplot` and adding a title

Here’s how we can use `pyplot` functions to modify elements in our graph.

``````import matplotlib.pyplot as plt
terrain_HARV_UTM18_valid.plot()
plt.title("Harvard Forest Digital Terrain Model")
`````` Because `xarray` has Matplotlib under the hood, we don’t need to modify our original plotting method.

## Customizing plots with Matplotlib

Now that you’ve added a title to your plot, look for other ways to customize your plot with Matplotlib. One possible way to quickly customize a plot is with the `plt.style.use()` function. You can check available styles with `plt.style.available`.

Another useful function for the plots we are making is `plt.ticklabel_format(style="plain")`. This will ensure that our ticks are not truncated, making our plot nicer.

Try customizing your plot with the functions above or any other `pyplot` parameter.

## Styles and formatting

Here is the result of using a ggplot like style for our digital terrain plot.

``````plt.style.use("ggplot")
terrain_HARV_UTM18_valid.plot()
plt.title("Harvard Forest Digital Terrain Model")
plt.ticklabel_format(style="plain")
`````` Notice that `plt.style.use()` comes before and both `plt.title()` and `plt.ticklabel_format` come after the `.plot()` function. This is because `plt.style.use()` is a `pyplot` wide setting, while the latter two functions apply only to our current figure.

Quick tip: for all following plots in our lesson, use the `plt.title` and `plt.ticklabel_format` functions.

## Challenge: Reproject, then Plot a Digital Terrain Model

Create 2 maps in a UTM projection of the San Joaquin Experimental Range field site, using the`SJER_dtmCrop.tif` and `SJER_dsmCrop_WGS84.tif` files. Use `rioxarray` and `matplotlib.pyplot` (to add a title). Reproject the data as necessary to make sure each map is in the same UTM projection and save the reprojected file with the file name “data/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmCrop_WGS84.tif”.

## Answers

If we read in these files with the argument `masked=True`, then the nodata values will be masked automatically and set to `numpy.nan`, or Not a Number. This can make plotting easier since only valid raster values will be shown. However, it’s important to remember that `numpy.nan` values still take up space in our raster just like `nodata` values, and thus they still affect the shape of the raster. Rasters need to be the same shape for raster math to work in Python. In the next lesson, we will examine how to prepare rasters of different shapes for calculations.

``````terrain_SJER = rioxarray.open_rasterio("data/NEON-DS-Airborne-Remote-Sensing/SJER/DTM/SJER_dtmCrop.tif", masked=True)
surface_SJER = rioxarray.open_rasterio("data/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmCrop_WGS84.tif", masked=True)
reprojected_surface_model = surface_SJER.rio.reproject(dst_crs=terrain_SJER.rio.crs)
plt.figure()
reprojected_surface_model.plot()
plt.title("SJER Reprojected Surface Model")
reprojected_surface_model.rio.to_raster("data/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmCrop_WGS84.tif")
plt.figure()
terrain_HARV_SJER.plot()
plt.title("SJER Terrain Model")
``````  ## Key Points

• In order to plot or do calculations with two raster data sets, they must be in the same CRS.

• rioxarray and xarray provide simple syntax for accomplishing fundamental geospatial operations.

• rioxarray is built on top of rasterio, and you can use rasterio directly to accomplish fundamental geospatial operations.

# Raster Calculations in Python

## Overview

Teaching: 40 min
Exercises: 20 min
Questions
• How do I subtract one raster from another and extract pixel values for defined locations?

Objectives
• Perform a subtraction between two rasters using Python’s built-in math operators to generate a Canopy Height Model (CHM).

• Reclassify a continuous raster to a categorical raster using the CHM values.

## Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

We often want to combine values of and perform calculations on rasters to create a new output raster. This episode covers how to subtract one raster from another using basic raster math. It also covers how to extract pixel values from a set of locations - for example, a buffer region around locations at a field site.

## Raster Calculations in Python & Canopy Height Models

We often want to perform calculations on two or more rasters to create a new output raster. For example, suppose we are interested in mapping the heights of trees across an entire field site. In that case, we might want to calculate the difference between the Digital Surface Model (DSM, tops of trees) and the Digital Terrain Model (DTM, ground level). The resulting dataset is referred to as a Canopy Height Model (CHM) and represents the actual height of trees, buildings, etc. with the influence of ground elevation removed. ## More Resources

### Load the Data

For this episode, we will use the DTM and DSM from the NEON Harvard Forest Field site and San Joaquin Experimental Range, which we already have loaded from previous episodes. Let’s load them again with `open_rasterio` using the argument `masked=True`.

``````import rioxarray

surface_HARV = rioxarray.open_rasterio("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif", masked=True)
terrain_HARV_UTM18 = rioxarray.open_rasterio("data/NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_dtmCrop_UTM18.tif", masked=True)
``````

## Raster Math

We can perform raster calculations by subtracting (or adding, multiplying, etc.) two rasters. In the geospatial world, we call this “raster math”, and typically it refers to operations on rasters that have the same width and height (including `nodata` pixels). We saw from the last episode’s Challenge this is not the case with our DTM and DSM. Even though the `reproject` function gets our rasters into the same CRS, they have slightly different extents. We can now use the `reproject_match` function, which both reprojects and clips a raster to the CRS and extent of another raster.

``````terrain_HARV_matched = terrain_HARV_UTM18.rio.reproject_match(surface_HARV)
``````

We could have used reproject_match on the original DTM model, “HARV_dtmCrop_WGS84.tif”. If we had, this would mean one less time our DTM was interpolated with reprojection, though this has a negligible impact on the data for our purposes.

Let’s subtract the DTM from the DSM to create a Canopy Height Model (CHM). We’ll use `rioxarray` so that we can easily plot our result and keep track of the metadata for our CHM.

``````canopy_HARV = surface_HARV - terrain_HARV_matched
canopy_HARV.compute()
``````
``````xarray.DataArray band: 1, y: 1367, x: 1697
array([[[ 1.93699951e+01,  1.86799927e+01,  1.70500183e+01, ...,
nan,  1.69982910e-01, -1.60003662e-01],
[ 1.76499939e+01,  1.71700134e+01,  1.56299744e+01, ...,
0.00000000e+00,  0.00000000e+00,  9.97924805e-03],
[ 1.81000061e+01,  1.68399963e+01,  1.43200073e+01, ...,
0.00000000e+00,  1.00006104e-01,  7.99865723e-02],
...,
[ 2.34400024e+01,  2.56800232e+01,  2.60599976e+01, ...,
1.98999023e+00,  1.09997559e+00,  2.09991455e-01],
[ 2.63299866e+01,  2.70399780e+01,  1.88900146e+01, ...,
5.44000244e+00,  0.00000000e+00,  2.72000122e+00],
[ 2.55499878e+01,  2.81300049e+01,  2.78999939e+01, ...,
nan,             nan,  1.85998535e+00]]])
Coordinates:
band         (band)      int64       1
y            (y)         float64     4.714e+06 4.714e+06 ... 4.712e+06
x            (x)         float64     7.315e+05 7.315e+05 ... 7.331e+05
spatial_ref  ()          int64       0
``````

We can now plot the output CHM. If we use the argument `robust=True`, our plot’s color values are stretched between the 2nd and 98th percentiles of the data, which results in clearer distinctions between forested and non-forested areas.

``````import matplotlib.pyplot as plt # in case it has not been imported recently
canopy_HARV.plot(cmap="viridis")
plt.title("Canopy Height Model for Harvard Forest, Z Units: Meters")
plt.ticklabel_format(style="plain") # use this if the title overlaps the scientific notation of the y axis
``````

Notice that the range of values for the output CHM is between 0 and 30 meters. Does this make sense for trees in Harvard Forest?

Maps are great, but it can also be informative to plot histograms of values to better understand the distribution. We can accomplish this using a built-in xarray method we have already been using: `plot`

``````plt.figure()
plt.style.use('ggplot') # adds a style to improve the aesthetics
canopy_HARV.plot.hist()
plt.title("Histogram of Canopy Height in Meters")
``````

## Challenge: Explore CHM Raster Values

It’s often a good idea to explore the range of values in a raster dataset just like we might explore a dataset that we collected in the field. The histogram we just made is a good start but there’s more we can do to improve our understanding of the data.

1. What is the min and maximum value for the Harvard Forest Canopy Height Model (`canopy_HARV`) that we just created?
2. Plot a histogram with 50 bins instead of 8. What do you notice that wasn’t clear before?
3. Plot the `canopy_HARV` raster using breaks that make sense for the data. Include an appropriate color palette for the data, plot title and no axes ticks / labels.

## Answers

1) Recall, if there were nodata values in our raster like `-9999.0`, we would need to filter them out with `.where()`.

``````canopy_HARV.min().values
canopy_HARV.max().values
``````
``````array(-1.)
array(38.16998291)
``````

2) Increasing the number of bins gives us a much clearer view of the distribution.

``````canopy_HARV.plot.hist(bins=50)
`````` ## Classifying Continuous Rasters in Python

Now that we have a sense of the distribution of our canopy height raster, we can reduce the complexity of our map by classifying it. Classification involves sorting raster values into unique classes, and in Python, we can accomplish this using the `numpy.digitize` function.

``````import numpy as np

# Defines the bins for pixel values
class_bins = [canopy_HARV.min().values, 2, 10, 20, np.inf]

# Classifies the original canopy height model array
canopy_height_classes = np.digitize(canopy_HARV, class_bins)
type(canopy_height_classes)
``````
``````numpy.ndarray
``````

The result is a `numpy.ndarray`, but we can put this into a DataArray along with the spatial metadata from our `canopy_HARV`, so that our resulting plot shows the spatial coordinates.

``````import xarray
from matplotlib.colors import ListedColormap
import earthpy.plot as ep

# Define color map of the map legend
height_colors = ["gray", "y", "yellowgreen", "g", "darkgreen"]
height_cmap = ListedColormap(height_colors)

# Define class names for the legend
category_names = [
"No Vegetation",
"Bare Area",
"Low Canopy",
"Medium Canopy",
"Tall Canopy",
]

# we need to know in what order the legend items should be arranged
category_indices = list(range(len(category_names)))

# we put the numpy array in a xarray DataArray so that the plot is made with coordinates
canopy_height_classified = xarray.DataArray(canopy_height_classes, coords = canopy_HARV.coords)

#Making the plot
plt.style.use("default")
plt.figure()
im = canopy_height_classified.plot(cmap=height_cmap, add_colorbar=False)
ep.draw_legend(im_ax=im, classes = category_indices, titles=category_names) # earthpy helps us by drawing a legend given an existing image plot and legend items, plus indices
plt.title("Classfied Canopy Height Model - NEON Harvard Forest Field Site")
plt.ticklabel_format(style="plain")
`````` ## Reassigning Geospatial Metadata and Exporting a GeoTiff

When we computed the CHM, the output no longer contains a reference to a nodata value, like `-9999.0`, which was associated with the DTM and DSM. Some calculations, like `numpy.digitize` can remove all geospatial metadata. Of what can be lost, the CRS and nodata value are particularly important to keep track of. Before we export the product of our calculation to a GeoTiff with the `to_raster` function, we need to reassign this metadata.

``````canopy_HARV.rio.write_crs(surface_HARV.rio.crs, inplace=True)
canopy_HARV.rio.set_nodata(-9999.0, inplace=True)
``````

When we write this raster object to a GeoTiff file we’ll name it `CHM_HARV.tif`. This name allows us to quickly remember both what the data contains (CHM data) and for where (HARVard Forest). The `to_raster()` function by default writes the output file to your working directory unless you specify a full file path.

``````import os
os.makedirs("./data/outputs/", exist_ok=True)
canopy_HARV.rio.to_raster("./data/outputs/CHM_HARV.tif")
``````

## Challenge: Explore the NEON San Joaquin Experimental Range Field Site

Data are often more interesting and powerful when we compare them across various locations. Let’s compare some data collected over Harvard Forest to data collected in Southern California. The NEON San Joaquin Experimental Range (SJER) field site located in Southern California has a very different ecosystem and climate than the NEON Harvard Forest Field Site in Massachusetts.

Import the SJER DSM and DTM raster files and create a Canopy Height Model. Then compare the two sites. Be sure to name your Python objects and outputs carefully, as follows: objectType_SJER (e.g. `surface_SJER`). This will help you keep track of data from different sites!

1. You should have the DSM and DTM data for the SJER site already loaded from the Reproject Raster Data with Rioxarray episode. Don’t forget to check the CRSs and units of the data.
2. Create a CHM from the two raster layers and check to make sure the data are what you expect.
3. Plot the CHM from SJER.
4. Export the SJER CHM as a GeoTiff.
5. Compare the vegetation structure of the Harvard Forest and San Joaquin Experimental Range.

## Answers

1) Read in the data again if you haven’t already with `masked=True`.

``````surface_SJER = rioxarray.open_rasterio("data/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmCrop.tif", masked=True)
terrain_SJER_UTM18 = rioxarray.open_rasterio("data/NEON-DS-Airborne-Remote-Sensing/SJER/DTM/SJER_dtmCrop_WGS84.tif", masked=True)
print(terrain_SJER_UTM18.shape)
print(surface_SJER.shape)
``````

2) Reproject and clip one raster to the extent of the smaller raster using `reproject_match`. Your output raster may have nodata values at the border. Nodata values are fine and can be removed for later calculations if needed. The lines of code below assign a variable to the reprojected terrain raster and calculate a CHM for SJER.

``````terrain_SJER_UTM18_matched = terrain_SJER_UTM18.rio.reproject_match(surface_SJER)
canopy_SJER = surface_SJER - terrain_SJER_UTM18_matched
``````

3) Plot the CHM with the same color map as HARV and save the CHM to the `outputs` folder.

``````plt.figure()
canopy_SJER.plot(robust=True, cmap="viridis")
plt.ticklabel_format(style="plain")
plt.title("Canopy Height Model for San Joaquin Experimental Range, Z Units: Meters")
os.makedirs("fig", exist_ok=True)
canopy_SJER.rio.to_raster("./data/outputs/CHM_SJER.tif")
`````` 4) Compare the SJER and HARV CHMs. Tree heights are much shorter in SJER. You can confirm this by looking at the histograms of the two CHMs.

``````fig, ax = plt.subplots(figsize=(9,6))
canopy_HARV.plot.hist(ax = ax, bins=50, color = "green")
canopy_SJER.plot.hist(ax = ax, bins=50, color = "blue")
``````

## Key Points

• Python’s built-in math operators are fast and simple options for raster math.

• numpy.digitize can be used to classify raster values in order to generate a less complicated map.

• DataArrays can be created from scratch from numpy arrays as well as read in from existing files.

# Open and Plot Shapefiles in Python

## Overview

Teaching: 20 min
Exercises: 10 min
Questions
• How can I distinguish between and visualize point, line and polygon vector data?

Objectives
• Know the difference between point, line, and polygon vector elements.

• Load point, line, and polygon shapefiles with `geopandas`.

• Access the attributes of a spatial object with `geopandas`.

## Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

Starting with this episode, we will be moving from working with raster data to working with vector data. In this episode, we will open and plot point, line and polygon vector data stored in shapefile format in Python. These data refer to the NEON Harvard Forest field site, which we have been working with in previous episodes. In later episodes, we will learn how to work with raster and vector data together and combine them into a single plot.

## Import Shapefiles

We will use the `geopandas` package to work with vector data in Python. We will also use the `rioxarray`.

``````import geopandas as gpd
``````

The shapefiles that we will import are:

The first shapefile that we will open contains the boundary of our study area (or our Area Of Interest [AOI], hence the name `aoi_boundary`). To import shapefiles we use the `geopandas` function `read_file()`.

Let’s import our AOI:

``````aoi_boundary_HARV = gpd.read_file(
"data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp")
``````

## Shapefile Metadata & Attributes

When we import the `HarClip_UTMZ18` shapefile layer into Python (as our `aoi_boundary_HARV` object) it comes in as a DataFrame, specifically a `GeoDataFrame`. `read_file()` also automatically stores geospatial information about the data. We are particularly interested in describing the format, CRS, extent, and other components of the vector data, and the attributes which describe properties associated with each individual vector object.

## Data Tip

The Explore and Plot by Shapefile Attributes episode provides more information on both metadata and attributes and using attributes to subset and plot data.

## Spatial Metadata

Key metadata for all shapefiles include:

1. Object Type: the class of the imported object.
2. Coordinate Reference System (CRS): the projection of the data.
3. Extent: the spatial extent (i.e. geographic area that the shapefile covers) of the shapefile. Note that the spatial extent for a shapefile represents the combined extent for all spatial objects in the shapefile.

Each `GeoDataFrame` has a `"geometry"` column that contains geometries. In the case of our `aoi_boundary_HARV`, this geometry is represented by a `shapely.geometry.Polygon` object. `geopandas` uses the `shapely` library to represent polygons, lines, and points, so the types are inherited from `shapely`.

We can view shapefile metadata using the `.crs`, `.bounds` and `.type` attributes. First, let’s view the geometry type for our AOI shapefile. To view the geometry type, we use the `pandas` method `.type` function on the `GeoDataFrame`, `aoi_boundary_HARV`.

``````aoi_boundary_HARV.type
``````
``````0    Polygon
dtype: object
``````

To view the CRS metadata:

``````aoi_boundary_HARV.crs
``````
``````<Projected CRS: EPSG:32618>
Name: WGS 84 / UTM zone 18N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: World - N hemisphere - 78°W to 72°W - by country
- bounds: (-78.0, 0.0, -72.0, 84.0)
Coordinate Operation:
- name: UTM zone 18N
- method: Transverse Mercator
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
``````

Our data is in the CRS UTM zone 18N. The CRS is critical to interpreting the object’s extent values as it specifies units. To find the extent of our AOI in the projected coordinates, we can use the `.bounds()` function:

``````aoi_boundary_HARV.bounds
``````
``````            minx          miny           maxx          maxy
0  732128.016925  4.713209e+06  732251.102892  4.713359e+06
``````

The spatial extent of a shapefile or `shapely` spatial object represents the geographic “edge” or location that is the furthest north, south, east, and west. Thus, it is represents the overall geographic coverage of the spatial object. Image Source: National Ecological Observatory Network (NEON). We can convert these coordinates to a bounding box or acquire the index of the dataframe to access the geometry. Either of these polygons can be used to clip rasters (more on that later).

## Reading a Shapefile from a csv

So far we have been loading file formats that were specifically built to hold spatial information. But often, point data is stored in table format, with a column for the x coordinates and a column for the y coordinates. The easiest way to get this type of data into a GeoDataFrame is with the `geopandas` function `geopandas.points_from_xy`, which takes list-like sequences of x and y coordinates. In this case, we can get these list-like sequences from columns of a pandas `DataFrame` that we get from `read_csv`.

``````# we get the projection of the point data from our Canopy Height Model,
# after examining the pandas DataFrame and seeing that the CRSs are the same
import rioxarray
CHM_HARV <-
rioxarray.open("data/NEON-DS-Airborne-Remote-Sensing/HARV/CHM/HARV_chmCrop.tif")

# plotting locations in CRS coordinates using CHM_HARV's CRS
plot_locations_HARV =
pd.read_csv("data/NEON-DS-Site-Layout-Files/HARV/HARV_PlotLocations.csv")
plot_locations_HARV = gpd.GeoDataFrame(plot_locations_HARV,
geometry=gpd.points_from_xy(plot_locations_HARV.easting, plot_locations_HARV.northing),
crs=CHM_HARV.rio.crs)
``````

## Plotting a Shapefile

Any `GeoDataFrame` can be plotted in CRS units to view the shape of the object with `.plot()`.

``````aoi_boundary_HARV.plot()
``````

We can customize our boundary plot by setting the `figsize`, `edgecolor`, and `color`. Making some polygons transparent will come in handy when we need to add multiple spatial datasets to a single plot.

``````aoi_boundary_HARV.plot(figsize=(5,5), edgecolor="purple", facecolor="None")
``````

Under the hood, `geopandas` is using `matplotlib` to generate this plot. In the next episode we will see how we can add `DataArrays` and other shapefiles to this plot to start building an informative map of our area of interest.

## Spatial Data Attributes

We introduced the idea of spatial data attributes in an earlier lesson. Now we will explore how to use spatial data attributes stored in our data to plot different features.

## Challenge: Import Line and Point Shapefiles

Using the steps above, import the HARV_roads and HARVtower_UTM18N layers into Python using `geopandas`. Name the HARV_roads shapefile as the variable `lines_HARV` and the HARVtower_UTM18N shapefile `point_HARV`.

Answer the following questions:

1. What type of Python spatial object is created when you import each layer?

2. What is the CRS and extent (bounds) for each object?

3. Do the files contain points, lines, or polygons?

4. How many spatial objects are in each file?

## Answers

First we import the data:

``````lines_HARV = gpd.read_file("data/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp")
point_HARV = gpd.read_file("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")
``````

Then we check the types:

``````lines_HARV.type
``````
``````point_HARV.type
``````

We also check the CRS and extent of each object:

``````print(lines_HARV.crs)
print(point_HARV.bounds)
print(lines_HARV.crs)
print(point_HARV.bounds)
``````

To see the number of objects in each file, we can look at the output from when we print the results in a Jupyter notebook of call `len()` on a `GeoDataFrame`. `lines_HARV` contains 13 features (all lines) and `point_HARV` contains only one point.

## Key Points

• Shapefile metadata include geometry type, CRS, and extent.

• Load spatial objects into Python with the `geopandas.read_file()` method.

• Spatial objects can be plotted directly with `geopandas.GeoDataFrame.plot()`.

# Plot Multiple Shapefiles with Geopandas FIXME

## Overview

Teaching: 40 min
Exercises: 20 min
Questions
• How can I create map compositions with custom legends using geopandas?

• How can I plot raster and vector data together?

Objectives
• Plot multiple shapefiles in the same plot.

• Apply custom symbols to spatial objects in a plot.

• Create a multi-layered plot with raster and vector data.

## Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

This episode explains how to crop a raster using the extent of a vector shapefile. We will also cover how to extract values from a raster that occur within a set of polygons, or in a buffer (surrounding) region around a set of points.

``````# Learners will have these data and libraries loaded from earlier episodes

import rioxarray
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd

# shapefiles
point_HARV = read_file("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")
lines_HARV = read_file("data/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp")
aoi_boundary_HARV = read_file("data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp")

# CHM
CHM_HARV <-
rioxarray.open("data/NEON-DS-Airborne-Remote-Sensing/HARV/CHM/HARV_chmCrop.tif")

plot_locations_HARV =
pd.read_csv("data/NEON-DS-Site-Layout-Files/HARV/HARV_PlotLocations.csv")
plot_locations_HARV = gpd.GeoDataFrame(plot_locations_HARV,
geometry=gpd.points_from_xy(plot_locations_HARV.easting, plot_locations_HARV.northing),
crs=CHM_HARV.rio.crs)
``````

## Crop a Raster to Vector Extent

We often work with spatial layers that have different spatial extents. The spatial extent of a shapefile or R spatial object represents the geographic “edge” or location that is the furthest north, south east and west. Thus is represents the overall geographic coverage of the spatial object. Image Source: National Ecological Observatory Network (NEON)

The graphic below illustrates the extent of several of the spatial layers that we have worked with in this workshop:

• Area of interest (AOI) – blue
• Roads and trails – purple
• Vegetation plot locations (marked with white dots)– black
• A canopy height model (CHM) in GeoTIFF format – green
``````# code not shown, for demonstration purposes only

``````

Frequent use cases of cropping a raster file include reducing file size and creating maps. Sometimes we have a raster file that is much larger than our study area or area of interest. It is often more efficient to crop the raster to the extent of our study area to reduce file sizes as we process our data. Cropping a raster can also be useful when creating pretty maps so that the raster layer matches the extent of the desired vector layers.

## Crop a Raster Using Vector Extent

We can use the `crop()` function to crop a raster to the extent of another spatial object. To do this, we need to specify the raster to be cropped and the spatial object that will be used to crop the raster. R will use the `extent` of the spatial object as the cropping boundary.

To illustrate this, we will crop the Canopy Height Model (CHM) to only include the area of interest (AOI). Let’s start by plotting the full extent of the CHM data and overlay where the AOI falls within it. The boundaries of the AOI will be colored blue, and we use `fill = NA` to make the area transparent.

```{r crop-by-vector-extent} ggplot() + geom_raster(data = CHM_HARV_df, aes(x = x, y = y, fill = HARV_chmCrop)) + scale_fill_gradientn(name = “Canopy Height”, colors = terrain.colors(10)) + geom_sf(data = aoi_boundary_HARV, color = “blue”, fill = NA) + coord_sf()

``````
Now that we have visualized the area of the CHM we want to subset, we can
perform the cropping operation. We are going to create a new object with only
the portion of the CHM data that falls within the boundaries of the AOI. The function `crop()` is from the raster package and doesn't know how to deal with `sf` objects. Therefore, we first need to convert `aoi_boundary_HARV` from a `sf` object to "Spatial" object.

```{r}
CHM_HARV_Cropped <- crop(x = CHM_HARV, y = as(aoi_boundary_HARV, "Spatial"))
``````

Now we can plot the cropped CHM data, along with a boundary box showing the full CHM extent. However, remember, since this is raster data, we need to convert to a data frame in order to plot using `ggplot`. To get the boundary box from CHM, the `st_bbox()` will extract the 4 corners of the rectangle that encompass all the features contained in this object. The `st_as_sfc()` converts these 4 coordinates into a polygon that we can plot:

```{r show-cropped-area} CHM_HARV_Cropped_df <- as.data.frame(CHM_HARV_Cropped, xy = TRUE)

ggplot() + geom_sf(data = st_as_sfc(st_bbox(CHM_HARV)), fill = “green”, color = “green”, alpha = .2) +
geom_raster(data = CHM_HARV_Cropped_df, aes(x = x, y = y, fill = HARV_chmCrop)) + scale_fill_gradientn(name = “Canopy Height”, colors = terrain.colors(10)) + coord_sf()

``````
The plot above shows that the full CHM extent (plotted in green) is much larger
than the resulting cropped raster. Our new cropped CHM now has the same extent
as the `aoi_boundary_HARV` object that was used as a crop extent (blue border
below).

```{r view-crop-extent}
ggplot() +
geom_raster(data = CHM_HARV_Cropped_df,
aes(x = x, y = y, fill = HARV_chmCrop)) +
geom_sf(data = aoi_boundary_HARV, color = "blue", fill = NA) +
scale_fill_gradientn(name = "Canopy Height", colors = terrain.colors(10)) +
coord_sf()
``````

We can look at the extent of all of our other objects for this field site.

``` {r view-extent} st_bbox(CHM_HARV) st_bbox(CHM_HARV_Cropped) st_bbox(aoi_boundary_HARV) st_bbox(plot_locations_sp_HARV)

``````
Our plot location extent is not the largest but is larger than the AOI Boundary.
It would be nice to see our vegetation plot locations plotted on top of the
Canopy Height Model information.

> ## Challenge: Crop to Vector Points Extent
>
> 1. Crop the Canopy Height Model to the extent of the study plot locations.
> 2. Plot the vegetation plot location points on top of the Canopy Height Model.
>
> > ## Answers
> >
> > ```{r challenge-code-crop-raster-points}
> >
> > CHM_plots_HARVcrop <- crop(x = CHM_HARV, y = as(plot_locations_sp_HARV, "Spatial"))
> >
> > CHM_plots_HARVcrop_df <- as.data.frame(CHM_plots_HARVcrop, xy = TRUE)
> >
> > ggplot() +
> >   geom_raster(data = CHM_plots_HARVcrop_df, aes(x = x, y = y, fill = HARV_chmCrop)) +
> >   scale_fill_gradientn(name = "Canopy Height", colors = terrain.colors(10)) +
> >   geom_sf(data = plot_locations_sp_HARV) +
> >   coord_sf()
> > ```
> {: .solution}
{: .challenge}

In the plot above, created in the challenge, all the vegetation plot locations
(black dots) appear on the Canopy Height Model raster layer except for one. One is
situated on the blank space to the left of the map. Why?

A modification of the first figure in this episode is below, showing the
relative extents of all the spatial objects. Notice that the extent for our
vegetation plot layer (black) extends further west than the extent of our CHM
raster (bright green). The `crop()` function will make a raster extent smaller, it
will not expand the extent in areas where there are no data. Thus, the extent of our
vegetation plot layer will still extend further west than the extent of our
(cropped) raster data (dark green).

```{r, echo = FALSE}
# code not shown, demonstration only
# create CHM_plots_HARVcrop as a shape file
CHM_plots_HARVcrop_sp <- st_as_sf(CHM_plots_HARVcrop_df, coords = c("x", "y"), crs = utm18nCRS)
# approximate the boundary box with random sample of raster points
CHM_plots_HARVcrop_sp_rand_sample = sample_n(CHM_plots_HARVcrop_sp, 10000)
``````

```{r repeat-compare-data-extents, ref.label=”compare-data-extents”, echo = FALSE}

``````
## Define an Extent

So far, we have used a shapefile to crop the extent of a raster dataset.
Alternatively, we can also the `extent()` function to define an extent to be
used as a cropping boundary. This creates a new object of class extent. Here we
will provide the `extent()` function our xmin, xmax, ymin, and ymax (in that
order).

```{r}
new_extent <- extent(732161.2, 732238.7, 4713249, 4713333)
class(new_extent)
``````

## Data Tip

The extent can be created from a numeric vector (as shown above), a matrix, or a list. For more details see the `extent()` function help file (`?raster::extent`).

Once we have defined our new extent, we can use the `crop()` function to crop our raster to this extent object.

```{r crop-using-drawn-extent} CHM_HARV_manual_cropped <- crop(x = CHM_HARV, y = new_extent)

``````
To plot this data using `ggplot()` we need to convert it to a dataframe.

```{r}
CHM_HARV_manual_cropped_df <- as.data.frame(CHM_HARV_manual_cropped, xy = TRUE)
``````

Now we can plot this cropped data. We will show the AOI boundary on the same plot for scale.

```{r show-manual-crop-area} ggplot() + geom_sf(data = aoi_boundary_HARV, color = “blue”, fill = NA) + geom_raster(data = CHM_HARV_manual_cropped_df, aes(x = x, y = y, fill = HARV_chmCrop)) + scale_fill_gradientn(name = “Canopy Height”, colors = terrain.colors(10)) + coord_sf()

``````
## Extract Raster Pixels Values Using Vector Polygons

Often we want to extract values from a raster layer for particular locations -
for example, plot locations that we are sampling on the ground. We can extract all pixel values within 20m of our x,y point of interest. These can then be summarized into some value of interest (e.g. mean, maximum, total).

![Extract raster information using a polygon boundary. From https://www.neonscience.org/sites/default/files/images/spatialData/BufferSquare.png](../images//BufferSquare.png)

To do this in R, we use the `extract()` function. The `extract()` function
requires:

* The raster that we wish to extract values from,
* The vector layer containing the polygons that we wish to use as a boundary or
boundaries,
* we can tell it to store the output values in a data frame using
`df = TRUE`. (This is optional, the default is to return a list, NOT a data frame.) .

We will begin by extracting all canopy height pixel values located within our
`aoi_boundary_HARV` polygon which surrounds the tower located at the NEON Harvard
Forest field site.

```{r extract-from-raster}
tree_height <- extract(x = CHM_HARV,
y = as(aoi_boundary_HARV, "Spatial"),
df = TRUE)

str(tree_height)
``````

When we use the `extract()` function, R extracts the value for each pixel located within the boundary of the polygon being used to perform the extraction - in this case the `aoi_boundary_HARV` object (a single polygon). Here, the function extracted values from 18,450 pixels.

We can create a histogram of tree height values within the boundary to better understand the structure or height distribution of trees at our site. We will use the column `layer` from our data frame as our x values, as this column represents the tree heights for each pixel.

```{r view-extract-histogram} ggplot() + geom_histogram(data = tree_height, aes(x = HARV_chmCrop)) + ggtitle(“Histogram of CHM Height Values (m)”) + xlab(“Tree Height”) + ylab(“Frequency of Pixels”)

``````
We can also use the
`summary()` function to view descriptive statistics including min, max, and mean
height values. These values help us better understand vegetation at our field
site.

```{r}
summary(tree_height\$HARV_chmCrop)
``````

## Summarize Extracted Raster Values

We often want to extract summary values from a raster. We can tell R the type of summary statistic we are interested in using the `fun =` argument. Let’s extract a mean height value for our AOI. Because we are extracting only a single number, we will not use the `df = TRUE` argument.

```{r summarize-extract } mean_tree_height_AOI <- extract(x = CHM_HARV, y = as(aoi_boundary_HARV, “Spatial”), fun = mean)

mean_tree_height_AOI

``````
It appears that the mean height value, extracted from our LiDAR data derived
canopy height model is 22.43 meters.

## Extract Data using x,y Locations

We can also extract pixel values from a raster by defining a buffer or area
surrounding individual point locations using the `extract()` function. To do this
we define the summary argument (`fun = mean`) and the buffer distance (`buffer = 20`)
which represents the radius of a circular region around each point. By default, the units of the
buffer are the same units as the data's CRS. All pixels that are touched by the buffer region are included in the extract.

![Extract raster information using a buffer region. From: https://www.neonscience.org/sites/default/files/images/spatialData/BufferCircular.png](../images/BufferCircular.png)

Source: National Ecological Observatory Network (NEON).

Let's put this into practice by figuring out the mean tree height in the
20m around the tower location (`point_HARV`). Because we are extracting only a single number, we
will not use the `df = TRUE` argument.

```{r extract-point-to-buffer }
mean_tree_height_tower <- extract(x = CHM_HARV,
y = as(point_HARV, "Spatial"),
buffer = 20,
fun = mean)

mean_tree_height_tower
``````

## Challenge: Extract Raster Height Values For Plot Locations

1) Use the plot locations object (`plot_locations_sp_HARV`) to extract an average tree height for the area within 20m of each vegetation plot location in the study area. Because there are multiple plot locations, there will be multiple averages returned, so the `df = TRUE` argument should be used.

2) Create a plot showing the mean tree height of each area.

## Answers

```{r hist-tree-height-veg-plot}

# extract data at each plot location

mean_tree_height_plots_HARV <- extract(x = CHM_HARV, y = as(plot_locations_sp_HARV, “Spatial”), buffer=20, fun = mean, df = TRUE)

# view data

mean_tree_height_plots_HARV

# plot data

ggplot(data = mean_tree_height_plots_HARV, aes(ID, HARV_chmCrop)) + geom_col() + ggtitle(“Mean Tree Height at each Plot”) + xlab(“Plot ID”) + ylab(“Tree Height (m)”) ```

## Key Points

• Use the `matplotlib.pyplot.axis` object to add multiple layers to a plot.

• Multi-layered plots can combine raster and vector datasets.

# Convert from .csv to a Shapefile in Python

## Overview

Teaching: 40 min
Exercises: 20 min
Questions
• How can I import CSV files as shapefiles in Python?

Objectives
• Import .csv files containing x,y coordinate locations as a GeoDataFrame.

• Export a spatial object to a .shp file.

``````import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
``````
``````# Learners will have this data loaded from earlier episodes
lines_HARV = gpd.read_file("data/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp")
aoi_boundary_HARV = gpd.read_file("data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp")
country_boundary_US = gpd.read_file("data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-Boundary-Dissolved-States.shp")
point_HARV = gpd.read_file("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")
``````

## Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

This episode will review how to import spatial points stored in `.csv` (Comma Separated Value) format into Python as a `geopandas` `GeoDataFrame`. We will also reproject data imported from a shapefile format, export this data as a shapefile, and plot raster and vector data as layers in the same plot.

## Spatial Data in Text Format

The `HARV_PlotLocations.csv` file contains `x, y` (point) locations for study plot where NEON collects data on vegetation and other ecological metrics. We would like to:

• Create a map of these plot locations.
• Export the data in a `shapefile` format to share with our colleagues. This shapefile can be imported into any GIS software.
• Create a map showing vegetation height with plot locations layered on top.

Spatial data are sometimes stored in a text file format (`.txt` or `.csv`). If the text file has an associated `x` and `y` or `lat` and `lon` location column, then we can convert the tabular text file into a `geopandas.GeoDataFrame`. This will have a column containing the point geometry. The `GeoDataFrame` allows us to store both the `x,y` values that represent the coordinate location of each point and the associated attribute data - or columns describing each feature in the spatial object.

We will continue using the `pandas`, `geopandas` and `rasterio` packages in this episode.

## Import .csv

To begin let’s import a `.csv` file that contains plot coordinate `x, y` locations at the NEON Harvard Forest Field Site (`HARV_PlotLocations.csv`) and look at the structure of that new object:

``````plot_locations_HARV = pd.read_csv("data/NEON-DS-Site-Layout-Files/HARV/HARV_PlotLocations.csv")

plot_locations_HARV.info()
``````

We now have a data frame that contains 21 locations (rows) and 16 variables (attributes). Next, let’s explore the `DataFrame` to determine whether it contains columns with coordinate values. If we are lucky, our `.csv` will contain columns labeled:

• “X” and “Y” OR
• Latitude and Longitude OR
• easting and northing (UTM coordinates)

Let’s check out the column names of our `DataFrame`.

``````plot_locations_HARV.columns
``````

## Identify X,Y Location Columns

Our column names include several fields that might contain spatial information. The `plot_locations_HARV["easting"]` and `plot_locations_HARV["northing"]` columns contain coordinate values. We can confirm this by looking at the first five rows of our data.

``````print(plot_locations_HARV["easting"].head())

print(plot_locations_HARV["northing"].head())
``````

We have coordinate values in our data frame. In order to convert our `DataFrame` to a `GeoDataFrame`, we also need to know the CRS associated with those coordinate values.

There are several ways to figure out the CRS of spatial data in text format.

1. We can infer from the range of numbers in the coordinate column if the values represent latitude/longitude (in which case, we can use a WGS84 projection) or another coordinate system.
2. If the values are not in latitude/longitude, we can check the file metadata, which may be in a separate file or listed somewhere in the text file itself. The file header or separate data columns are possible locations for CRS related storing metadata.

Following the `easting` and `northing` columns, there is a `geodeticDa` and a `utmZone` column. These appear to contain CRS information (`datum` and `projection`). Let’s view those next.

``````print(plot_locations_HARV["geodeticDa"].head())
print(plot_locations_HARV["utmZone"].head())
``````

The `geodeticDa` and `utmZone` columns contain the information that helps us determine the CRS:

• `geodeticDa`: WGS84 – this is geodetic datum WGS84
• `utmZone`: 18

In When Vector Data Don’t Line Up - Handling Spatial Projection & CRS in Python we learned about the components of a `proj4` string. We have everything we need to assign a CRS to our data frame.

To create the `proj4` associated with UTM Zone 18 WGS84 we can look up the projection on the Spatial Reference website, which contains a list of CRS formats for each projection. From here, we can extract the proj4 string for UTM Zone 18N WGS84.

However, if we have other data in the UTM Zone 18N projection, it’s much easier to use the `.crs` method to extract the CRS in `proj4` format from that GeoDataFrame and assign it to our new GeoDataFrame. We’ve seen this CRS before with our Harvard Forest study site (`point_HARV`).

``````point_HARV.crs
``````

The output above shows that the points shapefile is in UTM zone 18N. We can thus use the CRS from that spatial object to convert our non-spatial `DataFrame` into an `GeoDataFrame` with point geometry.

Next, let’s create a `crs` object that we can use to define the CRS of our `GeoDataFrame` when we create it.

``````utm18nCRS = point_HARV.crs
utm18nCRS
``````

## .csv to GeoDataFrame

Next, let’s convert our `DataFrame` into a `GeoDataFrame` To do this, we need to specify:

1. The columns containing X (`easting`) and Y (`northing`) coordinate values
2. The CRS that the column coordinate represent (units are included in the CRS) - stored in our `utmCRS` object.

We will use the `points_from_xy()` function to perform the conversion.

``````plot_locations_HARV_gdf = gpd.GeoDataFrame(plot_locations_HARV, geometry=gpd.points_from_xy(plot_locations_HARV.easting, plot_locations_HARV.northing), crs=utm18nCRS)
``````

We should double check the CRS to make sure it is correct.

``````plot_locations_HARV_gdf.crs
``````

## Plot Spatial Object

We now have a `GeoDataFrame`, we can plot our newly created spatial object.

``````fig, ax = plt.subplots()
plot_locations_HARV_gdf.plot(ax=ax)
plt.title("Map of Plot Locations")
plt.show()
`````` ## Plot Extent

In Open and Plot Shapefiles in Python we learned about `GeoDataFrame` extent. When we plot several spatial layers in Python using `matplotlib`, all of the layers of the plot are considered in setting the boundaries of the plot. To show this, let’s plot our `aoi_boundary_HARV` object with our vegetation plots.

``````fig, ax = plt.subplots()
plot_locations_HARV_gdf.plot(ax=ax)
aoi_boundary_HARV.plot(ax=ax, facecolor="None", edgecolor="orange")
plt.title("AOI Boundary Plot")
plt.show()
`````` ## Challenge - Import & Plot Additional Points

We want to add two phenology plots to our existing map of vegetation plot locations.

Import the .csv: `data/NEON-DS-Site-Layout-Files/HARV/HARV_2NewPhenPlots.csv` and do the following:

1. Find the X and Y coordinate locations. Which value is X and which value is Y?
2. These data were collected in a geographic coordinate system (WGS84). Convert the dataframe into an `geopandas.GeoDataFrame`.
3. Plot the new points with the plot location points from above. Be sure to add a legend. Use a different symbol for the 2 new points!

## Answers

First we read in the phenology data as a DataFrame then display some relevant metadata.

``````newplot_locations_HARV = pd.read_csv("data/NEON-DS-Site-Layout-Files/HARV/HARV_2NewPhenPlots.csv")
newplot_locations_HARV.info()
``````

We see that `decimalLon` and `decimalLat` are the relevant X, Y coordinate locations, respectively. Next we create a variable to store coordinate reference information for the phenology points (same as `country_boundary_US` GeoDataFrame)

``````geogCRS = country_boundary_US.crs
``````

Then we convert from the DataFrame to a GeoDataFrame by using the `points_from_xy` method and specifying the `crs` of the point data.

``````newplot_locations_HARV_gdf = gpd.GeoDataFrame(newplot_locations_HARV, geometry=gpd.points_from_xy(newplot_locations_HARV.decimalLon, newplot_locations_HARV.decimalLat), crs=geogCRS)
``````

Finally, we display `plot_locations` and phenology data on the same matplotlib figure. Furthermore, we project the phenology data to match the coordinate reference system of `plot_locations_HARV_gdf`. This is accomplished using the `to_crs` method.

``````fig, ax = plt.subplots()
plot_locations_HARV_gdf.plot(ax=ax)
newplot_locations_HARV_gdf.to_crs(plot_locations_HARV_gdf.crs).plot(ax=ax, color="orange")
plt.title("Map of All Plot Locations")
plt.show()
`````` ## Key Points

• Know the projection (if any) of your point data prior to converting to a spatial object.

• This projection information can be used to convert a text file with spatial columns into a shapefile (or GeoJSON) with geopandas.

# Calculating Zonal Statistics on Rasters

## Overview

Teaching: 40 min
Exercises: 20 min
Questions
Objectives

## Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

So far we have chained together two geospatial operations, reprojection and raster math, to produce our Canopy Height Model (CHM) and learned how to read vector data with `geopandas`.

``````from xrspatial import zonal_stats
import xarray as xr
import geopandas as gpd
import rasterio
import rioxarray

surface_HARV = rioxarray.open_rasterio("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif", masked=True)
terrain_HARV_UTM18 = rioxarray.open_rasterio("data/NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_dtmCrop_UTM18.tif", masked=True)
terrain_HARV_matched = terrain_HARV_UTM18.rio.reproject_match(surface_HARV)
canopy_HARV = surface_HARV - terrain_HARV_matched

roads = gpd.read_file("data/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp")
``````

Now that we have our CHM, we can compute statistics with it to understand how canopy height varies across our study area. If we want to focus on specific areas, or zones, of interest when calculating these statistics, we can use zonal statistics. To do this, we’ll import the `zonal_stats` function from the package `xrspatial`.

## Zonal Statistics with xarray-spatial

We often want to perform calculations for specific zones in a raster. These zones can be delineated by points, lines, or polygons (vectors). In the case of our Harvard Forest Dataset, we have a shapefile that contains lines representing walkways, footpaths, and roads. A single function call, `xrspatial.zonal_stats` can calculate the minimum, maximum, mean, median, and standard deviation for each line zone in our CHM.

In order to accomplish this, we first need to rasterize our `roads` geodataframe with the `rasterio.features.rasterize` function. This will produce a grid with number values representing each type of line, with numbers varying with the type of the line (walkway, footpath, road, etc.). This grid’s values will then represent each of our zones for the `xrspatial.zonal_stats` function, where each pixel in the zone grid overlaps with a corresponding pixel in our CHM raster.

Before rasterizing, we need to do a little work to make a variable, `shapes`, that associates a line with a unique number to represent that line. This variable will later be used as the first argument to `rasterio.features.rasterize`.

``````shapes = roads[['geometry', 'RULEID']].values.tolist()
``````

The `shapes` variable contains a list of tuples, where each tuple contains the shapely geometry from the `geometry` column of the `roads` geodataframe, and the unique zone ID from the `RULEID` column in the `roads` geodataframe.

``````[[<shapely.geometry.multilinestring.MultiLineString at 0x173463ac0>, 5],
[<shapely.geometry.linestring.LineString at 0x169b957c0>, 6],
[<shapely.geometry.linestring.LineString at 0x173475280>, 6],
[<shapely.geometry.linestring.LineString at 0x1734751c0>, 1],
[<shapely.geometry.linestring.LineString at 0x1734751f0>, 1],
[<shapely.geometry.linestring.LineString at 0x173475250>, 1],
[<shapely.geometry.linestring.LineString at 0x173475370>, 1],
[<shapely.geometry.linestring.LineString at 0x173475220>, 1],
[<shapely.geometry.linestring.LineString at 0x1734753d0>, 1],
[<shapely.geometry.multilinestring.MultiLineString at 0x1734754c0>, 2],
[<shapely.geometry.linestring.LineString at 0x1734754f0>, 5],
[<shapely.geometry.linestring.LineString at 0x173475580>, 5],
[<shapely.geometry.linestring.LineString at 0x1734755b0>, 5]]
``````

Below, the argument `out_shape` specifies the shape of the output grid in pixel units, while `transform` represents the projection from pixel space to the projected coordinate space. We also need to specify the fill value for pixels that do not intersect a line in our shapefile, which we do with `fill = 7`. It’s important to pick a fill value that is not the same as any values in `shapes`/`roads['RULEID]`, or else we won’t distinguish between this zone and the background. We also need to pick a fill value that is not `0`, since `xrspatial.zonal_stats` does not calculate statistics for pixels with `0` as a zone value.

``````zones_arr_out_shape = canopy_HARV.shape[1:]
canopy_HARV_transform = canopy_HARV.rio.transform()
road_zones_arr = rasterio.features.rasterize(shapes, fill = 7, out_shape = zones_arr_out_shape, transform=canopy_HARV_transform)
``````

After we have defined the variable road_zones_arr, we convert it to an `xarray.DataArray` and select the one and only band in that DataArray so that the DataArray only has an x and a y dimension, since this is what the `zonal_stats` function expects as an argument.

``````road_canopy_zones_xarr = xr.DataArray(road_zones_arr)
canopy_HARV_b1 = canopy_HARV.sel(band=1)
``````

Then we call the `zonal stats` function with the zones as the first argument and the raster with our values of interest as the second argument.

``````zonal_stats(road_canopy_zones_xarr, canopy_HARV_b1)
``````

This produces a neat table describing statistics for each of our zones.

mean max min std var count
1 18.0276 26.75 0 6.33709 40.1588 734
2 18.8825 23.98 0 5.83523 34.0499 59
5 14.2802 26.75 -0.399994 7.45486 55.5749 2419
6 15.7706 26.81 -0.0799866 6.59865 43.5422 719
7 14.9545 38.17 -0.809998 7.10642 50.5012 2.31557e+06

It’d be nice to associate the zone names with each row. To do this, we can use the `roads["TYPE"]` column, which contains the unique zone names for each line. We’ll make a new dataframe with two columns, one for the zone ID (numeric) and one for the zone type (a string), and then join this with our stats dataframe.

``````zoneid_zonetype = roads[['RULEID', 'TYPE']].drop_duplicates()
zoneid_zonetype = zoneid_zonetype.append({"RULEID":7, "TYPE":"non-road"}, ignore_index=True)
zoneid_zonetype = zoneid_zonetype.set_index("RULEID")
zstats_df = zoneid_zonetype.join(zstats_df)
``````

This results in a labeled table that is easier to interpret.

RULEID TYPE mean max min std var count
5 woods road 14.2802 26.75 -0.399994 7.45486 55.5749 2419
6 footpath 15.7706 26.81 -0.0799866 6.59865 43.5422 719
1 stone wall 18.0276 26.75 0 6.33709 40.1588 734
2 boardwalk 18.8825 23.98 0 5.83523 34.0499 59
7 non-road 14.9545 38.17 -0.809998 7.10642 50.5012 2.31557e+06

## Challenge: Explore Calculate the Canopy Height Statistics for a Meteorological Tower Site

Let’s calculate zonal statistics for an area where we are monitoring meteorological variables above the canopy of HARV.

Import the `NEON-DS-Site-Layout-Files/HARV/AOPClip_UTMz18N.shp` shapefile with `geopandas`. Then, calculate zonal statistics using our CHM. Inspect the output of your table.

## Answers

1) Read in the data.

``````tower_aoi_HARV = gpd.read_file("data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp")
``````

2) Create a zone array for this polygon.

``````shapes = tower_aoi_HARV[['geometry', 'id']].values.tolist()
zones_arr_out_shape = canopy_HARV.shape[1:]
canopy_HARV_transform = canopy_HARV.rio.transform()
tower_zone_arr = rasterio.features.rasterize(shapes, fill = 7, out_shape = zones_arr_out_shape, transform=canopy_HARV_transform)
``````

3) Create and display the zonal statistics table.

``````tower_zone_xarr = xr.DataArray(tower_zone_arr)
canopy_HARV_b1 = canopy_HARV.sel(band=1)
zstats_df = zonal_stats(tower_zone_xarr, canopy_HARV_b1)
zstats_df
``````

# Intro to Raster Data in Python FIXME

## Overview

Teaching: 40 min
Exercises: 20 min
Questions
• What do I do when vector data don’t line up?

Objectives
• Plot vector objects with different CRSs in the same plot.

FIXME

## Key Points

• In order to plot two vector data sets together, they must be in the same CRS.

• Use the `GeoDataFrame.to_crs()` method to convert between CRSs.

# Manipulate Raster Data in Python FIXME

## Overview

Teaching: 40 min
Exercises: 20 min
Questions
• How can I crop raster objects to vector objects, and extract the summary of raster pixels?

Objectives
• Crop a raster to the extent of a vector layer with `earthpy`.

• Extract values from a raster that correspond to a vector file overlay with `rasterstats`.

FIXME

# Work With Multi-Band Rasters in Python

## Overview

Teaching: 40 min
Exercises: 20 min
Questions
• How can I visualize individual and multiple bands in a raster object?

Objectives
• Identify a single vs. a multi-band raster file.

• Import multi-band rasters into Python using the `rioxarray` package.

• Plot multi-band color image rasters using the `rioxarray` package.

## Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

We introduced multi-band raster data in an earlier lesson. This episode explores how to import and plot a multi-band raster in Python.

## Getting Started with Multi-Band Data in Python

In this episode, the multi-band data that we are working with is imagery collected using the NEON Airborne Observation Platform high resolution camera over the NEON Harvard Forest field site. Each RGB image is a 3-band raster. The same steps would apply to working with a multi-spectral image with 4 or more bands - like Landsat imagery.

We will use 1 package in this episode to work with raster data - `rioxarray`, which is based on the popular `rasterio` package for working with rasters and `xarray` for working with multi-dimensional arrays. Make sure that you have `rioxarray` installed and imported.

``````import rioxarray
``````

The `open_rasterio` function in the `rioxarray` package can read multi-band rasters, also referred to as “stack(s)” into Python.

``````rgb_stack_HARV = rioxarray.open_rasterio("data/NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/HARV_RGB_Ortho.tif")
``````

As usual, we can use the `print()` function to inspect `rgb_stack_HARV`.

``````print(rgb_stack_HARV)
``````
``````<xarray.DataArray (band: 3, y: 2317, x: 3073)>
array([[[  0.,   2., ...,   0.,   0.],
[  0., 112., ...,   0.,   0.],
...,
[  0.,   0., ...,   0.,   0.],
[  0.,   0., ...,   0.,   0.]],

[[  1.,   0., ...,   0.,   0.],
[  0., 130., ...,   0.,   0.],
...,
[  0.,   0., ...,   0.,   0.],
[  0.,   0., ...,   0.,   0.]],

[[  0.,  10., ...,   0.,   0.],
[  1.,  99., ...,   0.,   0.],
...,
[  0.,   0., ...,   0.,   0.],
[  0.,   0., ...,   0.,   0.]]])
Coordinates:
* band         (band) int64 1 2 3
* y            (y) float64 4.714e+06 4.714e+06 ... 4.713e+06 4.713e+06
* x            (x) float64 7.32e+05 7.32e+05 7.32e+05 ... 7.328e+05 7.328e+05
spatial_ref  int64 0
Attributes:
STATISTICS_MAXIMUM:  255
STATISTICS_MEAN:     nan
STATISTICS_MINIMUM:  0
STATISTICS_STDDEV:   nan
transform:           (0.25, 0.0, 731998.5, 0.0, -0.25, 4713535.5)
_FillValue:          -1.7e+308
scale_factor:        1.0
add_offset:          0.0
grid_mapping:        spatial_ref
``````

On the first line we see that `rgb_stack_HARV` is an `xarray.DataArray` object, just like the object created by reading a single-band raster in an earlier episode. The shape, ```(band: 3, y: 2317, x: 3073)```, indicates that this raster has three bands.

## Package Delivery

You may have expected `rgb_stack_HARV` to be a `rioxarray` object. That is the Python package we used to read in the file, after all! What’s really happening here?

As we mentioned in an earlier episode, `rioxarray` is based on two other packages: `rasterio` and `xarray`.

`xarray` is a great tool for manipulating and anayzing data in labeled multi-dimensional arrays, but it cannot compute common geospatial operations like clipping and reprojecting. It also cannot write GeoTIFF (i.e., `.tif`) files, such as those used in this lesson. The `rasterio` package can do common geospatial operations and write GeoTIFF files, but `rasterio` provides limited functionality for data analysis, visualization, and parallel computation on arrays compared to `xarray`. This is where `rioxarray` comes in.

The `open_rasterio()` function provided by `rioxarray` uses the `rasterio` package to read GeoTIFF files directly into `xarray.DataArray` objects that have an additional `.rio` attribute for accessing geospatial operations like `.rio.reproject()`. Building on top of existing packages in this way allows packages like `rioxarray` to avoid “reinventing the wheel.”

As with a single-band raster, each of the object attributes can be accessed individually as well.

``````print(type(rgb_stack_HARV))
print(rgb_stack_HARV.shape)
print(rgb_stack_HARV.values)
``````
``````<class 'xarray.core.dataarray.DataArray'>
(3, 2317, 3073)
[[[  0.   2.   6. ...   0.   0.   0.]
[  0. 112. 104. ...   0.   0.   0.]
[ 19.  98. 137. ...   0.   0.   0.]
...
[  0.   0.   0. ...   0.   0.   0.]
[  0.   0.   0. ...   0.   0.   0.]
[  0.   0.   0. ...   0.   0.   0.]]

[[  1.   0.   9. ...   0.   0.   0.]
[  0. 130. 130. ...   0.   0.   0.]
[ 21. 117. 142. ...   0.   0.   0.]
...
[  0.   0.   0. ...   0.   0.   0.]
[  0.   0.   0. ...   0.   0.   0.]
[  0.   0.   0. ...   0.   0.   0.]]

[[  0.  10.   1. ...   0.   0.   0.]
[  1.  99. 113. ...   0.   0.   0.]
[ 12. 104. 114. ...   0.   0.   0.]
...
[  0.   0.   0. ...   0.   0.   0.]
[  0.   0.   0. ...   0.   0.   0.]
[  0.   0.   0. ...   0.   0.   0.]]]
``````

View the raster’s coordinates by accessing the `.coords` attribute.

``````print(rgb_stack_HARV.coords)
``````
``````Coordinates:
* band         (band) int64 1 2 3
* y            (y) float64 4.714e+06 4.714e+06 ... 4.713e+06 4.713e+06
* x            (x) float64 7.32e+05 7.32e+05 7.32e+05 ... 7.328e+05 7.328e+05
spatial_ref  int64 0
``````

Similarly, view the raster’s attributes field by accessing the `.attrs` attribute.

``````print(rgb_stack_HARV.attrs)
``````
``````{'STATISTICS_MAXIMUM': 255, 'STATISTICS_MEAN': nan, 'STATISTICS_MINIMUM': 0, 'STATISTICS_STDDEV': nan, 'transform': (0.25, 0.0, 731998.5, 0.0, -0.25, 4713535.5), '_FillValue': -1.7e+308, 'scale_factor': 1.0, 'add_offset': 0.0, 'grid_mapping': 'spatial_ref'}
``````

We can also directly access raster metadata values just as we did with single-band rasters.

``````print(rgb_stack_HARV.rio.crs)
print(rgb_stack_HARV.rio.nodata)
print(rgb_stack_HARV.rio.bounds())
print(rgb_stack_HARV.rio.width)
print(rgb_stack_HARV.rio.height)
``````
``````EPSG:32618
-1.7e+308
(731998.5, 4712956.25, 732766.75, 4713535.5)
3073
2317
``````

## Image Raster Data Values

As we saw in the previous exercise, this raster contains values between 0 and 255. These values represent degrees of brightness associated with the image band. In the case of a RGB image (red, green and blue), band 1 is the red band. When we plot the red band, larger numbers (towards 255) represent pixels with more red in them (a strong red reflection). Smaller numbers (towards 0) represent pixels with less red in them (less red was reflected). To plot an RGB image, we mix red + green + blue values into one single color to create a full color image - similar to the color image a digital camera creates.

## Select A Specific Band

We can use the `sel()` function to select specific bands from our raster object by specifying which band we want `band = <value>` (where `<value>` is the value of the band we want to work with). To select the red band, we would pass the argument `band=1` to the `sel()` function.

``````rgb_band1_HARV = rgb_stack_HARV.sel(band=1)
``````

To select the green band, we would pass `band=2`.

``````rgb_band2_HARV = rgb_stack_HARV.sel(band=2)
``````

We can confirm that these objects only contain a single band by checking the shape.

``````rgb_band1_HARV
``````
``````(2317, 3073)
``````

This shape output contains only two dimensions, confirming that `rgb_band1_HARV` contains only a single band.

We can then plot these bands using the `plot.imshow()` function. First, plot band 1 (red).

``````rgb_band1_HARV.plot.imshow(figsize=(9,7), cmap="Greys")
`````` Next, plot band 2 (green).

``````rgb_band2_HARV.plot.imshow(figsize=(9,7), cmap="Greys")
`````` ## Plot Arguments

You probably noticed the arguments `figsize` and `cmap` were passed to the plotting function. The default figure size is a bit small so we used the `figsize` argument to increase the size of the plot. We also elected to use a grayscale color map to allow you to more easily compare values contained in the red and green bands.

## Challenge: Making Sense of Single Band Images

Compare the plots of band 1 (red) and band 2 (green). Is the forested area darker or lighter in band 2 (the green band) compared to band 1 (the red band)? Why?

## Answers

The forested area appears darker in band 2 (green) compared to band 1 (red) because the leaves on healthy plants typically appear green–meaning they reflect more green light than red light. Remember, we’re dealing with RGB values in this raster where larger values represent a greater contribution of a color in the final mix of red, green, and blue. Revisit the Image Raster Data Values section above for a deeper explanation.

We don’t always have to create a new variable to explore or plot each band in a raster. We can select a band using the `sel()` function described earlier and then call another function on its output in the same line, a practice called “method chaining”. For example, let’s use method chaining to create a histogram of band 1.

``````rgb_stack_HARV.sel(band=1).plot.hist()
`````` We can use the `bins` argument to increase the number of bins for the histogram.

``````rgb_stack_HARV.sel(band=1).plot.hist(bins=30)
`````` ## Create A Three Band Image

To render a final, three band, colored image in Python, we again turn to the `plot.imshow()` function.

``````rgb_stack_HARV.plot.imshow(figsize=(9,7))
`````` From this plot we see something interesting, while our no data values were masked along the edges, the color channel’s no data values don’t all line up. The colored pixels at the edges between white black result from there being no data in one or two channels at a given pixel. 0 could conceivably represent a valid value for reflectance (the units of our pixel values) so it’s good to make sure we are masking values at the edges and not valid data values within the image.

While this plot tells us where we have no data values, the color scale look strange, because our plotting function expects image values to be normalized between a certain range (0-1 or 0-255). By using `rgb_stack_HARV.plot.imshow` with the `robust=True` argument, we can display values between the 2nd and 98th percentile, providing better color contrast, just like in episode 5.

``````rgb_stack_HARV.plot.imshow(figsize=(9,7), robust=True)
`````` When the range of pixel brightness values is closer to 0, a darker image is rendered by default. We can stretch the values to extend to the full 0-255 range of potential values to increase the visual contrast of the image. When the range of pixel brightness values is closer to 255, a lighter image is rendered by default. We can stretch the values to extend to the full 0-255 range of potential values to increase the visual contrast of the image. It is possible to perform a custom stretch when plotting multi-band rasters with `imshow()` by using the keyword arguments `vmin` and `vmax`. For example, here’s how we can use `vmin` and `vmax` to recreate the output of `imshow(robust=True)`:

``````rgb_stack_HARV.plot.imshow(figsize=(9,7),
vmin=rgb_stack_HARV.quantile(0.02),
vmax=rgb_stack_HARV.quantile(0.98))
`````` In the code above we use the `quantile()` function to calculate the 2nd and 98th quantiles of the data values in `rgb_stack_HARV` and pass those values to `vmin` and `max`, respectively.

## Challenge: NoData Values

Let’s explore what happens with NoData values when plotting multi-band rasters. We will use the `HARV_Ortho_wNA.tif` GeoTIFF file in the `NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/` directory.

1. Load the multi-band raster into Python and view the file’s attributes. Are there `NoData` values assigned for this file? (Hint: this value is sometimes called `__FillValue`)
2. If so, what is the `NoData` value?
3. How many bands does this raster have?
4. Plot the raster as a true-color image, using the `robust` argument.
5. Why does the plot show incorrect color stretching even though we used the `robust` argument? Hint: Look at the 2nd percentile of the data array with `np.percentile`.
6. What does this tell us about the differences between `HARV_Ortho_wNA.tif` and `HARV_RGB_Ortho.tif`. How can you check?
7. Plot the figure correctly by masking `NoData` values using the `.where()` method from episode 6 and `robust=True`.

## Answers

1) Load the raster into Python using `rasterio.open_rasterio()` and inspect the object’s attributes with the `print()` function:

``````rgb_stack_HARV_wNA = rioxarray.open_rasterio("data/NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/HARV_Ortho_wNA.tif")
print(rgb_stack_HARV_wNA)
``````
``````<xarray.DataArray (band: 3, y: 2317, x: 3073)>
[21360423 values with dtype=float64]
Coordinates:
* band         (band) int64 1 2 3
* y            (y) float64 4.714e+06 4.714e+06 ... 4.713e+06 4.713e+06
* x            (x) float64 7.32e+05 7.32e+05 7.32e+05 ... 7.328e+05 7.328e+05
spatial_ref  int64 0
Attributes:
STATISTICS_MAXIMUM:  255
STATISTICS_MEAN:     107.83651227531
STATISTICS_MINIMUM:  0
STATISTICS_STDDEV:   30.019177549096
transform:           (0.25, 0.0, 731998.5, 0.0, -0.25, 4713535.5)
_FillValue:          -9999.0
scale_factor:        1.0
add_offset:          0.0
grid_mapping:        spatial_ref
``````

2) The `NoData` value of this raster is -9999 (see `_FillValue` in output above). We can confirm this by accessing the `NoData` value directly:

``````print(rgb_stack_HARV_wNA.rio.nodata)
``````
``````-9999.0
``````

3) The raster has 3 bands (see first line in output of answer 1).

4) Plot the figure:

``````rgb_stack_HARV_wNA.plot.imshow(robust=True)
``````

This figure is incorrect because, even though we used `robust=True`, the color stretch is incorrect. This is not a bug! 6) Besides having a different `NoData` value, there are more `NoData` values in `HARV_Ortho_wNA.tif` than in `HARV_RGB_Ortho.tif`. There are so many that the `NoData` value is the 2nd percentile. We can check how many of the values in the array are `NoData` values.

``````number_of_nodata_values = rgb_stack_HARV_wNA.where(rgb_stack_HARV_wNA == -9999.0).count()
number_of_values = rgb_stack_HARV_wNA.size
print(float(number_of_nodata_values / number_of_values))
``````
``````0.0525384258542071
``````

About 5% of all values are `NoData` in `HARV_Ortho_wNA.tif`.

7) Before plotting we use the `where()` function to mask all data values equal to `-9999.0`. This plots the figure correctly:

``````rgb_stack_HARV_wNA.where(rgb_stack_HARV_wNA != -9999.0).plot.imshow(robust=True)
``````

`robust=True` is still required to select the 2nd percentile and 98th percentile for color stretching, otherwise the larger amount of `NoData` values causes another incorrect color stretch. ## Challenge: What Functions Can Be Used on a Python Object of a Particular Class?

1. What methods can be used on the `rgb_stack_HARV` object? Use Python’s built-in `dir` function to find out.
2. What methods can be used on a single band within `rgb_stack_HARV`?
3. Is there a difference? Why?

## Answers

1) We can see a list of all methods (and accessible attributes) for `rgb_stack_HARV` by typing the variable name followed by a period and then hitting the TAB key. Or by using the built-in `dir()` function.

``````dir(rgb_stack_HARV)
``````
``````['STATISTICS_MAXIMUM', 'STATISTICS_MEAN', 'STATISTICS_MINIMUM', 'STATISTICS_STDDEV', ... , 'where', 'x', 'y']
``````

2) Use same approach as above in combination with the `sel()` function to view the methods and attributes of the object containing a single band.

``````dir(rgb_stack_HARV.sel(band=1))
``````
``````['STATISTICS_MAXIMUM', 'STATISTICS_MEAN', 'STATISTICS_MINIMUM', 'STATISTICS_STDDEV', ... , 'where', 'x', 'y']
``````

3) Compare the output programmatically:

``````dir(rgb_stack_HARV) == dir(rgb_stack_HARV.sel(band=1))
``````
``````True
``````

The methods and attributes of each are the same because they are the same type of object, `xarray.DataArray`.

## Key Points

• A single raster file can contain multiple bands or layers.

• Individual bands within a DataArray can be accessed, analyzed, and visualized using the same plot function as single bands.

# Derive Values from Raster Time Series FIXME

## Overview

Teaching: 40 min
Exercises: 20 min
Questions
• How can I calculate, extract, and export summarized raster pixel data?

Objectives
• Extract summary pixel values from a raster.

• Save summary values to a .csv file.

• Plot summary pixel values using `pandas.plot()`.

• Compare NDVI values between two different sites.

FIXME

# Raster Time Series Data in Python FIXME

## Overview

Teaching: 40 min
Exercises: 20 min
Questions
• How can I view and and plot data for different times of the year?

Objectives
• Understand the format of a time series raster dataset.

• Work with time series rasters.

• Import a set of rasters stored in a single directory.

• Create a multi-paneled plot.

• Convert character data to `datetime` format.

FIXME

# Explore and Plot by Shapefile Attributes

## Overview

Teaching: 40 min
Exercises: 20 min
Questions
• How can I compute on the attributes of a spatial object?

Objectives
• Query attributes of a spatial object.

• Subset spatial objects using specific attribute values.

• Plot a shapefile, colored by unique attribute values.

``````# learners will have this data loaded from previous episodes
point_HARV = gpd.read_file("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")
lines_HARV = gpd.read_file("data/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp")
aoi_boundary_HARV = gpd.read_file(
"data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp")
``````

## Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

This episode continues our discussion of shapefile attributes and covers how to work with shapefile attributes in Python. It covers how to identify and query shapefile attributes, as well as how to subset shapefiles by specific attribute values. Finally, we will learn how to plot a shapefile according to a set of attribute values.

## Load the Data

We will continue using the `geopandas`, and `rioxarray` and `matplotlib.pyplot` packages in this episode. Make sure that you have these packages loaded. We will continue to work with the three shapefiles that we loaded in the Open and Plot Shapefiles in R episode.

## Query Shapefile Metadata

As we discussed in the Open and Plot Shapefiles in R episode, we can view metadata associated with a `GeoDataFrame` using:

• `.type` - The type of vector data stored in the object.
• `len` - The number of features in the object
• `.bounds` - The spatial extent (geographic area covered by) of the object.
• `.crs` - The CRS (spatial projection) of the data.

We started to explore our `point_HARV` object in the previous episode. We can view the object with `point_HARV` or print a summary of the object itself to the console.

``````point_HARV
``````

We view the columns in `lines_HARV` with `.columns` to count the number of attributes associated with a spatial object too. Note that the geometry is just another column and counts towards the total.

``````lines_HARV.columns
``````

## Challenge: Attributes for Different Spatial Classes

Explore the attributes associated with the `point_HARV` and `aoi_boundary_HARV` spatial objects.

1. How many attributes does each have?
2. Who owns the site in the `point_HARV` data object?
3. Which of the following is NOT an attribute of the `point_HARV` data object?

A) Latitude B) County C) Country

## Answers

1) To find the number of attributes, we use the `len()` and `.columns` attribute:

``````print(len(point_HARV.columns))
print(len(aoi_boundary_HARV.columns))
``````

2) Ownership information is in a column named `Ownership`:

``````point_HARV.Ownership
``````

3) To see a list of all of the attributes, we can use the `.columns` method:

``````point_HARV.columns
``````

“Country” is not an attribute of this object.

## Explore Values within One Attribute

We can explore individual values stored within a particular attribute. Comparing attributes to a spreadsheet or a data frame, this is similar to exploring values in a column. We did this with the `gapminder` dataframe in an earlier lesson. For `GeoDataFrames`, we can use the same syntax: `GeoDataFrame.attributeName` or `GeoDataFrame["attributeName"]`.

We can see the contents of the `TYPE` field of our lines shapefile:

``````lines_HARV.TYPE
``````

To see only unique values within the `TYPE` field, we can use the `np.unique()` function for extracting the possible values of a categorical (or numerical) variable.

``````np.unique(lines_HARV.TYPE)
``````

### Subset Shapefiles

We can use the `filter()` function from `dplyr` that we worked with in an earlier lesson to select a subset of features from a spatial object in Python, just like with data frames.

For example, we might be interested only in features that are of `TYPE` “footpath”. Once we subset out this data, we can use it as input to other code so that code only operates on the footpath lines.

``````footpath_HARV = lines_HARV[lines_HARV.TYPE == "footpath"]
len(footpath_HARV)
``````

Our subsetting operation reduces the `features` count to 2. This means that only two feature lines in our spatial object have the attribute `TYPE == footpath`. We can plot only the footpath lines:

``````footpath_HARV.plot()
``````

There are two features in our footpaths subset. Why does the plot look like there is only one feature? Let’s adjust the colors used in our plot. If we have 2 features in our vector object, we can plot each using a unique color by assigning a color map, or `cmap` to each geometry/row in our `GeoDataFrame`. We can also alter the default line thickness by using the `size =` parameter, as the default value can be hard to see.

``````footpath_HARV.plot(cmap="viridis", linewidth=4)
``````

Now, we see that there are in fact two features in our plot!

## Challenge: Subset Spatial Line Objects Part 1

Subset out all `woods road` from the lines layer and plot it. There are many more color maps to use, so if you’d like, do a web search to find a matplotlib `cmap` that works better for this plot than `viridis`.

## Answers

First we will save an object with only the boardwalk lines:

``````woods_road_HARV = lines_HARV[lines_HARV.TYPE == "woods_road_HARV"]
``````

Let’s check how many features there are in this subset:

``````len(woods_road_HARV)
``````

Now let’s plot that data:

``````woods_road_HARV.plot(cmap="viridis", linewidth=3)
``````

### Adjust Line Width

We adjusted line color by applying an arbitrary color map earlier. If we want a unique line color for each attribute category in our `GeoDataFrame`, we can use the following argument, `column`, as well as some style arguments to improv ethe visuals.

We already know that we have four different `TYPE` levels in the lines_HARV object, so we will set four different line colors.

``````import matplotlib.pyplot as plt
plt.style.use("ggplot")
lines_HARV.plot(column="TYPE", linewidth=3, legend=True, figsize=(16,10))
``````

Our map is starting together, in the next lesson we will add our Canopy Height Model that we calculated in an earlier episode.

## Challenge: Plot Polygon by Attribute

1. Create a map of the state boundaries in the United States using the data located in your downloaded data folder: `NEON-DS-Site-Layout-Files/US-Boundary-Layers\US-State-Boundaries-Census-2014`. Apply a fill color to each state using its `region` value. Add a legend.

## Answers

First we read in the data and check how many levels there are in the `region` column:

``````state_boundary_US =
gpd.read_file("data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp")

np.unique(state_boundary_US.region)
``````

Now we can create our plot:

``````state_boundary_US.plot(column = "region", linewidth = 2, legend = True, figsize=(20,5))
``````

## Key Points

• A `GeoDataFrame` in `geopandas` is similar to standard `pandas` data frames and can be manipulated using the same functions.

• Almost any feature of a plot can be customized using the various functions and options in the `matplotlib` package.