Content from Introduction to Raster Data
Last updated on 2023-08-14 | Edit this page
Overview
Questions
- What format should I use to represent my data?
- What are the main data types used for representing geospatial data?
- What are the main attributes of raster data?
Objectives
- Describe the difference between raster and vector data.
- Describe the strengths and weaknesses of storing data in raster format.
- Distinguish between continuous and categorical raster data and identify types of datasets that would be stored in each format.
Introduction
This episode introduces the two primary types of geospatial data: rasters and vectors. After briefly introducing these data types, this episode focuses on raster data, describing some major features and types of raster data.
Data Structures: Raster and Vector
The two primary types of geospatial data are raster and vector data. Raster data is stored as a grid of values which are rendered on a map as pixels. Each pixel value represents an area on the Earth’s surface. Vector data structures represent specific features on the Earth’s surface, and assign attributes to those features. Vector data structures will be discussed in more detail in the next episode.
This workshop will focus on how to work with both raster and vector data sets, therefore it is essential that we understand the basic structures of these types of data and the types of data that they can be used to represent.
About Raster Data
Raster data is any pixelated (or gridded) data where each pixel is associated with a specific geographic location. The value of a pixel can be continuous (e.g. elevation) or categorical (e.g. land use). If this sounds familiar, it is because this data structure is very common: it’s how we represent any digital image. A geospatial raster is only different from a digital photo in that it is accompanied by spatial information that connects the data to a particular location. This includes the raster’s extent and cell size, the number of rows and columns, and its coordinate reference system (or CRS).
Some examples of continuous rasters include:
- Precipitation maps.
- Maps of tree height derived from LiDAR data.
- Elevation values for a region.
A map of elevation for Harvard Forest derived from the NEON AOP LiDAR sensor is below. Elevation is represented as a continuous numeric variable in this map. The legend shows the continuous range of values in the data from around 300 to 420 meters.
Some rasters contain categorical data where each pixel represents a discrete class such as a landcover type (e.g., “forest” or “grassland”) rather than a continuous value such as elevation or temperature. Some examples of classified maps include:
- Landcover / land-use maps.
- Tree height maps classified as short, medium, and tall trees.
- Elevation maps classified as low, medium, and high elevation.
The map above shows the contiguous United States with landcover as categorical data. Each color is a different landcover category. (Source: Homer, C.G., et al., 2015, Completion of the 2011 National Land Cover Database for the conterminous United States-Representing a decade of land cover change information. Photogrammetric Engineering and Remote Sensing, v. 81, no. 5, p. 345-354)
Advantages and Disadvantages
With your neighbor, brainstorm potential advantages and disadvantages of storing data in raster format. Add your ideas to the Etherpad. The Instructor will discuss and add any points that weren’t brought up in the small group discussions.
Raster data has some important advantages:
- representation of continuous surfaces
- potentially very high levels of detail
- data is ‘unweighted’ across its extent - the geometry doesn’t implicitly highlight features
- cell-by-cell calculations can be very fast and efficient
The downsides of raster data are:
- very large file sizes as cell size gets smaller
- currently popular formats don’t embed metadata well (more on this later!)
- can be difficult to represent complex information
Important Attributes of Raster Data
Extent
The spatial extent is the geographic area that the raster data covers. The spatial extent of an object represents the geographic edge or location that is the furthest north, south, east and west. In other words, extent represents the overall geographic coverage of the spatial object.
Extent Challenge
In the image above, the dashed boxes around each set of objects seems to imply that the three objects have the same extent. Is this accurate? If not, which object(s) have a different extent?
The lines and polygon objects have the same extent. The extent for the points object is smaller in the vertical direction than the other two because there are no points on the line at y = 8.
Raster Data Format for this Workshop
Raster data can come in many different formats. For this workshop, we
will use the GeoTIFF format which has the extension .tif
. A
.tif
file stores metadata or attributes about the file as
embedded tif tags
. For instance, your camera might store a
tag that describes the make and model of the camera or the date the
photo was taken when it saves a .tif
. A GeoTIFF is a
standard .tif
image format with additional spatial
(georeferencing) information embedded in the file as tags. These tags
should include the following raster metadata:
- Extent
- Resolution
- Coordinate Reference System (CRS) - we will introduce this concept in a later episode
- Values that represent missing data (
NoDataValue
) - we will introduce this concept in a later episode.
We will discuss these attributes in more detail in a later episode. In that episode, we will also learn how to use Python to extract raster attributes from a GeoTIFF file.
More Resources on the .tif
format
Multi-band Raster Data
A raster can contain one or more bands. One type of multi-band raster dataset that is familiar to many of us is a color image. A basic color image consists of three bands: red, green, and blue. Each band represents light reflected from the red, green or blue portions of the electromagnetic spectrum. The pixel brightness for each band, when composited creates the colors that we see in an image.
We can plot each band of a multi-band image individually.
Or we can composite all three bands together to make a color image.
In a multi-band dataset, the rasters will always have the same extent, resolution, and CRS.
Other Types of Multi-band Raster Data
Multi-band raster data might also contain: 1. Time series: the same variable, over the same area, over time. 2. Multi or hyperspectral imagery: image rasters that have 4 or more (multi-spectral) or more than 10-15 (hyperspectral) bands. We won’t be working with this type of data in this workshop, but you can check out the NEON Data Skills Imaging Spectroscopy HDF5 in R tutorial if you’re interested in working with hyperspectral data cubes.
Key Points
- Raster data is pixelated data where each pixel is associated with a specific location.
- Raster data always has an extent and a resolution.
- The extent is the geographical area covered by a raster.
- The resolution is the area covered by each pixel of a raster.
Content from Introduction to Vector Data
Last updated on 2023-08-14 | Edit this page
Overview
Questions
- What are the main attributes of vector data?
Objectives
- Describe the strengths and weaknesses of storing data in vector format.
- Describe the three types of vectors and identify types of data that would be stored in each.
About Vector Data
Vector data structures represent specific features on the Earth’s surface, and assign attributes to those features. Vectors are composed of discrete geometric locations (x, y values) known as vertices that define the shape of the spatial object. The organization of the vertices determines the type of vector that we are working with: point, line or polygon.
Points: Each point is defined by a single x, y coordinate. There can be many points in a vector point file. Examples of point data include: sampling locations, the location of individual trees, or the location of survey plots.
Lines: Lines are composed of many (at least 2) points that are connected. For instance, a road or a stream may be represented by a line. This line is composed of a series of segments, each “bend” in the road or stream represents a vertex that has a defined x, y location.
Polygons: A polygon consists of 3 or more vertices that are connected and closed. The outlines of survey plot boundaries, lakes, oceans, and states or countries are often represented by polygons.
Data Tip
Sometimes, boundary layers such as states and countries, are stored as lines rather than polygons. However, these boundaries, when represented as a line, will not create a closed object with a defined area that can be filled.
Identify Vector Types
The plot below includes examples of two of the three types of vector objects. Use the definitions above to identify which features are represented by which vector type.
State boundaries are polygons. The Fisher Tower location is a point. There are no line features shown.
Vector data has some important advantages:
- The geometry itself contains information about what the dataset creator thought was important
- The geometry structures hold information in themselves - why choose point over polygon, for instance?
- Each geometry feature can carry multiple attributes instead of just one, e.g. a database of cities can have attributes for name, country, population, etc
- Data storage can be very efficient compared to rasters
The downsides of vector data include:
- Potential loss of detail compared to raster
- Potential bias in datasets - what didn’t get recorded?
- Calculations involving multiple vector layers need to do math on the geometry as well as the attributes, so can be slow compared to raster math.
Vector datasets are in use in many industries besides geospatial fields. For instance, computer graphics are largely vector-based, although the data structures in use tend to join points using arcs and complex curves rather than straight lines. Computer-aided design (CAD) is also vector- based. The difference is that geospatial datasets are accompanied by information tying their features to real-world locations.
Vector Data Format for this Workshop
Like raster data, vector data can also come in many different
formats. For this workshop, we will use the Shapefile format. A
Shapefile format consists of multiple files in the same directory, of
which .shp
, .shx
, and .dbf
files
are mandatory. Other non-mandatory but very important files are
.prj
and shp.xml
files.
- The
.shp
file stores the feature geometry itself -
.shx
is a positional index of the feature geometry to allow quickly searching forwards and backwards the geographic coordinates of each vertex in the vector -
.dbf
contains the tabular attributes for each shape. -
.prj
file indicates the Coordinate reference system (CRS) -
.shp.xml
contains the Shapefile metadata.
Together, the Shapefile includes the following information:
- Extent - the spatial extent of the shapefile (i.e. geographic area that the shapefile covers). The spatial extent for a shapefile represents the combined extent for all spatial objects in the shapefile.
- Object type - whether the shapefile includes points, lines, or polygons.
- Coordinate reference system (CRS)
- Other attributes - for example, a line shapefile that contains the locations of streams, might contain the name of each stream.
Because the structure of points, lines, and polygons are different, each individual shapefile can only contain one vector type (all points, all lines or all polygons). You will not find a mixture of point, line and polygon objects in a single shapefile.
More Resources on Shapefiles
More about shapefiles can be found on Wikipedia. Shapefiles are often publicly available from government services, such as this page from the US Census Bureau or this one from Australia’s Data.gov.au website.
Why not both?
Very few formats can contain both raster and vector data - in fact, most are even more restrictive than that. Vector datasets are usually locked to one geometry type, e.g. points only. Raster datasets can usually only encode one data type, for example you can’t have a multiband GeoTIFF where one layer is integer data and another is floating-point. There are sound reasons for this - format standards are easier to define and maintain, and so is metadata. The effects of particular data manipulations are more predictable if you are confident that all of your input data has the same characteristics.
Key Points
- Vector data structures represent specific features on the Earth’s surface along with attributes of those features.
- Vector objects are either points, lines, or polygons.
Content from Coordinate Reference Systems
Last updated on 2023-08-14 | Edit this page
Overview
Questions
- What is a coordinate reference system and how do I interpret one?
Objectives
- Name some common schemes for describing coordinate reference systems.
- Interpret a PROJ4 coordinate reference system description.
Coordinate Reference Systems
A data structure cannot be considered geospatial unless it is accompanied by coordinate reference system (CRS) information, in a format that geospatial applications can use to display and manipulate the data correctly. CRS information connects data to the Earth’s surface using a mathematical model.
CRS vs SRS
CRS (coordinate reference system) and SRS (spatial reference system) are synonyms and are commonly interchanged. We will use only CRS throughout this workshop.
The CRS associated with a dataset tells your mapping software (for example Python) where the raster is located in geographic space. It also tells the mapping software what method should be used to flatten or project the raster in geographic space.
The above image shows maps of the United States in different projections. Notice the differences in shape associated with each projection. These differences are a direct result of the calculations used to flatten the data onto a 2-dimensional map.
There are lots of great resources that describe coordinate reference systems and projections in greater detail. For the purposes of this workshop, what is important to understand is that data from the same location but saved in different projections will not line up in any GIS or other program. Thus, it’s important when working with spatial data to identify the coordinate reference system applied to the data and retain it throughout data processing and analysis.
Components of a CRS
CRS information has three components:
Datum: A model of the shape of the earth. It has angular units (i.e. degrees) and defines the starting point (i.e. where is [0,0]?) so the angles reference a meaningful spot on the earth. Common global datums are WGS84 and NAD83. Datums can also be local - fit to a particular area of the globe, but ill-fitting outside the area of intended use. In this workshop, we will use the WGS84 datum.
Projection: A mathematical transformation of the angular measurements on a round earth to a flat surface (i.e. paper or a computer screen). The units associated with a given projection are usually linear (feet, meters, etc.). In this workshop, we will see data in two different projections.
Additional Parameters: Additional parameters are often necessary to create the full coordinate reference system. One common additional parameter is a definition of the center of the map. The number of required additional parameters depends on what is needed by each specific projection.
Orange Peel Analogy
A common analogy employed to teach projections is the orange peel analogy. If you imagine that the Earth is an orange, how you peel it and then flatten the peel is similar to how projections get made.
- A datum is the choice of fruit to use. Is the Earth an orange, a lemon, a lime, a grapefruit?
A projection is how you peel your orange and then flatten the peel.
- An additional parameter could include a definition of the location of the stem of the fruit. What other parameters could be included in this analogy?
Which projection should I use?
To decide if a projection is right for your data, answer these questions:
- What is the area of minimal distortion?
- What aspect of the data does it preserve?
Peter Dana from the University of Colorado at Boulder and the Department of Geo-Information Processing have a good discussion of these aspects of projections. Online tools like Projection Wizard can also help you discover projections that might be a good fit for your data.
Data Tip
Take the time to identify a projection that is suited for your project. You don’t have to stick to the ones that are popular.
Describing Coordinate Reference Systems
There are several common systems in use for storing and transmitting CRS information, as well as translating among different CRSs. These systems generally comply with ISO 19111. Common systems for describing CRSs include EPSG, OGC WKT, and PROJ strings.
EPSG
The EPSG system is a database of CRS information maintained by the International Association of Oil and Gas Producers. The dataset contains both CRS definitions and information on how to safely convert data from one CRS to another. Using EPSG is easy as every CRS has an integer identifier, e.g. WGS84 is EPSG:4326. The downside is that you can only use the CRSs defined by EPSG and cannot customise them (some datasets do not have EPSG codes). epsg.io is an excellent website for finding suitable projections by location or for finding information about a particular EPSG code.
Well-Known Text
The Open Geospatial Consortium WKT standard is used by a number of important geospatial apps and software libraries. WKT is a nested list of geodetic parameters. The structure of the information is defined on their website. WKT is valuable in that the CRS information is more transparent than in EPSG, but can be more difficult to read and compare than PROJ since it is meant to necessarily represent more complex CRS information. Additionally, the WKT standard is implemented inconsistently across various software platforms, and the spec itself has some known issues.
PROJ
PROJ is an open-source library for storing, representing and transforming CRS information. PROJ strings continue to be used, but the format is deprecated by the PROJ C maintainers due to inaccuracies when converting to the WKT format. The data and python libraries we will be working with in this workshop use different underlying representations of CRSs under the hood for reprojecting. CRS information can still be represented with EPSG, WKT, or PROJ strings without consequence, but it is best to only use PROJ strings as a format for viewing CRS information, not for reprojecting data.
PROJ represents CRS information as a text string of key-value pairs, which makes it easy to read and interpret.
A PROJ4 string includes the following information:
- proj: the projection of the data
- zone: the zone of the data (this is specific to the UTM projection)
- datum: the datum used
- units: the units for the coordinates of the data
- ellps: the ellipsoid (how the earth’s roundness is calculated) for the data
Note that the zone is unique to the UTM projection. Not all CRSs will have a zone.
Reading a PROJ4 String
Here is a PROJ4 string for one of the datasets we will use in this workshop:
+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
- What projection, zone, datum, and ellipsoid are used for this data?
- What are the units of the data?
- Using the map above, what part of the United States was this data collected from?
- Projection is UTM, zone 18, datum is WGS84, ellipsoid is WGS84.
- The data is in meters.
- The data comes from the eastern US seaboard.
Format interoperability
Many existing file formats were invented by GIS software developers, often in a closed-source environment. This led to the large number of formats on offer today, and considerable problems transferring data between software environments. The Geospatial Data Abstraction Library (GDAL) is an open-source answer to this issue.
GDAL is a set of software tools that translate between almost any geospatial format in common use today (and some not so common ones). GDAL also contains tools for editing and manipulating both raster and vector files, including reprojecting data to different CRSs. GDAL can be used as a standalone command-line tool, or built in to other GIS software. Several open-source GIS programs use GDAL for all file import/export operations.
Metadata
Spatial data is useless without metadata. Essential metadata includes the CRS information, but proper spatial metadata encompasses more than that. History and provenance of a dataset (how it was made), who is in charge of maintaining it, and appropriate (and inappropriate!) use cases should also be documented in metadata. This information should accompany a spatial dataset wherever it goes. In practice this can be difficult, as many spatial data formats don’t have a built-in place to hold this kind of information. Metadata often has to be stored in a companion file, and generated and maintained manually.
More Resources on CRS
- spatialreference.org - A comprehensive online library of CRS information.
- QGIS Documentation - CRS Overview.
- Choosing the Right Map Projection.
- Video highlighting how map projections can make continents seems proportionally larger or smaller than they actually are.
Key Points
- All geospatial datasets (raster and vector) are associated with a specific coordinate reference system.
- A coordinate reference system includes datum, projection, and additional parameters specific to the dataset.
Content from The Geospatial Landscape
Last updated on 2023-08-14 | Edit this page
Overview
Questions
- What programs and applications are available for working with geospatial data?
Objectives
- Describe the difference between various approaches to geospatial computing, and their relative strengths and weaknesses.
- Name some commonly used GIS applications.
- Name some commonly used Python packages that can access and process spatial data.
- Describe pros and cons for working with geospatial data using a command-line versus a graphical user interface.
Standalone Software Packages
Most traditional GIS work is carried out in standalone applications that aim to provide end-to-end geospatial solutions. These applications are available under a wide range of licenses and price points. Some of the most common are listed below.
Open-source software
The Open Source Geospatial Foundation (OSGEO) supports several actively managed GIS platforms:
- QGIS is a professional GIS application that is built on top of and proud to be itself Free and Open Source Software (FOSS). QGIS is written in Python, has a python console interface, and has several interfaces written in R including RQGIS.
- GRASS GIS, commonly referred to as GRASS (Geographic Resources Analysis Support System), is a FOSS-GIS software suite used for geospatial data management and analysis, image processing, graphics and maps production, spatial modeling, and visualization. GRASS GIS is currently used in academic and commercial settings around the world, as well as by many governmental agencies and environmental consulting companies. It is a founding member of the Open Source Geospatial Foundation (OSGeo). GRASS GIS can be installed along with and made accessible within QGIS 3.
- GDAL is a multiplatform set of tools for translating between geospatial data formats. It can also handle reprojection and a variety of geoprocessing tasks. GDAL is built in to many applications both FOSS and commercial, including GRASS and QGIS.
- SAGA-GIS, or System for Automated Geoscientific Analyses, is a FOSS-GIS application developed by a small team of researchers from the Dept. of Physical Geography, Göttingen, and the Dept. of Physical Geography, Hamburg. SAGA has been designed for an easy and effective implementation of spatial algorithms, offers a comprehensive, growing set of geoscientific methods, provides an easily approachable user interface with many visualisation options, and runs under Windows and Linux operating systems. Like GRASS GIS, it can also be installed and made accessible in QGIS3.
- PostGIS is a geospatial extension to the PostGreSQL relational database.
Commercial software
- ESRI (Environmental Systems Research Institute) is an international supplier of geographic information system (GIS) software, web GIS and geodatabase management applications. ESRI provides several licenced platforms for performing GIS, including ArcGIS, ArcGIS Online, and Portal for ArcGIS a standalone version of ArGIS Online which you host locally. ESRI welcomes development on their platforms through their DevLabs. ArcGIS software can be installed using Chef Cookbooks from Github.
- Pitney Bowes produce MapInfo Professional, which was one of the earliest desktop GIS programs on the market.
- Hexagon Geospatial Power Portfolio includes many geospatial tools including ERDAS Imagine, powerful software for remote sensing.
- Manifold is a desktop GIS that emphasizes speed through the use of parallel and GPU processing.
Online + Cloud computing
- PANGEO is a community organization dedicated to open and reproducible data science with python. They focus on the Pangeo software ecosystem for working with big data in the geosciences.
- Google has created Google Earth Engine which combines a multi-petabyte catalog of satellite imagery and geospatial datasets with planetary-scale analysis capabilities and makes it available for scientists, researchers, and developers to detect changes, map trends, and quantify differences on the Earth’s surface. Earth Engine API runs in both Python and JavaScript.
- ArcGIS Online provides access to thousands of maps and base layers.
Private companies have released SDK platforms for large scale GIS analysis:
- Kepler.gl is Uber’s toolkit for handling large datasets (i.e. Uber’s data archive).
- Boundless Geospatial is built upon OSGEO software for enterprise solutions.
Publicly funded open-source platforms for large scale GIS analysis:
- PANGEO for the Earth Sciences. This community organization also supports python libraries like xarray, iris, dask, jupyter, and many other packages.
- Sepal.io by FAO Open Foris utilizing EOS satellite imagery and cloud resources for global forest monitoring.
GUI vs CLI
The earliest computer systems operated without a graphical user interface (GUI), relying only on the command-line interface (CLI). Since mapping and spatial analysis are strongly visual tasks, GIS applications benefited greatly from the emergence of GUIs and quickly came to rely heavily on them. Most modern GIS applications have very complex GUIs, with all common tools and procedures accessed via buttons and menus.
Benefits of using a GUI include:
- Tools are all laid out in front of you
- Complex commands are easy to build
- Don’t need to learn a coding language
- Cartography and visualisation is more intuitive and flexible
Downsides of using a GUI include:
- Low reproducibility - you can’t record your actions and replay
- Most are not designed for batch-processing files
- Limited ability to customise functions or write your own
- Intimidating interface for new users - so many buttons!
In scientific computing, the lack of reproducibility in point-and-click software has come to be viewed as a critical weakness. As such, scripted CLI-style workflows are again becoming popular, which leads us to another approach to doing GIS — via a programming language. This is the approach we will be using throughout this workshop.
GIS in programming languages
A number of powerful geospatial processing libraries exist for general-purpose programming languages like Java and C++. However, the learning curve for these languages is steep and the effort required is excessive for users who only need a subset of their functionality.
Higher-level scripting languages like Python and R are easier to
learn and use. Both now have their own packages that wrap up those
geospatial processing libraries and make them easy to access and use
safely. A key example is the Java Topology Suite (JTS), which is
implemented in C++ as GEOS. GEOS is accessible in Python via the
shapely
package (and geopandas
, which makes
use of shapely
) and in R via sf
. R and Python
also have interface packages for GDAL, and for specific GIS apps.
This last point is a huge advantage for GIS-by-programming; these interface packages give you the ability to access functions unique to particular programs, but have your entire workflow recorded in a central document - a document that can be re-run at will. Below are lists of some of the key spatial packages for Python, which we will be using in the remainder of this workshop.
-
geopandas
andgeocube
for working with vector data -
rasterio
andrioxarray
for working with raster data
These packages along with the matplotlib
package are all
we need for spatial data visualisation. Python also has many fundamental
scientific packages that are relevant in the geospatial domain. Below is
a list of particularly fundamental packages. numpy
,
scipy
, and scikit-image
are all excellent
options for working with rasters, as arrays.
An overview of these and other Python spatial packages can be accessed here.
As a programming language, Python can be a CLI tool. However, using Python together with an Integrated Development Environment (IDE) application allows some GUI features to become part of your workflow. IDEs allow the best of both worlds. They provide a place to visually examine data and other software objects, interact with your file system, and draw plots and maps, but your activities are still command-driven: recordable and reproducible. There are several IDEs available for Python. JupyterLab is well-developed and the most widely used option for data science in Python. VSCode and Spyder are other popular options for data science.
Traditional GIS apps are also moving back towards providing a scripting environment for users, further blurring the CLI/GUI divide. ESRI have adopted Python into their software, and QGIS is both Python and R-friendly.
GIS File Types
There are a variety of file types that are used in GIS analysis. Depending on the program you choose to use some file types can be used while others are not readable. Below is a brief table describing some of the most common vector and raster file types.
Vector
File Type | Extensions | Description |
---|---|---|
Esri Shapefile | .SHP .DBF .SHX | The most common geospatial file type. This has become the industry standard. The three required files are: SHP is the feature geometry. SHX is the shape index position. DBF is the attribute data. |
Geographic JavaScript Object Notation (GeoJSON) | .GEOJSON .JSON | Used for web-based mapping and uses JavaScript Object Notation to store the coordinates as text. |
Google Keyhole Markup Language (KML) | .KML .KMZ | KML stands for Keyhole Markup Language. This GIS format is XML-based and is primarily used for Google Earth. |
OpenStreetMap | .OSM | OSM files are the native file for OpenStreetMap which had become the largest crowdsourcing GIS data project in the world. These files are a collection of vector features from crowd-sourced contributions from the open community. |
Raster
File Type | Extensions | Description |
---|---|---|
ERDAS Imagine | .IMG | ERDAS Imagine IMG files is a proprietary file format developed by Hexagon Geospatial. IMG files are commonly used for raster data to store single and multiple bands of satellite data.Each raster layer as part of an IMG file contains information about its data values. For example, this includes projection, statistics, attributes, pyramids and whether or not it’s a continuous or discrete type of raster. |
GeoTIFF | .TIF .TIFF .OVR | The GeoTIFF has become an industry image standard file for GIS and satellite remote sensing applications. GeoTIFFs may be accompanied by other files:TFW is the world file that is required to give your raster geolocation.XML optionally accompany GeoTIFFs and are your metadata.AUX auxiliary files store projections and other information.OVR pyramid files improves performance for raster display. |
Cloud Optimized GeoTIFF (COG) | .TIF .TIFF | Based on the GeoTIFF standard, COGs incorporate tiling and overviews to support HTTP range requests where users can query and load subsets of the image without having to transfer the entire file. |
Key Points
- Many software packages exist for working with geospatial data.
- Command-line programs allow you to automate and reproduce your work.
- JupyterLab provides a user-friendly interface for working with Python.
Content from Access satellite imagery using Python
Last updated on 2023-08-29 | Edit this page
Overview
Questions
- Where can I find open-access satellite data?
- How do I search for satellite imagery with the STAC API?
- How do I fetch remote raster datasets using Python?
Objectives
- Search public STAC repositories of satellite imagery using Python.
- Inspect search result’s metadata.
- Download (a subset of) the assets available for a satellite scene.
- Open satellite imagery as raster data and save it to disk.
Introduction
A number of satellites take snapshots of the Earth’s surface from space. The images recorded by these remote sensors represent a very precious data source for any activity that involves monitoring changes on Earth. Satellite imagery is typically provided in the form of geospatial raster data, with the measurements in each grid cell (“pixel”) being associated to accurate geographic coordinate information.
In this episode we will explore how to access open satellite data using Python. In particular, we will consider the Sentinel-2 data collection that is hosted on AWS. This dataset consists of multi-band optical images acquired by the two satellites of the Sentinel-2 mission and it is continuously updated with new images.
Search for satellite imagery
The SpatioTemporal Asset Catalog (STAC) specification
Current sensor resolutions and satellite revisit periods are such that terabytes of data products are added daily to the corresponding collections. Such datasets cannot be made accessible to users via full-catalog download. Space agencies and other data providers often offer access to their data catalogs through interactive Graphical User Interfaces (GUIs), see for instance the Copernicus Open Access Hub portal for the Sentinel missions. Accessing data via a GUI is a nice way to explore a catalog and get familiar with its content, but it represents a heavy and error-prone task that should be avoided if carried out systematically to retrieve data.
A service that offers programmatic access to the data enables users to reach the desired data in a more reliable, scalable and reproducible manner. An important element in the software interface exposed to the users, which is generally called the Application Programming Interface (API), is the use of standards. Standards, in fact, can significantly facilitate the reusability of tools and scripts across datasets and applications.
The SpatioTemporal Asset Catalog (STAC) specification is an emerging standard for describing geospatial data. By organizing metadata in a form that adheres to the STAC specifications, data providers make it possible for users to access data from different missions, instruments and collections using the same set of tools.
More Resources on STAC
Search a STAC catalog
The STAC browser is a good starting point to discover available datasets, as it provides an up-to-date list of existing STAC catalogs. From the list, let’s click on the “Earth Search” catalog, i.e. the access point to search the archive of Sentinel-2 images hosted on AWS.
Exercise: Discover a STAC catalog
Let’s take a moment to explore the Earth Search STAC catalog, which is the catalog indexing the Sentinel-2 collection that is hosted on AWS. We can interactively browse this catalog using the STAC browser at this link.
- Open the link in your web browser. Which (sub-)catalogs are available?
- Open the Sentinel-2 Level 2A collection, and select one item from the list. Each item corresponds to a satellite “scene”, i.e. a portion of the footage recorded by the satellite at a given time. Have a look at the metadata fields and the list of assets. What kind of data do the assets represent?
- 7 subcatalogs are available, including a catalog for Landsat Collection 2, Level-2 and Sentinel-2 Level 2A (see left screenshot in the figure above).
- When you select the Sentinel-2 Level 2A collection, and randomly choose one of the items from the list, you should find yourself on a page similar to the right screenshot in the figure above. On the left side you will find a list of the available assets: overview images (thumbnail and true color images), metadata files and the “real” satellite images, one for each band captured by the Multispectral Instrument on board Sentinel-2.
When opening a catalog with the STAC browser, you can access the API URL by clicking on the “Source” button on the top right of the page. By using this URL, we have access to the catalog content and, if supported by the catalog, to the functionality of searching its items. For the Earth Search STAC catalog the API URL is:
You can query a STAC API endpoint from Python using the
pystac_client
library:
In the following, we ask for scenes belonging to the
sentinel-2-l2a
collection. This dataset includes Sentinel-2
data products pre-processed at level 2A (bottom-of-atmosphere
reflectance) and saved in Cloud Optimized GeoTIFF (COG) format:
Cloud Optimized GeoTIFFs
Cloud Optimized GeoTIFFs (COGs) are regular GeoTIFF files with some additional features that make them ideal to be employed in the context of cloud computing and other web-based services. This format builds on the widely-employed GeoTIFF format, already introduced in Episode 1: Introduction to Raster Data. In essence, COGs are regular GeoTIFF files with a special internal structure. One of the features of COGs is that data is organized in “blocks” that can be accessed remotely via independent HTTP requests. Data users can thus access the only blocks of a GeoTIFF that are relevant for their analysis, without having to download the full file. In addition, COGs typically include multiple lower-resolution versions of the original image, called “overviews”, which can also be accessed independently. By providing this “pyramidal” structure, users that are not interested in the details provided by a high-resolution raster can directly access the lower-resolution versions of the same image, significantly saving on the downloading time. More information on the COG format can be found here.
We also ask for scenes intersecting a geometry defined using the
shapely
library (in this case, a point):
Note: at this stage, we are only dealing with metadata, so no image is going to be downloaded yet. But even metadata can be quite bulky if a large number of scenes match our search! For this reason, we limit the search result to 10 items:
We submit the query and find out how many scenes match our search criteria (please note that this output can be different as more data is added to the catalog):
OUTPUT
840
Finally, we retrieve the metadata of the search results:
The variable items
is an ItemCollection
object. We can check its size by:
OUTPUT
10
which is consistent with the maximum number of items that we have set in the search criteria. We can iterate over the returned items and print these to show their IDs:
OUTPUT
<Item id=S2A_31UFU_20230701_0_L2A>
<Item id=S2B_31UFU_20230629_0_L2A>
<Item id=S2B_31UFU_20230626_0_L2A>
<Item id=S2A_31UFU_20230624_0_L2A>
<Item id=S2A_31UFU_20230621_0_L2A>
<Item id=S2B_31UFU_20230616_0_L2A>
<Item id=S2A_31UFU_20230614_0_L2A>
<Item id=S2A_31UFU_20230611_0_L2A>
<Item id=S2B_31UFU_20230609_0_L2A>
<Item id=S2B_31UFU_20230606_0_L2A>
Each of the items contains information about the scene geometry, its
acquisition time, and other metadata that can be accessed as a
dictionary from the properties
attribute.
Let’s inspect the metadata associated with the first item of the search results:
OUTPUT
2023-07-01 10:46:30.262000+00:00
{'type': 'Polygon', 'coordinates': [[[5.233744523520149, 53.228684673408296], [6.141754296879459, 53.20819279121764], [6.071664488869862, 52.22257539160585], [4.80943323800081, 52.2486879358387], [5.233744523520149, 53.228684673408296]]]}
{'created': '2023-07-02T01:49:17.191Z', 'platform': 'sentinel-2a', 'constellation': 'sentinel-2', 'instruments': ['msi'], 'eo:cloud_cover': 99.952936, 'proj:epsg': 32631, 'mgrs:utm_zone': 31, 'mgrs:latitude_band': 'U', 'mgrs:grid_square': 'FU', 'grid:code': 'MGRS-31UFU', 'view:sun_azimuth': 154.716674921261, 'view:sun_elevation': 58.4960054056685, 's2:degraded_msi_data_percentage': 0.0346, 's2:nodata_pixel_percentage': 33.00232, 's2:saturated_defective_pixel_percentage': 0, 's2:dark_features_percentage': 0, 's2:cloud_shadow_percentage': 0.030847, 's2:vegetation_percentage': 0, 's2:not_vegetated_percentage': 0.004947, 's2:water_percentage': 0.011271, 's2:unclassified_percentage': 0, 's2:medium_proba_clouds_percentage': 5.838514, 's2:high_proba_clouds_percentage': 94.035202, 's2:thin_cirrus_percentage': 0.07922, 's2:snow_ice_percentage': 0, 's2:product_type': 'S2MSI2A', 's2:processing_baseline': '05.09', 's2:product_uri': 'S2A_MSIL2A_20230701T103631_N0509_R008_T31UFU_20230701T200058.SAFE', 's2:generation_time': '2023-07-01T20:00:58.000000Z', 's2:datatake_id': 'GS2A_20230701T103631_041904_N05.09', 's2:datatake_type': 'INS-NOBS', 's2:datastrip_id': 'S2A_OPER_MSI_L2A_DS_2APS_20230701T200058_S20230701T104159_N05.09', 's2:granule_id': 'S2A_OPER_MSI_L2A_TL_2APS_20230701T200058_A041904_T31UFU_N05.09', 's2:reflectance_conversion_factor': 0.967641353116838, 'datetime': '2023-07-01T10:46:30.262000Z', 's2:sequence': '0', 'earthsearch:s3_path': 's3://sentinel-cogs/sentinel-s2-l2a-cogs/31/U/FU/2023/7/S2A_31UFU_20230701_0_L2A', 'earthsearch:payload_id': 'roda-sentinel2/workflow-sentinel2-to-stac/7b1a81ed3fb8d763a0cecf8d9edd4d4a', 'earthsearch:boa_offset_applied': True, 'processing:software': {'sentinel2-to-stac': '0.1.0'}, 'updated': '2023-07-02T01:49:17.191Z'}
Exercise: Search satellite scenes using metadata filters
Search for all the available Sentinel-2 scenes in the
sentinel-2-l2a
collection that satisfy the following
criteria: - intersect a provided bounding box (use ±0.01 deg in lat/lon
from the previously defined point); - have been recorded between 20
March 2020 and 30 March 2020; - have a cloud coverage smaller than 10%
(hint: use the query
input argument of
client.search
).
How many scenes are available? Save the search results in GeoJSON format.
Access the assets
So far we have only discussed metadata - but how can one get to the
actual images of a satellite scene (the “assets” in the STAC
nomenclature)? These can be reached via links that are made available
through the item’s attribute assets
.
OUTPUT
dict_keys(['aot', 'blue', 'coastal', 'granule_metadata', 'green', 'nir', 'nir08', 'nir09', 'red', 'rededge1', 'rededge2', 'rededge3', 'scl', 'swir16', 'swir22', 'thumbnail', 'tileinfo_metadata', 'visual', 'wvp', 'aot-jp2', 'blue-jp2', 'coastal-jp2', 'green-jp2', 'nir-jp2', 'nir08-jp2', 'nir09-jp2', 'red-jp2', 'rededge1-jp2', 'rededge2-jp2', 'rededge3-jp2', 'scl-jp2', 'swir16-jp2', 'swir22-jp2', 'visual-jp2', 'wvp-jp2'])
We can print a minimal description of the available assets:
OUTPUT
aot: Aerosol optical thickness (AOT)
blue: Blue (band 2) - 10m
coastal: Coastal aerosol (band 1) - 60m
granule_metadata: None
green: Green (band 3) - 10m
nir: NIR 1 (band 8) - 10m
nir08: NIR 2 (band 8A) - 20m
nir09: NIR 3 (band 9) - 60m
red: Red (band 4) - 10m
rededge1: Red edge 1 (band 5) - 20m
rededge2: Red edge 2 (band 6) - 20m
rededge3: Red edge 3 (band 7) - 20m
scl: Scene classification map (SCL)
swir16: SWIR 1 (band 11) - 20m
swir22: SWIR 2 (band 12) - 20m
thumbnail: Thumbnail image
tileinfo_metadata: None
visual: True color image
wvp: Water vapour (WVP)
aot-jp2: Aerosol optical thickness (AOT)
blue-jp2: Blue (band 2) - 10m
coastal-jp2: Coastal aerosol (band 1) - 60m
green-jp2: Green (band 3) - 10m
nir-jp2: NIR 1 (band 8) - 10m
nir08-jp2: NIR 2 (band 8A) - 20m
nir09-jp2: NIR 3 (band 9) - 60m
red-jp2: Red (band 4) - 10m
rededge1-jp2: Red edge 1 (band 5) - 20m
rededge2-jp2: Red edge 2 (band 6) - 20m
rededge3-jp2: Red edge 3 (band 7) - 20m
scl-jp2: Scene classification map (SCL)
swir16-jp2: SWIR 1 (band 11) - 20m
swir22-jp2: SWIR 2 (band 12) - 20m
visual-jp2: True color image
wvp-jp2: Water vapour (WVP)
Among the others, assets include multiple raster data files (one per optical band, as acquired by the multi-spectral instrument), a thumbnail, a true-color image (“visual”), instrument metadata and scene-classification information (“SCL”). Let’s get the URL links to the actual asset:
OUTPUT
https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/31/U/FU/2020/3/S2A_31UFU_20200328_0_L2A/thumbnail.jpg
This can be used to download the corresponding file:
Remote raster data can be directly opened via the
rioxarray
library. We will learn more about this library in
the next episodes.
PYTHON
import rioxarray
nir_href = assets["nir"].href
nir = rioxarray.open_rasterio(nir_href)
print(nir)
OUTPUT
<xarray.DataArray (band: 1, y: 10980, x: 10980)>
[120560400 values with dtype=uint16]
Coordinates:
* band (band) int64 1
* x (x) float64 6e+05 6e+05 6e+05 ... 7.098e+05 7.098e+05 7.098e+05
* y (y) float64 5.9e+06 5.9e+06 5.9e+06 ... 5.79e+06 5.79e+06
spatial_ref int64 0
Attributes:
AREA_OR_POINT: Area
OVR_RESAMPLING_ALG: AVERAGE
_FillValue: 0
scale_factor: 1.0
add_offset: 0.0
We can then save the data to disk:
Since that might take a while, given there are over 10000 x 10000 = a hundred million pixels in the 10 meter NIR band, you can take a smaller subset before downloading it. Becuase the raster is a COG, we can download just what we need!
Here, we specify that we want to download the first (and only) band in the tif file, and a slice of the width and height dimensions.
PYTHON
# save portion of an image to disk
nir[0,1500:2200,1500:2200].rio.to_raster("nir_subset.tif")
The difference is 155 Megabytes for the large image vs about 1 Megabyte for the subset.
Exercise: Downloading Landsat 8 Assets
In this exercise we put in practice all the skills we have learned in
this episode to retrieve images from a different mission: Landsat 8. In
particular, we browse images from the Harmonized Landsat
Sentinel-2 (HLS) project, which provides images from NASA’s Landsat
8 and ESA’s Sentinel-2 that have been made consistent with each other.
The HLS catalog is indexed in the NASA Common Metadata Repository (CMR)
and it can be accessed from the STAC API endpoint at the following URL:
https://cmr.earthdata.nasa.gov/stac/LPCLOUD
.
- Using
pystac_client
, search for all assets of the Landsat 8 collection (HLSL30.v2.0
) from February to March 2021, intersecting the point with longitude/latitute coordinates (-73.97, 40.78) deg. - Visualize an item’s thumbnail (asset key
browse
).
PYTHON
# connect to the STAC endpoint
cmr_api_url = "https://cmr.earthdata.nasa.gov/stac/LPCLOUD"
client = Client.open(cmr_api_url)
# setup search
search = client.search(
collections=["HLSL30.v2.0"],
intersects=Point(-73.97, 40.78),
datetime="2021-02-01/2021-03-30",
) # nasa cmr cloud cover filtering is currently broken: https://github.com/nasa/cmr-stac/issues/239
# retrieve search results
items = search.item_collection()
print(len(items))
OUTPUT
5
PYTHON
items_sorted = sorted(items, key=lambda x: x.properties["eo:cloud_cover"]) # sorting and then selecting by cloud cover
item = items_sorted[0]
print(item)
OUTPUT
<Item id=HLS.L30.T18TWL.2021039T153324.v2.0>
OUTPUT
'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-public/HLSL30.020/HLS.L30.T18TWL.2021039T153324.v2.0/HLS.L30.T18TWL.039T153324.v2.0.jpg'
Public catalogs, protected data
Publicly accessible catalogs and STAC endpoints do not necessarily imply publicly accessible data. Data providers, in fact, may limit data access to specific infrastructures and/or require authentication. For instance, the NASA CMR STAC endpoint considered in the last exercise offers publicly accessible metadata for the HLS collection, but most of the linked assets are available only for registered users (the thumbnail is publicly accessible).
The authentication procedure for dataset with restricted access might differ depending on the data provider. For the NASA CMR, follow these steps in order to access data using Python:
- Create a NASA Earthdata login account here;
- Set up a netrc file with your credentials, e.g. by using this script;
- Define the following environment variables:
Key Points
- Accessing satellite images via the providers’ API enables a more reliable and scalable data retrieval.
- STAC catalogs can be browsed and searched using the same tools and scripts.
-
rioxarray
allows you to open and download remote raster files.
Content from Read and visualize raster data
Last updated on 2023-08-23 | Edit this page
Overview
Questions
- How is a raster represented by rioxarray?
- How do I read and plot raster data in Python?
- How can I handle missing data?
Objectives
- Describe the fundamental attributes of a raster dataset.
- Explore raster attributes and metadata using Python.
- Read rasters into Python using the
rioxarray
package. - Visualize single/multi-band raster data.
Raster datasets have been introduced in Episode 1: Introduction to Raster Data. Here, we introduce the fundamental principles, packages and metadata/raster attributes for working with raster data in Python. We will also explore how Python handles missing and bad data values.
rioxarray
is the Python package we will use throughout this lesson to work with
raster data. It is based on the popular rasterio
package for working with rasters and xarray
for
working with multi-dimensional arrays. rioxarray
extends
xarray
by providing top-level functions (e.g. the
open_rasterio
function to open raster datasets) and by
adding a set of methods to the main objects of the xarray
package (the Dataset
and the DataArray
). These
additional methods are made available via the rio
accessor
and become available from xarray
objects after importing
rioxarray
.
We will also use the pystac
package to load rasters from the search results we created in the
previous episode.
Introduce the Raster Data
We’ll continue from the results of the satellite image search that we
have carried out in an exercise from a
previous episode. We will load data starting from the
search.json
file, using one scene from the search results
as an example to demonstrate data loading and visualization.
If you would like to work with the data for this lesson without
downloading data on-the-fly, you can download the raster data ahead of
time using this link. Save
the geospatial-python-raster-dataset.tar.gz
file in your
current working directory, and extract the archive file by
double-clicking on it or by running the following command in your
terminal tar -zxvf geospatial-python-raster-dataset.tar.gz
.
Use the file geospatial-python-raster-dataset/search.json
(instead of search.json
) to get started with this
lesson.
This can be useful if you need to download the data ahead of time to work through the lesson offline, or if you want to work with the data in a different GIS.
Load a Raster and View Attributes
In the previous episode, we searched for Sentinel-2 images, and then
saved the search results to a file: search.json
. This
contains the information on where and how to access the target images
from a remote repository. We can use the function
pystac.ItemCollection.from_file()
to load the search
results as an Item
list.
In the search results, we have 2 Item
type objects,
corresponding to 4 Sentinel-2 scenes from March 26th and 28th in 2020.
We will focus on the first scene S2A_31UFU_20200328_0_L2A
,
and load band nir09
(central wavelength 945 nm). We can
load this band using the function
rioxarray.open_rasterio()
, via the Hypertext Reference
href
(commonly referred to as a URL):
By calling the variable name in the jupyter notebook we can get a quick look at the shape and attributes of the data.
OUTPUT
<xarray.DataArray (band: 1, y: 1830, x: 1830)>
[3348900 values with dtype=uint16]
Coordinates:
* band (band) int64 1
* x (x) float64 6e+05 6.001e+05 6.002e+05 ... 7.097e+05 7.098e+05
* y (y) float64 5.9e+06 5.9e+06 5.9e+06 ... 5.79e+06 5.79e+06
spatial_ref int64 0
Attributes:
_FillValue: 0.0
scale_factor: 1.0
add_offset: 0.0
The first call to rioxarray.open_rasterio()
opens the
file from remote or local storage, and then returns a
xarray.DataArray
object. The object is stored in a
variable, i.e. raster_ams_b9
. Reading in the data with
xarray
instead of rioxarray
also returns a
xarray.DataArray
, but the output will not contain the
geospatial metadata (such as projection information). You can use numpy
functions or built-in Python math operators on a
xarray.DataArray
just like a numpy array. Calling the
variable name of the DataArray
also prints out all of its
metadata information.
The output tells us that we are looking at an
xarray.DataArray
, with 1
band,
1830
rows, and 1830
columns. We can also see
the number of pixel values in the DataArray
, and the type
of those pixel values, which is unsigned integer (or
uint16
). The DataArray
also stores different
values for the coordinates of the DataArray
. When using
rioxarray
, the term coordinates refers to spatial
coordinates like x
and y
but also the
band
coordinate. Each of these sequences of values has its
own data type, like float64
for the spatial coordinates and
int64
for the band
coordinate.
This DataArray
object also has a couple of attributes
that are accessed like .rio.crs
, .rio.nodata
,
and .rio.bounds()
, which contain the metadata for the file
we opened. Note that many of the metadata are accessed as attributes
without ()
, but bounds()
is a method (i.e. a
function in an object) and needs parentheses.
PYTHON
print(raster_ams_b9.rio.crs)
print(raster_ams_b9.rio.nodata)
print(raster_ams_b9.rio.bounds())
print(raster_ams_b9.rio.width)
print(raster_ams_b9.rio.height)
OUTPUT
EPSG:32631
0
(600000.0, 5790240.0, 709800.0, 5900040.0)
1830
1830
The Coordinate Reference System, or
raster_ams_b9.rio.crs
, is reported as the string
EPSG:32631
. The nodata
value is encoded as 0
and the bounding box corners of our raster are represented by the output
of .bounds()
as a tuple
(like a list but you
can’t edit it). The height and width match what we saw when we printed
the DataArray
, but by using .rio.width
and
.rio.height
we can access these values if we need them in
calculations.
We will be exploring this data throughout this episode. By the end of this episode, you will be able to understand and explain the metadata output.
Tip - Variable names
To improve code readability, file and object names should be used
that make it clear what is in the file. The data for this episode covers
Amsterdam, and is from Band 9, so we’ll use a naming convention of
raster_ams_b9
.
Visualize a Raster
After viewing the attributes of our raster, we can examine the raw
values of the array with .values
:
OUTPUT
array([[[ 0, 0, 0, ..., 8888, 9075, 8139],
[ 0, 0, 0, ..., 10444, 10358, 8669],
[ 0, 0, 0, ..., 10346, 10659, 9168],
...,
[ 0, 0, 0, ..., 4295, 4289, 4320],
[ 0, 0, 0, ..., 4291, 4269, 4179],
[ 0, 0, 0, ..., 3944, 3503, 3862]]], dtype=uint16)
This can give us a quick view of the values of our array, but only at
the corners. Since our raster is loaded in python as a
DataArray
type, we can plot this in one line similar to a
pandas DataFrame
with DataArray.plot()
.
Nice plot! Notice that rioxarray
helpfully allows us to
plot this raster with spatial coordinates on the x and y axis (this is
not the default in many cases with other functions or libraries).
This plot shows the satellite measurement of the spectral band
nir09
for an area that covers part of the Netherlands.
According to the Sentinel-2
documentaion, this is a band with the central wavelength of 945nm,
which is sensitive to water vapor. It has a spatial resolution of 60m.
Note that the band=1
in the image title refers to the
ordering of all the bands in the DataArray
, not the
Sentinel-2 band number 09
that we saw in the pystac search
results.
With a quick view of the image, we notice that half of the image is
blank, no data is captured. We also see that the cloudy pixels at the
top have high reflectance values, while the contrast of everything else
is quite low. This is expected because this band is sensitive to the
water vapor. However if one would like to have a better color contrast,
one can add the option robust=True
, which displays values
between the 2nd and 98th percentile:
Now the color limit is set in a way fitting most of the values in the image. We have a better view of the ground pixels.
Tool Tip
The option robust=True
always forces displaying values
between the 2nd and 98th percentile. Of course, this will not work for
every case. For a customized displaying range, you can also manually
specifying the keywords vmin
and vmax
. For
example ploting between 100
and 7000
:
View Raster Coordinate Reference System (CRS) in Python
Another information that we’re interested in is the CRS, and it can
be accessed with .rio.crs
. We introduced the concept of a
CRS in an earlier episode. Now we will see how
features of the CRS appear in our data file and what meanings they have.
We can view the CRS string associated with our DataArray’s
rio
object using the crs
attribute.
OUTPUT
EPSG:32631
To print the EPSG code number as an int
, we use the
.to_epsg()
method:
OUTPUT
32631
EPSG codes are great for succinctly representing a particular
coordinate reference system. But what if we want to see more details
about the CRS, like the units? For that, we can use pyproj
,
a library for representing and working with coordinate reference
systems.
OUTPUT
<Derived Projected CRS: EPSG:32631>
Name: WGS 84 / UTM zone 31N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Between 0°E and 6°E, northern hemisphere between equator and 84°N, onshore and offshore. Algeria. Andorra. Belgium. Benin. Burkina Faso. Denmark - North Sea. France. Germany - North Sea. Ghana. Luxembourg. Mali. Netherlands. Niger. Nigeria. Norway. Spain. Togo. United Kingdom (UK) - North Sea.
- bounds: (0.0, 0.0, 6.0, 84.0)
Coordinate Operation:
- name: UTM zone 31N
- method: Transverse Mercator
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
The CRS
class from the pyproj
library
allows us to create a CRS
object with methods and
attributes for accessing specific information about a CRS, or the
detailed summary shown above.
A particularly useful attribute is area_of_use
, which
shows the geographic bounds that the CRS is intended to be used.
OUTPUT
AreaOfUse(west=0.0, south=0.0, east=6.0, north=84.0, name='Between 0°E and 6°E, northern hemisphere between equator and 84°N, onshore and offshore. Algeria. Andorra. Belgium. Benin. Burkina Faso. Denmark - North Sea. France. Germany - North Sea. Ghana. Luxembourg. Mali. Netherlands. Niger. Nigeria. Norway. Spain. Togo. United Kingdom (UK) - North Sea.')
Exercise: find the axes units of the CRS
What units are our data in? See if you can find a method to examine
this information using help(crs)
or
dir(crs)
crs.axis_info
tells us that the CRS for our raster has
two axis and both are in meters. We could also get this information from
the attribute raster_ams_b9.rio.crs.linear_units
.
Understanding pyproj CRS Summary
Let’s break down the pieces of the pyproj
CRS summary.
The string contains all of the individual CRS elements that Python or
another GIS might need, separated into distinct sections, and datum.
OUTPUT
<Derived Projected CRS: EPSG:32631>
Name: WGS 84 / UTM zone 31N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Between 0°E and 6°E, northern hemisphere between equator and 84°N, onshore and offshore. Algeria. Andorra. Belgium. Benin. Burkina Faso. Denmark - North Sea. France. Germany - North Sea. Ghana. Luxembourg. Mali. Netherlands. Niger. Nigeria. Norway. Spain. Togo. United Kingdom (UK) - North Sea.
- bounds: (0.0, 0.0, 6.0, 84.0)
Coordinate Operation:
- name: UTM zone 31N
- method: Transverse Mercator
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
- Name of the projection is UTM zone 31N (UTM has 60 zones, each 6-degrees of longitude in width). The underlying datum is WGS84.
- Axis Info: the CRS shows a Cartesian system with two axes, easting and northing, in meter units.
-
Area of Use: the projection is used for a
particular range of longitudes
0°E to 6°E
in the northern hemisphere (0.0°N to 84.0°N
) - Coordinate Operation: the operation to project the coordinates (if it is projected) onto a cartesian (x, y) plane. Transverse Mercator is accurate for areas with longitudinal widths of a few degrees, hence the distinct UTM zones.
-
Datum: Details about the datum, or the reference
point for coordinates.
WGS 84
andNAD 1983
are common datums.NAD 1983
is set to be replaced in 2022.
Note that the zone is unique to the UTM projection. Not all CRSs will have a zone. Below is a simplified view of US UTM zones.
Calculate Raster Statistics
It is useful to know the minimum or maximum values of a raster
dataset. We can compute these and other descriptive statistics with
min
, max
, mean
, and
std
.
PYTHON
print(raster_ams_b9.min())
print(raster_ams_b9.max())
print(raster_ams_b9.mean())
print(raster_ams_b9.std())
OUTPUT
<xarray.DataArray ()>
array(0, dtype=uint16)
Coordinates:
spatial_ref int64 0
<xarray.DataArray ()>
array(15497, dtype=uint16)
Coordinates:
spatial_ref int64 0
<xarray.DataArray ()>
array(1652.44009944)
Coordinates:
spatial_ref int64 0
<xarray.DataArray ()>
array(2049.16447495)
Coordinates:
spatial_ref int64 0
The information above includes a report of the min, max, mean, and
standard deviation values, along with the data type. If we want to see
specific quantiles, we can use xarray’s .quantile()
method.
For example for the 25% and 75% quantiles:
OUTPUT
<xarray.DataArray (quantile: 2)>
array([ 0., 2911.])
Coordinates:
* quantile (quantile) float64 0.25 0.75
Data Tip - NumPy methods
You could also get each of these values one by one using
numpy
.
PYTHON
import numpy
print(numpy.percentile(raster_ams_b9, 25))
print(numpy.percentile(raster_ams_b9, 75))
OUTPUT
0.0
2911.0
You may notice that raster_ams_b9.quantile
and
numpy.percentile
didn’t require an argument specifying the
axis or dimension along which to compute the quantile. This is because
axis=None
is the default for most numpy functions, and
therefore dim=None
is the default for most xarray methods.
It’s always good to check out the docs on a function to see what the
default arguments are, particularly when working with multi-dimensional
image data. To do so, we can
usehelp(raster_ams_b9.quantile)
or
?raster_ams_b9.percentile
if you are using jupyter notebook
or jupyter lab.
Dealing with Missing Data
So far, we have visualized a band of a Sentinel-2 scene and
calculated its statistics. However, we need to take missing data into
account. Raster data often has a “no data value” associated with it and
for raster datasets read in by rioxarray
. This value is
referred to as nodata
. This is a value assigned to pixels
where data is missing or no data were collected. There can be different
cases that cause missing data, and it’s common for other values in a
raster to represent different cases. The most common example is missing
data at the edges of rasters.
By default the shape of a raster is always rectangular. So if we have a dataset that has a shape that isn’t rectangular, some pixels at the edge of the raster will have no data values. This often happens when the data were collected by a sensor which only flew over some part of a defined region.
As we have seen above, the nodata
value of this dataset
(raster_ams_b9.rio.nodata
) is 0. When we have plotted the
band data, or calculated statistics, the missing value was not
distinguished from other values. Missing data may cause some unexpected
results. For example, the 25th percentile we just calculated was 0,
probably reflecting the presence of a lot of missing data in the
raster.
To distinguish missing data from real data, one possible way is to
use nan
to represent them. This can be done by specifying
masked=True
when loading the raster:
One can also use the where
function to select all the
pixels which are different from the nodata
value of the
raster:
Either way will change the nodata
value from 0 to
nan
. Now if we compute the statistics again, the missing
data will not be considered:
print(raster_ams_b9.min())
print(raster_ams_b9.max())
print(raster_ams_b9.mean())
print(raster_ams_b9.std())
```python
```output
<xarray.DataArray ()>
array(8., dtype=float32)
Coordinates:
spatial_ref int64 0
<xarray.DataArray ()>
array(15497., dtype=float32)
Coordinates:
spatial_ref int64 0
<xarray.DataArray ()>
array(2477.405, dtype=float32)
Coordinates:
spatial_ref int64 0
<xarray.DataArray ()>
array(2061.9539, dtype=float32)
Coordinates:
spatial_ref int64 0
And if we plot the image, the nodata
pixels are not
shown because they are not 0 anymore:
One should notice that there is a side effect of using
nan
instead of 0
to represent the missing
data: the data type of the DataArray
was changed from
integers to float. This need to be taken into consideration when the
data type matters in your application.
Raster Bands
So far we looked into a single band raster, i.e. the
nir09
band of a Sentinel-2 scene. However, to get a
smaller, non georeferenced version of the scene, one may also want to
visualize the true-color overview of the region. This is provided as a
multi-band raster – a raster dataset that contains more than one
band.
The overview
asset in the Sentinel-2 scene is a
multiband asset. Similar to nir09
, we can load it by:
PYTHON
raster_ams_overview = rioxarray.open_rasterio(items[0].assets['visual'].href, overview_level=3)
raster_ams_overview
OUTPUT
<xarray.DataArray (band: 3, y: 687, x: 687)>
[1415907 values with dtype=uint8]
Coordinates:
* band (band) int64 1 2 3
* x (x) float64 6.001e+05 6.002e+05 ... 7.096e+05 7.097e+05
* y (y) float64 5.9e+06 5.9e+06 5.9e+06 ... 5.79e+06 5.79e+06
spatial_ref int64 0
Attributes:
AREA_OR_POINT: Area
OVR_RESAMPLING_ALG: AVERAGE
_FillValue: 0
scale_factor: 1.0
add_offset: 0.0
The band number comes first when GeoTiffs are read with the
.open_rasterio()
function. As we can see in the
xarray.DataArray
object, the shape is now
(band: 3, y: 687, x: 687)
, with three bands in the
band
dimension. It’s always a good idea to examine the
shape of the raster array you are working with and make sure it’s what
you expect. Many functions, especially the ones that plot images, expect
a raster array to have a particular shape. One can also check the shape
using the .shape
attribute:
OUTPUT
(3, 687, 687)
One can visualize the multi-band data with the
DataArray.plot.imshow()
function:
Note that the DataArray.plot.imshow()
function makes
assumptions about the shape of the input DataArray, that since it has
three channels, the correct colormap for these channels is RGB. It does
not work directly on image arrays with more than 3 channels. One can
replace one of the RGB channels with another band, to make a false-color
image.
Exercise: set the plotting aspect ratio
As seen in the figure above, the true-color image is stretched. Let’s
visualize it with the right aspect ratio. You can use the documentation
of DataArray.plot.imshow()
.
Since we know the height/width ratio is 1:1 (check the
rio.height
and rio.width
attributes), we can
set the aspect ratio to be 1. For example, we can choose the size to be
5 inches, and set aspect=1
. Note that according to the documentation
of DataArray.plot.imshow()
, when specifying the
aspect
argument, size
also needs to be
provided.
Key Points
-
rioxarray
andxarray
are for working with multidimensional arrays like pandas is for working with tabular data. -
rioxarray
stores CRS information as a CRS object that can be converted to an EPSG code or PROJ4 string. - Missing raster data are filled with nodata values, which should be handled with care for statistics and visualization.
Content from Vector data in Python
Last updated on 2023-08-14 | Edit this page
Overview
Questions
- How can I read, inspect, and process spatial objects, such as points, lines, and polygons?
Objectives
- Load spatial objects.
- Select the spatial objects within a bounding box.
- Perform a CRS conversion of spatial objects.
- Select features of spatial objects.
- Match objects in two datasets based on their spatial relationships.
Introduction
As discussed in Episode 2: Introduction to Vector Data, vector data represents specific features on the Earth’s surface using points, lines, and polygons. These geographic elements can then have one or more attributes assigned to them, such as ‘name’ and ‘population’ for a city, or crop type for a field. Vector data can be much smaller in (file) size than raster data, while being very rich in terms of the information captured.
In this episode, we will be moving from working with raster data to
working with vector data. We will use Python to open and plot point,
line, and polygon vector data. In particular, we will make use of the geopandas
package to open, manipulate and write vector datasets.
geopandas
extends the popular pandas
library for data analysis to geospatial applications. The main
pandas
objects (the Series
and the
DataFrame
) are expanded to geopandas
objects
(GeoSeries
and GeoDataFrame
). This extension
is implemented by including geometric types, represented in Python using
the shapely
library, and by providing dedicated methods for
spatial operations (union, intersection, etc.). The relationship between
Series
, DataFrame
, GeoSeries
and
GeoDataFrame
can be briefly explained as follow:
- A
Series
is a one-dimensional array with axis, holding any data type (integers, strings, floating-point numbers, Python objects, etc.) - A
DataFrame
is a two-dimensional labeled data structure with columns of potentially different types1. - A
GeoSeries
is aSeries
object designed to store shapely geometry objects. - A
GeoDataFrame
is an extenedpandas.DataFrame
, which has a column with geometry objects, and this column is aGeoSeries
.
In later episodes, we will learn how to work with raster and vector data together and combine them into a single plot.
Introduce the Vector Data
In this episode, we will use the downloaded vector data in the
data
directory. Please refer to the setup page on how to download the data.
Import Vector Datasets
We will use the geopandas
package to load the crop field
vector data we downloaded at:
data/brpgewaspercelen_definitief_2020_small.gpkg
.
The data are read into the variable fields
as a
GeoDataFrame
. This is an extened data format of
pandas.DataFrame
, with an extra column
geometry
.
This file contains a relatively large number of crop field parcels. Directly loading a large file to memory can be slow. If the Area of Interest (AoI) is small, we can define a bounding box of the AoI, and only read the data within the extent of the bounding box.
PYTHON
# Define bounding box
xmin, xmax = (110_000, 140_000)
ymin, ymax = (470_000, 510_000)
bbox = (xmin, ymin, xmax, ymax)
Using the bbox
input argument, we can load only the
spatial features intersecting the provided bounding box.
PYTHON
# Partially load data within the bounding box
fields = gpd.read_file("data/brpgewaspercelen_definitief_2020_small.gpkg", bbox=bbox)
How should I define my bounding box?
For simplicity, here we assume the Coordinate Reference System (CRS) and extent of the vector file are known (for instance they are provided in the dataset documentation).
You can also define your bounding box with online coordinates visualization tools. For example, we can use the “Draw Rectangular Polygon” tool in geojson.io.
Some Python tools, e.g. fiona
(which
is also the backend of geopandas
), provide the file
inspection functionality without the need to read the full data set into
memory. An example can be found in the
documentation of fiona.
And we can plot the overview by:
Vector Metadata & Attributes
When we read the vector dataset with Python (as our
fields
variable) it is loaded as a
GeoDataFrame
object. The read_file()
function
also automatically stores geospatial information about the data. We are
particularly interested in describing the format, CRS, extent, and other
components of the vector data, and the attributes which describe
properties associated with each vector object.
We will explore
- Object Type: the class of the imported object.
- Coordinate Reference System (CRS): the projection of the data.
- Extent: the spatial extent (i.e. geographic area that the data covers). Note that the spatial extent for a vector dataset represents the combined extent for all spatial objects in the dataset.
Each GeoDataFrame
has a "geometry"
column
that contains geometries. In the case of our fields
object,
this geometry is represented by a shapely.geometry.Polygon
object. geopandas
uses the shapely
library to
represent polygons, lines, and points, so the types are inherited from
shapely
.
We can view the metadata using the .crs
,
.bounds
and .type
attributes. First, let’s
view the geometry type for our crop field dataset. To view the geometry
type, we use the pandas
method .type
on the
GeoDataFrame
object, fields
.
OUTPUT
0 Polygon
1 Polygon
2 Polygon
3 Polygon
4 Polygon
...
22026 Polygon
22027 Polygon
22028 Polygon
22029 Polygon
22030 Polygon
Length: 22031, dtype: object
To view the CRS metadata:
OUTPUT
<Derived Projected CRS: EPSG:28992>
Name: Amersfoort / RD New
Axis Info [cartesian]:
- X[east]: Easting (metre)
- Y[north]: Northing (metre)
Area of Use:
- name: Netherlands - onshore, including Waddenzee, Dutch Wadden Islands and 12-mile offshore coastal zone.
- bounds: (3.2, 50.75, 7.22, 53.7)
Coordinate Operation:
- name: RD New
- method: Oblique Stereographic
Datum: Amersfoort
- Ellipsoid: Bessel 1841
- Prime Meridian: Greenwich
Our data is in the CRS RD New. The CRS is critical
to interpreting the object’s extent values as it specifies units. To
find the extent of our dataset in the projected coordinates, we can use
the .total_bounds
attribute:
OUTPUT
array([109222.03325 , 469461.512625, 140295.122125, 510939.997875])
This array contains, in order, the values for minx, miny, maxx and maxy, for the overall dataset. The spatial extent of a GeoDataFrame represents the geographic “edge” or location that is the furthest north, south, east, and west. Thus, it represents the overall geographic coverage of the spatial object.
We can convert these coordinates to a bounding box or acquire the index of the Dataframe to access the geometry. Either of these polygons can be used to clip rasters (more on that later).
Further crop the dataset
We might realize that the loaded dataset is still too large. If we
want to refine our area of interest to an even smaller extent, without
reloading the data, we can use the cx
indexer:
Export data to file
We will save the cropped results to a shapefile (.shp
)
and use it later. The to_file
function can be used for
exportation:
This will write it to disk (in this case, in ‘shapefile’ format),
containing only the data from our cropped area. It can be read again at
a later time using the read_file()
method we have been
using above. Note that this actually writes multiple files to disk
(fields_cropped.cpg
, fields_cropped.dbf
,
fields_cropped.prj
, fields_cropped.shp
,
fields_cropped.shx
). All these files should ideally be
present in order to re-read the dataset later, although only the
.shp
, .shx
, and .dbf
files are
mandatory (see the Introduction to
Vector Data lesson for more information.)
Selecting spatial features
From now on, we will take in a point dataset
brogmwvolledigeset.zip
, which is the underground water
monitoring wells. We will perform vector processing on this dataset,
together with the cropped field polygons
fields_cropped.shp
.
Let’s read the two datasets.
PYTHON
fields = gpd.read_file("fields_cropped.shp")
wells = gpd.read_file("data/brogmwvolledigeset.zip")
And take a look at the wells:
The points represents all the wells over the Netherlands. Since the wells are in the lat/lon coordinates. To make it comparable with fields, we need to first transfer the CRS to the “RD New” projection:
Now we would like to compare the wells with the cropped fields. We
can select the wells within the fields using the .clip
function:
OUTPUT
bro_id delivery_accountable_party quality_regime ...
40744 GMW000000043703 27364178 IMBRO/A ...
38728 GMW000000045818 27364178 IMBRO/A ...
... ... ... ... ...
40174 GMW000000043963 27364178 IMBRO/A ...
19445 GMW000000024992 50200097 IMBRO/A ...
[79 rows x 40 columns]
After this selection, all the wells outside the fields are dropped. This takes a while to execute, because we are clipping a relatively large number of points with many polygons.
If we do not want a precise clipping, but rather have the points in the neighborhood of the fields, we will need to create another polygon, which is slightly bigger than the coverage of the field. To do this, we can increase the size of the field polygons, to make them overlap with each other, and then merge the overlapping polygons together.
We will first use the buffer
function to increase field
size with a given distance
. The unit of the
distance
argument is the same as the CRS. Here we use a
50-meter buffer. Also notice that the .buffer
function
produces a GeoSeries
, so to keep the other columns, we
assign it to the GeoDataFrame
as a geometry column.
PYTHON
buffer = fields.buffer(50)
fields_buffer = fields.copy()
fields_buffer['geometry'] = buffer
fields_buffer.plot()
To further simplify them, we can use the dissolve
function to dissolve the buffers into one:
All the fields will be dissolved into one multi-polygon, which can be
used to clip
the wells.
In this way, we selected all wells within the 50m range of the
fields. It is also significantly faster than the previous
clip
operation, since the number of polygons is much
smaller after dissolve
.
Exercise: clip fields within 500m from the wells
This time, we will do a selection the other way around. Can you clip the field polygons (stored in fields_cropped.shp) with the 500m buffer of the wells (stored in brogmwvolledigeset.zip)? Please visualize the results.
Hint 1: The file
brogmwvolledigeset.zip
is in CRS 4326. Don’t forget the CRS conversion.Hint 2:
brogmwvolledigeset.zip
contains all the wells in the Netherlands, which means it might be too big for the.buffer()
function. To improve the performance, first crop it with the bounding box of the fields.
PYTHON
# Read in data
fields = gpd.read_file("fields_cropped.shp")
wells = gpd.read_file("data/brogmwvolledigeset.zip")
# Crop points with bounding box
xmin, ymin, xmax, ymax = fields.total_bounds
wells = wells.to_crs(28992)
wells_cx = wells.cx[xmin-500:xmax+500, ymin-500:ymax+500]
# Create buffer
wells_cx_500mbuffer = wells_cx.copy()
wells_cx_500mbuffer['geometry'] = wells_cx.buffer(500)
# Clip
fields_clip_buffer = fields.clip(wells_cx_500mbuffer)
fields_clip_buffer.plot()
Spatially join the features
In the exercise, we clipped the fields polygons with the 500m buffers
of wells. The results from this clipping changed the shape of the
polygons. If we would like to keep the original shape of the fields, one
way is to use the sjoin
function, which join two
GeoDataFrame
’s on the basis of their spatial
relationship:
PYTHON
# Join fields and wells_cx_500mbuffer
fields_wells_buffer = fields.sjoin(wells_cx_500mbuffer)
print(fields_wells_buffer.shape)
OUTPUT
(11420, 46)
This will result in a GeodataFrame
of all possible
combinations of polygons and well buffers intersecting each other. Since
a polygon can fall into multiple buffers, there will be duplicated field
indexes in the results. To select the fields which intersects the well
buffers, we can first get the unique indexes, and use the
iloc
indexer to select:
Modify the geometry of a GeoDataFrame
Exercise: Investigate the waterway lines
Now we will take a deeper look at the Dutch waterway lines:
waterways_nl
. Let’s load the file
status_vaarweg.zip
, and visualize it with the
plot()
function. Can you tell what is wrong with this
vector file?
Axis ordering
According to the standards, the axis ordering for a CRS should follow the definition provided by the competent authority. For the commonly used EPSG:4326 geographic coordinate system, the EPSG defines the ordering as first latitude then longitude. However, in the GIS world, it is custom to work with coordinate tuples where the first component is aligned with the east/west direction and the second component is aligned with the north/south direction. Multiple software packages thus implement this convention also when dealing with EPSG:4326. As a result, one can encounter vector files that implement either convention - keep this in mind and always check your datasets!
Sometimes we need to modify the geometry
of a
GeoDataFrame
. For example, as we have seen in the previous
exercise Investigate the waterway lines, the latitude
and longitude are flipped in the vector data waterways_nl
.
This error needs to be fixed before performing further analysis.
Let’s first take a look on what makes up the geometry
column of waterways_nl
:
OUTPUT
0 LINESTRING (52.41810 4.84060, 52.42070 4.84090...
1 LINESTRING (52.11910 4.67450, 52.11930 4.67340...
2 LINESTRING (52.10090 4.25730, 52.10390 4.25530...
3 LINESTRING (53.47250 6.84550, 53.47740 6.83840...
4 LINESTRING (52.32270 5.14300, 52.32100 5.14640...
...
86 LINESTRING (51.49270 5.39100, 51.48050 5.39160...
87 LINESTRING (52.15900 5.38510, 52.16010 5.38340...
88 LINESTRING (51.97340 4.12420, 51.97110 4.12220...
89 LINESTRING (52.11910 4.67450, 52.11850 4.67430...
90 LINESTRING (51.88940 4.61900, 51.89040 4.61350...
Name: geometry, Length: 91, dtype: geometry
Each row is a LINESTRING
object. We can further zoom
into one of the rows, for example, the third row:
OUTPUT
LINESTRING (52.100900002 4.25730000099998, 52.1039 4.25529999999998, 52.111299999 4.24929999900002, 52.1274 4.23449999799999)
<class 'shapely.geometry.linestring.LineString'>
As we can see in the output, the LINESTRING
object
contains a list of coordinates of the vertices. In our situation, we
would like to find a way to flip the x and y of every coordinates set. A
good way to look for the solution is to use the documentation
of the shapely
package, since we are seeking to modify the
LINESTRING
object. Here we are going to use the shapely.ops.transform
function, which applies a self-defined function to all coordinates of a
geometry.
PYTHON
import shapely
# Define a function flipping the x and y coordinate values
def flip(geometry):
return shapely.ops.transform(lambda x, y: (y, x), geometry)
# Apply this function to all coordinates and all lines
geom_corrected = waterways_nl['geometry'].apply(flip)
Then we can update the geometry
column with the
corrected geometry geom_corrected
, and visualize it to
check:
PYTHON
# Update geometry
waterways_nl['geometry'] = geom_corrected
# Visualization
waterways_nl.plot()
Now the waterways look good! We can save the vector data for later usage:
Key Points
- Load spatial objects into Python with
geopandas.read_file()
function. - Spatial objects can be plotted directly with
GeoDataFrame
’s.plot()
method. - Crop spatial objects with
.cx[]
indexer. - Convert CRS of spatial objects with
.to_crs()
. - Select spatial features with
.clip()
. - Create a buffer of spatial objects with
.buffer()
. - Merge overlapping spatial objects with
.dissolve()
. - Join spatial features spatially with
.sjoin()
.
Content from Crop raster data with rioxarray and geopandas
Last updated on 2023-08-14 | Edit this page
Overview
Questions
- How can I crop my raster data to the area of interest?
Objectives
- Crop raster data with a bounding box.
- Crop raster data with a polygon.
- Match two raster datasets in different CRS.
It is quite common that the raster data you have in hand is too large to process, or not all the pixels are relevant to your area of interest (AoI). In both situations, you should consider cropping your raster data before performing data analysis.
In this episode, we will introduce how to crop raster data into the desired area. We will use one Sentinel-2 image over Amsterdam as the example raster data, and introduce how to crop your data to different types of AoIs.
Introduce the Data
We will use the results of the satellite image search:
search.json
, which is generated in an exercise from Episode 5: Access satellite imagery using
Python.
If you would like to work with the data for this lesson without
downloading data on-the-fly, you can download the raster data using this
link. Save
the geospatial-python-raster-dataset.tar.gz
file in your
current working directory, and extract the archive file by
double-clicking on it or by running the following command in your
terminal tar -zxvf geospatial-python-raster-dataset.tar.gz
.
Use the file geospatial-python-raster-dataset/search.json
(instead of search.json
) to get started with this
lesson.
We also use the cropped fields polygons
fields_cropped.shp
, which was generated in an exercise from
Episode 7: Vector data in
python.
Align the CRS of the raster and the vector data
We load a true color image using pystac
and
rioxarray
and check the shape of the raster:
PYTHON
import pystac
import rioxarray
# Load image and inspect the shape
items = pystac.ItemCollection.from_file("search.json")
raster = rioxarray.open_rasterio(items[1].assets["visual"].href) # Select a true color image
print(raster.shape)
OUTPUT
(3, 10980, 10980)
This will perform a “lazy” loading of the image, i.e. the image will not be loaded into the memory until necessary, but we can still access some attributes, e.g. the shape of the image.
The large size of the raster data makes it time and memory consuming to visualize in its entirety. Instead, we can fetch and plot the overviews of the raster. “Overviews” are precomputed lower resolution representations of a raster, stored in the same COG that contains the original raster.
PYTHON
# Get the overview asset
raster_overview = rioxarray.open_rasterio(items[1].assets["visual"].href, overview_level=3)
print(raster_overview.shape)
# Visualize it
raster_overview.plot.imshow(figsize=(8,8))
As we can see, the overview image is much smaller compared to the original true color image.
To align the raster and vector data, we first check each coordinate
system. For raster data, we use pyproj.CRS
:
OUTPUT
<Derived Projected CRS: EPSG:32631>
Name: WGS 84 / UTM zone 31N
Axis Info [cartesian]:
- [east]: Easting (metre)
- [north]: Northing (metre)
Area of Use:
- undefined
Coordinate Operation:
- name: UTM zone 31N
- method: Transverse Mercator
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
To open and check the coordinate system of vector data, we use
geopandas
:
PYTHON
import geopandas as gpd
# Load the polygons of the crop fields
fields = gpd.read_file("fields_cropped.shp")
# Check the coordinate system
fields.crs
OUTPUT
<Derived Projected CRS: EPSG:28992>
Name: Amersfoort / RD New
Axis Info [cartesian]:
- X[east]: Easting (metre)
- Y[north]: Northing (metre)
Area of Use:
- name: Netherlands - onshore, including Waddenzee, Dutch Wadden Islands and 12-mile offshore coastal zone.
- bounds: (3.2, 50.75, 7.22, 53.7)
Coordinate Operation:
- name: RD New
- method: Oblique Stereographic
Datum: Amersfoort
- Ellipsoid: Bessel 1841
- Prime Meridian: Greenwich
As seen, the coordinate systems differ. To crop the raster using the
shapefile, we first need to reproject one dataset to the other’s CRS.
Since raster
is large, we will convert the CRS of
fields
to the CRS of raster
to avoid loading
the entire image:
Crop raster data with a bounding box
The clip_box
function allows one to crop a raster by the
min/max of the x and y coordinates. Note that we are cropping the
original image raster
now, and not the overview image
raster_overview
.
PYTHON
# Crop the raster with the bounding box
raster_clip_box = raster.rio.clip_box(*fields.total_bounds)
print(raster_clip_box.shape)
OUTPUT
(3, 1565, 1565)
We successfully cropped the raster to a much smaller piece. We can visualize it now:
This cropped image can be saved for later usage:
Crop raster data with polygons
We have a cropped image around the fields. To further analyze the
fields, one may want to crop the image to the exact field boundaries.
This can be done with the clip
function:
And we can visualize the results:
Exercise: crop raster data with a specific code
In the column “gewascode” (translated as “crop code”) of
fields
, you can find the code representing the types of
plants grown in each field. Can you:
- Select the fields with “gewascode” equal to
257
; - Crop the raster
raster_clip_box
with the selected fields; - Visualize the cropped image.
Crop raster data using reproject_match()
function
So far we have learned how to crop raster images with vector data. We
can also crop a raster with another raster data. In this section, we
will demonstrate how to crop the raster_clip_box
image
using the raster_clip_fields_gwascode
image. We will use
the reproject_match
function. As indicated by its name, it
performs reprojection and clipping in one go.
To demonstrate the reprojection, we will first reproject
raster_clip_fields_gwascode
to the RD CRS system, so it
will be in a different CRS from raster_clip_box
:
PYTHON
# Reproject to RD to make the CRS different from the "raster"
raster_clip_fields_gwascode = raster_clip_fields_gwascode.rio.reproject("EPSG:28992")
CRS(raster_clip_fields_gwascode.rio.crs)
OUTPUT
<Derived Projected CRS: EPSG:28992>
Name: Amersfoort / RD New
Axis Info [cartesian]:
- [east]: Easting (metre)
- [north]: Northing (metre)
Area of Use:
- undefined
Coordinate Operation:
- name: unnamed
- method: Oblique Stereographic
Datum: Amersfoort
- Ellipsoid: Bessel 1841
- Prime Meridian: Greenwich
And let’s check again the CRS of raster_clip_box
:
OUTPUT
<Derived Projected CRS: EPSG:32631>
Name: WGS 84 / UTM zone 31N
Axis Info [cartesian]:
- [east]: Easting (metre)
- [north]: Northing (metre)
Area of Use:
- undefined
Coordinate Operation:
- name: UTM zone 31N
- method: Transverse Mercator
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
Now the two images are in different coordinate systems. We can use
rioxarray.reproject_match()
function to crop
raster_clip_box
image.
PYTHON
raster_reproject_match = raster_clip_box.rio.reproject_match(raster_clip_fields_gwascode)
raster_reproject_match.plot.imshow(figsize=(8,8))
We can also use it to expand raster_clip_fields_gwascode
to the extent of raster_clip_box
:
PYTHON
raster_reproject_match = raster_clip_fields_gwascode.rio.reproject_match(raster_clip_box)
raster_reproject_match.plot.imshow(figsize=(8,8))
In one line reproject_match
does a lot of helpful
things:
- It reprojects.
- It matches the extent using
nodata
values or by clipping the data. - It sets
nodata
values. This means we can run calculations on those two images.
Code Tip
As we saw before, there also exists a method called
reproject()
, which only reprojects one raster to another
projection. If you want more control over how rasters are resampled,
clipped, and/or reprojected, you can use the reproject()
method and other rioxarray
methods individually.
Key Points
- Use
clip_box
to crop a raster with a bounding box. - Use
clip
to crop a raster with a given polygon. - Use
reproject_match
to match two raster datasets.
Content from Raster Calculations in Python
Last updated on 2023-08-14 | Edit this page
Overview
Questions
- How do I perform calculations on rasters and extract pixel values for defined locations?
Objectives
- Carry out operations with two rasters using Python’s built-in math operators.
- Reclassify a continuous raster to a categorical raster.
Introduction
We often want to combine values of and perform calculations on rasters to create a new output raster. This episode covers how to perform basic math operations using raster datasets. It also illustrates how to match rasters with different resolutions so that they can be used in the same calculation. As an example, we will calculate a vegetation index over one of the satellite scenes.
Normalized Difference Vegetation Index (NDVI)
Suppose we are interested in monitoring vegetation fluctuations using satellite remote sensors. Scientists have defined a vegetation index to quantify the amount of green leaf vegetation using the light reflected in different wavelengths. This index, named Normalized Difference Vegetation Index (NDVI), exploits the fact that healthy green leaves strongly absorb red visible light while they mostly reflect light in the near infrared (NIR). The NDVI is computed as:
\[ NDVI = \frac{NIR - red}{NIR + red} \]
where \(NIR\) and \(red\) label the reflectance values of the corresponding wavelengths. NDVI values range from -1 to +1. Values close to one indicate high density of green leaves. Poorly vegetated areas typically have NDVI values close to zero. Negative NDVI values often indicate cloud and water bodies.
More Resources
Check out more on NDVI in the NASA Earth Observatory portal: Measuring Vegetation.
Load and crop the Data
For this episode, we will use one of the Sentinel-2 scenes that we have already employed in the previous episodes.
Introduce the Data
We’ll continue from the results of the satellite image search that we
have carried out in an exercise from a
previous episode. We will load data starting from the
search.json
file.
If you would like to work with the data for this lesson without
downloading data on-the-fly, you can download the raster data using this
link. Save
the geospatial-python-raster-dataset.tar.gz
file in your
current working directory, and extract the archive file by
double-clicking on it or by running the following command in your
terminal tar -zxvf geospatial-python-raster-dataset.tar.gz
.
Use the file geospatial-python-raster-dataset/search.json
(instead of search.json
) to get started with this
lesson.
Let’s load the results of our initial imagery search using
pystac
:
We then select the second item, and extract the URIs of the red and NIR bands (“red” and “nir08”, respectively):
Let’s load the rasters with open_rasterio
using the
argument masked=True
.
PYTHON
import rioxarray
red = rioxarray.open_rasterio(red_uri, masked=True)
nir = rioxarray.open_rasterio(nir_uri, masked=True)
Let’s also restrict our analysis to the same crop field area defined in the previous episode by clipping the rasters using a bounding box:
PYTHON
bbox = (629_000, 5_804_000, 639_000, 5_814_000)
red_clip = red.rio.clip_box(*bbox)
nir_clip = nir.rio.clip_box(*bbox)
We can now plot the two rasters. Using robust=True
color
values are stretched between the 2nd and 98th percentiles of the data,
which results in clearer distinctions between high and low
reflectances:
It is immediately evident how crop fields (rectangular shapes in the central part of the two figures) appear as dark and bright spots in the red-visible and NIR wavelengths, respectively, suggesting the presence of leafy crop at the time of observation (end of March). The same fields would instead appear as dark spots in the off season.
Raster Math
We can perform raster calculations by subtracting (or adding,
multiplying, etc.) two rasters. In the geospatial world, we call this
“raster math”, and typically it refers to operations on rasters that
have the same width and height (including nodata
pixels).
We can check the shapes of the two rasters in the following way:
OUTPUT
(1, 1000, 1000) (1, 500, 500)
Both rasters include a single band, but their width and height do not
match. We can now use the reproject_match
function, which
both reprojects and clips a raster to the CRS and extent of another
raster.
OUTPUT
(1, 500, 500)
Let’s now compute the NDVI as a new raster using the formula
presented above. We’ll use rioxarray
objects so that we can
easily plot our result and keep track of the metadata.
OUTPUT
<xarray.DataArray (band: 1, y: 500, x: 500)>
array([[[ 0.7379576 , 0.77153456, 0.54531944, ..., 0.39254385,
0.49227372, 0.4465174 ],
[ 0.7024894 , 0.7074668 , 0.3903298 , ..., 0.423283 ,
0.4706971 , 0.45964912],
[ 0.6557818 , 0.5610572 , 0.46742022, ..., 0.4510345 ,
0.43815723, 0.6005133 ],
...,
[ 0.02391171, 0.21843003, 0.02479339, ..., -0.50923485,
-0.53367877, -0.4955414 ],
[ 0.11376493, 0.17681159, -0.1673566 , ..., -0.5221932 ,
-0.5271318 , -0.4852753 ],
[ 0.45398772, -0.00518135, 0.03346133, ..., -0.5019455 ,
-0.4987013 , -0.49081364]]], dtype=float32)
Coordinates:
* band (band) int64 1
* x (x) float64 6.29e+05 6.29e+05 6.29e+05 ... 6.39e+05 6.39e+05
* y (y) float64 5.814e+06 5.814e+06 ... 5.804e+06 5.804e+06
spatial_ref int64 0
We can now plot the output NDVI:
Notice that the range of values for the output NDVI is between -1 and 1. Does this make sense for the selected region?
Maps are great, but it can also be informative to plot histograms of
values to better understand the distribution. We can accomplish this
using a built-in xarray method we have already been using:
plot
Exercise: Explore NDVI Raster Values
It’s often a good idea to explore the range of values in a raster dataset just like we might explore a dataset that we collected in the field. The histogram we just made is a good start but there’s more we can do to improve our understanding of the data.
- What is the min and maximum value for the NDVI raster
(
ndvi
) that we just created? Are there missing values? - Plot a histogram with 50 bins instead of 8. What do you notice that wasn’t clear before?
- Plot the
ndvi
raster using breaks that make sense for the data.
- Recall, if there were nodata values in our raster, we would need to
filter them out with
.where()
. Since we have loaded the rasters withmasked=True
, missing values are already encoded asnp.nan
’s. Thendvi
array actually includes a single missing value.
OUTPUT
-0.99864775
0.9995788
1
- Increasing the number of bins gives us a much clearer view of the distribution. Also, there seem to be very few NDVI values larger than ~0.9.
- We can discretize the color bar by specifying the intervals via the
levels
argument toplot()
. Suppose we want to bin our data in the following intervals:
- \(-1 \le NDVI \lt 0\) for water;
- \(0 \le NDVI \lt 0.2\) for no vegetation;
- \(0.2 \le NDVI \lt 0.7\) for sparse vegetation;
- \(0.7 \le NDVI \lt 1\) for dense vegetation.
Missing values can be interpolated from the values of neighbouring
grid cells using the .interpolate_na
method. We then save
ndvi
as a GeoTiff file:
Classifying Continuous Rasters in Python
Now that we have a sense of the distribution of our NDVI raster, we
can reduce the complexity of our map by classifying it. Classification
involves assigning each pixel in the raster to a class based on its
value. In Python, we can accomplish this using the
numpy.digitize
function.
First, we define NDVI classes based on a list of values, as defined
in the last exercise: [-1, 0., 0.2, 0.7, 1]
. When bins are
ordered from low to high, as here, numpy.digitize
assigns
classes like so:
Note that, by default, each class includes the left but not the right bound. This is not an issue here, since the computed range of NDVI values is fully contained in the open interval (-1; 1) (see exercise above).
PYTHON
import numpy as np
import xarray
# Defines the bins for pixel values
class_bins = (-1, 0., 0.2, 0.7, 1)
# The numpy.digitize function returns an unlabeled array, in this case, a
# classified array without any metadata. That doesn't work--we need the
# coordinates and other spatial metadata. We can get around this using
# xarray.apply_ufunc, which can run the function across the data array while
# preserving metadata.
ndvi_classified = xarray.apply_ufunc(
np.digitize,
ndvi_nonan,
class_bins
)
Let’s now visualize the classified NDVI, customizing the plot with proper title and legend. We then export the figure in PNG format:
PYTHON
import earthpy.plot as ep
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
# Define color map of the map legend
ndvi_colors = ["blue", "gray", "green", "darkgreen"]
ndvi_cmap = ListedColormap(ndvi_colors)
# Define class names for the legend
category_names = [
"Water",
"No Vegetation",
"Sparse Vegetation",
"Dense Vegetation"
]
# We need to know in what order the legend items should be arranged
category_indices = list(range(len(category_names)))
# Make the plot
im = ndvi_classified.plot(cmap=ndvi_cmap, add_colorbar=False)
plt.title("Classified NDVI")
# earthpy helps us by drawing a legend given an existing image plot and legend items, plus indices
ep.draw_legend(im_ax=im, classes=category_indices, titles=category_names)
# Save the figure
plt.savefig("NDVI_classified.png", bbox_inches="tight", dpi=300)
We can finally export the classified NDVI raster object to a GeoTiff
file. The to_raster()
function by default writes the output
file to your working directory unless you specify a full file path.
Exercise: Compute the NDVI for the Texel island
Data are often more interesting and powerful when we compare them across various locations. Let’s compare the computed NDVI map with the one of another region in the same Sentinel-2 scene: the Texel island, located in the North Sea.
- You should have the red- and the NIR-band rasters already loaded
(
red
andnir
variables, respectively). - Crop the two rasters using the following bounding box:
(610000, 5870000, 630000, 5900000)
. Don’t forget to check the shape of the data, and make sure the cropped areas have the same CRSs, heights and widths. - Compute the NDVI from the two raster layers and check the max/min values to make sure the data is what you expect.
- Plot the NDVI map and export the NDVI as a GeoTiff.
- Compare the distributions of NDVI values for the two regions investigated.
- We crop the area of interest using
clip_box
:
PYTHON
bbox_texel = (610000, 5870000, 630000, 5900000)
nir_texel = nir.rio.clip_box(*bbox_texel)
red_texel = red.rio.clip_box(*bbox_texel)
- Reproject and clip one raster to the extent of the smaller raster
using
reproject_match
. The lines of code below assign a variable to the reprojected raster and calculate the NDVI.
PYTHON
red_texel_matched = red_texel.rio.reproject_match(nir_texel)
ndvi_texel = (nir_texel - red_texel_matched)/ (nir_texel + red_texel_matched)
- Plot the NDVI and save the raster data as a GeoTIFF file.
- Compute the NDVI histogram and compare it with the region that we have previously investigated. Many more grid cells have negative NDVI values, since the area of interest includes much more water. Also, NDVI values close to zero are more abundant, indicating the presence of bare ground (sand) regions.
Key Points
- Python’s built-in math operators are fast and simple options for raster math.
- numpy.digitize can be used to classify raster values in order to generate a less complicated map.
Content from Calculating Zonal Statistics on Rasters
Last updated on 2023-08-14 | Edit this page
Overview
Questions
- How to compute raster statistics on different zones delineated by vector data?
Objectives
- Extract zones from the vector dataset
- Convert vector data to raster
- Calculate raster statistics over zones
Introduction
Statistics on predefined zones of the raster data are commonly used for analysis and to better understand the data. These zones are often provided within a single vector dataset, identified by certain vector attributes. For example, in the previous episodes, we used the crop field polygon dataset. The fields with the same crop type can be identified as a “zone”, resulting in multiple zones in one vector dataset. One may be interested in performing statistical analysis over these crop zones.
In this episode, we will explore how to calculate zonal statistics
based on the types of crops in fields_cropped.shp
. To do
this, we will first identify zones from the vector data, then rasterize
these vector zones. Finally the zonal statistics for ndvi
will be calculated over the rasterized zones.
Making vector and raster data compatible
First, let’s load the NDVI.tif
file saved in the
previous episode to obtained our calculated raster ndvi
data. We also use the squeeze()
function in order to reduce
our raster data ndvi
dimension to 2D by removing the
singular band
dimension - this is necessary for use with
the rasterize
and zonal_stats
functions:
Let’s also read the crop fields vector data from our saved
fields_cropped.shp
file.
In order to use the vector data as a classifier for our raster, we
need to convert the vector data to the appropriate CRS. We can perform
the CRS conversion from the vector CRS (EPSG:28992) to our raster
ndvi
CRS (EPSG:32631) with:
Rasterizing the vector data
Before calculating zonal statistics, we first need to rasterize our
fields_utm
vector geodataframe with the
rasterio.features.rasterize
function. With this function,
we aim to produce a grid with numerical values representing the types of
crops as defined by the column gewascode
from
field_cropped
- gewascode
stands for the crop
codes as defined by the Netherlands Enterprise Agency (RVO) for
different types of crop or gewas
(Grassland, permanent;
Grassland, temporary; corn fields; etc.). This grid of values thus
defines the zones for the xrspatial.zonal_stats
function,
where each pixel in the zone grid overlaps with a corresponding pixel in
our NDVI raster.
We can generate the geometry, gewascode
pairs for each
vector feature to be used as the first argument to
rasterio.features.rasterize
as:
OUTPUT
[[<shapely.geometry.polygon.Polygon at 0x7ff88666f670>, 265],
[<shapely.geometry.polygon.Polygon at 0x7ff86bf39280>, 265],
[<shapely.geometry.polygon.Polygon at 0x7ff86ba1db80>, 265],
[<shapely.geometry.polygon.Polygon at 0x7ff86ba1d730>, 265],
[<shapely.geometry.polygon.Polygon at 0x7ff86ba1d400>, 265],
[<shapely.geometry.polygon.Polygon at 0x7ff86ba1d130>, 265],
...
[<shapely.geometry.polygon.Polygon at 0x7ff88685c970>, 265],
[<shapely.geometry.polygon.Polygon at 0x7ff88685c9a0>, 265],
[<shapely.geometry.polygon.Polygon at 0x7ff88685c9d0>, 265],
[<shapely.geometry.polygon.Polygon at 0x7ff88685ca00>, 331],
...]
This generates a list of the shapely geometries from the
geometry
column, and the unique field ID from the
gewascode
column in the fields_utm
geodataframe.
We can now rasterize our vector data using
rasterio.features.rasterize
:
PYTHON
from rasterio import features
fields_rasterized = features.rasterize(geom, out_shape=ndvi.shape, transform=ndvi.rio.transform())
The argument out_shape
specifies the shape of the output
grid in pixel units, while transform
represents the
projection from pixel space to the projected coordinate space. By
default, the pixels that are not contained within a polygon in our
shapefile will be filled with 0. It’s important to pick a fill value
that is not the same as any value already defined in
gewascode
or else we won’t distinguish between this zone
and the background.
Let’s inspect the results of rasterization:
OUTPUT
(500, 500)
[ 0 259 265 266 331 332 335 863]
The output fields_rasterized
is an
np.ndarray
with the same shape as ndvi
. It
contains gewascode
values at the location of fields, and 0
outside the fields. Let’s visualize it:
We will convert the output to xarray.DataArray
which
will be used further. To do this, we will “borrow” the coordinates from
ndvi
, and fill in the rasterization data:
Calculate zonal statistics
In order to calculate the statistics for each crop zone, we call the
function, xrspatial.zonal_stats
. The
xrspatial.zonal_stats
function takes as input
zones
, a 2D xarray.DataArray
, that defines
different zones, and values
, a 2D
xarray.DataArray
providing input values for calculating
statistics.
We call the zonal_stats
function with
fields_rasterized_xarr
as our classifier and the 2D raster
with our values of interest ndvi
to obtain the NDVI
statistics for each crop type:
OUTPUT
zone mean max min sum std var count
0 0 0.266531 0.999579 -0.998648 38887.648438 0.409970 0.168075 145903.0
1 259 0.520282 0.885242 0.289196 449.003052 0.111205 0.012366 863.0
2 265 0.775609 0.925955 0.060755 66478.976562 0.091089 0.008297 85712.0
3 266 0.794128 0.918048 0.544686 1037.925781 0.074009 0.005477 1307.0
4 331 0.703056 0.905304 0.142226 10725.819336 0.102255 0.010456 15256.0
5 332 0.681699 0.849158 0.178113 321.080261 0.123633 0.015285 471.0
6 335 0.648063 0.865804 0.239661 313.662598 0.146582 0.021486 484.0
7 863 0.388575 0.510572 0.185987 1.165724 0.144245 0.020807 3.0
The zonal_stats
function calculates the minimum,
maximum, and sum for each zone along with statistical measures such as
the mean, variance and standard deviation for each rasterized vector
zone. In our raster dataset zone = 0
, corresponding to
non-crop areas, has the highest count followed by
zone = 265
which corresponds to ‘Grasland, blijvend’ or
‘Grassland, permanent’. The highest mean NDVI is observed for
zone = 266
for ‘Grasslands, temporary’ with the lowest
mean, aside from non-crop area, going to zone = 863
representing ‘Forest without replanting obligation’. Thus, the
zonal_stats
function can be used to analyze and understand
different sections of our raster data. The definition of the zones can
be derived from vector data or from classified raster data as presented
in the challenge below:
Exercise: Calculate zonal statistics for zones
defined by ndvi_classified
Let’s calculate NDVI zonal statistics for the different zones as
classified by ndvi_classified
in the previous episode.
Load both raster datasets: NDVI.tif
and
NDVI_classified.tif
. Then, calculate zonal statistics for
each class_bins
. Inspect the output of the
zonal_stats
function.
- Load and convert raster data into suitable inputs for
zonal_stats
:
PYTHON
ndvi = rioxarray.open_rasterio("NDVI.tif").squeeze()
ndvi_classified = rioxarray.open_rasterio("NDVI_classified.tif").squeeze()
- Create and display the zonal statistics table.
OUTPUT
zone mean max min sum std var count
0 1 -0.355660 -0.000257 -0.998648 -12838.253906 0.145916 0.021291 36097.0
1 2 0.110731 0.199839 0.000000 1754.752441 0.055864 0.003121 15847.0
2 3 0.507998 0.700000 0.200000 50410.167969 0.140193 0.019654 99233.0
3 4 0.798281 0.999579 0.700025 78888.523438 0.051730 0.002676 98823.0
Key Points
- Zones can be extracted by attribute columns of a vector dataset
- Zones can be rasterized using
rasterio.features.rasterize
- Calculate zonal statistics with
xrspatial.zonal_stats
over the rasterized zones.
Content from Parallel raster computations using Dask
Last updated on 2023-08-14 | Edit this page
Overview
Questions
- How can I parallelize computations on rasters with Dask?
- How can I determine if parallelization improves calculation speed?
- What are good practices in applying parallelization to my raster calculations?
Objectives
- Profile the timing of the raster calculations.
- Open raster data as a chunked array.
- Recognize good practices in selecting proper chunk sizes.
- Setup raster calculations that take advantage of parallelization.
Introduction
Very often raster computations involve applying the same operation to different pieces of data. Think, for instance, to the “pixel”-wise sum of two raster datasets, where the same sum operation is applied to all the matching grid-cells of the two rasters. This class of tasks can benefit from chunking the input raster(s) into smaller pieces: operations on different pieces can be run in parallel using multiple computing units (e.g., multi-core CPUs), thus potentially speeding up calculations. In addition, working on chunked data can lead to smaller memory footprints, since one may bypass the need to store the full dataset in memory by processing it chunk by chunk.
In this episode, we will introduce the use of Dask in the context of
raster calculations. Dask is a Python library for parallel and
distributed computing. It provides a framework to work with different
data structures, including chunked arrays (Dask Arrays). Dask is well
integrated with (rio
)xarray
, which can use
Dask arrays as underlying data structures.
Dask
This episode shows how Dask can be used to parallelize operations on local CPUs. However, the same library can be configured to run tasks on large compute clusters.
More resources on Dask: - Dask and Dask Array. - Xarray with Dask.
It is important to realize, however, that many details determine the extent to which using Dask’s chunked arrays instead of regular Numpy arrays leads to faster calculations (and lower memory requirements). The actual operations to carry out, the size of the dataset, and parameters such as the chunks’ shape and size, all affects the performance of our computations. Depending on the specifics of the calculations, serial calculations might actually turn out to be faster! Being able to profile the computational time is thus essential, and we will see how to do that in a Jupyter environment in the next section.
The example that we consider here is the application of a median filter to a satellite image. Median filtering is a common noise removal technique which replaces a pixel’s value with the median value computed from its surrounding pixels.
Time profiling in Jupyter
Introduce the Data
We’ll continue from the results of the satellite image search that we
have carried out in an exercise from a
previous episode. We will load data starting from the
search.json
file.
If you would like to work with the data for this lesson without
downloading data on-the-fly, you can download the raster data using this
link. Save
the geospatial-python-raster-dataset.tar.gz
file in your
current working directory, and extract the archive file by
double-clicking on it or by running the following command in your
terminal tar -zxvf geospatial-python-raster-dataset.tar.gz
.
Use the file geospatial-python-raster-dataset/search.json
(instead of search.json
) to get started with this
lesson.
Let’s set up a raster calculation using assets from our previous
search of satellite scenes. We first load the item collection using the
pystac
library:
We select the last scene, and extract the URL of the true-color image (“visual”):
PYTHON
assets = items[-1].assets # last item's assets
visual_href = assets["visual"].href # true color image
The true-color image is available as a raster file with 10 m
resolution and 3 bands (you can verify this by opening the file with
rioxarray
), which makes it a relatively large file (few
hundreds MBs). In order to keep calculations “manageable” (reasonable
execution time and memory usage) we select here a lower resolution
version of the image, taking advantage of the so-called “pyramidal”
structure of cloud-optimized GeoTIFFs (COGs). COGs, in fact, typically
include multiple lower-resolution versions of the original image, called
“overviews”, in the same file. This allows us to avoid downloading
high-resolution images when only quick previews are required.
Overviews are often computed using powers of 2 as down-sampling (or zoom) factors. So, typically, the first level overview (index 0) corresponds to a zoom factor of 2, the second level overview (index 1) corresponds to a zoom factor of 4, and so on. Here, we open the third level overview (zoom factor 8) and check that the resolution is about 80 m:
PYTHON
import rioxarray
visual = rioxarray.open_rasterio(visual_href, overview_level=2)
print(visual.rio.resolution())
OUTPUT
(79.97086671522214, -79.97086671522214)
Let’s make sure the data has been loaded into memory before
proceeding to time profile our raster calculation. Calling the
.load()
method of a DataArray object triggers data
loading:
Note that by default data is loaded using Numpy arrays as underlying data structure. We can visualize the raster:
Let’s now apply a median filter to the image while keeping track of
the execution time of this task. The filter is carried out in two steps:
first, we define the size and centering of the region around each pixel
that will be considered for the median calculation (the so-called
“windows”), using the .rolling()
method. We choose here
windows that are 7 pixel wide in both x and y dimensions, and, for each
window, consider the central pixel as the window target. We then call
the .median()
method, which initiates the construction of
the windows and the actual calculation.
For the time profiling, we make use of the Jupyter magic
%%time
, which returns the time required to run the content
of a cell (note that commands starting with %%
needs to be
on the first line of the cell!):
OUTPUT
CPU times: user 15.6 s, sys: 3.2 s, total: 18.8 s
Wall time: 19.6 s
Let’s note down the calculation’s “Wall time” (actual time to perform the task). We can inspect the image resulting after the application of median filtering:
Handling edges
By looking closely, you might notice a tiny white edge at the plot boundaries. These are the pixels that are less than 3 pixels away from the border of the image. These pixels cannot be surrounded by a 7 pixel wide window. The default behaviour is to assign these with nodata values.
Finally, let’s write the data to disk:
In the following section we will see how to parallelize these raster calculations, and we will compare timings to the serial calculations that we have just run.
Dask-powered rasters
Chunked arrays
As we have mentioned, rioxarray
supports the use of
Dask’s chunked arrays as underlying data structure. When opening a
raster file with open_rasterio
and providing the
chunks
argument, Dask arrays are employed instead of
regular Numpy arrays. chunks
describes the shape of the
blocks which the data will be split in. As an example, we open the blue
band raster using a chunk shape of (1, 4000, 4000)
(block
size of 1
in the first dimension and of 4000
in the second and third dimensions):
PYTHON
blue_band_href = assets["blue"].href
blue_band = rioxarray.open_rasterio(blue_band_href, chunks=(1, 4000, 4000))
Xarray and Dask also provide a graphical representation of the raster data array and of its blocked structure.
Exercise: Chunk sizes matter
We have already seen how COGs are regular GeoTIFF files with a special internal structure. other feature of COGs is that data is organized in “blocks” that can be accessed remotely via independent HTTP requests, abling partial file readings. This is useful if you want to access only a portion of your raster file, but it also lows for efficient parallel reading. You can check the blocksize employed in a COG file with the following code ippet:
PYTHON
import rasterio
with rasterio.open(visual_href) as r:
if r.is_tiled:
print(f"Chunk size: {r.block_shapes}")
In order to optimally access COGs it is best to align the blocksize
of the file with the chunks ployed when loading the file. Open the
blue-band asset (the “blue” asset) of a Sentinel-2 scene as a chunked
DataArray
object using a suitable chunk size. Which
elements do you think should be considered when choosing the chunk
size?
PYTHON
import rasterio
with rasterio.open(blue_band_href) as r:
if r.is_tiled:
print(f"Chunk size: {r.block_shapes}")
OUTPUT
Chunk size: [(1024, 1024)]
Ideal chunk size values for this raster are thus multiples of 1024.
An element to consider is number of resulting chunks and their size.
While the optimal chunk size strongly depends on the specific
application, chunks should in general not be too big nor too small
(i.e. too many). As a rule of thumb, chunk sizes of 100 MB typically
work well with Dask (see, e.g., this blog
post). Also, the shape might be relevant, depending on the
application! Here, we might select a chunks shape of
(1, 6144, 6144)
::
which leads to chunks 72 MB large: ((1 x 6144 x 6144) x 2 bytes /
2^20 = 72 MB). Also, we can t
ioxarrayand Dask figure out appropriate chunk shapes by setting
chunks=“auto”`:
which leads to (1, 8192, 8192)
chunks (128 MB).
Parallel computations
Operations performed on a DataArray
that has been opened
as a chunked Dask array are executed using Dask. Dask coordinates how
the operations should be executed on the individual chunks of data, and
runs these tasks in parallel as much as possible.
Let’s now repeat the raster calculations that we have carried out in the previous section, but running calculations in parallel over a multi-core CPU. We first open the relevant rasters as chunked arrays:
PYTHON
visual_dask = rioxarray.open_rasterio(visual_href, overview_level=2, lock=False, chunks=(3, 500, 500))
Setting lock=False
tells rioxarray
that the
individual data chunks can be loaded simultaneously from the source by
the Dask workers.
As the next step, we trigger the download of the data using the
.persist()
method, see below. This makes sure that the
downloaded chunks are stored in the form of a chunked Dask array
(calling .load()
would instead merge the chunks in a single
Numpy array).
We explicitly tell Dask to parallelize the required workload over 4 threads:
Let’s now continue to the actual calculation. Note how the same syntax as for its serial version is employed for applying the median filter. Don’t forget to add the Jupyter magic to record the timing!
OUTPUT
CPU times: user 20.6 ms, sys: 3.71 ms, total: 24.3 ms
Wall time: 25.2 ms
Did we just observe a 700x speed-up when comparing to the serial calculation (19.6 s vs 25.2 ms)? Actually, no calculation has run yet. This is because operations performed on Dask arrays are executed “lazily”, i.e. they are not immediately run.
Dask graph
The sequence of operations to carry out is stored in a task graph, which can be visualized with:
The task graph gives Dask the complete “overview” of the calculation, thus enabling a better nagement of tasks and resources when dispatching calculations to be run in parallel.
Most methods of DataArray
’s run operations lazily when
Dask arrays are employed. In order to trigger calculations, we can use
either .persist()
or .compute()
. The former
keeps data in the form of chunked Dask arrays, and it should thus be
used to run intermediate steps that will be followed by additional
calculations. The latter merges instead the chunks in a single Numpy
array, and it should be used at the very end of a sequence of
calculations. Both methods accept the same parameters (here, we again
explicitly tell Dask to run tasks on 4 threads). Let’s again time the
cell execution:
OUTPUT
CPU times: user 19.1 s, sys: 3.2 s, total: 22.3 s
Wall time: 6.61 s
The timing that we have recorded makes much more sense now. When running the task on a 4-core CPU laptop, we observe a x3 speed-up when comparing to the analogous serial calculation (19.6 s vs 6.61 s).
Once again, we stress that one does not always obtain similar performance gains by exploiting the Dask-based parallelization. Even if the algorithm employed is well suited for parallelization, Dask introduces some overhead time to manage the tasks in the Dask graph. This overhead, which is typically of the order of few milliseconds per task, can be larger than the parallelization gain. This is the typical situation with calculations with many small chunks.
Finally, let’s have a look at how Dask can be used to save raster
files. When calling .to_raster()
, we provide the following
additional arguments: * tiled=True
: write raster as a
chunked GeoTIFF. * lock=threading.Lock()
: the threads which
are splitting the workload must “synchronise” when writing to the same
file (they might otherwise overwrite each other’s output).
PYTHON
from threading import Lock
median_dask.rio.to_raster("visual_median-filter_dask.tif", tiled=True, lock=Lock())
Note that .to_raster()
is among the methods that trigger
immediate calculations (one can change this behaviour by specifying
compute=False
)
Key Points
- The
%%time
Jupyter magic command can be used to profile calculations. - Data ‘chunks’ are the unit of parallelization in raster calculations.
- (
rio
)xarray
can open raster files as chunked arrays. - The chunk shape and size can significantly affect the calculation performance.
- Cloud-optimized GeoTIFFs have an internal structure that enables performant parallel read.