This lesson is still being designed and assembled (Pre-Alpha version)

Frictionless Tables

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is a Frictionless Table Schema?

  • How can I create and edit a Frictionless Table Schema?

Objectives
  • Learn the Frictionless Table Schema and how it is used to describe a tabular dataset.

  • Import a table and infer metadata about it using Python.

  • Understand the Field Descriptor options for describing table fields.

  • Use Python to edit the Field Descriptors for fields in our dataset.

In this lesson we will be working with the Frictionless Python module and CSV (comma separated) data files. Each CSV file represents a table in our dataset. For each file, the first row is the header row and provides the names of the table fields. All following rows are data.

Introducing the Frictionless Table Schema

The Frictionless Table Schema is a simple format for describing a table using JSON. The schema has properties for describing information about the table and an array property listing the tables fields.

Why do we need the Table Schema?

While CSV is a simple and effective way for providing tabular data, it only tells us the name of a field. To use the table we might need to know other information, or metadata, about the fields such as the fields data type, i.e. is it text, an integer, a decimal or a date. The Frictionless Table Schema allows us to provide this metadata for the tables fields.

Providing additional metadata for our tables means we can improve them in three important ways:

  1. We can provide validation rules for quality checking the data.
  2. We can provide additional information to describe the fields and data. This makes it easier to re-use the data.
  3. We can add semantic annotations to fields to improve their interoperability with other datasets.

FAIR Data Principle

Providing additional metadata to describe our tables helps us to meet FAIR data principles for reuse of data.

Discussion

What other information could you provide to describe fields in a table?

Describing our first table

We are going to use the Frictionless Python module to describe our first CSV table.

Start Python or a new Jupyter Notebook and import the Frictionless module’s describe function. As we’ll be working with JSON and Python dictionary data structures we will also import the PrettyPrinter module to return more readable dictionary data.

from frictionless import describe
import pprint

pp = pprint.PrettyPrinter(depth=4)

Next we’ll describe the yields.csv file using the describe function to generate a Frictionless table schema and print the results.

yields_schema = describe("data/yields.csv")
pp.pprint(yields_schema)

This should output the following JSON table schema.

{'encoding': 'utf-8',
 'format': 'csv',
 'hashing': 'md5',
 'name': 'yields',
 'path': 'data/yields.csv',
 'profile': 'tabular-data-resource',
 'schema': {'fields': [{'name': 'plot_no', 'type': 'integer'},
                       {'name': 'expt_id', 'type': 'string'},
                       {'name': 'h_Date', 'type': 'string'},
                       {'name': 'col_y', 'type': 'integer'},
                       {'name': 'col_x', 'type': 'integer'},
                       {'name': 'variety', 'type': 'string'},
                       {'name': 'grain_weight', 'type': 'number'}]},
 'scheme': 'file'}

If we look at the output we can see the Frictionless describe function has automatically inferred basic table and field metadata, such as the name and relative path of the file and all table fields. The describe function also samples the data rows to infer the data type for each field.

Exercise

Challenge: Describe the varieties and experiments files

Using the code for describing the yields.csv to describe varieties.csv and experiments.csv. Assign the resulting table schemas to variables called varieties_schema and experiments_schema respectively.

Solution

varieties_schema = describe("data/varieties.csv")
experiments_schema = describe("data/experiments.csv")

Improving field descriptions

We have seen the Frictionless describe function generates a basic definition for each of our fields by assigning a name and data type. This is a good start for describing our tables, but it doesn’t provide enough information to make the data usable. For example what are the units for grain_weight and what do col_x and row_y mean?

We can use Python to edit the table schema to improve our metadata. We will do this by adding extra field descriptors to the table schema.

Field Descriptors

The Frictionless Table Schema uses Field Descriptors to provide additional information for a field.

Descriptor How to use it Example
name The name must match a field name in the data table hrv_date
title A human readable title for the field. Harvest date
description A more detailed description of the field. The date on which the crop was harvested.
type The data type for the field. date
format The format for the field data YYYY-MM-DD
rdfType This is rich type or semantic type. It should be a URI for a term from a controlled vocabulary http://purl.obolibrary.org/obo/TO_0000396
constraints This is used to constrain the values in a field and is used for validation required

In the table schema, using the hrv_date field example above would give the following JSON definition:

{
    'name': 'hrv_date',
    'title': 'Harvest date',
    'description': 'The date on which the crop was harvested.',
    'type': 'date',
    'format': 'YYYY-MM-DD'
    'rdfType': 'http://purl.obolibrary.org/obo/TO_0000396',
    'constraints': {'required': True}
}

Read the Frictionless Field Descriptors documentation for an in-depth description of the field descriptors.

FAIR Data Principle

Using the rdfType helps to improve interoperability of our dataset by annotating the field using a term or concept from a community vocabulary. In the above example we have used the Trait Ontology term for Harvest Date. This means we can more confidently link the data to other datasets that are similarly annotated, evern if the fields have different names.

Adding field descriptors to the table schema

In python we will start adding a title and description

yields_schema.schema.get_field("plot_no").title = "Plot Number"
yields_schema.schema.get_field("plot_no").description = "A unique identifer for the plot"

yields_schema.schema.get_field("expt_id").title = "Experiment Code"
yields_schema.schema.get_field("expt_id").description = "Institute standard code for a field experiment"

yields_schema.schema.get_field("h_date").title = "Harvest Date"
yields_schema.schema.get_field("h_date").description = "Date on which the plot was harvested"

pp.pprint(yields_schema)
{'encoding': 'utf-8',
 'format': 'csv',
 'hashing': 'md5',
 'name': 'yields',
 'path': 'data/yields.csv',
 'profile': 'tabular-data-resource',
 'schema': {'fields': [{'description': 'A unique identifer for the plot.',
                        'name': 'plot_no',
                        'title': 'Plot Number',
                        'type': 'integer'},
                       {'description': 'Institute standard code for a field experiment.',
                        'name': 'expt_id',
                        'title': 'Experiment Code',
                        'type': 'string'},
                       {'description': 'Date on which the plot was harvested.',
                        'name': 'h_date',
                        'title': 'Harvest Date',
                        'type': 'string'},
                       {'name': 'col_y', 'type': 'integer'},
                       {'name': 'col_x', 'type': 'integer'},
                       {'name': 'variety', 'type': 'string'},
                       {'name': 'grain_weight', 'type': 'number'}]},
 'scheme': 'file'}

Exercise

Challenge: Add field descriptors for the varieties and experiments tables

Using the code for adding field descriptors to the yields table as an example, use the information below to add field descriptors to the experiments table schema.

experiments

field name title description
expt_code Experiment Code A unique institute standard code for a field experiment.
harvest_machine Harvest machine Type of machine used to harvest plots.
harvest_width Harvest Width Width of the area harvested in metres.
harvest_length Harvest Length Length of the area harvested in metres.

Solution

experiments_schema.schema.get_field("expt_code").title = "Experiment Code"
experiments_schema.schema.get_field("expt_code").description = "A unique Institute standard code for a field experiment."

experiments_schema.schema.get_field("harvest_machine").title = "Harvest machine"
experiments_schema.schema.get_field("harvest_machine").description = "Type of machine used to harvest plots."

experiments_schema.schema.get_field("harvest_width").title = "Harvest Width"
experiments_schema.schema.get_field("harvest_width").description = "Width of the area harvested in metres."

experiments_schema.schema.get_field("harvest_length").title = "Harvest Length"
experiments_schema.schema.get_field("harvest_length").description = "Length of the area harvested in metres."

pp.pprint(experiments_schema)
{
'encoding': 'utf-8',
'format': 'csv',
'hashing': 'md5',
'name': 'experiments',
'path': 'data/experiments.csv',
'profile': 'tabular-data-resource',
'schema': {'fields': [{'description': 'A unique Institute standard code for a field experiment.',
                       'name': 'expt_code',
                       'title': 'Experiment Code',
                       'type': 'string'},
                      {'description': 'Type of machine used to harvest plots.',
                       'name': 'harvest_machine',
                       'title': 'Harvest machine',
                       'type': 'string'},
                      {'description': 'Width of the area harvested in metres.',
                       'name': 'harvest_width',
                       'title': 'Harvest Width',
                       'type': 'number'},
                      {'description': 'Length of the area harvested in metres.',
                       'name': 'harvest_length',
                       'title': 'Harvest Length',
                       'type': 'integer'},
                      {'name': 'expt_name', 
                       'type': 'string'},
                      {'name': 'site',
                       'type': 'string'}]},
'scheme': 'file'}

Adding constraint field descriptors to the table schema

We can also add constraint field descriptors to our table schema. Constraints are used to validate and quality check the data, for example, by checking numeric fields are within a certain range.

Frictionless define the following field constraints

Constraint name type usage
required True or False Indicates the field must have a value.
unique True or False Indicates all values in the field must be unique and not repeated.
minLength integer A number indicating the minimum length for text.
maxLength integer A number indicating the maximum length for text.
minimum integer A number indicating the minimum value for a number or date.
maximum integer A number indicating the maximum value for a number or date.
pattern string A regular expression defining the format of allowed values. For example experiment codes must follow a specified institute format.
enum array A list of allowed values. All values in a field must be from this list.

Constraint properties can be added to fields in the same way that we have just edited the title and description properties for experiments table schema.

Exercise

Challenge: Complete the code to add additional constraints

Complete the code so that:

  1. Experiment code is unique.
  2. Site must be from the list ‘Brooms Barn’, ‘Rothamsted’, ‘Woburn’.
  3. Harvest width must be between 1 and 2 m.
  4. Harvest length must be between 1 and 10 m.
experiments_schema.schema.get_field("expt_code").constraints["unique"] = _________
 
experiments_schema.schema.get_field("site").constraints["_________"] = ["Brooms Barn","Rothamsted",_________]

experiments_schema.schema.get_field("harvest_width").constraints["minimum"] = _________
experiments_schema.schema.get_field("harvest_width").constraints[""] = 2

_________.schema.get_field(_________).constraints[_________] = _________
_________.schema.get_field(_________).constraints[_________] = _________

pp.pprint(_________)

Solution

experiments_schema.schema.get_field("expt_code").constraints["unique"] = True
 
experiments_schema.schema.get_field("site").constraints["enum"] = ["Brooms Barn","Rothamsted","Woburn"]

experiments_schema.schema.get_field("harvest_width").constraints["minimum"] = 1
experiments_schema.schema.get_field("harvest_width").constraints["maximum"] = 2

experiments_schema.schema.get_field("harvest_length").constraints["minimum"] = 1
experiments_schema.schema.get_field("harvest_length").constraints["maximum"] = 10

pp.pprint(experiments_schema)
{
'encoding': 'utf-8',
'format': 'csv',
'hashing': 'md5',
'name': 'experiments',
'path': 'data/experiments.csv',
'profile': 'tabular-data-resource',
'schema': {'fields': [{'constraints': {'unique': True},
                       'description': 'A unique Institute standard code for a field experiment.',
                       'name': 'expt_code',
                       'title': 'Experiment Code',
                       'type': 'string'},
                      {'description': 'Type of machine used to harvest plots.',
                       'name': 'harvest_machine',
                       'title': 'Harvest machine',
                       'type': 'string'},
                      {'constraints': {'maximum': 2, 'minimum': 1},
                       'description': 'Width of the area harvested in metres.',
                       'name': 'harvest_width',
                       'title': 'Harvest Width',
                       'type': 'number'},
                      {'constraints': {'maximum': 10, 'minimum': 1},
                       'description': 'Length of the area harvested in metres.',
                       'name': 'harvest_length',
                       'title': 'Harvest Length',
                       'type': 'integer'},
                      {'name': 'expt_name', 'type': 'string'},
                      {'constraints': {'enum': ['Brooms Barn',
                                                'Rothamsted',
                                                'Woburn']},
                       'name': 'site',
                       'type': 'string'}]},
'scheme': 'file'}

Handling missing values

If our dataset has missing values we can use the Frictionless Table Schema to define how missing values are represented in the data. For example in the yields table missing yield data is represented in the grain_weight field by an *. However, the grain_weight field is defined in the schema as a number, therefore to prevent the Frictionless validator throwing an error, because * is not a number, we need to indicate the special meaning of *.

We can provide multiple missing value codes, so missing values are added to schema as an array. For example, the following code sets zero-length strings and * as allowed missing values for our yields table.

yields_schema.schema.missing_values = ["","*"]
pp.pprint(yields_schema)
{'encoding': 'utf-8',
 'format': 'csv',
 'hashing': 'md5',
 'name': 'yields',
 'path': 'data/yields.csv',
 'profile': 'tabular-data-resource',
 'schema': {'fields': [{'description': 'A unique identifer for the plot',
                        'name': 'plot_no',
                        'title': 'Plot Number',
                        'type': 'integer'},
                       {'description': 'Institute standard code for a field '
                                       'experiment',
                        'name': 'expt_id',
                        'title': 'Experiment Code',
                        'type': 'string'},
                       {'description': 'Date on which the plot was harvested',
                        'name': 'h_date',
                        'title': 'Harvest Date',
                        'type': 'string'},
                       {'name': 'col_y', 'type': 'integer'},
                       {'name': 'col_x', 'type': 'integer'},
                       {'name': 'variety', 'type': 'string'},
                       {'name': 'grain_weight', 'type': 'number'}],
            'foreignKeys': [{'fields': 'variety',
                             'reference': {'fields': 'variety',
                                           'resource': 'varieties'}},
                            {'fields': 'expt_code',
                             'reference': {'fields': 'expt_code',
                                           'resource': 'experiments'}}],
            'missingValues': ['', '*'],
            'primaryKey': 'plot_no'},
 'scheme': 'file'}

Adding a table description

We have added descriptions to our table fields using the Field Descriptor, but we haven’t added a description for the table. We can do this using the Tabular Data Resource description property.

For example adding a description to the yields table:

yields_schema.description = "The yields table contains raw plot yields for each experiment plot"
pp.pprint(yields_schema)

Key Points

  • The Frictionless Table Schema allows us to describe metadata for a table using JSON.

  • The Frictionless python module describe function is used to import a file as a table and infer information about it.

  • Using the Frictionless python module we can edit the Table Schema’s Field Descriptors.