Dataset README and Data Dictionary
Last updated on 2024-11-19 | Edit this page
Overview
Questions
- What is a dataset-level README file?
- How is this different from a project-level README file?
- What is a data dictionary and how is it different from a data-level README file?
Objectives
- Be able to write a data-level README
- Be able to write out a data dictionary
Introduction
A data-level README (DLR) describes the dataset, the file structure, and the file naming scheme. Unlike a DLR, a data dictionary describes a single data file. Think of a data dictionary as a translator for a foreign language. It explains the meaning of each word (variable) in a dataset, making it understandable to others.
Let’s illustrate the differences between a DLR and a data dictionary
by using an example project. The following is a simple file structure
for our project with folder1
being a single dataset within
our project.
project
| README.txt
| file001.txt
|
|___folder1
| | README.txt
| | file011.txt
| | file011_dictionary.txt
| | file012.csv
| | file012_dictionary.txt
| | ...
| |____subfolder1
| | file111.csv
| | file111_dictionary.txt
| | file112.csv
| | file112_dictionary.txt
We see from the example project that there is a PLR under the folder
project
and a DLR in the top level of the first dataset,
signified by folder1
.
Dataset-level README (DLR)
A DLR describes the data in each folder. Think of a researcher that
might pick up your data repository for the first time. They are looking
for a certain dataset that is described by the files under
subfolder1
. The researcher will have two options here: they
can open data or data dictionary in each folder to see what type of data
is housed within, or they can look at the DLR that describes which data
is in each folder. For an example dataset, you can look at this one on
figshare: example
dataset.
Challenge
For the above folder structure, the dataset starts with
folder1
and all folders within. The data concerns
performance metrics of autonomous underwater vehicles (AUVs). Assume
that subfolder1
contains reference data from the following
reference:
Michael Coe, Stefanie Gutschmidt, Cost of Transport is not the whole story — A review, Ocean Engineering, Volume 313, Part 1, 2024, https://doi.org/10.1016/j.oceaneng.2024.119332.
file011.txt
is text responses for a survey pertaining to
people’s views of AUVs.
file012.csv
is comma-separated values (csv) of energy
requirements for AUVs.
file111.csv
is data pertaining to the speed of different
AUV platforms.
file112.csv
is data pertaining to the energy consumption
of different AUV platforms.
This example solution is written for a text (.txt) format, but could easily be used for a markdown (.md). Both of these formats can be open universally. Note that your solution can be different and this is just an example.
folder1 includes data for all figures included in this publication
file011.txt contains text responses of people view of AUVs used to generate figure 1
file012.csv contains new energy requirements for AUVs used to generate figure 2
subfolder2 houses all the reference data from the following reference:
Michael Coe, Stefanie Gutschmidt, Cost of Transport is not the whole story — A review, Ocean Engineering, Volume 313, Part 1, 2024, https://doi.org/10.1016/j.oceaneng.2024.119332.
file111.csv is data from the reference on the speed of different AUV platforms used in the generation of figure 3.
file112.csv is data from the reference on energy consumption of different AUV platforms, also included in figure 2.
Data Dictionary
Ideally, a spreadsheet is formatted with a row of variable names at the top, followed by rows of data going down. This makes easy for data to be used in any data analysis software (interoperability is a good thing) but makes it impossible to document a spreadsheet within the file itself. For this reason, it’s useful to create a data dictionary to describe the spreadsheet so that others can interpret the data. A data dictionary describes the variables in a spreadsheet, aiding in data interpretation. It should include:
- Variable name
- Variable description
- Variable units
- Relationships to other variables
- Coding values and their meanings
- Data issues
- Additional relevant information
You can download a blank example dictionary. Note: that this can be formatted in a spreadsheet for multiple variables or formatted for multiple variables in the text file.
The following callout is an example of a data dictionary.
Callout
Question | Example |
---|---|
Variable name | site |
Variable description | Two-letter abbreviation describing the name of the overall site where the sample was collected. |
Variable units | N/A |
Relationship to other variables | Partner to variable “sampleNum”, which together define the sample ID (site name + sample number at that site). Related to variables “latitude” and “longitude”, which record exact coordinate location and are more specific than the larger site code. |
Variable coding values and meanings | Coding values and meanings: BL = Badlands NP; DV = Death Valley NP; GT = Grand Teton NP; JT = Joshua Tree NP; ZN = Zion NP |
Known issues with the data | Some Badlands samples were collected outsideof the park boundaries; see latitude and longitude variables for specific locations. |
Anything else to know about the data? | Older data (pre-2013) used one-letter abbreviations for site code but this was updated for clarity and ease of identification. |
Challenge
Let’s assume that we have a csv where the first row is all the variable names and each row after is a data point. In keeping with our previous challenge, the first three variables are t, L, and U which correspond to time, length, and speed. How would you write a data dictionary for these three variables?
Question | t | L | U |
---|---|---|---|
Variable name | time | Length | Velocity |
Variable description | timestep | length of the AUV | Velocity of the AUV |
Variable units | second | meter | meter per second |
Relationship to other variables | N/A | Partner variable to time | Partner variable to time |
Variable coding values and meanings | N/A | N/A | N/A |
Known issues with the data | N/A | Some lengths had to be estimated from figures | N/A |
Anything else to know about the data? | N/A | N/A | N/A |
Key Points
- Data-level README can help a researcher quickly understand which data is which and where that data is stored in your dataset.
- Data-level README is a quick description of how your dataset is structured.
- Data dictionaries are a decoding of the variables used in a single file of your dataset.