Data Management: Meta-data
Overview
Teaching: 20 min
Exercises: 3 minQuestions
What is meta-data?
Why is it useful?
Objectives
Understand the concept and importance of meta-data
What is Meta-data?
Meta-data is data about data. It is structured information that describes, explains, locates, or otherwise represents something else. Its purpose is to maximise data interpretation and understanding, both by humans and by computers. In other words, its aim is to avoid the details of what the data is from being lost, to prevent rendering the data useless as time passes (sometimes referred to as data entropy or data rot).
The simplest approach to meta-data is to create a simple text file (named, for example, ‘README.txt’) that resides in the same directory as the data. Don’t worry about the file-name of this file in terms of the formal file-naming rules, as crucially, this file will never be distributed on its own. Think of it more as an informal collection of notes.
This file is particularly important for projects that involve multiple files, and it should cover various aspects, such as,
- Project level: What is the study? Name, instruments, methodologies, etc
- File or database level: What are all the files? How do they relate to one another?
- Variable level: Fully explain each variable in the spreadsheets, etc
- Processing: What, in general terms, has been done with the data?
Think of this README.txt file as an email to your future self. What will you need to know about these files in, say, 18 months?
In all but the most complex of datasets, this file should take <10 minutes to create. It does not have to be formal and detailed, although of course it can be if that is deemed necessary. An example of a relatively simple README.txt file,
README.txt file created by Rob Harrand on 2020-04-29
========
Principle researcher: Joe Bloggs
Email: joe.bloggs@research.institute.com
Secondary researcher: Jim Smith
Email: joe.bloggs@research.institute.com
Files in this folder related to project PROJ_00123
See report proj_00123_plan_20190315_JB_v01.docx for general project details
All data collected week commencing 2nd Dec 2019 by Jim Smith (lab B, building 631)
For protocols and instrument detail see corresponding ELISA details in experimental_protocols_20180319_TR_v03.docx
Raw data is in Data/Raw/
Analysed data is in Data/Analysed/
Analysis scripts are in Code/R scripts/
Both the raw and analysed data are commercially sensitive. See NDA-000113
Adjusted biomarker levels in all .xls files are adjusted by subtracking the mean and dividing by the standard deviation
Missing data are indicated using NA throughout all analysed files
Data dictionaries
A more formal version of the simple README.txt is something called a data dictionary. These are repositories of meta-data, containing detailed aspects of the associated data, such as the following,
- Explanations of all files or tables (in a database)
- Field/column definitions and explanations
- Example values, default values, allowed value ranges, and units
- Information on the sources of the datasets
- Information of data types, such as numeric or text
- Any relationships between different fields/columns
- The chosen placeholder for missing values, such as ‘NA’
Data dictionaries are particularly useful when sharing data with others.
Exercise: Data dictionary
Take a look at the data dictionary at the Zika Data Repository
Internal file annotation
In some cases, file types may offer the ability to include meta-data within the main data file. In these cases, a separate meta-data file may be unnecessary, especially if the file is standalone and not part of a group of project files. The simplest example are Word and Excel files. When creating such files, any extra information that will aid in the data’s future use should be included.
Such aspects to keep in mind include,
- With Word files, include the author’s name, date of creation, and project description in the header
- With Excel files, ensure columns and tables are labeled. If the file contains plots, label the axes and give the plot a title. If the file contains multiple tabs, label them appropriately.
Also, consider a separate, initial tab containing a data dictionary and meta-data explaining the rest of the file (e.g. where the data is from, who collected it, units, what each tab is for, etc.).
There are no strict rules on any of the above, and attempting to create such rules for everyone to adhere to may do more harm than good (for example, if the process ends up overly complex and bureaucratic). Instead, guiding principles should be used, with the key aspect being kept in mind, namely, that your aim is to ensure the long-term usefulness of the data.
Key Points
Files and folders with meta-data are far more useable than without