Structured Metadata: From Markup to JSON
Last updated on 2023-11-16 | Edit this page
Overview
Questions
- What is semi-structured metadata?
- How do you extract semi-structured metadata from natural language.
- What is the JSON syntax?
Objectives
- Explain the importance of semi-structured metadata for machine readability.
- Understand, read and write basic Markdown / HTML / XML / JSON.
Markup?
Markup is not part of the natural language or content of the text but tells something about it 1, 2. By “marking up” a text document additional information on the structure, formatting and relationships within the document can be given. A familiar example are the annotations left by a teacher in a student’s assay with a red pen.2 In order to make Markup work, it is essential, that it follows determined rules, that are understood by the entity marking up the document and the interpreter alike. Markup languages in the digital context establish sets of rules that allow the machine to interpret the building blocks of the document. Some categories of markup are:
Punctuational markup is placing periods, question marks, or similar punctuations at the end of sentences. It gives clues about intonations.
This is a question?
Presentational markup is mainly about style.
Descriptive or declarative markup declares what an element is; e. g. a member of a particular type or class like a:
If design rules for headlines change, the document structure remains intact and is still in line with the authors’ original intention.
Referential markup refers to entities external to
the document and may be replaced by those entities during processing.
The World Wide Web markup language HTML
(HyperText Markup
Language) e.g. uses the anchor <a>
tag for hypertext references (hyperlinks) or <img>
for images.
Callout
Rigorous markup can make text (character strings) more accessible for computer analysis.
SGML (Standard Generalized Markup Language) was one of the first industry standards for electronic publishing – a meta-language for generalized, descriptive markup languages – first accepted as an ISO standard in 1986. Both, HTML (1989) and XML (1998) are based on SGML.3
HTML (HyperText Markup Language) is the standard markup language for web pages.
XML
The main purpose of XML (eXtensible
Markup Language) is the
transfer and storage of arbitrary data on the World Wide Web.
XML is software- and hardware-independent. It is considered
human-readable and allows for hierarchical (tree-like) structures. Data
elements are wrapped in start <...>
and end
</...>
“tags”. XML tags can be customized by the
author of the document, its markup is therefore not limited to a set of
rules but extensible.4
JSON
JSON (JavaScript Object Notation) is not a markup language. It is a lightweight, human-readable, hierarchical format to store and transport data.5 JSON syntax is inspired by JavaScript object notation.6 Like XML, JSON is software- and hardware-independent.
- (meta)data elements are defined in key/value pairs
- keys are of data type
string
(in quotes) - values must be of data type
string
,number
,boolean
,array
orobject
- elements are separated by commas
- curly braces hold
objects
- square brackets hold
arrays
- in-line commenting is not supported
JSON
{
"key":"value",
"aString":"string",
"anInteger":5,
"aFloat":0.5,
"aBoolean":true,
"anArray": ["item1", "item2", "item3"],
"anObject": {
"key1":"value1",
"key2":"value2",
"key3":"value3"
}
}
Callout
Data exchange formats such as XML or JSON can be read and processed not only by humans but also by computers. Structured (meta)data is key to enable machine-readability.
Challenge 2: Identify metadata in README.txt
You cannot make sense of the data you got from your collaborators. You ask them for supplemental information and they send you the following README file (see below).
- Read the README carefully.
- In the group, discuss, decide and prioritize which information in the text are relevant experimental metadata.
-
Mark up the relevant information. In markdown you
can mark up the respective words with “==”.
Example:==This text will be highlighted==
You can download the README as TXT file here: README_exampleDataObject.txt
README_exampleDataObject.txt
This README file describes the data in trainingObject.csv
The data describes the biomechanical acceleration and screams detected of a test person during the ride of the roller coaster "Flight of the Bat" in Gotham City.
The data was collected by Bruce Wayne and Selina Kyle (Institute for Vigilance and Nightly Motion -- Justice League) on 2022-02-28 in Gotham City, New Jersey.
The test person (male) is 5'11 tall and weighs 187 lbs.
The test person strapped the recording device (iPhone X) with a running armband to their left upper arm and activated the biomechanical acceleration and scream detection of the application "Physics Toolbox Suite" by Vieyra Software.
- "t" describes the ride time at which measurements were taken upon activating the recording in seconds.
- "ax" describes the biomechanical acceleration of the test person on the x axis in m/s².
- "ay" describes the biomechanical acceleration of the test person on the y axis in m/s².
- "az" describes the biomechanical acceleration of the test person on the z axis in m/s².
- "scr" is a boolean indicating a detected scream of the test person.
This README describes the data in trainingObject.csv
The data describes the biomechanical acceleration and screams detected of a test person during the ride of the roller coaster “Flight of the Bat” in Gotham City.
The data was collected by Bruce Wayne and Selina Kyle (Institute for Vigilance and Nightly Motion - Justice League) on 2022-02-28 in Gotham City, New Jersey.
The test person (male) is 5’11’’ tall and weighs 187 lbs.
The test person strapped the recording device (iPhone X) with a running armband to the left upper arm and activated the biomechanical acceleration and scream detection of the application Physics Toolbox Suite by Vieyra Software. During the ride, the test person was instructed to firmly hold on to the safety handles in order to avoid excessive movement of the arm. The test person was seated in row 10 on the outer left (seat 37).
“t” describes the ride time at which measurements were taken upon activating the recording in seconds. “ax” describes the biomechanical acceleration of the test person on the x axis in m/s². “ay” describes the biomechanical acceleration of the test person on the y axis in m/s². “az” describes the biomechanical acceleration of the test person on the z axis in m/s². “scr” is a boolean indicating a detected scream of the test person.
Challenge 3: Write JSON metadata record
You have manually marked up the relevant information in the README. However, your project requires you to provide metadata in the form of a machine-readable JSON metadata record. The project provides you with a simple example JSON object (remember, that curly braces hold objects):
- Based on the information identified in the README, write a structured, descriptive JSON object.
- Collaboratively, find suitable keys to your values.
- You may want to use some JSON formatter web service to check and beautify (lint) your output.
Keep in mind, that values in JSON must be one of the following data types:
- a string
""
- a number
42
- a boolean
True
- null
null
- an array
[]
- an object
{}
Example:
This is just one (of many) valid solutions.
JSON
{
"fileName": "trainingObject.csv",
"abstract": "The data describes the biomechanical acceleration and screams detected of a test person during the ride of the roller coaster \"Flight of the Bat\" in Gotham City.",
"format": "text/csv",
"date": "2022-02-28",
"creator": [
{
"creatorName": "Bruce Wayne",
"creatorAffiliation": "Institute for Vigilance and Nightly Motion - Justice League"
},
{
"creatorName": "Selina Kyle",
"creatorAffiliation": "Institute for Vigilance and Nightly Motion - Justice League"
}
],
"experimentalParameters": {
"testRide": {
"rideName": "Flight of the Bat",
"location": "Gotham City, New Jersey",
"rideType": "roller coaster"
},
"testPerson": {
"sex": "male",
"height": 180
},
"recording": {
"testDevice": "iPhone X",
"testDeviceFixture": "left upper arm",
"testApp": "Physics Toolbox Suite by Vieyra Software"
}
},
"columns": [
{
"columnName": "t",
"columnDescription": "ride time at which measurements were taken upon activating the recording in seconds",
"dataType": "number",
"columnUnit": "sec"
},
{
"columnName": "ax",
"columnDescription": "biomechanical acceleration of the test person on the x axis in m/s²",
"dataType": "number",
"columnUnit": "m/s²"
},
{
"columnName": "ay",
"columnDescription": "biomechanical acceleration of the test person on the y axis in m/s²",
"dataType": "number",
"columnUnit": "m/s²"
},
{
"columnName": "az",
"columnDescription": "biomechanical acceleration of the test person on the z axis in m/s²",
"dataType": "number",
"columnUnit": "m/s²"
},
{
"columnName": "scr",
"columnDescription": "boolean indicating a detected scream of the test person",
"dataType": "boolean",
"columnUnit": "1"
}
]
}
Plenary result discussion
- What was easy while generating the structured metadata record?
- What was hard? Which points were intensely discussed in the group?
- Which differences do you see between the different JSON metadata records?
- How do you feel after comparing the results?
Key Points
- Markup languages add information to a text that is separated from the content.
-
XML
andJSON
are lightweight, hierarchical file formats to store and transfer data. -
XML
andJSON
are human readable, software- and hardware-independent
James H. Coombs et al. (November 1987). Markup Systems and the Future of Scholarly Text Processing. Communications of the ACM 30. http://xml.coverpages.org/coombs.html#Note1↩︎
Cynthia Zender (2005). Markup 101: Markup Basics. SAS Institute. https://www.lexjansen.com/pharmasug/2005/Tutorials/tu12.pdf↩︎
XML Tutorial. (C) 1999-2022. Refsnes Data, W3Schools. https://www.w3schools.com/xml/↩︎
ECMA-404 - ECMA International. (2021, February 4). Ecma International. https://www.ecma-international.org/publications-and-standards/standards/ecma-404/↩︎
JSON Introduction. (C) 1999-2022. Refsnes Data, W3Schools. https://www.w3schools.com/js/js_json_intro.asp↩︎