Structured Metadata: From Markup to JSON

Last updated on 2023-11-16 | Edit this page

Quote by Monya Baker in Nature from 2016, saying 'More than 70% of researchers have tried and failed to reproduce another scientist's experiments. More than half have failed to reproduce their own experiments.'
Baker, M. (2016) “1,500 scientists lift the lid on reproducibility.”, Nature 533, 452 - 454.https://doi.org/10.1038/533452a

Overview

Questions

  • What is semi-structured metadata?
  • How do you extract semi-structured metadata from natural language.
  • What is the JSON syntax?

Objectives

  • Explain the importance of semi-structured metadata for machine readability.
  • Understand, read and write basic Markdown / HTML / XML / JSON.

Markup?


Markup is not part of the natural language or content of the text but tells something about it 1, 2. By “marking up” a text document additional information on the structure, formatting and relationships within the document can be given. A familiar example are the annotations left by a teacher in a student’s assay with a red pen.2 In order to make Markup work, it is essential, that it follows determined rules, that are understood by the entity marking up the document and the interpreter alike. Markup languages in the digital context establish sets of rules that allow the machine to interpret the building blocks of the document. Some categories of markup are:

Punctuational markup is placing periods, question marks, or similar punctuations at the end of sentences. It gives clues about intonations.

This is a question?

Presentational markup is mainly about style.

MARKDOWN

Simple markdown will interpret the following word **bold**.

Descriptive or declarative markup declares what an element is; e. g. a member of a particular type or class like a:

MARKDOWN

## headline

If design rules for headlines change, the document structure remains intact and is still in line with the authors’ original intention.

Referential markup refers to entities external to the document and may be replaced by those entities during processing. The World Wide Web markup language HTML (HyperText Markup Language) e.g. uses the anchor <a> tag for hypertext references (hyperlinks) or <img> for images.

HTML

<a href="url">link text displayed to reader</a>

Callout

Rigorous markup can make text (character strings) more accessible for computer analysis.

SGML (Standard Generalized Markup Language) was one of the first industry standards for electronic publishing – a meta-language for generalized, descriptive markup languages – first accepted as an ISO standard in 1986. Both, HTML (1989) and XML (1998) are based on SGML.3

HTML (HyperText Markup Language) is the standard markup language for web pages.

Quote by Cynthia Zender, SAS Institue, saying 'To make markup work, the writer and the interpreter of the marked up content have to agree on the interpretation of the markup symbols.'
* Zender, C. (2005) Cynthia Zender (2005). Markup 101: Markup Basics. SAS Institute. https://www.lexjansen.com/pharmasug/2005/Tutorials/tu12.pdf

XML


The main purpose of XML (eXtensible Markup Language) is the transfer and storage of arbitrary data on the World Wide Web. XML is software- and hardware-independent. It is considered human-readable and allows for hierarchical (tree-like) structures. Data elements are wrapped in start <...> and end </...> “tags”. XML tags can be customized by the author of the document, its markup is therefore not limited to a set of rules but extensible.4

XML

<example>
    <title>This is the example title</title>
    <description>A simple XML example</description>
    <wordCount>1</wordCount>
</example>

JSON


JSON (JavaScript Object Notation) is not a markup language. It is a lightweight, human-readable, hierarchical format to store and transport data.5 JSON syntax is inspired by JavaScript object notation.6 Like XML, JSON is software- and hardware-independent.

  • (meta)data elements are defined in key/value pairs
  • keys are of data type string (in quotes)
  • values must be of data type string, number, boolean, array or object
  • elements are separated by commas
  • curly braces hold objects
  • square brackets hold arrays
  • in-line commenting is not supported

JSON

{
    "key":"value",
    "aString":"string",
    "anInteger":5,
    "aFloat":0.5,
    "aBoolean":true,
    "anArray": ["item1", "item2", "item3"],
    "anObject": {
        "key1":"value1",
        "key2":"value2",
        "key3":"value3"
    }
}

Callout

Data exchange formats such as XML or JSON can be read and processed not only by humans but also by computers. Structured (meta)data is key to enable machine-readability.

Challenge 2: Identify metadata in README.txt

You cannot make sense of the data you got from your collaborators. You ask them for supplemental information and they send you the following README file (see below).

  • Read the README carefully.
  • In the group, discuss, decide and prioritize which information in the text are relevant experimental metadata.
  • Mark up the relevant information. In markdown you can mark up the respective words with “==”.
    Example: ==This text will be highlighted==

You can download the README as TXT file here: README_exampleDataObject.txt

README_exampleDataObject.txt
This README file describes the data in trainingObject.csv

The data describes the biomechanical acceleration and screams detected of a test person during the ride of the roller coaster "Flight of the Bat" in Gotham City. 

The data was collected by Bruce Wayne and Selina Kyle (Institute for Vigilance and Nightly Motion -- Justice League) on 2022-02-28 in Gotham City, New Jersey.

The test person (male) is 5'11 tall and weighs 187 lbs.

The test person strapped the recording device (iPhone X) with a running armband to their left upper arm and activated the biomechanical acceleration and scream detection of the application "Physics Toolbox Suite" by Vieyra Software.

- "t" describes the ride time at which measurements were taken upon activating the recording in seconds.
- "ax" describes the biomechanical acceleration of the test person on the x axis in m/s².
- "ay" describes the biomechanical acceleration of the test person on the y axis in m/s².
- "az" describes the biomechanical acceleration of the test person on the z axis in m/s².
- "scr" is a boolean indicating a detected scream of the test person.

This README describes the data in trainingObject.csv

The data describes the biomechanical acceleration and screams detected of a test person during the ride of the roller coaster “Flight of the Bat” in Gotham City.

The data was collected by Bruce Wayne and Selina Kyle (Institute for Vigilance and Nightly Motion - Justice League) on 2022-02-28 in Gotham City, New Jersey.

The test person (male) is 5’11’’ tall and weighs 187 lbs.

The test person strapped the recording device (iPhone X) with a running armband to the left upper arm and activated the biomechanical acceleration and scream detection of the application Physics Toolbox Suite by Vieyra Software. During the ride, the test person was instructed to firmly hold on to the safety handles in order to avoid excessive movement of the arm. The test person was seated in row 10 on the outer left (seat 37).

t” describes the ride time at which measurements were taken upon activating the recording in seconds. “ax” describes the biomechanical acceleration of the test person on the x axis in m/s². “ay” describes the biomechanical acceleration of the test person on the y axis in m/s². “az” describes the biomechanical acceleration of the test person on the z axis in m/s². “scr” is a boolean indicating a detected scream of the test person.

Challenge 3: Write JSON metadata record

You have manually marked up the relevant information in the README. However, your project requires you to provide metadata in the form of a machine-readable JSON metadata record. The project provides you with a simple example JSON object (remember, that curly braces hold objects):

JSON

{
  "key": "value",
  "key": "value"
}
  • Based on the information identified in the README, write a structured, descriptive JSON object.
  • Collaboratively, find suitable keys to your values.
  • You may want to use some JSON formatter web service to check and beautify (lint) your output.

Keep in mind, that values in JSON must be one of the following data types:

  • a string ""
  • a number 42
  • a boolean True
  • null null
  • an array []
  • an object {}

Example:

JSON

{
    "key": "value",
    "aString": "string",
    "anInteger": 5,
    "aFloat": 0.5,
    "aBoolean": true,
    "anArray": ["item1", "item2", "item3"],
    "anObject": {
        "key1": "value1",
        "key2": "value2",
        "key3": "value3"
    }
}

This is just one (of many) valid solutions.

JSON

{
  "fileName": "trainingObject.csv",
  "abstract": "The data describes the biomechanical acceleration and screams detected of a test person during the ride of the roller coaster \"Flight of the Bat\" in Gotham City.",
  "format": "text/csv",
  "date": "2022-02-28",
  "creator": [
    {
      "creatorName": "Bruce Wayne",
      "creatorAffiliation": "Institute for Vigilance and Nightly Motion - Justice League"
    },
    {
      "creatorName": "Selina Kyle",
      "creatorAffiliation": "Institute for Vigilance and Nightly Motion - Justice League"
    }
  ],
  "experimentalParameters": {
    "testRide": {
      "rideName": "Flight of the Bat",
      "location": "Gotham City, New Jersey",
      "rideType": "roller coaster"
    },
    "testPerson": {
      "sex": "male",
      "height": 180
    },
    "recording": {
      "testDevice": "iPhone X",
      "testDeviceFixture": "left upper arm",
      "testApp": "Physics Toolbox Suite by Vieyra Software"
    }
  },
  "columns": [
    {
      "columnName": "t",
      "columnDescription": "ride time at which measurements were taken upon activating the recording in seconds",
      "dataType": "number",
      "columnUnit": "sec"
    },
    {
      "columnName": "ax",
      "columnDescription": "biomechanical acceleration of the test person on the x axis in m/s²",
      "dataType": "number",
      "columnUnit": "m/s²"
    },
    {
      "columnName": "ay",
      "columnDescription": "biomechanical acceleration of the test person on the y axis in m/s²",
      "dataType": "number",
      "columnUnit": "m/s²"
    },
    {
      "columnName": "az",
      "columnDescription": "biomechanical acceleration of the test person on the z axis in m/s²",
      "dataType": "number",
      "columnUnit": "m/s²"
    },
    {
      "columnName": "scr",
      "columnDescription": "boolean indicating a detected scream of the test person",
      "dataType": "boolean",
      "columnUnit": "1"
    }
  ]
}

Plenary result discussion

  • What was easy while generating the structured metadata record?
  • What was hard? Which points were intensely discussed in the group?
  • Which differences do you see between the different JSON metadata records?
  • How do you feel after comparing the results?

Key Points

  • Markup languages add information to a text that is separated from the content.
  • XML and JSON are lightweight, hierarchical file formats to store and transfer data.
  • XML and JSON are human readable, software- and hardware-independent

  1. James H. Coombs et al. (November 1987). Markup Systems and the Future of Scholarly Text Processing. Communications of the ACM 30. http://xml.coverpages.org/coombs.html#Note1↩︎

  2. Cynthia Zender (2005). Markup 101: Markup Basics. SAS Institute. https://www.lexjansen.com/pharmasug/2005/Tutorials/tu12.pdf↩︎

  3. https://www.iso.org/standard/16387.html↩︎

  4. XML Tutorial. (C) 1999-2022. Refsnes Data, W3Schools. https://www.w3schools.com/xml/↩︎

  5. ECMA-404 - ECMA International. (2021, February 4). Ecma International. https://www.ecma-international.org/publications-and-standards/standards/ecma-404/↩︎

  6. JSON Introduction. (C) 1999-2022. Refsnes Data, W3Schools. https://www.w3schools.com/js/js_json_intro.asp↩︎