This lesson is being piloted (Beta version)
If you teach this lesson, please tell the authors and provide feedback by opening an issue in the source repository

Being precise

Overview

Teaching: 25 min
Exercises: 10 min
Questions
  • How to make my metadata interoperable?

  • How to avoid disambiguation?

Objectives
  • Using public identifiers

  • Understand difference between close vocabulary and ontology

  • Finding ontology terms

(16 min teaching)

Being precise

If the metadata purpose is to help understand the data, it has to be done in a precise and “understandable” way i.e. it has to be interoperable. To be interoperable metadata should use a formal, accessible, shared, and broadly applicable terms/language for knowledge representation.

One of the easiest examples is the problem of author disambiguation.

Why we need ORCID

After Libarary Carpentry FAIR Data

Open Researcher and Contributor ID (ORCID)

Have you ever searched yourself in pubmed and found that you have a doppelganger? So how can you uniquely associate something you created to just yourself and no other researcher with the same name?

ORCID is a free, unique, persistent identifier that you own and control—forever. It distinguishes you from every other researcher across disciplines, borders, and time.

ORCIDs of authors of this episode are:

You can connect your iD with your professional information—affiliations, grants, publications, peer review, and more. You can use your iD to share your information with other systems, ensuring you get recognition for all your contributions, saving you time and hassle, and reducing the risk of errors.

If you do not have an ORCID, you should register to get one!

Exercise 1: Public ID in action (3 min)

The Wellcome Open Research journal uses ORCID to identify authors.

  • Open one of our papers doi.org/10.12688/wellcomeopenres.15341.2 and have a look how public IDs such as ORCID can be used to interlink information.

  • If you have not done so yet, register yourself at ORCID*

Solution

ORCID is used to link to authors profiles which list their other publications.

ORCID provides the registry of researchers, so they can be precisely identified. Similarly, there are other registries that can be used to identify many of biological concepts and entities:

BioPortal or NCBI
are good places to start searching for a registry or a term.

Exercise 2: Public ID in action 2 (3 min)

  • The second metadata example (the Excel table): contains two other types of public IDs.
    Metadata in data table example Figure credits: Tomasz Zielinski and Andrés Romanowski

    • Can you find the public IDs?
    • Can you find the meaning behind those Ids?

Solution

The metadata example contains genes IDs from The Arabidopsis Information Resource TAIR and metabolites IDs from KEGG

Disambiguation (7 min teaching)

In academic disciplines we quickly run into problems of naming standards e.g.:

In order to prevent such ambiguities applications provide interface that constraints users to pre-defined options, it controls the available vocabulary.

UI with controlled vocabulary Example of graphical user interfaces with controlled vocabularies

Controlled Vocabulary

Definition: Any closed prescribed list of terms

Key Features:

  • Terms are not usually defined
  • Relationships between the terms are not usually defined
  • the simplest form is a list

Example:

  • E. coli
  • Drosophila melanogaster
  • Homo sapiens
  • Mus musculus
  • Salmonella

Use of controlled vocabulary (a list) can be organised hierarchically into a taxonomy, a system we know mostly from our species taxonomy.

Taxonomy

Definition: Any controlled vocabulary that is arranged in a hierarchy

Key Features:

  • Terms are not usually defined
  • Relationships between the terms are not usually defined
  • Terms are arranged in a hierarchy

Example:

  • Bacteria
    • E. coli
    • Salmonella
  • Eucariota
    • Mammalia
      • Homo sapiens
      • Mus musculus
  • Insecta
    • Drosophila melanogaster

Ontologies add a further dimension to controlled vocabularies and taxonomy. They allow us to conceptualise relationships between the established hierarchy which helps with more sophisticated data queries and metadata searches.

Ontology

Definition: A formal conceptualisation of a specified domain

Key Features:

  • Terms are DEFINED
  • Relationships between the terms are DEFINED, allowing logical inference and sophisticated data queries
  • Terms are arranged in a hierarchy
  • expressed in a knowledge representation language such as RDFS, OBO, or OWL

Example:

  • Bacteria
    • E. coli
    • Salmonella
  • Eucariota — has_part —> nucleas
    • Mammalia — has_part —> placenta
      • Homo sapiens
      • Mus musculus
  • Insecta
    • Drosophila melanogaster

Ontologies represent a standardised, formal naming system and define categories, properties and relationships between data. Ontologies allow to describe properties of a subject area and how they are related (e.g. taxonomy).

ontology-example

Ontologies allows automatic reasoning about the data, using the relations between the described terms. For example, in the picture above we can “deduce” that hippocampal astrocyte:

We could follow the term axon regeneration to find its definition and what it means, to have a better understanding of hippocampal astrocyte.

Ontologies are crucial for aggregation of information and finding suitable data. Imagine you are interested in gene expersions in glial cells. Using a free text search you would need to hope that data from astrocytes will be also tagged as glial cell, or you would need to specify all suptypes of glial cells like oligodendrocytes astrocytes. Ontologies permits to automatically expands such searches that they contain suitable domain.

The above figure presents also another aspect of using ontologies. The terms are often referred to as:
PREFIX_ID e.g. CL_0002604
Prefix (CL, UB, UBERON, GO) will identify the ontology which defined the term (Cell Ontology for CL).
ID refers to particular term in the ontology.

Ontologies are encoded in standard, interoperable formats, which permitted creation of re-usable interfaces to access the described terms or browsing different ontologies. For example: https://bioportal.bioontology.org/](https://bioportal.bioontology.org/) permits searching for suitable terms definitions in hundreds of ontologies.

Obo foundry is another useful tool that lists recommended ontologies.

Exercise 3: Exploring ontologies (with instructor)

Check the example of ontology records:

Check what type of information is available, explore the terms hierarchy, the visualization, the interlinked terms, mapping between different ontologies.

Exercise 4: Ontology tests

  1. The prefix CL stands for:
    • a) Class ontology:
    • b) Cell ontology:
    • c) Cell line ontology
  2. The recommended ontology for chemical compounds is:
    • a) cheminf
    • b) chmo
    • c) chebi
  3. Which terms captures both Alzheimer’s and Huntington’s diseases
    • a) DOID_680
    • b) DOID_1289
    • c) DOID_0060090

Solution

3 b (hint try to see the tree in bioportal)

Attribution

Content of this episode was adapted from:

Ed_DaSH

Key Points

  • Public identifiers and ontologies are key to knowledge discovery

  • Automatic data aggregations needs standardised metadata formats and values