Public repositories

Last updated on 2024-05-01 | Edit this page

Estimated time: 40 minutes

Overview

Questions

Where can I deposit datasets?
What are general data repositories?
How to find a repository?

Objectives

See the benefits of using research data repositories.
Differentiate between general and specific repositories.
Find a suitable repository.

What are research data repositories? (4 min teaching)

Research data repositories are online repositories that enable the preservation, curation and publication of research ‘products’. These repositories are mainly used to deposit research ‘data’. However, the scope of the repositories is broader as we can also deposit/publish ‘code’ or ‘protocols’ (as we saw with protocols.io).

There are general “data agnostic” repositories, for example:

Or domain specific, for example:

UniProt protein data,
GenBank sequence data,
MetaboLights metabolomics data
GitHub for code.

Research outputs should be submitted to discipline/domain-specific repositories whenever it is possible. When such a resource does not exist, data should be submitted to a ‘general’ repository. Research data repositories are a key resource to help in data FAIRification as they assure Findability and Accessibility.

Discussion

Exercise 1: Public general record (5 min)

Have a look at the following record for data set in Zenodo repository:

We have discussed which elements of a Zenodo record makes it FAIR.

Now, skim through the data set description (HINT there is also a README), try to judge the following, and indicate your evaluation using marks from 0 to 5 (5 best) as to whether:

It is clear what the content of the data set is:
It is clear why the data could be used (i.e., what for):
It is well described:
How confident will you be to work with this data set:
How easy it is to access the data set content:
Your team datasets are equally well described (or better):

(4 min discussion)

Challenge

Exercise 2: Datasets discovery (5 min)

Try to find: - data sets related to neuromuscular junction in Zenodo Judge the following, indicating your assessment using marks from 0 to 5 (5 best)

how easy it is to find similar or interesting data sets:
It is clear what the content of the other data sets are:
It is clear why the data could be used (ie what for):
They are well described:

Show me the solution

Zenodo is a good place to keep your data separate from paper. It gives access to all files, allowing you to cite the data as well (or instead of) the paper.

However, it is not to good for discovery and does not enforce most metadata!

(3 min teaching)

Callout

Minimal data set

Minimal data set to consist of the data required to replicate all study findings reported in the article, as well as related metadata and methods.

The values behind the means, standard deviations and other measures reported;
The values used to build graphs;
The points extracted from images for analysis.

(no need for raw data if the standard in the field is to share data that have been processed)

PLOS

Challenge

Exercise 3. Domain specific repositories

(5 min + 5).

Select one of the following repositories based on your expertise/interests:

Have a look at mRNAseq accession ‘E-MTAB-7933’ in ArrayExpress
Have a look at microscopy ‘project-1101’ in IDR
Have a look at the synthethic part record ‘SubtilinReceiver_spaRK_separated’ within the ‘bsu’ collection in SynBioHub
Have a look at the proteomics record ‘PXD013039’ in PRIDE
Have a look at the metabolomics record ‘MTBLS2289’ in Metabolights
Have a look at the scripts deposit ‘RNA-Seq-validation’ in GitHub

Report to the group, what advantages can you see in using a specific repository over a generalist repository like Zenodo.

Show me the solution

Some advantages are:

The repository is more relevant to your discipline than a generalist one.
Higher exposure (people looking for those specific types of data will usually first look at the specific repository).
Higher number of citations (see above).

How do we choose a research data repository?

(3 min teaching) As a general rule, your research needs to be deposited in discipline/data specific repository. If no specific repository can be found, then you can use a generalist repository. Having said this, there are tons of data repositories to choose from. Choosing one can be time consuming and challenging as well. So how do you go about finding a repository:

Check the publisher’s / funder’ recommended list of repositories, some of which can be found below:
Check Fairsharing recommendations
or check the Registry of research data repositories - re3data

Challenge

Exercise 4: Finding a repository (5 min + 2 min discussion).

Using Fairsharing (https://fairsharing.org/) find a repo for flow cytometry
once done, search for repository for genomics data.

Note to instructor: Fairsharing gives few options, people may give different answer follow up why they selected particular ones.

Show me the solution

FlowRepository
GEO/SRA and ENA/ArrayExpress are good examples. Interestingly these repositories do not issue a DOI.

(6 min teaching)

A list of UoE BioRDM’s recommended data repositories can be found here.

What comes first? the repository or the metadata?

Finding a repository first may help in deciding what metadata to collect and how!

Extra features

It is also worth considering that some repositories offer extra features, such as running simulations or providing visualisation. For example, FAIRDOMhub can run model simulations and has project structures. Do not forget to take this into account when choosing your repository. Extra features might come in handy.

Can GitHub be cited?

As github is a tool for version controll and life code, it may not alway match the version of the code which was described in a publication.

To solve this issue you should create a snapshot of your repository and deposit it in a general repository to obtain a DOI.

For example, Zenodo can automatically integrate with github and create a new deposit for each of your releases.

Evaluating a research data repository

You can evaluate the repositories by following this criteria:

who is behind it, what is its funding
quality of interaction: is the interaction for purposes of data deposit or reuse efficient, effective and satisfactory for you?
take-up and impact: what can I put in it? Is anyone else using it? Will others be able to find stuff deposited in it? Is the repository linked to other data repositories so I don’t have to search tehre as well? Can anyone reuse the data? Can others cite the data, and will depositing boost citations to related papers?
policy and process: does it help you meet community standards of good practice and comply with policies stipulating data deposit?

An interesting take can be found at Peter Murray-Rust’s blog post Criteria for succesful repositories.

Discussion

Exercise 5: Wrap up discussion (5 min).

Discuss the following questions:

Why is choosing a domain specific repositories over zenodo more FAIR?
How can selecting a repository for your data as soon as you do an experiment (or even before!) benefit your research and help your data become FAIR?
What is your favourite research data repository? Why?

SH

## Attribution
Content of this episode was adapted or inspired by:.
- [FAIR principles](https://www.go-fair.org/fair-principles/)
- [BioRDM suggested data repositories](https://www.wiki.ed.ac.uk/display/RDMS/Suggested+data+repositories)
- [DCC - How can we evaluate data repositories?](https://www.dcc.ac.uk/news/how-can-we-evaluate-data-repositories-pointers-dryaduk)
- [Criteria for succesful repositories](https://blogs.ch.cam.ac.uk/pmr/2011/08/19/criteria-for-successful-repositories/)

Key Points

Repositories are the main means for sharing research data.
You should use data-type specific repository whenever possible.
Repositories are the key players in data reuse.