Instructor Notes

The content of this course might be provided following different schedule, in parts or in a different order. Here we list different possibilities

Teaching schedules

Six times one hour.

  1. Introduction: what is this course, what are the motivation to include data science practices in research, what do we mean with data science practices.
  2. Reproducibility, provenance and version control.
  3. Setting up a project and its management tools.
  4. Research data management in a computational project.
  5. Code quality control.
  6. Publication and open science.

One day workshop (8h)

Following the blocks above, but adding ice breaking and review exercises.

Advertise the course

Content of the introduction part (chapter 1-3) may be reused to give a 10 minutes overview of the objectives.

  • It might be best to start a workshop with the chapter 2: motivations.

Project management, data science


Course content and motivation


Instructor Note

This chapter about motivation can be used to convince people to sign up to the course. It can also be used as a small starter for each lessons.



Ensuring reproducibility


Managing project start and collaborations


Instructor Note

In order to indroduce kanbans, one can use different tools. We used miro to both collect feedback and introduce kanbans. Then we had a practical section with a mix of demo and hands-on using https://next.forgejo.org/. We choose that tool because it is a particularly not well known open source alternative.



Managing Data


Managing code


Instructor Note

Data wrangling—also called data cleaning, data remediation, or data munging—refers to a variety of processes designed to transform raw data into more readily used formats. The exact methods differ from project to project depending on the data you’re leveraging and the goal you’re trying to achieve.

Some examples of data wrangling include:

  • Merging multiple data sources into a single dataset for analysis,
  • Identifying gaps in data (for example, empty cells in a spreadsheet) and either filling or deleting them,
  • Deleting data that’s either unnecessary or irrelevant to the project you’re working on,
  • Identifying extreme outliers in data and either explaining the discrepancies or removing them so that analysis can take place.


Instructor Note

Case Study

A postdoc wrote a helpful series of functions for data analysis with neurophysiology recordings. The postdoc wrote them as reuseable, and so two PhD students copied and pasted these blocks of code into their code and used it to analyse the data for their projects.

The postdoc later discovers a better way of writing the functions. One PhD student also wants to change the method and so has to search through his files to replace the code. The other PhD student wants the old method in some files and the new method in others, and so does not change all of them. It is therefore complex to follow the differences in the methods across the projects, and this is very open to errors in typos.

Instead, the postdoc could have saved his functions in the lab’s private repo which becomes the master copy and students pull from this. With the functions saved in a library, the PhD students can import them into their scripts. Now when the postdoc changes the functions and saves them to the repo, PhD students can choose to update their version of the functions. The students should document which version they have used.



Managing publication


Instructor Note

Some of these servers are Zenodo, FigShare, Data Dryad (for data), Open Grants (for grant proposals) and Open Science Framework (OSF) (for different components of an open research project). It allows you to show connections between different parts of research as well as cite different objects from your work independently.

When working on GitHub for instance, you can connect the project repository with Zenodo to get a DOI for your repository. The Citation File Format, then lets you provide citation metadata, for software or datasets, in plaintext files that are easy to read by both humans and machines.



Zenodo

zenodo entry example which is synchronised with GitHub.

Zenodo is a general-purpose open-access repository developed under the European OpenAIRE program and operated by CERN. It allows researchers to deposit research papers, data sets, research software, reports, and any other research related digital artefacts.

Uploads to Zenodo are:

  • Safe — your research is stored safely for the future in CERN’s Data Centre for as long as CERN exists.
  • Trusted — built and operated by CERN and OpenAIRE to ensure that everyone can join in Open Science.
  • Citeable — every upload is assigned a Digital Object Identifier (DOI), to make them citable and trackable. No waiting time — Uploads are made available online as soon as you hit publish, and your DOI is registered within seconds.
  • Open or closed — Share e.g. anonymized clinical trial data with only medical professionals via our restricted access mode.
  • Versioning — Easily update your dataset with our versioning feature.
  • GitHub integration — Easily preserve your GitHub repository in Zenodo.
  • Usage statistics — All uploads display standards compliant usage statistics

This goes in more details about collaborative code, it may be useful if collaboration is the main topic of your workshop.

Collaborative Open Code

drawing

Downloading code and data files from Zenodo or other open access repositories can be useful when someone wants to review your the final outcome of your computational work. However, with an open GitHub repository, sharing code becomes much more collaborative and in real-time.

drawing

Uploading code in progress to an open GitHub Repo is the best and most well-used method for programming collaboration.

As you develop a tool or methodology, users have the ability to use your code while it is a work in progress and others can contribute or add features.

drawing

When using specifically R, you could release R packages on CRAN where anyone can then download and use you code.



Conclusion and feedback