Project management, data science


  • Project management includes resource, team, data, communication and risk management.
  • Data science is dealing with large and diverse team, working remotely while using and re-using code to achieve reproducible analysis.

Course content and motivation


  • This course will help you manage team work that involves coding, using digital tools.
  • One objective is to produce reproducible analysis and a more transparent research.
  • The course provides an introduction to Managing Open and Reproducible Computational Projects to senior biomedical researchers, in order to equip them with tools and techniques to generate and document and manage complex reproducible computational projects.

  • Without going into underlying technical details, the contents provide a general overview and present selected relevant biomedical case studies.

Ensuring reproducibility


  • Provenance information and software dependencies management are needed to reproduce an analysis on the mid-term (3 years)
  • Information about the version of the code and of the data used to create an analysis should be captured.
  • Git is a software developed to facilitate version control of text files

  • Version controlled repositories help record different contributions and contributor information openly.

Managing project start and collaborations


  • Shared repository with well structured and organised files are crucial for starting a project
  • Documentation is as important as data and code to understand the different aspects of the project and communicate about the research.
  • Licencing and open science practices allow proper use and reuse of all research objects, hence should be applied in computational research from the start.
  • Make group leaders familiar with practices that are crucial for their teams to develop reproducible code.
  • Encourage researchers to think about code reproducibility through quality check, testing, sharing their code as well as a research environment.
  • Introduce Continuous Integration for automating the testing process.

Managing Data


  • Data management plan is produced by the whole team.
  • All information is present and digitalised
  • Raw data is raw
  • Spreadsheets are tidy, validated and in text format (.tsv)
  • Data is safe (backup)
  • Data is FAIR (findable, accessible, interoperable and reusable)
  • Data can be opened and analysed in a programming language

Managing code


  • Can you follow the analysis workflow without reading the code
  • Is the code well commented and structures
  • Are the figure created accessible
  • Is the statistical analysis founded
  • There are many benefits of code review and this should be implemented and practised in research team culture as early and as frequently as possible.
  • Synchronous code review creates opportunities for researchers to get feedback and learn from others in real-time.
  • Asynchronous code review is a good practice when working with busy researchers or collaborators in different time zones.

Managing publication


  • Online Persistent Identifiers or Digital Object Identifiers are useful for releasing and citing different versions of research objects.
  • Planning for publication will change your workflow, making these decisions early and allocating time to implement your decision is key
  • Cite code and data

Conclusion and feedback