Data Management
Last updated on 2024-11-19 | Edit this page
Overview
Questions
- What are the benefits of writing a living Data Management Plan?
- Why is it a good idea to think of data stewardship?
Objectives
- Demonstrate writing a living DMP using a word processor or DMPTool.org.
- Understand the importance of talking about data stewardship.
Introduction
While this whole training encompasses data management activities, this section will introduce you to two useful areas: A living Data Management Plan and the roles and responsibilites around data management with research collaborators, known as Data Stewardship.
Living Data Management Plan (DMP)
Many researchers are aware of the two-page data management plan (DMP) for a grant application, but you may not be aware of the more useful type of DMP: a living DMP. This document describes how data will be actively managed during a project and may be updated whenever necessary to reflect current data practices. A living DMP is a useful touchstone for understanding where data lives, how it’s labelled, how it moves through the research process, and who will oversee the data management.
For DMPs, UC has partnered with DMPTool. This is a free service for researchers and you can log in with your UC credentials through SSO. The internet has many resources and videos going over how to use DMPTool. The following link is 13 min. video by Carnegie Mellon University Libraries showing DMPTool and how to fill out the forms is provided.
Challenge
Pick a project and answer the following questions to build your living DMP. This DMP may be changed at any time to improve practices. If you are doing collaborative research, work through this exercise with your collaborators to agree on shared conventions. Feel free to use the DMPTool service and you can check that the DMP made is a test in the initial screen.
Write a short summary of the project this DMP is for:
Example: This project uses mass spectrometry to identify isotopic composition of soil samples.
Where will data be stored? How will data be backed up? (See Exercise 4.1: Pick Storage and Backup Systems.)
Example: The data is generated on the mass spectrometer then copied to a shared lab server. The server is backed up by departmental IT.
How will you document your research? Where will your research notes be stored?
Example: Data collection and analysis is primarily documented in a laboratory notebook, organized by date. README.txt files add documentation to the digital files as needed.
How will your data be organized? (See Exercise 3.1: Set Up a File Organization System.)
Example: Each researcher has their own folder on the shared server. Data within my folder is organized in folders by sample site with subfolders labeled by sample ID. Sample ID consists of: two-letter sample site code, three-digit sample number, and date of sample collection formatted as YYYYMMDD (e.g. “MA006- 20230901” and “CB012-20100512”).
What naming convention(s) will you use for your data? (See Exercise 3.2: Create a File Naming Convention.)
Example: Files will be named with the sample ID, type of measurement, and stage in the analysis process; these pieces of information will be separated by underscores. Examples: “MA006-20230901_TIMS_raw” and “CB012-20100512_SIMS_analyzed”.
Do you need to do any version control on your files? How will that be done?
Example: Version control will be very simple through file naming, appending analysis information onto the end of file names to keep track of which version of the file it is.
How will data move through the collection and analysis pipelines?
Example: Once data is collected on the mass spectrometer, I will copy it to the correct folder on the shared server for analysis. Data will stay in its sample IDlabeled folder as it gets analyzed, with different file names to annotate analysis stage. Data that will be published will be copied into separate folders, organized by article.
Record any project roles and responsibilities around data management:
Example: It is each researcher’s responsibility to ensure that data moves through the analysis pipeline and is labeled correctly. The lab manager will ensure that the shared server stays organized and will periodically check that backups are working.
Record any other details on how data will be managed:
Example: Copies of this DMP will live in my top-level folder on the lab server so that others can find and use my data as needed.
Determine Data Stewardship
It is often helpful to be up front about requirements and permissions around research data. This exercise encourages you to discuss these issues with supervisors and peers to make sure that there are no misunderstandings about who has what rights to use, retain, and share data. Many of these will be addressed in the living DMP, but we will separate them into their own category here.
Challenge
Determine which research data should be discussed. Bring together the Principle Investigator, the researcher collecting the data, and anyone else who works with that data. As a group, answer the questions in the exercise, making sure that everyone agrees on the final decisions. Record the results of the discussion and save them with the project files.
Who is participating in this discussion?
Example: This discussion includes the graduate student who collected the data, the project Principle Investigator (PI), and the laboratory manager.
What data is being discussed? Example: This discussion covers all of the data collected by the graduate student during their time at the university.
Are there security or privacy restrictions on the data and, if so, what are they?
Example: Some of the research data includes human subjects data. This data must be held securely with limited data sharing, as outlined in the IRB protocols.
Are there intellectual property limitations on the data and, if so, what are they?
Example: There are no intellectual property concerns for the data.
Are there any requirements to publicly share the data and, if so, what are they?
Example: This research was funded by the NIH, which requires data sharing. The laboratory plans to share all data reproducing published results with the exception of the human subjects data.
Who will store the copy of record of the data and for how long?
Example: The project PI will retain the copy of record of the data for at least 3 years after the end of the grant award, with an ideal 10 year retention period.
Who is allowed to keep a copy of the data after the project ends? Which data?
Example: The graduate student may keep a copy of all data except the human subjects data after they leave the university.
Who is allowed to reuse the data after the project ends? Which data? Are there any requirements for reuse, such as co-authorship?
Example: The graduate student may reuse and publish with the data collected during their time at the university but must offer co-authorship of any papers using the data to the project PI and any relevant lab members.
Who keeps any physical research notebooks after the project ends?
Example: The PI will keep all physical laboratory notebooks but the graduate student may make copies to retain for their personal records.
Key Points
- A living DMP is useful for understanding where data lives, how it’s labelled, how it moves through the research process, and who will oversee the data management.
- Data stewardship lays out who has what rights to use, retain, and share data.