Data Readiness: Data Format and Integrity
Last updated on 2024-05-14 | Edit this page
Overview
Questions
- What format do I require my data in to build a supervised machine learning classifier?
- What are the potential data format and data integrity issues I will encounter with RNA-Seq datasets, including those downloaded from public repositories, that I’ll need to address before beginning any analysis?
Objectives
- Describe the format of a dataset required as input to a simple supervised machine learning analysis
- Recall a checklist of common formatting and data integrity issues encountered with sample information (metadata) in RNA-Seq datasets
- Recall a checklist of common formatting and data integrity issues encountered with counts matrix RNA-Seq datasets
- Recall why formatting and data integrity issues present a problem for downstream ML analysis
- Apply the checklist to an unseen dataset downloaded from either the Array Express or GEO platform to identify potential issues
- Construct a reusable R code pipeline implement required changes to a new dataset
Required format for machine learning libraries
Data must be correctly formatted for use in a machine learning / AI pipeline. The garbage in garbage out principle applies; a machine learning model is only as good as the input data. This episode first discusses the kind of data required as input to a machine learing classification model. We’ll then illustrates the process of formatting an RNA-Seq dataset downloaded from a public repository. On the way, we’ll define a checklist of data integrity issues to watch out for.
For a supervised machine learning classification task, we require:
A matrix of values of all of the predictor variables to be included in a machine learning model for each of the samples. Example predictor variables would be gene abundance in the form of read counts. Additional predictor variables may be obtained from the sample information including demographic data (e.g.
sex
,age
) and clinical and laboratory measures.A vector of values for the target variable that we are trying to predict for each sample. Example target variables for a classification model would be disease state, for example ‘infected with tuburculosis’ compared with ‘healthy control’. The analogy for a regression task would be a continuous variable such as a disease progression or severity score.
We will follow a series of steps to reformat and clean up the sample information and counts matrix file for the IBD dataset collected form ArrayExpress in Episode 2 by applying a checklist of data clean up items. If required, the data can be downloaded again by running this code.
R
download.file(url = "https://zenodo.org/record/8125141/files/E-MTAB-11349.sdrf.txt",
destfile = "data/E-MTAB-11349.sdrf.txt")
download.file(url = "https://zenodo.org/record/8125141/files/E-MTAB-11349.counts.matrix.csv",
destfile = "data/E-MTAB-11349.counts.matrix.csv")
We’ll start by reading in the dataset from our data
subfolder.
R
samp.info.ibd <- read.table(file="./data/E-MTAB-11349.sdrf.txt", sep="\t", header=T, fill=T, check.names=F)
raw.counts.ibd <- read.table(file="./data/E-MTAB-11349.counts.matrix.csv", sep="," , header=T, fill=T, check.names=F)
IDB dataset: Sample Information Clean Up
In real RNA-Seq experiments, sample information (metadata) has likely been collected from a range of sources, for example medical records, clinical and laboratory readings and then manually compiled in a matrix format, often a spreadsheet, before being submitted to a public respository. Each of these steps has the potential to introduce errors and inconsistency in the data that must be investigated prior to any downstream machine learning analysis. Whether you are working with sample information received directly from co-workers or downloaded from a public repsository, the following guidelines will help identify and resolve issues that may impact downstream analysis.
Sample Information Readiness Checklist
Item | Check For… | Rationale |
---|---|---|
1 | Unique identifier matching between counts matrix and sample information | Target variables must be matched to predictor variables for each sample. Target variables are the thing we are trying to predict such as disease state. Predictor variables are the things we are going to predict based on, such as the expression levels of particular genes. |
2 | Appropriate target variable present in the data | The sample information needs to contain the target variable you are trying to predict with the machine learning model |
3 | Unique variable names, without spaces or special characters | Predictor variables may be drawn from the sample information. ML algorithms require unique variables. Spaces etc. may cause errors with bioinformatics and machine learning libraries used downstream |
4 | Machine readable and consistent encoding of categorical variables | Inconsistent spellings between samples will be treated as separate values. Application specific formatting such as colour coding and notes in spreadsheets will be lost when converting to plain text formats. |
5 | Numerical class values formatted as a factor (for classification) | Some machine learning algorithms/libraries (e.g. SVM) require classes defined by a number (e.g. -1 and +1 ) |
6 | Balance of classes in the target variable (for classification) | Imbalanced datasets may negatively impact results of machine learning algorithms |
We’ll now apply these steps sequentially to the sample information
for the IBD dataset, contained in our variable
samp.info.ibd
. First let’s take a look at the data:
R
dplyr::glimpse(samp.info.ibd)
OUTPUT
Rows: 590
Columns: 32
$ `Source Name` <chr> "Sample 1", "Sample 2", "Sam…
$ `Characteristics[organism]` <chr> "Homo sapiens", "Homo sapien…
$ `Characteristics[age]` <int> 34, 49, 27, 9, 34, 60, 54, 2…
$ `Unit[time unit]` <chr> "year", "year", "year", "yea…
$ `Term Source REF` <chr> "EFO", "EFO", "EFO", "EFO", …
$ `Term Accession Number` <chr> "UO_0000036", "UO_0000036", …
$ `Characteristics[developmental stage]` <chr> "adult", "adult", "adult", "…
$ `Characteristics[sex]` <chr> "male", "female", "male", "m…
$ `Characteristics[individual]` <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1…
$ `Characteristics[disease]` <chr> "Crohns disease\tblood\torga…
$ `Characteristics[organism part]` <chr> "", "blood", "blood", "blood…
$ `Material Type` <chr> "", "organism part", "organi…
$ `Protocol REF` <chr> "", "P-MTAB-117946", "P-MTAB…
$ `Protocol REF` <chr> "", "P-MTAB-117947", "P-MTAB…
$ `Protocol REF` <chr> "", "P-MTAB-117948", "P-MTAB…
$ `Extract Name` <chr> "", "Sample 2", "Sample 3", …
$ `Material Type` <chr> "", "RNA", "RNA", "RNA", "RN…
$ `Comment[LIBRARY_LAYOUT]` <chr> "", "PAIRED", "PAIRED", "PAI…
$ `Comment[LIBRARY_SELECTION]` <chr> "", "other", "other", "other…
$ `Comment[LIBRARY_SOURCE]` <chr> "", "TRANSCRIPTOMIC", "TRANS…
$ `Comment[LIBRARY_STRATEGY]` <chr> "", "Targeted-Capture", "Tar…
$ `Protocol REF` <chr> "", "P-MTAB-117950", "P-MTAB…
$ Performer <chr> "", "Wellcome Trust Clinical…
$ `Assay Name` <chr> "", "Sample 2", "Sample 3", …
$ `Technology Type` <chr> "", "sequencing assay", "seq…
$ `Protocol REF` <chr> "", "P-MTAB-117949", "P-MTAB…
$ `Derived Array Data File` <chr> "", "ArrayExpress-normalized…
$ `Comment [Derived ArrayExpress FTP file]` <chr> "", "ftp://ftp.ebi.ac.uk/pub…
$ `Protocol REF` <chr> "", "P-MTAB-117949", "P-MTAB…
$ `Derived Array Data File` <chr> "", "ArrayExpress-raw.csv", …
$ `Comment [Derived ArrayExpress FTP file]` <chr> "", "ftp://ftp.ebi.ac.uk/pub…
$ `Factor Value[disease]` <chr> "", "normal", "normal", "nor…
All of the checklist items apply in this case. Let’s go through them…
- Let’s verify that there is a unique identifier that matches between
the sample information and counts matrix, and identify the name of the
column in the sample information. We manually find the names of the
samples in
raw.counts.ibd
(the counts matrix), and set to a new variableibd.samp.names
. We then write afor loop
that evaluates which columns in the sample information match the sample names. There is one match, the column namedSource Name
. Note that there are other columns such asAssay Name
in this dataset that contain identifiers for some but not all samples. This is a good illustration of why it is important to check carefully to ensure you have a complete set of unique identifiers.
R
ibd.samp.names <- colnames(raw.counts.ibd)[3:ncol(raw.counts.ibd)]
lst.colnames <- c()
for(i in seq_along(1:ncol(samp.info.ibd))){
lst.colnames[i] <- all(samp.info.ibd[,i] == ibd.samp.names)
}
sprintf("The unique IDs that match the counts matrix are in column: %s", colnames(samp.info.ibd)[which(lst.colnames)])
OUTPUT
[1] "The unique IDs that match the counts matrix are in column: Source Name"
- We next extract this unique identifier along with the condition of
interest, (in this case the
disease
), and any other variables that may be potential predictor variables or used to assess confounding factors in the subsequent analysis of the data (in this case patientsex
andage
). We will use functions from thedplyr
package, part of thetidyverse
collection of packages throughout this episode.
R
samp.info.ibd.sel <- dplyr::select(samp.info.ibd,
'Source Name',
'Characteristics[age]',
'Characteristics[sex]',
'Characteristics[disease]'
)
- Let’s then rename the variables to something more easy to interpret
for humans, avoiding spaces and special characters. We’ll save these
selected columns to a new variable name
samp.info.ibd.sel
.
R
samp.info.ibd.sel <- dplyr::rename(samp.info.ibd.sel,
'sampleID' = 'Source Name',
'age' = 'Characteristics[age]',
'sex' = 'Characteristics[sex]',
'condition' = 'Characteristics[disease]'
)
- Now check how each of the variables are encoded in our reduced sample information data to identify any errors, data gaps and inconsistent coding of categorical variables. Firstly let’s take a look at the data.
R
dplyr::glimpse(samp.info.ibd.sel)
OUTPUT
Rows: 590
Columns: 4
$ sampleID <chr> "Sample 1", "Sample 2", "Sample 3", "Sample 4", "Sample 5", …
$ age <int> 34, 49, 27, 9, 34, 60, 54, 24, 39, 41, 38, 46, 7, 46, 21, 35…
$ sex <chr> "male", "female", "male", "male", "female", "female", "femal…
$ condition <chr> "Crohns disease\tblood\torganism part\tP-MTAB-117946\tP-MTAB…
- There is clearly an issue with the coding of the
condition
Crohn’s disease. We’ll fix that first, and then check the consistency of the coding of categorical variablessex
andcondition
.
R
samp.info.ibd.sel$condition[agrep("Crohns", samp.info.ibd.sel$condition)] <- "crohns_disease"
unique(c(samp.info.ibd.sel$sex, samp.info.ibd.sel$condition))
OUTPUT
[1] "male" "female" "crohns_disease"
[4] "normal" "ulcerative colitis"
- Both
sex
andcondition
are consistently encoded over all samples. We should however remove the spaces in the values forsampleID
and for thecondition
value ulcerative colitis. Note: since we are changing our unique identifier, we’ll need to make the same update later in the counts matrix.
Before we do this, since we’ll be using the pipe operator from the tidyverse libraries, we will to define the pipe operator from the magrittr package in the environment.
R
`%>%` <- magrittr::`%>%`
samp.info.ibd.sel <- samp.info.ibd.sel %>%
dplyr::mutate(sampleID = gsub(" ", "_", sampleID),
condition = gsub(" ", "_", condition))
- As some machine learning classification algorithms such as support
vector machines require a numerical input for each class, we’ll add in a
new numerical column to represent our target variable called
class
where all patients denoted asnormal
are given the value -1, and all patients denoted as eithercrohns_disease
orulcerative_colitis
are given the vale +1. We’ll also now convert the type of categorical columns to factors as this is preferred by downstream libraries used for data normalisation such as DESeq2.
R
samp.info.ibd.sel <- samp.info.ibd.sel %>%
dplyr::mutate(class = dplyr::if_else(condition == 'normal', -1, 1))
R
samp.info.ibd.sel[c('sex', 'condition', 'class')] <- lapply(samp.info.ibd.sel[c('sex', 'condition', 'class')], factor)
- Finally, let’s check the distribution of the classes by creating a
table
from theclass
column.
R
table(samp.info.ibd.sel$class)
OUTPUT
-1 1
267 323
The two classes are approximately equally represented, so let’s check everything one last time.
R
dplyr::glimpse(samp.info.ibd.sel)
OUTPUT
Rows: 590
Columns: 5
$ sampleID <chr> "Sample_1", "Sample_2", "Sample_3", "Sample_4", "Sample_5", …
$ age <int> 34, 49, 27, 9, 34, 60, 54, 24, 39, 41, 38, 46, 7, 46, 21, 35…
$ sex <fct> male, female, male, male, female, female, female, male, male…
$ condition <fct> crohns_disease, normal, normal, normal, normal, ulcerative_c…
$ class <fct> 1, -1, -1, -1, -1, 1, 1, -1, 1, 1, -1, 1, -1, -1, -1, 1, -1,…
And save the output as a file.
R
write.table(samp.info.ibd.sel, file="./data/ibd.sample.info.txt", sep = '\t', quote = FALSE, row.names = TRUE)
IDB dataset: Counts Matrix Clean Up
Count Matrix Readiness Checklist
Item | Check For… | Rationale |
---|---|---|
1 | Extraneous information that you don’t want to include as a predictor variable | Blank rows, alternative gene/ transcript IDs and metadata about the genes/transcripts will be treated as a predictor variable unless you explicitly remove them |
2 | Unique variable names, without spaces or special characters | ML algorithms require unique variables. Spaces etc. may cause errors with bioinformatics and ml libraries used downstream (and they must match the sample information) |
3 | Duplicate data | Duplicate variables may bias results of analysis and should be removed unless intentional |
4 | Missing values | Missing values may cause errors and bias results. Missing values must be identified and appropriate treatment determined (e.g. drop vs impute) |
- Take a look at a sample of columns and rows to see what the downloaded file looks like
R
raw.counts.ibd[1:10,1:8]
OUTPUT
read Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6
1 1 * 13961 16595 20722 17696 25703 20848
2 2 ERCC-00002 0 0 0 0 0 0
3 3 ERCC-00003 0 0 0 0 0 0
4 4 ERCC-00004 0 0 0 0 3 0
5 5 ERCC-00009 0 0 0 0 0 0
6 6 ERCC-00012 0 0 0 0 0 0
7 7 ERCC-00013 0 0 0 0 0 0
8 8 ERCC-00014 0 0 0 0 0 0
9 9 ERCC-00016 0 0 0 0 0 0
10 10 ERCC-00017 2 0 0 0 1 0
- Remove the first row, which is in this case totals of counts for each sample
- Remove the first columns, which just numbers the transcript IDs
- Move the column named
read
that contains to the transcript IDs to the row names
R
counts.mat.ibd <- raw.counts.ibd[-1,-1]
rownames(counts.mat.ibd) <- NULL
counts.mat.ibd <- counts.mat.ibd %>% tibble::column_to_rownames('read')
counts.mat.ibd[1:10,1:6]
OUTPUT
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6
ERCC-00002 0 0 0 0 0 0
ERCC-00003 0 0 0 0 0 0
ERCC-00004 0 0 0 0 3 0
ERCC-00009 0 0 0 0 0 0
ERCC-00012 0 0 0 0 0 0
ERCC-00013 0 0 0 0 0 0
ERCC-00014 0 0 0 0 0 0
ERCC-00016 0 0 0 0 0 0
ERCC-00017 2 0 0 0 1 0
ERCC-00019 0 0 0 0 0 0
- Let’s check for duplicate sampleIDs and transcript IDs. Provided
there are no duplicate row or column names, the following should return
interger(0)
.
R
which(duplicated(rownames(counts.mat.ibd)))
OUTPUT
integer(0)
R
which(duplicated(colnames(counts.mat.ibd)))
OUTPUT
integer(0)
- We’ll double check the sampleIDs match the sample information file.
R
if(!identical(colnames(counts.mat.ibd), samp.info.ibd.sel$sampleID)){stop()}
ERROR
Error in eval(expr, envir, enclos):
We renamed the samples in the sample information to remove spaces, so we need to do the same here.
R
colnames(counts.mat.ibd) <- gsub(x = colnames(counts.mat.ibd), pattern = "\ ", replacement = "_")
- Finally, we’ll check that there are no missing values in the counts
matrix. Note, there will be many zeros, but we are ensuring that there
are no
NA
values or black records. The following should returnFALSE
.
R
allMissValues <- function(x){all(is.na(x) | x == "")}
allMissValues(counts.mat.ibd)
OUTPUT
[1] FALSE
Take a final look at the cleaned up matrix.
R
counts.mat.ibd[1:10,1:6]
OUTPUT
Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6
ERCC-00002 0 0 0 0 0 0
ERCC-00003 0 0 0 0 0 0
ERCC-00004 0 0 0 0 3 0
ERCC-00009 0 0 0 0 0 0
ERCC-00012 0 0 0 0 0 0
ERCC-00013 0 0 0 0 0 0
ERCC-00014 0 0 0 0 0 0
ERCC-00016 0 0 0 0 0 0
ERCC-00017 2 0 0 0 1 0
ERCC-00019 0 0 0 0 0 0
R
sprintf("There are %i rows, corresponding to the transcript IDs", dim(counts.mat.ibd)[1])
OUTPUT
[1] "There are 22750 rows, corresponding to the transcript IDs"
R
sprintf("There are %i columns, corresponding to the samples", dim(counts.mat.ibd)[2])
OUTPUT
[1] "There are 590 columns, corresponding to the samples"
And save the output as a file.
R
write.table(counts.mat.ibd, file="./data/counts.mat.ibd.txt", sep = '\t', quote = FALSE, row.names = TRUE)
Document your reformatting steps
It is crucial that you document your data reformatting so that it is reproducible by other researchers. Create a project folder for your data processing and include:
- Raw input data files collected from ArrayExpress, GEO or similar
- The output counts matrix and target variable vector, stored in a common format such at .txt format
- The full R scripts used to perform all the data clean up steps. Store the script with the input and output data.
- A readme file that explains the steps taken, and how to run the script on the inputs to generate the outputs. This should also include details of all software versions used (e.g. R version, RStudio or Jupyter Notebooks version, and the versions of any code libraries used)
The TB Dataset: Another Example
Discussion
Take a look at The
Tuburculosis (TB) Dataset - E-MTAB-6845. You can download the sdrf
file from zenodo and save it to your data
directory by
running the following code.
R
download.file(url = "https://zenodo.org/record/8125141/files/E-MTAB-6845.sdrf.txt",
destfile = "data/E-MTAB-6845.sdrf.txt")
Read the sdrf file into R and take a look at the data. Make a list of the potential issues with this data based on the checklist above? What steps would need to be taken to get them ready to train a machine learning classifier?
As a help, the code to read the file in from your data directory is:
R
samp.info.tb <- read.table(file="./data/E-MTAB-6845.sdrf.txt", sep="\t", header=T, fill=T, check.names=F)
Challenge Extension
Can you correctly reorder the following code snippets to create a
pipeline to reformat the data in line with the sample information
checklist? Use the
Characteristics[progressor status (median follow-up 1.9 years)]
as the target variable.
R
dplyr::rename(
'sampleID' = 'Comment[sample id]',
'prog_status' = 'Factor Value[progressor status (median follow-up 1.9 years)]') %>%
dplyr::select(
'Comment[sample id]',
'Factor Value[progressor status (median follow-up 1.9 years)]') %>%
dplyr::mutate(prog_status = factor(prog_status, levels = c("non_progressor","tb_progressor")),
class = factor(class, levels = c(-1, 1))) %>%
dplyr::mutate(prog_status = dplyr::recode(prog_status,
"non-progressor" = "non_progressor",
"TB progressor" = "tb_progressor"),
class = dplyr::if_else(prog_status == 'non_progressor', -1, 1)
) %>%
dplyr::distinct() %>%
samp.info.tb %>%
dplyr::glimpse()
R
samp.info.tb %>%
dplyr::select(
'Comment[sample id]',
'Factor Value[progressor status (median follow-up 1.9 years)]') %>% # select columns
dplyr::rename(
'sampleID' = 'Comment[sample id]',
'prog_status' = 'Factor Value[progressor status (median follow-up 1.9 years)]') %>% # rename them
dplyr::distinct() %>% # remove duplicate rows
dplyr::mutate(prog_status = dplyr::recode(prog_status, # recode target variable
"non-progressor" = "non_progressor",
"TB progressor" = "tb_progressor"),
class = dplyr::if_else(prog_status == 'non_progressor', -1, 1) # create numerical class
) %>%
dplyr::mutate(prog_status = factor(prog_status, levels = c("non_progressor","tb_progressor")),
class = factor(class, levels = c(-1, 1))) %>% # set variables to factors
dplyr::glimpse() # view output
OUTPUT
Rows: 360
Columns: 3
$ sampleID <chr> "PR123_S19", "PR096_S13", "PR146_S14", "PR158_S12", "PR095…
$ prog_status <fct> non_progressor, non_progressor, non_progressor, non_progre…
$ class <fct> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1…
Key Points
- Machine learning algorithms require specific data input formats, and for data to be consistently formatted across variables and samples. A classification task for example requires a matrix of the value of each input variable for each sample, and the class label for each sample.
- Clinical sample information is often messy, with inconsistent formatting as a result of how it is collected. This applies to data downloaded from public repositories.
- You must carefully check all data for formatting and data integrity issues that may negatively impact your downstream ml analysis.
- Document your data reformatting and store the code pipeline used along with the raw and reformatted data to ensure your procedure is reproducible