Instructor Notes
This is a placeholder file. Please add content here.
Introduction: Machine Learning Ready RNA-Seq Data
Data Collection: ArrayExpress
Data Collection: GEO
Data Readiness: Data Format and Integrity
Instructor Note
Here are a number of issues that need addressing in the TB dataset:
- One important reformatting point in this dataset is that there are two lines in the sdrf file per sample, as there is a line for each of the read in the paired end reads. The only difference is the file name given for the fastq file. To solve this we need to select the distinct rows with the variables of interest. Check the issue this line of code:
which(duplicated(samp.info.tb$Comment[sample id])))
- There are missing values for some variables that may be useful as
predictor variables, (e.g.
Characteristics[sex]
missing values are represented by double spaces). It would be better to recode them asNA
as this is universally understood to represent a missing value. This is easily seem by running this:
unique(samp.info.tb$
Characteristics[sex])
A number of variables need renaming and special characters and spaces removed. For example, the variable name
progressor status
and the valueTB progressor
of this variable use a space that could replaced with an underscoreAnd perhaps most importantly of all, the dataset is highly unbalanced, with 9 TB progressors and 351 non-progressors when you account for duplicate data. This dataset is very unlikely to perform well as a training dataset for a machine learning classifier!
table(samp.info.tb$
Factor Value[progressor status
(median follow-up 1.9 years)])