Instructor Notes

This is a placeholder file. Please add content here.

Introduction: Machine Learning Ready RNA-Seq Data


Data Collection: ArrayExpress


Data Collection: GEO


Data Readiness: Data Format and Integrity


Instructor Note

Here are a number of issues that need addressing in the TB dataset:

  • One important reformatting point in this dataset is that there are two lines in the sdrf file per sample, as there is a line for each of the read in the paired end reads. The only difference is the file name given for the fastq file. To solve this we need to select the distinct rows with the variables of interest. Check the issue this line of code:

which(duplicated(samp.info.tb$Comment[sample id])))

  • There are missing values for some variables that may be useful as predictor variables, (e.g. Characteristics[sex] missing values are represented by double spaces). It would be better to recode them as NA as this is universally understood to represent a missing value. This is easily seem by running this:

unique(samp.info.tb$Characteristics[sex])

  • A number of variables need renaming and special characters and spaces removed. For example, the variable name progressor status and the value TB progressor of this variable use a space that could replaced with an underscore

  • And perhaps most importantly of all, the dataset is highly unbalanced, with 9 TB progressors and 351 non-progressors when you account for duplicate data. This dataset is very unlikely to perform well as a training dataset for a machine learning classifier!

table(samp.info.tb$Factor Value[progressor status (median follow-up 1.9 years)])



Data Readiness: Technical Noise


Data Readiness: Distribution and Scale