Introduction: Machine Learning Ready RNA-Seq Data


  • The two main repositories for sourcing RNA-Seq datasets are ArrayExpress and GEO
  • Processed data, in the form of raw counts, or further processed counts data is the starting point for machine learning analysis
  • Sourcing and appropriate RNA-Seq dataset and preparing it for a machine learning analysis requires you to consider the dataset quality, size, format, noise content, data distribution and scale

Data Collection: ArrayExpress


  • ArrayExpress stores two standard files with information about each experiment: 1. Sample and Data Relationship Format (SDRF) and 2. Investigation Description Format (IDF).
  • ArrayExpress provides raw and processed data for RNA-Seq datasets, typically stored as csv, tsv, or txt files.
  • The filters on the ArrayExpress website allow you to select and download a dataset that suit your task.

Data Collection: GEO


  • Similar to ArrayExpress, GEO stores samples information and counts matrices separately. Sample information is typically stored in SOFT format, as well as a .txt file.
  • Counts data may be stored as raw counts and some further processed form of counts, typically as supplementary files. Always review the documentation to determine what “processed data” refers to for a particular dataset.
  • GEOquery provides a convenient way to download files directly from GEO into R.

Data Readiness: Data Format and Integrity


  • Machine learning algorithms require specific data input formats, and for data to be consistently formatted across variables and samples. A classification task for example requires a matrix of the value of each input variable for each sample, and the class label for each sample.
  • Clinical sample information is often messy, with inconsistent formatting as a result of how it is collected. This applies to data downloaded from public repositories.
  • You must carefully check all data for formatting and data integrity issues that may negatively impact your downstream ml analysis.
  • Document your data reformatting and store the code pipeline used along with the raw and reformatted data to ensure your procedure is reproducible

Data Readiness: Technical Noise


  • RNA-Seq read counts contain two main sources of technical ‘noise’ that are unlikely to represent true biological information, and may be artefacts from experimental processes required to generate the data: Low read counts and influential outlier read counts.
  • Filtering out low count and influential outlier genes removes potentially biasing variables without negatively impacting the performance of downstream machine learning analysis. Gene filtering is therefore an important step in preparing an RNA-Seq dataset to be machine learning ready.
  • Multiple approaches exist to identify the specific genes with uninformative low count and influential outlier read counts in RNA-Seq data, however they all aim to find the boundary between true biological signal and technical and biological artefacts.

Data Readiness: Distribution and Scale


  • RNA-Seq read count data is heavily skewed with a large percentage of zero values. The distribution is heteroskedastic, meaning the variance depends on the mean.
  • Standard transformations such as the variance stabilising transformation and rlog transformation are designed to make the distribution of RNA-Seq data more Gaussian and homoskedastic, which will improve the performance of some machine learning approaches. These transformations are improvements to a simple log2 transformation in particular for low count values.
  • Many machine learning algorithms require predictor variables to be on the same scale. Even after vst or rlog transformation, genes in an RNA-Seq dataset will be on very different scales.
  • Standardisation (z-score) and min-max scaling are two common techniques to rescale variables.
  • Data should be scaled before use with machine learning models that use distance based measures, that include regularisation penalties that utilise absolute values of model coefficients and for models that are optimised via gradient descent. Naive Bayes and tree based models are insensitive to scaling and variables may be left unscaled.