Summary and Schedule

This is a new lesson built with The Carpentries Workbench.

Setup Instructions Download files required for the lesson

Duration: 00h 00m 1. Introduction: Machine Learning Ready RNA-Seq Data Where can I find a publicly available RNA-Seq dataset suitable for a machine learning analysis?
What format is RNA-Seq data stored in on public repositories?
What characteristics of a dataset do I need to consider to make it ‘ready’ for a machine learning /AI modelling analysis?

Duration: 00h 11m 2. Data Collection: ArrayExpress What format are processed RNA-Seq dataset stored on Array Express?
How do I search for a dataset that meets my requirements on Array Express?
Which files do I need to download?

Duration: 00h 26m 3. Data Collection: GEO What format are processed RNA-Seq dataset stored on GEO?
How do I search for a dataset that meets my requirements on GEO?
How do I use the R package GEOquery to download datasets from GEO into R?

Duration: 00h 46m 4. Data Readiness: Data Format and Integrity What format do I require my data in to build a supervised machine learning classifier?
What are the potential data format and data integrity issues I will encounter with RNA-Seq datasets, including those downloaded from public repositories, that I’ll need to address before beginning any analysis?

Duration: 01h 36m 5. Data Readiness: Technical Noise How do technical artefacts in RNA-Seq data impact machine learning algorithms?
How can technical artefacts such as low count genes and outlier read counts be effectively removed from RNA-Seq data prior to analysis?

Duration: 01h 48m 6. Data Readiness: Distribution and Scale Do I need to transform or rescale RNA-Seq data before inputting into a machine learning algorithm?
What are the most appropriate transformations and how do these depend on the particular machine learning algorithm being employed?

Duration: 02h 00m Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

Summary

This lesson provides a practical guide to sourcing and pre-processing a bulk RNA-Seq dataset for use in a machine learning classification task. The lessons explains the characteristics of a dataset required for this type of analysis, how to search for and download a dataset from each of the main public functional genomics repositories, and then provides guidelines on how to pre-process a dataset to make it machine learning ready, with detailed examples. The lesson finally explains some of the additional data filtering and transformation steps that will improve the performance of machine learning algorithms using RNA-Seq count data.

The lesson is written in the context of a supervised machine learning classification modelling task, where the goal is to construct a model that is able to differentiate two different disease states (e.g. disease vs. healthy control) based on the gene expression profile.

This work was funded by the ELIXIR-UK: FAIR Data Stewardship training UKRI award (MR/V038966/1)

Prerequisite

This lesson assumes a working knowledge of programming in R. For learners who aren’t familiar with R or feel they need a refresher, the Programming with R provides a good introduction to both R and working with R studio.

Data Sets

This lesson uses a number of RNA-Seq datasets downloaded from public functional genomics repositories.

Before the start of the lesson, create a new RStudio project where you will keep all of the files for this lesson. In the project directory (where the .Rproj file is), create a subdirectory called data.

Links to the relevant datasets and instructions on how to download them provided in each episode.

Software Setup

R and RStudio

Learners will need updated versions of R and RStudio installed. There are instructions for installing R, RStudio, and additional R packages for all the main operating systems in the R Ecology Lesson.

Please install the following R packages. You will need to install the package BiocManager to be able to install packages from Bioconductor.

tidyverse
reshape2
scales
BiocManager
Biobase
GEOquery
DESeq2

Executing the following lines of code in the R console will install all of these packages.

R


install.packages(c("tidyverse", "reshape2", "scales", "BiocManager"))

BiocManager::install(c("Biobase", "GEOquery", "DESeq2"))

You can check that all packages have been installed using the following command, which will return character(0) if all packages have been successfully installed.

R


setdiff(c("tidyverse", "reshape2", "scales", "BiocManager", "Biobase", "GEOquery", "DESeq2"),
        rownames(installed.packages()))