RNA-seq analysis with Bioconductor: Key Points

Introduction to RNA-seqWhat are we measuring in an RNA-seq experiment?Experimental design considerationsRNA-seq quantification: from reads to count matrixFinding the reference sequencesWhere are we heading towards in this workshop?

RNA-seq is a technique of measuring the amount of RNA expressed within a cell/tissue and state at a given time.
Many choices have to be made when planning an RNA-seq experiment, such as whether to perform poly-A selection or ribosomal depletion, whether to apply a stranded or an unstranded protocol, and whether to sequence the reads in a single-end or paired-end fashion. Each of the choices have consequences for the processing and interpretation of the data.
Many approaches exist for quantification of RNA-seq data. Some methods align reads to the genome and count the number of reads overlapping gene loci. Other methods map reads to the transcriptome and use a probabilistic approach to estimate the abundance of each gene or transcript.
Information about annotated genes can be accessed via several sources, including Ensembl, UCSC and GENCODE.

Proper organisation of the files required for your project in a working directory is crucial for maintaining order and ensuring easy access in the future.
RStudio project serves as a valuable tool for managing your project’s working directory and facilitating analysis.
The download.file function in R can be used for downloading datasets from the internet.

Depending on the gene expression quantification tool used, there are different ways (often distributed in Bioconductor packages) to read the output into a SummarizedExperiment or DGEList object for further processing in R.
Stable gene identifiers such as Ensembl or Entrez IDs should preferably be used as the main identifiers throughout an RNA-seq analysis, with gene symbols added for easier interpretation.

Exploratory analysis is essential for quality control and to detect potential problems with a data set.
Different classes of exploratory analysis methods expect differently preprocessed data. The most commonly used methods expect counts to be normalized and log-transformed (or similar- more sensitive/sophisticated), to be closer to homoskedastic. Other methods work directly on the raw counts.

With DESeq2, the main steps of a differential expression analysis (size factor estimation, dispersion estimation, calculation of test statistics) are wrapped in a single function: DESeq().
Independent filtering of lowly expressed genes is often beneficial.

The formula framework in R allows creation of design matrices, which details the variables expected to be associated with systematic differences in gene expression levels.
Comparisons of interest can be defined using contrasts, which are linear combinations of the model coefficients.

ORA analysis is based on the gene counts and it is based on Fisher’s exact test or the hypergeometric distribution.
In R, it is easy to obtain gene sets from a large number of sources.

RNA-seq data is very versatile and can be used for a number of different purposes. It is important, however, to carefully plan one’s analyses, to make sure that enough data is available and that abundances for appropriate features (e.g., genes, transcripts, or exons) are quantified.