RNA-seq is a technique of measuring the amount of RNA expressed
within a cell/tissue and state at a given time.
Many choices have to be made when planning an RNA-seq experiment,
such as whether to perform poly-A selection or ribosomal depletion,
whether to apply a stranded or an unstranded protocol, and whether to
sequence the reads in a single-end or paired-end fashion. Each of the
choices have consequences for the processing and interpretation of the
Many approaches exist for quantification of RNA-seq data. Some
methods align reads to the genome and count the number of reads
overlapping gene loci. Other methods map reads to the transcriptome and
use a probabilistic approach to estimate the abundance of each gene or
Information about annotated genes can be accessed via several
sources, including Ensembl, UCSC and GENCODE.
Depending on the gene expression quantification tool used, there are
different ways (often distributed in Bioconductor packages) to read the
output into a SummarizedExperiment or DGEList
object for further processing in R.
Stable gene identifiers such as Ensembl or Entrez IDs should
preferably be used as the main identifiers throughout an RNA-seq
analysis, with gene symbols added for easier interpretation.
Exploratory analysis is essential for quality control and to detect
potential problems with a data set.
Different classes of exploratory analysis methods expect differently
preprocessed data. The most commonly used methods expect counts to be
normalized and log-transformed (or similar- more
sensitive/sophisticated), to be closer to homoskedastic. Other methods
work directly on the raw counts.
RNA-seq data is very versatile and can be used for a number of
different purposes. It is important, however, to carefully plan one’s
analyses, to make sure that enough data is available and that abundances
for appropriate features (e.g., genes, transcripts, or exons) are