This lesson is still being designed and assembled (Pre-Alpha version)

Quality Control & Normalization

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • What is the aim of normalization?

Objectives
  • To understand the structure of expression matrix.

  • To use the Scanpy package for normalization.

  • To create an object containing expression matrix, barcodes, and cell IDs.

  • Understanding the necessity of normalizing counts for accurate comparison between cells.

Generating Expression Metadata

Sequencing centers deliver the results of scRNA-seq in three formats including BCL, fastq, and .mtx. Raw data for scRNA-seq data are received as BCL2 or fastq files. BCL2 files should be converted into FASTQ files using a command line software called bcl2fastq. Analysis of data in FASTQ format includes Quality Control, Trimming, Alignment, Mapping which are mainly similar to bulk RNA-seq analysis and you can find in Data Carpentry Genomics Lessons. Quantification analysis uses statistical analysis and machine learning methods to detect the number of each transcript and count them per cell. Some of the methodologies normalize the counts of transcripts and filter the genes with no significantly different expression levels among which edgeR, DESeq, DESeq2, etc can be mentioned. The output of quantification analysis is a text file containing gene IDs in rows and cell IDs in columns which is called an expression matrix. In this curriculum, we will work on the expression matrix and will apply the analysis steps on it.

Get Data

The dataset which we use for this workshop includes 6 files in .tsv format with information about cell barcodes, metadata, cell IDs and gene IDs and one file in .mtx format which includes the number of RNAs per cell.

Project Directory

scRNA-seq analysis workflow begins with a few files and will produce a lot of files. So it is a big data project and it is useful to manage our files and directories. For this, create a directory called scRNA-seq and four subdirectories called data, scripts, output, and docs.

$ mkdir -p scRNA/{data,docs,output,scripts}.
$ ls
$ cd scRNA/data

As we do for all big data projects, for a better data management in here, we can add the date of files generation at the first part of files names. For example, 2021-09-20-normalization.txt.

Exercise

Discuss your results with your classmate. What are the advantages and disadvantages of file name system suggested? Which system do you suggest for file names which helps for better management of scRNA-seq as a big data project?

Solution

  • Advantages: Reproducibility. Easy to track the project progress. Disadvantage: Not all of the collaborators use the same manner for the data management. It makes hard to merge files and data.

Please download dataset using the link in setup (should be linked) into the data directory using the command below:

$ wget -c data-link
$ ls
barcodes.tsv.gz  deng-reads.rds  features.tsv.gz  matrix.mtx.gz  molecules.txt reads.rds  reads.txt  TPs.txt  umi.rds

Preprocessing

The first step using expression matrix is preprocessing divided into two main steps of preprocessing and normalization.

SCANPY

Scanpy is a large scale toolkit for analysis of single-cell gene expression data. The methods for preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing, and simulation of gene regulatory networks are included. The Python-based implementation of SCANPY efficiently deals with data sets of more than one million cells.

You can see that there are different data structures. For this lesson, scanpy is used in combination with pandas and numpy libraries. In the following steps, matplotlib will be needed for visualization, too. For this, the libraries should be imported as below:

$ import scanpy as sc
$ import pandas as pd
$ import numpy as np

Now, configuration settings of scanpy should be performed. Verbosity is an attribute for management of warning messages. The second line is used for creating a log file and the thهrd line is the method for general settings of the graph.

$ sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
$ sc.logging.print_header()
$ sc.settings.set_figure_params(dpi=80, facecolor='white')

Which will give you the following output:

scanpy==1.6.0 anndata==0.7.5.dev7+gefffdfb umap==0.4.2 numpy==1.18.1 scipy==1.4.1 pandas==1.0.3 scikit-learn==0.22.1 statsmodels==0.11.0 python-igraph==0.7.1 leidenalg==0.7.0

A file is generated to store the analysis results:

$ results_file = 'write/pbmc3k.h5ad'

Based on the different types of files, it is needed to create an object which contains data of count matrix, annotations, barcodes, etc. The object is called AnnData. AnnData stores data in its special format called h5ad. The object named as adata is created using the code below:

$ adata = sc.read_10x_mtx(
$   'data/filtered_gene_bc_matrices/hg19/',  # the directory with the `.mtx` file
$   var_names='gene_symbols',                # use gene symbols for the variable names (variables-axis index)
$   cache=True)                              # write a cache file for faster subsequent reading

sc.read_10x_mtx is used for reading expression data matrix. hg19 is the directory where .mtx file is stored. var_names is set to use gene_symbols. cahe=True is the code to write a cache file for faster subsequent reading.

... reading from cache file cache/data-filtered_gene_bc_matrices-hg19-matrix.h5ad

Now, it is possible to check out the contents of adata object:

$ adata
AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'

The output shows that there are 2700 columns containing observations and 32738 rows including the variables with gene_id information in the reads count expression matrix.

Preprocessing & Normalization

Expression matrix contains all the genes and their expression levels. As the first step of filtering, it is possible to find the first twenty genes with highest expression levels using the code below:

$ sc.pl.highest_expr_genes(adata, n_top=20, )
Figure should be added.

Usually, filtering is applied on two dimensions of the expression matrix: cells and genes. For this, scanpy includes four important filteration functions included in pp (preprocessing) module: filter_cells, filter_genes, and filter_highly_variable_genes, and normalize_total. Each function has parameter. Here, some challenges are required!

Exercise

Which one of the following options can descripe min_counts better?

1) Minimum number of genes required for a count to pass filtering. 2) Minimum number of counts required for a cell to pass filtering. 3) Minimum number of counts required for a gene to pass filtering. 4) Minimum number of cells required to pass filtering.

Solution

Option No.2.

In the same way, it is possible to explain min_genes, min_cells, max_counts, and max_genes, etc,.

Exercise

Which one of the following options can descripe n_top_genes better?

1) Number of highly-variable genes to remove. 2) Number of highly expressed genes to keep. 3) Number of highly-variable genes to keep. Mandatory if flavor=’seurat_v3’. 4) Number of highly expressed genes to remove.

Solution

Option No.3.

Now, basic and actual filtering can be performed.

$ sc.pp.filter_cells(adata, min_genes=200)
$ sc.pp.filter_genes(adata, min_cells=3)
filtered out 19024 genes that are detected in less than 3 cells

One main step for quality control of expression matrix is to achieve the information about the proportion of mitochondrial genes.

Mitochondiral genes a challenge for scRNA-seq

Cells with poor quality in scRNA-sq contain high proportions of mitochondrial genes. These cells are assumed to be discarded before entering into the droplet. Not filtering mitochondrial genes can cause formation a separate cluster in the further steps.

Quality control is performed using calculate_qc_metrics function in pp module of scanpy using the code below:

$ adata.var['mt'] = adata.var_names.str.startswith('MT-')  
$ sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

This code annotates the group of mitochondrial genes as ‘mt’.

Now, visualization is possible.

Violin Plots

  • Violin plots are used for visualization of quality control results in scRNA-seq.
  • The number of genes expressed in the count matrix, the total counts per cell, and the percentage of counts in mitochondrial genes.
  • These plots are similar to box plots with addition of the probability density of the data at different values.
$ sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)
The figures should be added.

Scatter plots are applied to visualize data after removal of mitochondrial genes with very high level of expression.

$ sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
$ sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')
The figures should be added.

The results of filters are applied to splice AnnData object.

$ adata = adata[adata.obs.n_genes_by_counts < 2500, :]
$ adata = adata[adata.obs.pct_counts_mt < 5, :]

Total-count normalize (library-size correct) is used to make counts comparable among cells. Using this method, the data matrix contains 10,000 reads per cell.

$ sc.pp.normalize_total(adata, target_sum=1e4)

The process takes a few seconds …

normalizing counts per cell
    finished (0:00:00)

Logartimaize data:

$ sc.pp.log1p(adata)

And, identify highly-variable genes:

$ sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)

highly_variable_genes function expects normalized and logarithmized data and the variation in genes expression level are rated using the normalized variance of count number. Highly variable gene (HVG) is an important parameter in scRNA-seq that helps to find the genes contribute strongly to cell-to-cell variation within a homogeneous cell population, such as a population of embryonic stem cells.

extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)

Visualizing the HVGs:

$ sc.pl.highly_variable_genes(adata)
The figure should be added.

Key Points

  • scRNA-seq requires much pre-processing before analysis can be performed.