Gene set analysis

Last updated on 2023-05-12 | Edit this page

Overview

Questions

  • Why do we perform gene set enrichment analysis?
  • How does one retrieve predefined sets of genes with shared characteristics for gene set testing?
  • What different classes of methods are available for gene set analysis?

Objectives

  • Explain how to find differentially expressed pathways with gene set analysis in R.
  • Understand how differentially expressed genes can enrich a gene set.
  • Explain how to perform a gene set analysis in R, using clusterProfiler.

Contribute!

This episode is intended to introduce the concept of how to carry out a functional analysis of a subset of differentially expressed (DE) genes, by means of assessing how significantly DE genes enrich gene sets of our interest.

First, we are going to explore the basic concept of enriching a gene set with differentially expressed (DE) genes. Recall the differential expression analysis.

R

library(SummarizedExperiment)
library(DESeq2)

R

se <- readRDS("data/GSE96870_se.rds")

R

dds <- DESeq2::DESeqDataSet(se[, se$tissue == "Cerebellum"],
                            design = ~ sex + time)

WARNING

Warning in DESeq2::DESeqDataSet(se[, se$tissue == "Cerebellum"], design = ~sex
+ : some variables in design formula are characters, converting to factors

R

dds <- DESeq2::DESeq(dds)

Fetch results for the contrast between male and female mice.

R

resSex <- DESeq2::results(dds, contrast = c("sex", "Male", "Female"))
summary(resSex)

OUTPUT


out of 32652 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 53, 0.16%
LFC < 0 (down)     : 71, 0.22%
outliers [1]       : 10, 0.031%
low counts [2]     : 13717, 42%
(mean count < 6)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Select DE genes between males and females with FDR < 5%.

R

sexDE <- as.data.frame(subset(resSex, padj < 0.05))
dim(sexDE)

OUTPUT

[1] 54  6

R

sexDE <- sexDE[order(abs(sexDE$log2FoldChange), decreasing=TRUE), ]
head(sexDE)

OUTPUT

               baseMean log2FoldChange     lfcSE      stat        pvalue
Eif2s3y       1410.8750       12.62514 0.5652155  22.33685 1.620659e-110
Kdm5d          692.1672       12.55386 0.5936267  21.14773  2.895664e-99
Uty            667.4375       12.01728 0.5935911  20.24505  3.927797e-91
Ddx3y         2072.9436       11.87241 0.3974927  29.86825 5.087220e-196
Xist         22603.0359      -11.60429 0.3362822 -34.50761 6.168523e-261
LOC105243748    52.9669        9.08325 0.5976242  15.19893  3.594320e-52
                      padj
Eif2s3y      1.022366e-106
Kdm5d         1.370011e-95
Uty           1.486671e-87
Ddx3y        4.813782e-192
Xist         1.167393e-256
LOC105243748  1.133708e-48

R

sexDEgenes <- rownames(sexDE)
head(sexDEgenes)

OUTPUT

[1] "Eif2s3y"      "Kdm5d"        "Uty"          "Ddx3y"        "Xist"        
[6] "LOC105243748"

R

length(sexDEgenes)

OUTPUT

[1] 54

Enrichment of a curated gene set


Contribute!

Here we illustrate how to assess the enrichment of one gene set we curate ourselves with our subset of DE genes with sex-specific expression. Here we form such a gene set with genes from sex chromosomes. Could you think of another more accurate gene set formed by genes with sex-specific expression?

Build a gene set formed by genes located in the sex chromosomes X and Y.

R

xygenes <- rownames(se)[decode(seqnames(rowRanges(se)) %in% c("X", "Y"))]
length(xygenes)

OUTPUT

[1] 2123

Build a contingency table and conduct a one-tailed Fisher’s exact test that verifies the association between genes being DE between male and female mice and being located in a sex chromosome.

R

N <- nrow(se)
n <- length(sexDEgenes)
m <- length(xygenes)
k <- length(intersect(xygenes, sexDEgenes)) 
dnames <- list(GS=c("inside", "outside"), DE=c("yes", "no"))
t <- matrix(c(k, n-k, m-k, N+k-n-m),
                        nrow=2, ncol=2, dimnames=dnames)
t

OUTPUT

         DE
GS        yes    no
  inside   18  2105
  outside  36 39627

R

fisher.test(t, alternative="greater")

OUTPUT


	Fisher's Exact Test for Count Data

data:  t
p-value = 7.944e-11
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 5.541517      Inf
sample estimates:
odds ratio 
  9.411737 

Gene ontology analysis with clusterProfiler


Contribute!

Here we illustrate how to assess the enrichment on the entire collection of Gene Ontology (GO) gene sets using the package clusterProfiler. Could we illustrate any missing important feature of this package for this analysis objective? Could we briefly mention other packages that may be useful for this task?

Second, let’s perform a gene set analysis for an entire collection of gene sets using the Bioconductor package clusterProfiler. For this purpose, we will fetch the results for the contrast between two time points.

R

resTime <- DESeq2::results(dds, contrast = c("time", "Day8", "Day0"))
summary(resTime)

OUTPUT


out of 32652 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 4472, 14%
LFC < 0 (down)     : 4276, 13%
outliers [1]       : 10, 0.031%
low counts [2]     : 8732, 27%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Select DE genes between Day0 and Day8 with FDR < 5% and minimum 1.5-fold change.

R

timeDE <- as.data.frame(subset(resTime, padj < 0.05 & abs(log2FoldChange) > log2(1.5)))
dim(timeDE)

OUTPUT

[1] 2110    6

R

timeDE <- timeDE[order(abs(timeDE$log2FoldChange), decreasing=TRUE), ]
head(timeDE)

OUTPUT

              baseMean log2FoldChange     lfcSE      stat       pvalue
LOC105245444  2.441873       4.768938 0.9013067  5.291138 1.215573e-07
LOC105246405  9.728219       4.601505 0.6101832  7.541186 4.657174e-14
4933427D06Rik 1.480365       4.556126 1.0318402  4.415535 1.007607e-05
A930006I01Rik 2.312732      -4.353155 0.9176026 -4.744053 2.094837e-06
LOC105245223  3.272536       4.337202 0.8611255  5.036666 4.737099e-07
A530053G22Rik 1.554735       4.243903 1.0248977  4.140806 3.460875e-05
                      padj
LOC105245444  1.800765e-06
LOC105246405  2.507951e-12
4933427D06Rik 9.169093e-05
A930006I01Rik 2.252139e-05
LOC105245223  6.047199e-06
A530053G22Rik 2.720142e-04

R

timeDEgenes <- rownames(timeDE)
head(timeDEgenes)

OUTPUT

[1] "LOC105245444"  "LOC105246405"  "4933427D06Rik" "A930006I01Rik"
[5] "LOC105245223"  "A530053G22Rik"

R

length(timeDEgenes)

OUTPUT

[1] 2110

Call the enrichGO() function from clusterProfiler as follows.

R

library(clusterProfiler)
library(org.Mm.eg.db)
library(enrichplot)

resTimeGO <- enrichGO(gene = timeDEgenes,
                      keyType = "SYMBOL",
                      universe = rownames(se),
                      OrgDb = org.Mm.eg.db,
                      ont = "BP",
                      pvalueCutoff = 0.01,
                      qvalueCutoff = 0.01)
dim(resTimeGO)

OUTPUT

[1] 18  9

R

head(resTimeGO)

OUTPUT

                   ID                                           Description
GO:0071674 GO:0071674                            mononuclear cell migration
GO:0035456 GO:0035456                           response to interferon-beta
GO:0050900 GO:0050900                                   leukocyte migration
GO:0030595 GO:0030595                                  leukocyte chemotaxis
GO:0035458 GO:0035458                  cellular response to interferon-beta
GO:0002523 GO:0002523 leukocyte migration involved in inflammatory response
           GeneRatio   BgRatio       pvalue     p.adjust       qvalue
GO:0071674   32/1245 178/21025 1.612273e-08 7.986143e-05 7.581201e-05
GO:0035456   17/1245  59/21025 3.142295e-08 7.986143e-05 7.581201e-05
GO:0050900   50/1245 374/21025 5.961138e-08 1.010016e-04 9.588020e-05
GO:0030595   34/1245 221/21025 2.991816e-07 3.801850e-04 3.609074e-04
GO:0035458   14/1245  48/21025 4.368673e-07 4.441193e-04 4.216000e-04
GO:0002523   10/1245  25/21025 7.368444e-07 6.242300e-04 5.925780e-04
                                                                                                                                                                                                                                                                                           geneID
GO:0071674                                                                                                  Tnfsf18/Aire/Ccl17/Ccr7/Nlrp12/Ccl2/Retnlg/Apod/Il12a/Ccl5/Fut7/Ccl7/Spn/Itgb3/Grem1/Ptk2b/Lgals3/Adam8/Dusp1/Ch25h/Nbl1/Alox5/Padi2/Plg/Calr/Ager/Slamf9/Ccl6/Mdk/Itga4/Hsd3b7/Trpm4
GO:0035456                                                                                                                                                                          Tgtp1/Tgtp2/F830016B08Rik/Iigp1/Ifitm6/Igtp/Gm4951/Bst2/Oas1c/Irgm1/Gbp6/Ifi47/Aim2/Ifitm7/Irgm2/Ifit1/Ifi204
GO:0050900 Tnfsf18/Aire/Ccl17/Ccr7/Nlrp12/Bst1/Ccl2/Retnlg/Ppbp/Cxcl5/Apod/Il12a/Ccl5/Fut7/Ccl7/Ccl28/Spn/Sell/Itgb3/Grem1/Cxcl1/Ptk2b/Lgals3/Adam8/Pf4/Dusp1/Ch25h/S100a8/Nbl1/Alox5/Padi2/Plg/Edn3/Il33/Ptn/Ada/Emp2/Enpp1/Calr/Ager/Slamf9/Ccl6/Prex1/Aoc3/Itgam/Mdk/Itga4/Hsd3b7/P2ry12/Trpm4
GO:0030595                                                                                      Tnfsf18/Ccl17/Ccr7/Bst1/Ccl2/Retnlg/Ppbp/Cxcl5/Il12a/Ccl5/Ccl7/Sell/Grem1/Cxcl1/Ptk2b/Lgals3/Adam8/Pf4/Dusp1/Ch25h/S100a8/Nbl1/Alox5/Padi2/Edn3/Ptn/Calr/Slamf9/Ccl6/Prex1/Itgam/Mdk/Hsd3b7/Trpm4
GO:0035458                                                                                                                                                                                             Tgtp1/Tgtp2/F830016B08Rik/Iigp1/Igtp/Gm4951/Oas1c/Irgm1/Gbp6/Ifi47/Aim2/Irgm2/Ifit1/Ifi204
GO:0002523                                                                                                                                                                                                                                   Ccl2/Ppbp/Fut7/Adam8/S100a8/Alox5/Ptn/Aoc3/Itgam/Mdk
           Count
GO:0071674    32
GO:0035456    17
GO:0050900    50
GO:0030595    34
GO:0035458    14
GO:0002523    10

Let’s build a more readable table of results.

R

library(kableExtra)

resTimeGOtab <- as.data.frame(resTimeGO)
resTimeGOtab$ID <- NULL
resTimeGOtab$geneID <- sapply(strsplit(resTimeGO$geneID, "/"), paste, collapse=", ")
ktab <- kable(resTimeGOtab, row.names=TRUE, caption="GO results for DE genes between time points.")
kable_styling(ktab, bootstrap_options=c("stripped", "hover", "responsive"), fixed_thead=TRUE)
GO results for DE genes between time points.
Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
GO:0071674 mononuclear cell migration 32/1245 178/21025 0.00e+00 0.0000799 0.0000758 Tnfsf18, Aire, Ccl17, Ccr7, Nlrp12, Ccl2, Retnlg, Apod, Il12a, Ccl5, Fut7, Ccl7, Spn, Itgb3, Grem1, Ptk2b, Lgals3, Adam8, Dusp1, Ch25h, Nbl1, Alox5, Padi2, Plg, Calr, Ager, Slamf9, Ccl6, Mdk, Itga4, Hsd3b7, Trpm4 32
GO:0035456 response to interferon-beta 17/1245 59/21025 0.00e+00 0.0000799 0.0000758 Tgtp1, Tgtp2, F830016B08Rik, Iigp1, Ifitm6, Igtp, Gm4951, Bst2, Oas1c, Irgm1, Gbp6, Ifi47, Aim2, Ifitm7, Irgm2, Ifit1, Ifi204 17
GO:0050900 leukocyte migration 50/1245 374/21025 1.00e-07 0.0001010 0.0000959 Tnfsf18, Aire, Ccl17, Ccr7, Nlrp12, Bst1, Ccl2, Retnlg, Ppbp, Cxcl5, Apod, Il12a, Ccl5, Fut7, Ccl7, Ccl28, Spn, Sell, Itgb3, Grem1, Cxcl1, Ptk2b, Lgals3, Adam8, Pf4, Dusp1, Ch25h, S100a8, Nbl1, Alox5, Padi2, Plg, Edn3, Il33, Ptn, Ada, Emp2, Enpp1, Calr, Ager, Slamf9, Ccl6, Prex1, Aoc3, Itgam, Mdk, Itga4, Hsd3b7, P2ry12, Trpm4 50
GO:0030595 leukocyte chemotaxis 34/1245 221/21025 3.00e-07 0.0003802 0.0003609 Tnfsf18, Ccl17, Ccr7, Bst1, Ccl2, Retnlg, Ppbp, Cxcl5, Il12a, Ccl5, Ccl7, Sell, Grem1, Cxcl1, Ptk2b, Lgals3, Adam8, Pf4, Dusp1, Ch25h, S100a8, Nbl1, Alox5, Padi2, Edn3, Ptn, Calr, Slamf9, Ccl6, Prex1, Itgam, Mdk, Hsd3b7, Trpm4 34
GO:0035458 cellular response to interferon-beta 14/1245 48/21025 4.00e-07 0.0004441 0.0004216 Tgtp1, Tgtp2, F830016B08Rik, Iigp1, Igtp, Gm4951, Oas1c, Irgm1, Gbp6, Ifi47, Aim2, Irgm2, Ifit1, Ifi204 14
GO:0002523 leukocyte migration involved in inflammatory response 10/1245 25/21025 7.00e-07 0.0006242 0.0005926 Ccl2, Ppbp, Fut7, Adam8, S100a8, Alox5, Ptn, Aoc3, Itgam, Mdk 10
GO:0050953 sensory perception of light stimulus 26/1245 164/21025 4.10e-06 0.0029769 0.0028259 Aipl1, Vsx2, Nxnl2, Lrit3, Cryba2, Bfsp2, Lrat, Gabrr2, Lum, Rlbp1, Pde6g, Gpr179, Col1a1, Cplx3, Best1, Ush1g, Rs1, Rdh5, Guca1b, Th, Ppef2, Rbp4, Olfm2, Rom1, Vsx1, Rpe65 26
GO:0071675 regulation of mononuclear cell migration 21/1245 117/21025 4.70e-06 0.0029799 0.0028288 Tnfsf18, Aire, Ccr7, Ccl2, Apod, Il12a, Ccl5, Ccl7, Spn, Itgb3, Grem1, Ptk2b, Lgals3, Adam8, Dusp1, Nbl1, Padi2, Calr, Ager, Mdk, Itga4 21
GO:0060326 cell chemotaxis 38/1245 298/21025 7.00e-06 0.0039475 0.0037473 Tnfsf18, Ccl17, Ccr7, Bst1, Ccl2, Retnlg, Ppbp, Cxcl5, Nr4a1, Il12a, Ccl5, Ccl7, Ccl28, Sell, Grem1, Cxcl1, Ptk2b, Lgals3, Adam8, Pf4, Dusp1, Ch25h, S100a8, Nbl1, Alox5, Padi2, Edn3, Ptn, Plxnb3, Calr, Lpar1, Slamf9, Ccl6, Prex1, Itgam, Mdk, Hsd3b7, Trpm4 38
GO:0007601 visual perception 25/1245 160/21025 8.10e-06 0.0041224 0.0039134 Aipl1, Vsx2, Nxnl2, Lrit3, Cryba2, Bfsp2, Lrat, Gabrr2, Lum, Rlbp1, Pde6g, Gpr179, Col1a1, Cplx3, Best1, Rs1, Rdh5, Guca1b, Th, Ppef2, Rbp4, Olfm2, Rom1, Vsx1, Rpe65 25
GO:0002460 adaptive immune response based on somatic recombination of immune receptors built from immunoglobulin superfamily domains 42/1245 350/21025 1.09e-05 0.0050165 0.0047621 Tnfsf18, Aire, Ccr7, Gzmb, Cd70, H2-Q6, H2-Q7, Il12a, Il12rb1, H2-Q4, Il18rap, Fut7, Spn, Icosl, Pirb, H2-Q2, Irf7, Cd274, Tnfrsf13c, Il2rb, Masp2, C8g, C4b, Il33, H2-Q1, Ada, Emp2, C3, Enpp1, Tfrc, Cd46, H2-K1, Rorc, Csf2rb, Ager, H2-T23, Tap2, Tnfsf13b, Pla2g4a, Trpm4, Parp3, Klhl6 42
GO:1990266 neutrophil migration 21/1245 125/21025 1.36e-05 0.0057697 0.0054771 Ccl17, Ccr7, Bst1, Ccl2, Ppbp, Cxcl5, Ccl5, Fut7, Ccl7, Sell, Cxcl1, Lgals3, Adam8, Pf4, S100a8, Edn3, Emp2, Ccl6, Prex1, Itgam, Mdk 21
GO:0036336 dendritic cell migration 9/1245 27/21025 1.54e-05 0.0057975 0.0055036 Tnfsf18, Ccr7, Nlrp12, Retnlg, Il12a, Alox5, Calr, Slamf9, Trpm4 9
GO:0002685 regulation of leukocyte migration 31/1245 230/21025 1.60e-05 0.0057975 0.0055036 Tnfsf18, Aire, Ccr7, Bst1, Ccl2, Apod, Il12a, Ccl5, Fut7, Ccl7, Ccl28, Spn, Sell, Itgb3, Grem1, Ptk2b, Lgals3, Adam8, Dusp1, Nbl1, Padi2, Edn3, Il33, Ptn, Ada, Calr, Ager, Aoc3, Mdk, Itga4, P2ry12 31
GO:0030593 neutrophil chemotaxis 18/1245 100/21025 2.11e-05 0.0071625 0.0067993 Ccl17, Ccr7, Bst1, Ccl2, Ppbp, Cxcl5, Ccl5, Ccl7, Sell, Cxcl1, Lgals3, Pf4, S100a8, Edn3, Ccl6, Prex1, Itgam, Mdk 18
GO:0016264 gap junction assembly 7/1245 17/21025 2.88e-05 0.0086355 0.0081976 Gjd3, Aplnr, Gja5, Agt, Hopx, Cav1, Ace 7
GO:1903028 positive regulation of opsonization 6/1245 12/21025 2.89e-05 0.0086355 0.0081976 Pla2g5, Masp2, C4b, Colec11, C3, Cfp 6
GO:0034341 response to type II interferon 22/1245 142/21025 3.18e-05 0.0089664 0.0085118 Ccl17, Gbp4, Ccl2, Tgtp1, H2-Q7, Il12rb1, Ifitm6, Ccl5, Ccl7, Igtp, Nos2, Nlrc5, Bst2, Irgm1, Gbp6, Capg, Ifitm7, Gbp9, Gbp5, Irgm2, Ccl6, Aqp4 22

Keypoints

  • Gene set enrichment analysis helps to answer “what are the common biological functions affected in the experiment”.
  • Gene sets can be retrieved from several online resources, including mSigDB, KEGG, reactome, Gene Ontology.
  • A rich set of gene set analysis methods are available. Some methods require genes to be classified into ‘significant’ and ‘non-significant’ before the gene set analysis, while others ingest the full ranking of genes.