Multi-sample analyses

Last updated on 2025-12-19 | Edit this page

Estimated time: 45 minutes

Overview

Questions

  • How can we integrate data from multiple batches, samples, and studies?
  • How can we identify differentially expressed genes between experimental conditions for each cell type?
  • How can we identify changes in cell type abundance between experimental conditions?

Objectives

  • Correct batch effects and diagnose potential problems such as over-correction.
  • Perform differential expression comparisons between conditions based on pseudo-bulk samples.
  • Perform differential abundance comparisons between conditions.

Setup and data exploration


As before, we will use the the wild-type data from the Tal1 chimera experiment:

  • Sample 5: E8.5 injected cells (tomato positive), pool 3
  • Sample 6: E8.5 host cells (tomato negative), pool 3
  • Sample 7: E8.5 injected cells (tomato positive), pool 4
  • Sample 8: E8.5 host cells (tomato negative), pool 4
  • Sample 9: E8.5 injected cells (tomato positive), pool 5
  • Sample 10: E8.5 host cells (tomato negative), pool 5

Note that this is a paired design in which for each biological replicate (pool 3, 4, and 5), we have both host and injected cells.

We start by loading the data and doing a quick exploratory analysis, essentially applying the normalization and visualization techniques that we have seen in the previous lectures to all samples. Note that this time we’re selecting samples 5 to 10, not just 5 by itself. Also note the type = "processed" argument: we are explicitly selecting the version of the data that has already been QC processed.

R

library(MouseGastrulationData)
library(batchelor)
library(edgeR)
library(scater)
library(ggplot2)
library(scran)
library(pheatmap)
library(scuttle)

sce <- WTChimeraData(samples = 5:10, type = "processed")

R

sce

OUTPUT

class: SingleCellExperiment
dim: 29453 20935
metadata(0):
assays(1): counts
rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
  ENSMUSG00000095742 tomato-td
rowData names(2): ENSEMBL SYMBOL
colnames(20935): cell_9769 cell_9770 ... cell_30702 cell_30703
colData names(11): cell barcode ... doub.density sizeFactor
reducedDimNames(2): pca.corrected.E7.5 pca.corrected.E8.5
mainExpName: NULL
altExpNames(0):

R

colData(sce)

OUTPUT

DataFrame with 20935 rows and 11 columns
                  cell          barcode    sample       stage    tomato
           <character>      <character> <integer> <character> <logical>
cell_9769    cell_9769 AAACCTGAGACTGTAA         5        E8.5      TRUE
cell_9770    cell_9770 AAACCTGAGATGCCTT         5        E8.5      TRUE
cell_9771    cell_9771 AAACCTGAGCAGCCTC         5        E8.5      TRUE
cell_9772    cell_9772 AAACCTGCATACTCTT         5        E8.5      TRUE
cell_9773    cell_9773 AAACGGGTCAACACCA         5        E8.5      TRUE
...                ...              ...       ...         ...       ...
cell_30699  cell_30699 TTTGTCACAGCTCGCA        10        E8.5     FALSE
cell_30700  cell_30700 TTTGTCAGTCTAGTCA        10        E8.5     FALSE
cell_30701  cell_30701 TTTGTCATCATCGGAT        10        E8.5     FALSE
cell_30702  cell_30702 TTTGTCATCATTATCC        10        E8.5     FALSE
cell_30703  cell_30703 TTTGTCATCCCATTTA        10        E8.5     FALSE
                pool stage.mapped        celltype.mapped closest.cell
           <integer>  <character>            <character>  <character>
cell_9769          3        E8.25             Mesenchyme   cell_24159
cell_9770          3         E8.5            Endothelium   cell_96660
cell_9771          3         E8.5              Allantois  cell_134982
cell_9772          3         E8.5             Erythroid3  cell_133892
cell_9773          3        E8.25             Erythroid1   cell_76296
...              ...          ...                    ...          ...
cell_30699         5         E8.5             Erythroid3   cell_38810
cell_30700         5         E8.5       Surface ectoderm   cell_38588
cell_30701         5        E8.25 Forebrain/Midbrain/H..   cell_66082
cell_30702         5         E8.5             Erythroid3  cell_138114
cell_30703         5         E8.0                Doublet   cell_92644
           doub.density sizeFactor
              <numeric>  <numeric>
cell_9769    0.02985045    1.41243
cell_9770    0.00172753    1.22757
cell_9771    0.01338013    1.15439
cell_9772    0.00218402    1.28676
cell_9773    0.00211723    1.78719
...                 ...        ...
cell_30699   0.00146287   0.389311
cell_30700   0.00374155   0.588784
cell_30701   0.05651258   0.624455
cell_30702   0.00108837   0.550807
cell_30703   0.82369305   1.184919

For the sake of making these examples run faster, we drop low quality cells (stripped nuclei and doublets) and also randomly select 50% cells per sample.

R

drop <- sce$celltype.mapped %in% c("stripped", "Doublet")

sce <- sce[,!drop]

set.seed(29482)

idx <- unlist(tapply(colnames(sce), sce$sample, function(x) {
    perc <- round(0.50 * length(x))
    sample(x, perc)
}))

sce <- sce[,idx]

We now normalize the data, run some dimensionality reduction steps, and visualize the data in a tSNE plot. In this case we have many different cell types, so we define a custom palette with many visually distinct colors (adapted from the polychrome palette in the pals package).

R

sce <- logNormCounts(sce)

dec <- modelGeneVar(sce, block = sce$sample)

chosen.hvgs <- dec$bio > 0

sce <- runPCA(sce, subset_row = chosen.hvgs, ntop = 1000)

sce <- runTSNE(sce, dimred = "PCA")

sce$sample <- as.factor(sce$sample)

plotTSNE(sce, colour_by = "sample")

R

color_vec <- c("#5A5156", "#E4E1E3", "#F6222E", "#FE00FA", "#16FF32", "#3283FE", 
               "#FEAF16", "#B00068", "#1CFFCE", "#90AD1C", "#2ED9FF", "#DEA0FD", 
               "#AA0DFE", "#F8A19F", "#325A9B", "#C4451C", "#1C8356", "#85660D", 
               "#B10DA1", "#3B00FB", "#1CBE4F", "#FA0087", "#333333", "#F7E1A0", 
               "#C075A6", "#782AB6", "#AAF400", "#BDCDFF", "#822E1C", "#B5EFB5", 
               "#7ED7D1", "#1C7F93", "#D85FF7", "#683B79", "#66B0FF", "#FBE426")

plotTSNE(sce, colour_by = "celltype.mapped") +
    scale_color_manual(values = color_vec) +
    theme(legend.position = "bottom")

There are evident sample effects. Depending on the analysis that you want to perform you may want to remove or retain the sample effect. For instance, if the goal is to identify cell types with a clustering method, one may want to remove the sample effects with “batch effect” correction methods.

For now, let’s assume that we want to remove this effect.

Challenge

Challenge

It seems like samples 5 and 6 are clearly separated off the other samples in gene expression space. Given the group of cells in each sample, why might this make sense for these samples as opposed to some other pair of samples? What is the factor presumably leading to this difference?

Samples 5 and 6 were from the same “pool” of cells. Looking at the documentation for the dataset under ?WTChimeraData we see that the pool variable is defined as: “Integer, embryo pool from which cell derived; samples with same value are matched.” So samples 5 and 6 have an experimental factor in common which causes a shared, systematic difference in their gene expression profiles compared to the other samples. That’s why you can see many isolated blue/orange clusters on the first TSNE plot. If you were developing single-cell library preparation protocols you might want to preserve this effect to understand how variation in pools leads to variation in expression, but for now, given that we’re investigating other effects, we’ll want to remove this as undesired technical variation.

Correcting batch effects


We “correct” the effect of samples with the correctExperiment function in the batchelor package, using the sample column as the batch variable.

R

set.seed(10102)

merged <- correctExperiments(
    sce, 
    batch = sce$sample, 
    subset.row = chosen.hvgs,
    PARAM = FastMnnParam(
        merge.order = list(
            list(1,3,5), # WT (3 replicates)
            list(2,4,6)  # td-Tomato (3 replicates)
        )
    )
)

merged <- runTSNE(merged, dimred = "corrected")

plotTSNE(merged, colour_by = "batch")

We can also see that when coloring cells by cell type, the cell types are now largely confined to individual clusters:

R

plotTSNE(merged, colour_by = "celltype.mapped") +
    scale_color_manual(values = color_vec) +
    theme(legend.position = "bottom")

Once we have removed the sample effect, we can proceed with the differential expression (DE) analysis.

Challenge

Challenge

True or False? After batch correction, no batch-level information is present in the corrected data.

False. Batch-level data can be retained through confounding with experimental factors or poor ability to distinguish experimental effects from batch effects. Remember, the changes needed to correct the data are empirically estimated, so they can carry along error.

While batch effect correction algorithms usually do a pretty good job, it’s smart to do a sanity check for batch effects at the end of your analysis. You always want to make sure that that effect you’re resting your paper submission on isn’t driven by batch effects.

Differential Expression


In order to perform a differential expression (DE) analysis, we need to identify groups of cells across samples/conditions (depending on the experimental design and the overall goal of the experiment).

As we have seen before, there are two ways of grouping cells, cell clustering and cell labeling. Here, we apply the second approach to group cells according to the already annotated cell types to proceed with the computation of the pseudo-bulk samples.

Pseudo-bulk samples

To compute differences between groups of cells, a possible way is to compute pseudo-bulk samples, where we summarize the gene expression for all the cells of each specific cell type. We are then able to detect differences in gene expression between two different conditions for one cell type at a time.

To compute pseudo-bulk samples, we use the aggregateAcrossCells function in the scuttle package, which takes as input not only a SingleCellExperiment, but also the label used for the identification of cell groups/types. Here, we use as not just the cell type label, but also the sample ID, as we want be able to discern between replicates and conditions later in the analysis.

R

# Using 'label' and 'sample' as our two factors; each column of the output
# corresponds to one unique combination of these two factors.

summed <- aggregateAcrossCells(
    merged, 
    id = colData(merged)[,c("celltype.mapped", "sample")]
)

summed

OUTPUT

class: SingleCellExperiment
dim: 13641 179
metadata(2): merge.info pca.info
assays(1): counts
rownames(13641): ENSMUSG00000051951 ENSMUSG00000025900 ...
  ENSMUSG00000096730 ENSMUSG00000095742
rowData names(3): rotation ENSEMBL SYMBOL
colnames: NULL
colData names(15): batch cell ... sample ncells
reducedDimNames(5): corrected pca.corrected.E7.5 pca.corrected.E8.5 PCA
  TSNE
mainExpName: NULL
altExpNames(0):

Differential Expression (DE) Analysis

The main advantage of using pseudo-bulk samples is that we can use established methods for bulk DE analysis like edgeR and DESeq2. Both, edgeR and DESeq2, are based on negative binomial models, but differ in their normalization strategies and several implementation details.

First, let’s start with a specific cell type, for instance the “Mesenchymal stem cells”, and analyze gene expression differences between conditions for this cell type. We store the counts table in a DGEList data container called y, along with experimental metadata.

R

current <- summed[, summed$celltype.mapped == "Mesenchyme"]

y <- DGEList(counts(current), samples = colData(current))

y

OUTPUT

An object of class "DGEList"
$counts
                   Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
ENSMUSG00000051951       2       0       0       0       1       0
ENSMUSG00000025900       0       0       0       0       0       0
ENSMUSG00000025902       4       0       2       0       3       6
ENSMUSG00000033845     765     130     508     213     781     305
ENSMUSG00000002459       2       0       1       0       0       0
13636 more rows ...

$samples
        group lib.size norm.factors batch cell barcode sample stage tomato pool
Sample1     1  2478901            1     5 <NA>    <NA>      5  E8.5   TRUE    3
Sample2     1   548407            1     6 <NA>    <NA>      6  E8.5  FALSE    3
Sample3     1  1260187            1     7 <NA>    <NA>      7  E8.5   TRUE    4
Sample4     1   578699            1     8 <NA>    <NA>      8  E8.5  FALSE    4
Sample5     1  2092329            1     9 <NA>    <NA>      9  E8.5   TRUE    5
Sample6     1   904929            1    10 <NA>    <NA>     10  E8.5  FALSE    5
        stage.mapped celltype.mapped closest.cell doub.density sizeFactor
Sample1         <NA>      Mesenchyme         <NA>           NA         NA
Sample2         <NA>      Mesenchyme         <NA>           NA         NA
Sample3         <NA>      Mesenchyme         <NA>           NA         NA
Sample4         <NA>      Mesenchyme         <NA>           NA         NA
Sample5         <NA>      Mesenchyme         <NA>           NA         NA
Sample6         <NA>      Mesenchyme         <NA>           NA         NA
        celltype.mapped.1 sample.1 ncells
Sample1        Mesenchyme        5    151
Sample2        Mesenchyme        6     28
Sample3        Mesenchyme        7    127
Sample4        Mesenchyme        8     75
Sample5        Mesenchyme        9    239
Sample6        Mesenchyme       10    146

We usually want to discard low quality samples with low sequencing depth / library size as they have the potential to skew normalization and/or DE analysis.

We can see that in our case we don’t have low quality samples, so there is no need for such a filtering step.

R

discarded <- current$ncells < 10

y <- y[,!discarded]

summary(discarded)

OUTPUT

   Mode   FALSE
logical       6 

Typically, we also want to filter out genes with too low of an expression to be meaningfully retained in a statistcal analysis for differential expression.

R

keep <- filterByExpr(y, group = current$tomato)

y <- y[keep,]

summary(keep)

OUTPUT

   Mode   FALSE    TRUE
logical    9121    4520 

We can now proceed with normalizing the data. There are several approaches for normalizing bulk data, that are thus readily applicable to pseudo-bulk data. Here, we use the Trimmed Mean of M-values (TMM) method, implemented in the edgeR package within the calcNormFactors function.

R

y <- calcNormFactors(y)

y$samples

OUTPUT

        group lib.size norm.factors batch cell barcode sample stage tomato pool
Sample1     1  2478901    1.0506857     5 <NA>    <NA>      5  E8.5   TRUE    3
Sample2     1   548407    1.0399112     6 <NA>    <NA>      6  E8.5  FALSE    3
Sample3     1  1260187    0.9700083     7 <NA>    <NA>      7  E8.5   TRUE    4
Sample4     1   578699    0.9871129     8 <NA>    <NA>      8  E8.5  FALSE    4
Sample5     1  2092329    0.9695559     9 <NA>    <NA>      9  E8.5   TRUE    5
Sample6     1   904929    0.9858611    10 <NA>    <NA>     10  E8.5  FALSE    5
        stage.mapped celltype.mapped closest.cell doub.density sizeFactor
Sample1         <NA>      Mesenchyme         <NA>           NA         NA
Sample2         <NA>      Mesenchyme         <NA>           NA         NA
Sample3         <NA>      Mesenchyme         <NA>           NA         NA
Sample4         <NA>      Mesenchyme         <NA>           NA         NA
Sample5         <NA>      Mesenchyme         <NA>           NA         NA
Sample6         <NA>      Mesenchyme         <NA>           NA         NA
        celltype.mapped.1 sample.1 ncells
Sample1        Mesenchyme        5    151
Sample2        Mesenchyme        6     28
Sample3        Mesenchyme        7    127
Sample4        Mesenchyme        8     75
Sample5        Mesenchyme        9    239
Sample6        Mesenchyme       10    146

To investigate the effect of the normalization, we use a Mean-Difference (MD) plot for each sample in order to detect possible normalization issues due to insufficient cells/reads/UMIs in any of the pseudo-bulk profiles.

In our case, we verify that all these plots are centered on 0 (\(y\)-axis) and display a trumpet shape, as expected.

R

par(mfrow = c(2,3))

for (i in seq_len(ncol(y))) {
    plotMD(y, column = i)
}

R

par(mfrow = c(1,1))

Furthermore, we want to check if the samples cluster together based on known experimental factors (like the tomato injection in this case).

Here, we use a multidimensional scaling (MDS) plot to inspect this. Multidimensional scaling (also called principal coordinate analysis (PCoA)) is a dimensionality reduction technique that’s conceptually similar to principal component analysis (PCA).

R

limma::plotMDS(cpm(y, log = TRUE), 
               col = ifelse(y$samples$tomato, "red", "blue"))

We then construct a design matrix with the tomato variable as the main factors and pool as an additional covariate.

R

design <- model.matrix(~factor(pool) + factor(tomato),
                       data = y$samples)
design

OUTPUT

        (Intercept) factor(pool)4 factor(pool)5 factor(tomato)TRUE
Sample1           1             0             0                  1
Sample2           1             0             0                  0
Sample3           1             1             0                  1
Sample4           1             1             0                  0
Sample5           1             0             1                  1
Sample6           1             0             1                  0
attr(,"assign")
[1] 0 1 1 2
attr(,"contrasts")
attr(,"contrasts")$`factor(pool)`
[1] "contr.treatment"

attr(,"contrasts")$`factor(tomato)`
[1] "contr.treatment"

Now we can estimate the Negative Binomial (NB) overdispersion parameter, to model the mean-variance trend.

R

y <- estimateDisp(y, design)

summary(y$trended.dispersion)

OUTPUT

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.01002 0.01591 0.02472 0.02131 0.02574 0.02652 

The BCV plot allows us to visualize the relationship between the Biological Coefficient of Variation and the Average log CPM for each gene. Additionally, the Common and Trend BCV are shown in red and blue.

R

plotBCV(y)

We then fit a Quasi-Likelihood (QL) negative binomial generalized linear model for each gene. The robust = TRUE parameter avoids distortions from highly variable clusters. The QL method includes an additional dispersion parameter for incorporating the uncertainty and variability of the per-gene variance, which is not well estimated by the NB dispersions, so the two dispersion types complement each other in the final analysis.

R

fit <- glmQLFit(y, design, robust = TRUE)

summary(fit$var.prior)

OUTPUT

Length  Class   Mode
     0   NULL   NULL 

R

summary(fit$df.prior)

OUTPUT

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.6968  7.9221  7.9221  7.8972  7.9221  7.9221 

QL dispersion estimates for each gene as a function of abundance. Raw estimates (black) are shrunk towards the trend (blue) to yield squeezed estimates (red).

R

plotQLDisp(fit)

We then use an empirical Bayes quasi-likelihood F-test to test for differential expression (due to tomato injection) for each gene at a False Discovery Rate (FDR) of 5%. The low number of DE genes shwos that tomato injection does not have a major impact on gene expression in mesenchymal cells.

R

res <- glmQLFTest(fit, coef = ncol(design))

summary(decideTests(res))

OUTPUT

       factor(tomato)TRUE
Down                    5
NotSig               4510
Up                      5

R

topTags(res)

OUTPUT

Coefficient:  factor(tomato)TRUE
                        logFC   logCPM          F       PValue          FDR
ENSMUSG00000010760 -4.1560168 9.973704 1070.96054 1.940322e-11 8.770254e-08
ENSMUSG00000096768  1.9992563 8.844258  379.81665 3.081008e-09 6.963079e-06
ENSMUSG00000086503 -6.4902580 7.411257  248.15766 1.911494e-07 2.879984e-04
ENSMUSG00000035299  1.7979019 6.904163  126.18570 5.814967e-07 6.570913e-04
ENSMUSG00000101609  1.3746824 7.310009   81.40822 4.275035e-06 3.864631e-03
ENSMUSG00000019188 -1.0193854 7.545530   62.43651 1.374972e-05 1.035812e-02
ENSMUSG00000024423  0.9943824 7.391075   59.10516 1.742749e-05 1.125318e-02
ENSMUSG00000042607 -0.9541504 7.468203   45.63814 5.201678e-05 2.938948e-02
ENSMUSG00000027520  1.5873733 6.952923   41.78827 7.480284e-05 3.594063e-02
ENSMUSG00000036446 -0.8307640 9.401028   41.16271 7.951468e-05 3.594063e-02

All the previous steps can be conveniently performed for each cell type, with the pseudoBulkDGE function from the scran package.

R

summed.filt <- summed[,summed$ncells >= 10]

de.results <- pseudoBulkDGE(
    summed.filt, 
    label = summed.filt$celltype.mapped,
    design = ~factor(pool) + tomato,
    coef = "tomatoTRUE",
    condition = summed.filt$tomato 
)

The returned object is a list of DataFrames each storing the results for one of the cell types. Each of these DataFrames also contains the intermediate results of the full edgeR pipeline carried out above, which allows us to apply diagnostics and visualization of individual steps of the pipeline.

R

cur.results <- de.results[["Allantois"]]

cur.results[order(cur.results$PValue),]

OUTPUT

DataFrame with 13641 rows and 5 columns
                       logFC    logCPM         F      PValue         FDR
                   <numeric> <numeric> <numeric>   <numeric>   <numeric>
ENSMUSG00000037664  -8.00072  11.56010  3349.683 1.15757e-27 4.81318e-24
ENSMUSG00000010760  -2.57691  12.41096  1095.755 8.21809e-22 1.70854e-18
ENSMUSG00000086503  -7.01824   7.50026   762.544 6.36342e-20 8.81970e-17
ENSMUSG00000096768   1.82579   9.33347   314.684 1.83678e-15 1.90934e-12
ENSMUSG00000022464   0.96701  10.28038   119.758 6.65050e-11 5.53056e-08
...                      ...       ...       ...         ...         ...
ENSMUSG00000095247        NA        NA        NA          NA          NA
ENSMUSG00000096808        NA        NA        NA          NA          NA
ENSMUSG00000079808        NA        NA        NA          NA          NA
ENSMUSG00000096730        NA        NA        NA          NA          NA
ENSMUSG00000095742        NA        NA        NA          NA          NA
Challenge

Challenge

Clearly some of the results have low p-values. What about the effect sizes? What does logFC stand for?

“logFC” stands for log fold-change, typically on a log2 scale. That means a 2-fold increase in gene expression corresponds to a logFC of log2(2) = 1.

ENSMUSG00000037664 seems to have an estimated logFC of about -8. That points to a large decrease in expression of that gene in Allantois cells of the tomato positive samples.

Differential Abundance (DA) analysis


In addition to differences in gene expression, we also want to find differences in cell type abundance between conditions (here in tomato positive vs wild type samples).

Therefore, we first quantify the number of cells for each cell type, and then fit a model to detect differences between the injected cells and the background.

This process is very similar to differential expression analysis, but here we apply the analysis on the computed abundances without normalizing the data first.

R

abundances <- table(merged$celltype.mapped, merged$sample) 

abundances <- unclass(abundances) 

extra.info <- colData(merged)[match(colnames(abundances), merged$sample),]

y.ab <- DGEList(abundances, samples = extra.info)

design <- model.matrix(~factor(pool) + factor(tomato), y.ab$samples)

y.ab <- estimateDisp(y.ab, design, trend = "none")

ERROR

Error in loglik + prior.n * m0: non-conformable arrays

R

fit.ab <- glmQLFit(y.ab, design, robust = TRUE, abundance.trend = FALSE)

Background on compositional effect

We don’t normalize the abundance data with the calcNormFactors function, as this would implicitly work under the assumption that most of the input features do not vary between conditions. This is typically not a reasonable assumption for cell type abundances as we often only have a few different cell populations that all can change with different experimental conditions. This means that here we will not normalize for library size, which in abundance data corresponds to the total number of cells in each sample (cell type).

However, this can lead our data to be susceptible to compositional effects. “Compositional” refers to the fact that the cluster abundances in a sample are not independent of one another because each cell type is effectively competing for space in the sample. They behave like proportions in that they must sum to 1. If the abundance of cell type A increases under a certain condition, we consequenlty observe less abundance of all other cell types, even if all other cell types are not directly affected by this condition.

Not accounting for compositionality means that any conclusions derived from the DA analysis can be biased by the amount of cells present for each cell type. And it is not uncommon that the number of cells can be strongly unbalanced between cell types, with some low abundance cell types comprising close to 0 percent and certain high abundance cell types making up close to 100 percent of all cells in a sample.

We now look at different approaches for handling the compositional effect.

Assuming most labels do not change

We can use a similar approach as for the DE analysis, assuming that most labels are not changing, in particular if we consider the fact that only few genes where found to be differentially expressed in the analysis above.

To do so, we first normalize the data with calcNormFactors and then we fit and estimate a QL-model for the abundance data.

R

y.ab2 <- calcNormFactors(y.ab)

y.ab2$samples$norm.factors

OUTPUT

[1] 1.1029040 1.0228173 1.0695358 0.7686501 1.0402941 1.0365354

We then use functions from edgeR as before:

R

y.ab2 <- estimateDisp(y.ab2, design, trend = "none")

ERROR

Error in loglik + prior.n * m0: non-conformable arrays

R

fit.ab2 <- glmQLFit(y.ab2, design, robust = TRUE, abundance.trend = FALSE)

res2 <- glmQLFTest(fit.ab2, coef = ncol(design))

summary(decideTests(res2))

OUTPUT

       factor(tomato)TRUE
Down                    2
NotSig                 32
Up                      0

R

topTags(res2, n = 10)

OUTPUT

Coefficient:  factor(tomato)TRUE
                       logFC   logCPM         F       PValue          FDR
ExE ectoderm      -5.7548150 13.13052 36.592240 6.975860e-08 2.371793e-06
Parietal endoderm -6.9054141 12.34396 25.026083 4.230885e-06 7.192505e-05
Mesenchyme         0.9661857 16.32776  6.489987 1.311439e-02 1.220524e-01
Erythroid3        -0.9188382 17.34602  6.313470 1.435911e-02 1.220524e-01
Neural crest      -1.0212778 14.84714  5.444709 2.259224e-02 1.536272e-01
ExE endoderm      -3.9992886 10.75223  4.356716 4.061361e-02 2.301438e-01
Endothelium        0.8725903 14.12053  3.393104 6.983028e-02 3.391756e-01
Cardiomyocytes     0.6943555 14.93781  2.662584 1.073567e-01 4.562660e-01
Allantois          0.5914769 15.55508  2.120304 1.499595e-01 5.665138e-01
Erythroid2        -0.5257535 15.97144  1.742353 1.912673e-01 6.503088e-01

Testing against a log-fold change threshold

An alternative approach assumes that the composition bias introduces a spurious log2-fold change of no more than a quantity for a non-DA label.

In other words, we interpret this as the maximum log-fold change in the total number of cells given by DA in other labels. On the other hand, when choosing , we should not consider fold-differences in the totals due to differences in capture efficiency or for the case that the size of the original cell population is not attributable to composition bias. We then mitigate the effect of composition biases by testing each label for changes in abundance beyond .

R

res.lfc <- glmTreat(fit.ab, coef = ncol(design), lfc = 1)

summary(decideTests(res.lfc))

OUTPUT

       factor(tomato)TRUE
Down                    2
NotSig                 32
Up                      0

R

topTags(res.lfc)

OUTPUT

Coefficient:  factor(tomato)TRUE
                         logFC unshrunk.logFC   logCPM       PValue
ExE ectoderm        -5.5017654     -5.9296357 13.07357 6.705761e-06
Parietal endoderm   -6.5845020    -27.4411287 12.28571 1.247000e-04
ExE endoderm        -3.9304866    -23.9322019 10.76304 6.597463e-02
Mesenchyme           1.1604318      1.1616585 16.35326 1.442717e-01
Endothelium          1.0475417      1.0530630 14.14043 2.211749e-01
Caudal neurectoderm -1.4682413     -1.6212501 11.10535 3.169903e-01
Cardiomyocytes       0.8628677      0.8654665 14.97008 3.478012e-01
Neural crest        -0.8281842     -0.8307410 14.84525 3.726793e-01
Allantois            0.7832266      0.7846135 15.55470 4.217972e-01
Def. endoderm        0.7225404      0.7356721 12.49927 4.304678e-01
                             FDR
ExE ectoderm        0.0002279959
Parietal endoderm   0.0021198993
ExE endoderm        0.7477125066
Mesenchyme          0.9876518478
Endothelium         0.9876518478
Caudal neurectoderm 0.9876518478
Cardiomyocytes      0.9876518478
Neural crest        0.9876518478
Allantois           0.9876518478
Def. endoderm       0.9876518478

Addionally, the choice of can be guided by other external experimental data, like a previous or a pilot experiment.

Exercises


Challenge

Exercise 1: Heatmaps

Use the pheatmap package to create a heatmap of the abundances table. Does it comport with the model results?

You can simply hand pheatmap() a matrix as its only argument. pheatmap() has a million options you can adjust, but the defaults are usually pretty good. Try to overlay sample-level information with the annotation_col argument for an extra challenge.

R

pheatmap(y.ab$counts)

R

anno_df <- y.ab$samples[,c("tomato", "pool")]

anno_df$pool = as.character(anno_df$pool)

anno_df$tomato <- ifelse(anno_df$tomato,
                         "tomato+",
                         "tomato-")

pheatmap(y.ab$counts,
         annotation_col = anno_df)

The top DA result was a decrease in ExE ectoderm in the tomato condition, which you can sort of see, especially if you log1p() the counts or discard rows that show much higher values. ExE ectoderm counts were much higher in samples 8 and 10 compared to 5, 7, and 9.

Challenge

Exercise 2: Model specification and comparison

Try re-running the pseudobulk DGE without the pool factor in the design specification. Compare the logFC estimates and the distribution of p-values for the Erythroid3 cell type.

After running the second pseudobulk DGE, you can join the two DataFrames of Erythroid3 statistics using the merge() function. You will need to create a common key column from the gene IDs.

R

de.results2 <- pseudoBulkDGE(
    summed.filt, 
    label = summed.filt$celltype.mapped,
    design = ~tomato,
    coef = "tomatoTRUE",
    condition = summed.filt$tomato 
)

eryth1 <- de.results$Erythroid3

eryth2 <- de.results2$Erythroid3

eryth1$gene <- rownames(eryth1)

eryth2$gene <- rownames(eryth2)

comp_df <- merge(eryth1, eryth2, by = 'gene')

comp_df <- comp_df[!is.na(comp_df$logFC.x),]

ggplot(comp_df, aes(logFC.x, logFC.y)) + 
    geom_abline(lty = 2, color = "grey") +
    geom_point() 

R

# Reshape to long format for ggplot facets. This is 1000x times easier to do
# with tidyverse packages:
pval_df <- reshape(comp_df[,c("gene", "PValue.x", "PValue.y")],
                   direction = "long", 
                   v.names = "Pvalue",
                   timevar = "pool_factor",
                   times = c("with pool factor", "no pool factor"),
                   varying = c("PValue.x", "PValue.y"))

ggplot(pval_df, aes(Pvalue)) + 
    geom_histogram(boundary = 0,
                   bins = 30) + 
    facet_wrap("pool_factor")

We can see that in this case, the logFC estimates are strongly consistent between the two models, which tells us that the inclusion of the pool factor in the model doesn’t strongly influence the estimate of the tomato coefficients in this case.

The p-value histograms both look alright here, with a largely flat plateau over most of the 0 - 1 range and a spike near 0. This is consistent with the hypothesis that most genes are unaffected by tomato but there are a small handful that clearly are.

If there were large shifts in the logFC estimates or p-value distributions, that’s a sign that the design specification change has a large impact on how the model sees the data. If that happens, you’ll need to think carefully and critically about what variables should and should not be included in the model formula.

Challenge

Extension challenge 1: Group effects

Having multiple independent samples in each experimental group is always helpful, but it’s particularly important when it comes to batch effect correction. Why?

It’s important to have multiple samples within each experimental group because it helps the batch effect correction algorithm distinguish differences due to batch effects (uninteresting) from differences due to group/treatment/biology (interesting).

Imagine you had one sample that received a drug treatment and one that did not, each with 10,000 cells. They differ substantially in expression of gene X. Is that an important scientific finding? You can’t tell for sure, because the effect of drug is indistinguishable from a sample-wise batch effect. But if the difference in gene X holds up when you have five treated samples and five untreated samples, now you can be a bit more confident. Many batch effect correction methods will take information on experimental factors as additional arguments, which they can use to help remove batch effects while retaining experimental differences.

Checklist

Further Reading

Key Points
  • Batch effects are systematic technical differences in the observed expression in cells measured in different experimental batches.
  • Computational removal of batch-to-batch variation with the correctExperiment function from the batchelor package allows us to combine data across multiple batches for a consolidated downstream analysis.
  • Differential expression (DE) analysis of replicated multi-condition scRNA-seq experiments is typically based on pseudo-bulk expression profiles, generated by summing counts for all cells with the same combination of label and sample.
  • The aggregateAcrossCells function from the scater package facilitates the creation of pseudo-bulk samples.
  • The pseudoBulkDGE function from the scran package can be used to detect significant changes in expression between conditions for pseudo-bulk samples consisting of cells of the same type.
  • Differential abundance (DA) analysis aims at identifying significant changes in cell type abundance across conditions.
  • DA analysis uses bulk DE methods such as edgeR and DESeq2, which provide suitable statistical models for count data in the presence of limited replication - except that the counts are not of reads per gene, but of cells per label.

Session Info


R

sessionInfo()

OUTPUT

R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] pheatmap_1.0.13              scran_1.38.0
 [3] scater_1.38.0                ggplot2_4.0.1
 [5] scuttle_1.20.0               edgeR_4.8.1
 [7] limma_3.66.0                 batchelor_1.26.0
 [9] MouseGastrulationData_1.24.0 SpatialExperiment_1.20.0
[11] SingleCellExperiment_1.32.0  SummarizedExperiment_1.40.0
[13] Biobase_2.70.0               GenomicRanges_1.62.1
[15] Seqinfo_1.0.0                IRanges_2.44.0
[17] S4Vectors_0.48.0             BiocGenerics_0.56.0
[19] generics_0.1.4               MatrixGenerics_1.22.0
[21] matrixStats_1.5.0            BiocStyle_2.38.0

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                 formatR_1.14
 [3] gridExtra_2.3             httr2_1.2.2
 [5] rlang_1.1.6               magrittr_2.0.4
 [7] otel_0.2.0                compiler_4.5.2
 [9] RSQLite_2.4.5             DelayedMatrixStats_1.32.0
[11] png_0.1-8                 vctrs_0.6.5
[13] pkgconfig_2.0.3           crayon_1.5.3
[15] fastmap_1.2.0             dbplyr_2.5.1
[17] magick_2.9.0              XVector_0.50.0
[19] labeling_0.4.3            rmarkdown_2.30
[21] ggbeeswarm_0.7.3          purrr_1.2.0
[23] bit_4.6.0                 bluster_1.20.0
[25] xfun_0.55                 cachem_1.1.0
[27] beachmat_2.26.0           blob_1.2.4
[29] DelayedArray_0.36.0       BiocParallel_1.44.0
[31] cluster_2.1.8.1           irlba_2.3.5.1
[33] parallel_4.5.2            R6_2.6.1
[35] RColorBrewer_1.1-3        Rcpp_1.1.0
[37] knitr_1.50                splines_4.5.2
[39] Matrix_1.7-4              igraph_2.2.1
[41] tidyselect_1.2.1          viridis_0.6.5
[43] abind_1.4-8               yaml_2.3.12
[45] codetools_0.2-20          curl_7.0.0
[47] lattice_0.22-7            tibble_3.3.0
[49] withr_3.0.2               KEGGREST_1.50.0
[51] BumpyMatrix_1.18.0        S7_0.2.1
[53] Rtsne_0.17                evaluate_1.0.5
[55] BiocFileCache_3.0.0       ExperimentHub_3.0.0
[57] Biostrings_2.78.0         pillar_1.11.1
[59] BiocManager_1.30.27       filelock_1.0.3
[61] renv_1.1.5                BiocVersion_3.22.0
[63] sparseMatrixStats_1.22.0  scales_1.4.0
[65] glue_1.8.0                metapod_1.18.0
[67] tools_4.5.2               AnnotationHub_4.0.0
[69] BiocNeighbors_2.4.0       ScaledMatrix_1.18.0
[71] locfit_1.5-9.12           cowplot_1.2.0
[73] grid_4.5.2                AnnotationDbi_1.72.0
[75] beeswarm_0.4.0            BiocSingular_1.26.1
[77] vipor_0.4.7               cli_3.6.5
[79] rsvd_1.0.5                rappdirs_0.3.3
[81] viridisLite_0.4.2         S4Arrays_1.10.1
[83] dplyr_1.1.4               ResidualMatrix_1.20.0
[85] gtable_0.3.6              digest_0.6.39
[87] dqrng_0.4.1               ggrepel_0.9.6
[89] SparseArray_1.10.7        rjson_0.2.23
[91] farver_2.1.2              memoise_2.0.1
[93] htmltools_0.5.9           lifecycle_1.0.4
[95] httr_1.4.7                statmod_1.5.1
[97] bit64_4.6.0-1