This tutorial will walk through multimodal intersection analysis (MIA) for the integration of single-cell RNA-seq (scRNA-seq) and microarray-based spatial transcriptomics (ST) data, as described in Moncada et al. Nature Biotechnology (2020). We will start with the scRNA-seq and ST for the PDAC-A sample.
Let’s begin by loading the filtered scRNA-seq data for PDAC-A (GEO accession GSE111672) and converting the data to a Seurat object for analysis.
The scRNA-seq expression matrix from GEO already has cluster annotations as the column header, so we won’t need to cluster the data.
Now that we have a set of marker genes for the cell types and for the ST tissue regions, we can now integrate both datasets with MIA to determine where the scRNA-seq defined cell types localize in the tissue.
Briefly, MIA uses the hypergeometric cumulative distribution to measure the enrichment of cell-type specific marker genes (from scRNA-seq) within a list of tissue-region specific genes (from ST). We use this as a proxy for estimating the likelihood each cell type is localized to a given tissue region.
First, let’s compile a list of marker genes for both the ST and scRNA-seq data:
# Create a list object containing the marker genes for each ST region:
st.clusts <- Idents(st) %>% levels()
N <- length(st.clusts)
st.marker.list <- vector(mode = 'list', length = N)
names(st.marker.list) <- st.clusts
for(i in st.clusts) {
st.marker.list[[i]] <- st.markers[st.markers$cluster == i,'gene']
}
# Create a list object containing the marker genes for each cell type:
sc.clusts <- Idents(sc) %>% levels()
M <- length(sc.clusts)
sc.marker.list <- vector(mode = 'list', length = M)
names(sc.marker.list) <- sc.clusts
for (i in sc.clusts) {
sc.marker.list[[i]] <- sc.markers[sc.markers$cluster == i,'gene']
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Let’s first demonstrate MIA with a single example: fibroblasts (scRNA-seq) and the cancer region (ST):
gene.universe <- length(rownames(st))
cell.type.markers <- sc.marker.list[["Fibroblasts"]] # Genes specific to fibroblasts
tissue.region.markers <- st.marker.list[["Cancer"]] # Genes specific to the cancer region
common.genes <- intersect(cell.type.markers, # Common genes between these two gene sets
tissue.region.markers)
# MIA:
A <- length(common.genes)
B <- length(cell.type.markers)
C <- length(tissue.region.markers)
enr <- -log10(phyper(A, B, gene.universe-B, C, lower.tail = FALSE))
dep <- -log10(1-phyper(A, B, gene.universe-B, C, lower.tail = FALSE))
if (enr < dep) {
MIA.results <- -dep
} else {
MIA.results <- enr
}
print(MIA.results)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
What this signifies is that the fibroblast genes are significantly enriched within the genes specific to the cancer region of the tissue.
With a -log10(p-value) of ~13, one way to think about this is that there’s approximately a 0.00000000000001% chance we observed such an overlap between the cell-type and tissue-region genes simply by chance.
In parallel we also test for cell type depletion by computing -log10(1-p). Here, we are estimating the likelihood for an observed lack of overlap between gene sets.
Let’s use MIA to determine the cell type enrichments for all cell types and tissue regions:
# Initialize a dataframe for us to store values in:
N <- length(st.clusts) ; M <- length(sc.clusts)
MIA.results <- matrix(0,nrow = M, ncol = N)
row.names(MIA.results) <- sc.clusts
colnames(MIA.results) <- st.clusts
# Gene universe
gene.universe <- length(rownames(st))
# Loop over ST clusters
for (i in 1:N) {
# Then loop over SC clusters
for (j in 1:M) {
genes1 <- st.marker.list[[st.clusts[i]]]
genes2 <- sc.marker.list[[sc.clusts[j]]]
# Hypergeometric
A <- length(intersect(genes1,genes2))
B <- length(genes1)
C <- length(genes2)
enr <- -log10(phyper(A, B, gene.universe-B, C, lower.tail = FALSE))
dep <- -log10(1-phyper(A, B, gene.universe-B, C, lower.tail = FALSE))
if (enr < dep) {
MIA.results[j,i] = -dep
} else {
MIA.results[j,i] = enr
}
}
}
# Some results were -Inf...check why this is the case...
MIA.results[is.infinite(MIA.results)] <- 0