Discovery of Spatial Ecotypes from Multiple Samples
Source:vignettes/Integration.Rmd
Integration.Rmd
Overview
In this tutorial, we will illustrate how to identify conserved spatial ecotypes (SEs) across multiple samples using the MultiSpatialEcoTyper function. Each sample represents a single-cell spatial transcriptomics data, including a gene expression profile and associated single-cell metadata.
We will be analyzing single-cell spatial transcriptomics data from
cancer samples provided by Vizgen’s MERSCOPE FFPE
Human Immuno-oncology. For efficiency, we have selected a subset
regions from a melanoma sample and a colon cancer sample for the
integrative analysis. You can download the gene expression profiles and
single-cell metadata here
.
The melanoma sample includes spatial expression data for 500 genes across 27,907 cells, while the colon cancer sample contains data for 38,080 cells. In both samples, cells were categorized into ten distinct cell types: B cells, CD4+ T cells, CD8+ T cells, NK cells, plasma cells, macrophages, dendritic cells (DC), fibroblasts, endothelial cells, and the cancer cell of origin (CCOs). Due to their high heterogeneity, CCOs were excluded from this demonstration.
All cells were also annotated into four spatial regions, including tumor, inner margin, outer margin, and stroma. The tumor and stroma regions were defined based on the density of cancer cells. The inner and outer margins were delineated as regions extending 250 μm inside and outside the tumor boundaries, respectively. In addition, we assessed distance of each single cell to tumor/stroma interface by computing the shortest Euclidean distance to tumor regions (or stromal regions) for cells localized in stromal region (or tumor regions). A positive distance is used for cells localized in tumor region, and a negative distance is used for cells localized in stroma region.
First load required packages for this vignette
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(parallel))
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(googledrive))
suppressPackageStartupMessages(library(NMF))
suppressPackageStartupMessages(library(ComplexHeatmap))
library(SpatialEcoTyper)
Loading data
Spatial EcoTyper analysis requires two input data for each sample:
- gene expression matrix: rows represent gene names and columns represent cell IDs
- meta data: a data frame with at least three columns, including “X” (X-coordinate), “Y” (Y-coordinate), and “CellType” (the cell type annotation). The row names of the meta data should match the column names (cell IDs) in the expression matrix.
Text files as input
First, download the demo data from Google Drive
drive_deauth() # Disable Google sign-in requirement
drive_download(as_id("1CoQmU3u8MoVC8RbLUvTDQmOuJJ703HHB"), "HumanMelanomaPatient1_subset_counts.tsv.gz", overwrite = TRUE)
drive_download(as_id("1CgUOQKrWY_TG61o5aw7J9LZzE20D6NuI"), "HumanMelanomaPatient1_subset_scmeta.tsv", overwrite = TRUE)
drive_download(as_id("1ChwONUjr_yoURodnkDBj68ZUbjdHtmP6"), "HumanColonCancerPatient2_subset_counts.tsv.gz", overwrite = TRUE)
drive_download(as_id("1CipRjjD7cqzqKO0Yf4LUdsEw1XDzP6JS"), "HumanColonCancerPatient2_subset_scmeta.tsv", overwrite = TRUE)
Then, load text files into R. You can use read.table
for
small files or data.table::fread
for larger files.
# Load single-cell metadata
# Row names should be cell ids. Required columns: X, Y, CellType;
# Recommend a 'Region' column if pathologist region annotations are available
scmeta1 <- read.table("HumanMelanomaPatient1_subset_scmeta.tsv",
sep = "\t", header = TRUE, row.names = 1)
scmeta2 <- read.table("HumanColonCancerPatient2_subset_scmeta.tsv",
sep = "\t", header = TRUE, row.names = 1)
head(scmeta1[, c("X", "Y", "CellType", "Region")])
## X Y CellType Region
## HumanMelanomaPatient1__cell_3655 1894.706 -6367.766 Fibroblast Stroma
## HumanMelanomaPatient1__cell_3657 1942.480 -6369.602 Fibroblast Stroma
## HumanMelanomaPatient1__cell_3658 1963.007 -6374.026 Fibroblast Stroma
## HumanMelanomaPatient1__cell_3660 1981.600 -6372.266 Fibroblast Stroma
## HumanMelanomaPatient1__cell_3661 1742.939 -6374.851 Fibroblast Stroma
## HumanMelanomaPatient1__cell_3663 1921.683 -6383.309 Fibroblast Stroma
## X Y CellType Region
## HumanColonCancerPatient2__cell_1085 5057.912 -1721.871 Macrophage Inner margin
## HumanColonCancerPatient2__cell_1205 5035.339 -1783.352 Macrophage Inner margin
## HumanColonCancerPatient2__cell_1312 5012.093 -1908.390 Macrophage Inner margin
## HumanColonCancerPatient2__cell_1625 5073.743 -1861.740 Macrophage Inner margin
## HumanColonCancerPatient2__cell_1629 5030.235 -1867.441 Macrophage Inner margin
## HumanColonCancerPatient2__cell_1639 5020.448 -1882.751 Macrophage Inner margin
# Load single-cell gene expression data. Rows represent gene names and columns represent cell IDs
scdata1 <- fread("HumanMelanomaPatient1_subset_counts.tsv.gz",
sep = "\t",header = TRUE, data.table = FALSE)
rownames(scdata1) <- scdata1[, 1] # Setting the first column as row names
scdata1 <- as.matrix(scdata1[, -1]) # Dropping first column
scdata2 <- fread("HumanColonCancerPatient2_subset_counts.tsv.gz",
sep = "\t",header = TRUE, data.table = FALSE)
rownames(scdata2) <- scdata2[, 1] # Setting the first column as row names
scdata2 <- as.matrix(scdata2[, -1]) # Dropping first column
head(scdata1[,1:5])
## HumanMelanomaPatient1__cell_3655 HumanMelanomaPatient1__cell_3657
## PDK4 0 1
## TNFRSF17 0 0
## ICAM3 0 0
## FAP 1 0
## GZMB 0 0
## TSC2 0 0
## HumanMelanomaPatient1__cell_3658 HumanMelanomaPatient1__cell_3660
## PDK4 1 0
## TNFRSF17 0 0
## ICAM3 0 0
## FAP 0 0
## GZMB 0 0
## TSC2 0 0
## HumanMelanomaPatient1__cell_3661
## PDK4 0
## TNFRSF17 0
## ICAM3 0
## FAP 0
## GZMB 0
## TSC2 0
head(scdata2[,1:5])
## HumanColonCancerPatient2__cell_1085 HumanColonCancerPatient2__cell_1205
## PDK4 0 0
## CCL26 0 0
## CX3CL1 0 0
## PGLYRP1 0 0
## CD4 0 0
## SNAI2 0 0
## HumanColonCancerPatient2__cell_1312 HumanColonCancerPatient2__cell_1625
## PDK4 1 0
## CCL26 0 0
## CX3CL1 0 0
## PGLYRP1 0 0
## CD4 0 0
## SNAI2 0 0
## HumanColonCancerPatient2__cell_1629
## PDK4 0
## CCL26 0
## CX3CL1 0
## PGLYRP1 0
## CD4 0
## SNAI2 0
Sparse matrix as input
Mtx files can be loaded into R using the ReadMtx
function from the Seurat package.
drive_download(as_id("13M3xhRxp0xK9gf5F4DE9idSBFqVQIXDT"), "HumanMelanomaPatient1_subset_counts.mtx.gz", overwrite = TRUE)
drive_download(as_id("136feRaFjMtNvduLTm5xqa3WhyyoG4Xzo"), "HumanMelanomaPatient1_subset_cells.tsv.gz", overwrite = TRUE)
drive_download(as_id("13QprWzJhzzUy_w3XSrjlt9pjf2n-G7HV"), "HumanMelanomaPatient1_subset_genes.tsv.gz", overwrite = TRUE)
drive_download(as_id("17fH9BAAugYi1FMLrMuTxojtOoklFBB-K"), "HumanColonCancerPatient2_subset_counts.mtx.gz", overwrite = TRUE)
drive_download(as_id("17a1f1VjxJSje_uyPt6zA97zTps9ko6rk"), "HumanColonCancerPatient2_subset_cells.tsv.gz", overwrite = TRUE)
drive_download(as_id("17fH0jE5b2YqJ5fpQtYil27k-7YhzArD6"), "HumanColonCancerPatient2_subset_genes.tsv.gz", overwrite = TRUE)
scdata1 <- ReadMtx(mtx = "HumanMelanomaPatient1_subset_counts.mtx.gz", cells = "HumanMelanomaPatient1_subset_cells.tsv.gz", features = "HumanMelanomaPatient1_subset_genes.tsv.gz", feature.column = 1, cell.column = 1)
scdata2 <- ReadMtx(mtx = "HumanColonCancerPatient2_subset_counts.mtx.gz", cells = "HumanColonCancerPatient2_subset_cells.tsv.gz", features = "HumanColonCancerPatient2_subset_genes.tsv.gz", feature.column = 1, cell.column = 1)
Data normalization
Gene expression data should be normalized for the
SpatialEcoTyper
analysis. This can be achieved using either
NormalizeData
or SCTransform.
Alternatively, you may skip this step and allow the MultiSpatialEcoTyper
function to handle normalization by specifying the desired method (e.g.,
SCT) in the normalization.method
argument.
Using SCTransform
for the normalization:
For faster computation, it is recommended to install the
glmGamPoi
package for the SCTransform
normalization.
## Install the glmGamPoi package
if(!"glmGamPoi" %in% installed.packages()){
BiocManager::install("glmGamPoi")
}
## Data normalization
tmpobj1 <- CreateSeuratObject(scdata1) %>%
SCTransform(clip.range = c(-10, 10), verbose = FALSE)
tmpobj2 <- CreateSeuratObject(scdata2) %>%
SCTransform(clip.range = c(-10, 10), verbose = FALSE)
## Extract the normalized gene expression data
seurat_version = as.integer(gsub("\\..*", "", as.character(packageVersion("SeuratObject"))))
if(seurat_version<5){
normdata1 <- GetAssayData(tmpobj1, "data")
normdata2 <- GetAssayData(tmpobj2, "data")
}else{
normdata1 <- tmpobj1[["SCT"]]$data
normdata2 <- tmpobj2[["SCT"]]$data
}
Using NormalizeData
for the normalization:
normdata1 <- NormalizeData(scdata1)
normdata2 <- NormalizeData(scdata2)
Preview of the samples
The SpatialView function can be used to visualize the spatial distribution of single cells within the tissues.
# Visualize the cell type annotations in the tissue
p1 <- SpatialView(scmeta1, by = "CellType") + labs(title = "SKCM") +
scale_color_manual(values = pals::cols25()) + theme(legend.position = "none")
p2 <- SpatialView(scmeta2, by = "CellType") + labs(title = "CRC") +
scale_color_manual(values = pals::cols25())
p1 + p2
The SpatialView function can also be used to visualize spatial landscape of these regions in the samples.
# Visualize the regions in the tissue
p1 <- SpatialView(scmeta1, by = "Region") + labs(title = "SKCM") +
scale_color_brewer(type = "qual", palette = "Set1") +
theme(legend.position = "none")
p2 <- SpatialView(scmeta2, by = "Region") + labs(title = "CRC") +
scale_color_brewer(type = "qual", palette = "Set1")
p1 + p2
You can also visualize continuous characteristics, such as the minimum distance of each single cell to the tumor/stroma margin, using the SpatialView function. Here, positive distances indicate cells located within the tumor region, while negative distances denote cells within the stroma.
# Visualize the distance to tumor margin
p1 <- SpatialView(scmeta1, by = "Dist2Interface") + labs(title = "SKCM") +
scale_colour_gradient2(low = "#5e3c99", high = "#e66101", mid = "#d9d9d9", midpoint = 0) +
theme(legend.position = "none")
p2 <- SpatialView(scmeta2, by = "Dist2Interface") + labs(title = "CRC") +
scale_colour_gradient2(low = "#5e3c99", high = "#e66101", mid = "#d9d9d9", midpoint = 0) + labs(color = "Distance to\ntumor margin")
p1 + p2
## Visualize gene expression in the tissue
gg <- scmeta1
gg$Expression <- normdata1["STAT1", ]
p1 <- SpatialView(gg, by = "Expression") +
scale_color_viridis_c() + labs(color = "STAT1")
gg <- scmeta2
gg$Expression <- normdata2["STAT1", ]
p2 <- SpatialView(gg, by = "Expression") +
scale_color_viridis_c() + labs(color = "STAT1")
p1 + p2
Spatial EcoTyper analysis across multiple samples
The MultiSpatialEcoTyper function enables the integration of single-cell spatial transcriptomics data across multiple samples, facilitating the identification of SEs shared across samples or cancer types. The analysis is performed in three stages:
- Independent Spatial EcoTyper Analysis: Spatial EcoTyper is run on each sample independently to identify spatial clusters and construct cell-type-specific gene expression profiles for all spatial clusters.
- Integration: Spatial clusters from different samples are integrated to learn a unified embedding.
- Clustering: Clustering is applied to the unified embedding to identify conserved SEs across samples.
Key arguments for MultiSpatialEcoTyper
-
data_list
A named list of expression matrices where each matrix represents gene expression data for a sample. The columns of each matrix correspond to cells, and the rows correspond to genes. The list names should match sample names; otherwise, samples will be named as ‘Sample1’, ‘Sample2’, etc. -
metadata_list
A named list of metadata data frames, where each data frame contains metadata corresponding to the cells in the expression matrices. Each metadata data frame should include at least three columns (X, Y, CellType), and the row names should match the cell IDs (column names) in the expression matrix. -
outdir
Directory where the results will be saved. Defaults to the current directory with a subdirectory named “SpatialEcoTyper_results_” followed by the current date. -
normalization.method
Method for normalizing the expression data. Options include “None” (default), “SCT”, or other methods compatible with Seurat’sNormalizeData
function. -
nmf_ranks
Integer or vector specifying the number of clusters (10 by default). If a vector is provided, the function tests all ranks and selects the optimal rank for NMF, though this may increase computation time. -
radius
Numeric specifying the radius (in the same units as spatial coordinates) for defining spatial neighborhoods around each cell. Default is 50. -
minibatch
Integer specifying the number of columns (default: 5000) to process in each minibatch in the SNF analysis. This option splits the matrix into smaller chunks, thus reducing memory usage. -
ncores
Integer specifying the number of cores for parallel processing (default: 1).
You can type ?MultiSpatialEcoTyper
to visualize the full
manual.
Important note: In the examples below, we will use
10 clusters and 5 runs per rank for demonstration purposes. To select
the optimal number of clusters and obtain robust results, you can use
the nmf_ranks
argument with a vector of integers to test
multiple ranks for NMF analysis. You can also use the nmfClustering function for
the rank selection, as outlined in the next section
(NMF clustering for SpatialEcoTyper
).
Pathologist region annotation unavailable
MultiSpatialEcoTyper
analysis (without region annotation)
MultiSpatialEcoTyper(data_list, metadata_list,
normalization.method = "None", # Normalization method
nmf_ranks = 10, # Number of clusters
nrun.per.rank = 5, # Recommend 30 or higher for robust results
ncores = 2)
# This demo takes ~3 minutes on a macOS with an Apple M1 Pro chip and 16 GB memory.
Pathologist region annotation available
If region annotations are available, we recommend to include the
region annotations in all metadata and specify the Region
in the MultiSpatialEcoTyper
function, which could improve integration accuracy.
## Annotation of tumor and stroma regions will be considered for the analysis.
metadata_list$SKCM$Region2 = metadata_list$SKCM$Region
metadata_list$CRC$Region2 = metadata_list$CRC$Region
metadata_list$SKCM$Region2 = ifelse(metadata_list$SKCM$Dist2Interface<0, "Stroma", "Tumor")
metadata_list$CRC$Region2 = ifelse(metadata_list$CRC$Dist2Interface<0, "Stroma", "Tumor")
MultiSpatialEcoTyper(data_list, metadata_list,
normalization.method = "None", # Normalization method
min.cts.per.region = 2, # At least two cell types in each microregion
nmf_ranks = 10, # Number of clusters
nrun.per.rank = 5, # Recommend 30 or higher for robust results
Region = "Region2", # Use pathologist annotations if available
ncores = 2)
## This demo takes ~3 minutes to complete on macOS with an Apple M1 Pro chip and 16 GB memory.
Pathologist region annotation available (2)
You can specify the nmf_ranks
argument with a vector of
integers to test multiple ranks, enabling the selection of the optimal
number of clusters.
## Annotation of tumor and stroma regions will be considered for the analysis.
metadata_list$SKCM$Region2 = metadata_list$SKCM$Region
metadata_list$CRC$Region2 = metadata_list$CRC$Region
metadata_list$SKCM$Region2 = ifelse(metadata_list$SKCM$Dist2Interface<0, "Stroma", "Tumor")
metadata_list$CRC$Region2 = ifelse(metadata_list$CRC$Dist2Interface<0, "Stroma", "Tumor")
MultiSpatialEcoTyper(data_list, metadata_list,
normalization.method = "None", # Normalization method
nmf_ranks = seq(4,15,2), # Number of clusters
nrun.per.rank = 30, # 30 runs per rank for robust results
Region = "Region2", # Use pathologist annotations
ncores = 2)
## This demo takes ~16 minutes to complete on macOS with an Apple M1 Pro chip and 16 GB memory.
Using pre-existing Spatial EcoTyper results
If you’ve already run SpatialEcoTyper for each sample, you can integrate the results directly using IntegrateSpatialEcoTyper.
drive_download(as_id("126OSOk8sAaxhJID1D3-75bI9S-to9Lkg"), "CRC_SpatialEcoTyper_results.rds", overwrite = TRUE)
drive_download(as_id("11y0lcpawQAZAvDCnt3tBQsF90Wet0lKe"), "SKCM_SpatialEcoTyper_results.rds", overwrite = TRUE)
data_list <- list(SKCM = normdata1, CRC = normdata2)
SpatialEcoTyper_list <- list(SKCM = readRDS("SKCM_SpatialEcoTyper_results.rds"),
CRC = readRDS("CRC_SpatialEcoTyper_results.rds"))
IntegrateSpatialEcoTyper(SpatialEcoTyper_list, data_list,
outdir = "SpatialEcoTyper_results/",
normalization.method = "None",
nmf_ranks = 10, # Number of clusters
nrun.per.rank = 5, # recommend 30 or higher for robust results
Region = "Region2", # Use pathologist annotations if available
ncores = 2)
# This demo takes ~3 minutes on a macOS with an Apple M1 Pro chip and 16 GB memory.
Optimizing memory usage
For real-world single-cell ST datasets with over 100,000 cells, the
analysis can be both time- and memory-intensive. To speed up the
process, you can increase the number of cores used
(ncores
), but this will also increase memory
consumption.
If you have limited computational memory, there are several
strategies to reduce usage. One approach is to decrease the
minibatch
size and reduce the number of cores
(ncores
) allocated. However, for larger datasets, the
minimum memory requirement may remain high. Another option is to
increase the grid.size
, which reduces the number of spatial
microregions. This can significantly decrease memory usage by
downsampling the ST data, potentially excluding many cells from the
analysis though.
Interested in multicellular communities associated with specific cell types?
If you’re interested in studying multicellular communities associated
with specific cell types, you can use the
filter.region.by.celltypes
argument to focus on
microregions containing at least one cell from the specified cell types.
An example is available in Discovery of
Spatial Ecotypes from a Single Sample.
Interested in regions with multiple cell types?
If you’re interested in regions composed of multiple cell types, you
can use the min.cts.per.region
argument to specify the
minimum number of distinct cell types required for each microregion to
be included in the analysis.
Output results
All results will be saved in the specified directory
(outdir
) with a subdirectory named
SpatialEcoTyper_results_
followed by the current date. The
output files include:
-
(SampleName)_SpatialEcoTyper_results.rds
SpatialEcoTyper
result for each sample. - MultiSE_integrated_final.rds A matrix representing the fused similarity matrix of spatial clusters across all samples.
- MultiSE_integrated_final_hmap.pdf A heatmap visualizing the fused similarity matrix of spatial clusters, grouped by samples and SEs.
- MultiSE_metadata_final.rds Single-cell metadata with an ‘SE’ column annotating the discovered SEs, saved as an RDS file.
-
MultiSE_metadata_final.tsv The same as
MultiSE_metadata_final.rds
, but saved in TSV format. - SpatialView_SEs_by_Sample.pdf A figure showing the spatial landscape of SEs across samples.
- BarView_SEs_Sample_Frac.pdf A bar plot showing the fraction of cells from each sample represented within each SE.
- BarView_SEs_Region_Frac_Avg.pdf A bar plot showing the average fraction of regions represented within each SE.
- BarView_SEs_CellType_Frac_Avg.pdf A bar plot showing the average cell type composition within each SE.
-
MultiSE_NMF_results.rds The NMF result derived from
the
nmfClustering
function. If a single rank was provided, it contains anNMFfitX1
object . If multiple ranks were provided, it is alist
containing the optimal number of communities (bestK
), a list ofNMFfitX1
objects (NMFfits
), and aggplot
object (p
) displaying the cophenetic coefficient for different cluster numbers. - MultiSE_NMF_Cophenetic_Dynamics.pdf A figure showing the cophenetic coefficients for different numbers of clusters, only available when multiple ranks were provided.
The results for the demo data can be downloaded from MultiSpatialEcoTyper_results
.
NMF clustering
Spatial EcoTyper uses NMF to group spatial clusters from multiple samples into conserved SEs. To determine the optimal number of clusters, or rank, we computed cophenetic coefficient, which quantifies the classification stability for a given rank (i.e., the number of clusters) and ranges from 0 to 1, with 1 indicating maximally stability. Typically, the rank at which the cophenetic coefficient begins to decrease is chosen. However, this method can be challenging when the cophenetic coefficient exhibits a multi-modal shape across different ranks. In such cases, for Spatial EcoTyper analysis, we recommend selecting the number of SEs after which the coefficient drops below 0.95 by default.
Selecting the optimal rank
To determine the optimal rank, you can use the nmfClustering function, which
tests multiple ranks and computes the cophenetic coefficient for each.
This analysis requires the fused similarity network matrix
(MultiSE_integrated_final.rds
) generated from previous
step.
Load the files into R:
## Fused similarity matrix of spatial clusters across all samples
outdir = paste0("SpatialEcoTyper_results_", Sys.Date())
integrated = readRDS(file.path(outdir, "MultiSE_integrated_final.rds"))
dim(integrated)
## [1] 376 376
## Single-cell metadata with an 'SE' column annotating the discovered SEs
finalmeta = readRDS(file.path(outdir, "MultiSE_metadata_final.rds"))
head(finalmeta[, c("Sample", "X", "Y", "CellType", "Region", "InitSE", "SE")])
## Sample X Y CellType Region
## HumanMelanomaPatient1__cell_3655 SKCM 1894.706 -6367.766 Fibroblast Stroma
## HumanMelanomaPatient1__cell_3657 SKCM 1942.480 -6369.602 Fibroblast Stroma
## HumanMelanomaPatient1__cell_3658 SKCM 1963.007 -6374.026 Fibroblast Stroma
## HumanMelanomaPatient1__cell_3660 SKCM 1981.600 -6372.266 Fibroblast Stroma
## HumanMelanomaPatient1__cell_3661 SKCM 1742.939 -6374.851 Fibroblast Stroma
## HumanMelanomaPatient1__cell_3663 SKCM 1921.683 -6383.309 Fibroblast Stroma
## InitSE SE
## HumanMelanomaPatient1__cell_3655 SKCM..InitSE77 SE3
## HumanMelanomaPatient1__cell_3657 SKCM..InitSE63 SE4
## HumanMelanomaPatient1__cell_3658 SKCM..InitSE63 SE4
## HumanMelanomaPatient1__cell_3660 SKCM..InitSE63 SE4
## HumanMelanomaPatient1__cell_3661 SKCM..InitSE60 SE3
## HumanMelanomaPatient1__cell_3663 SKCM..InitSE77 SE3
Next, test multiple ranks to find the optimal number of SEs. The running time for this step depends on the data size, number of ranks to test, the number of runs per rank, and the number of cores for parallel analysis. For demonstration, here we test six different ranks ranging from 4 to 15 with 5 runs per rank.
nmf_res <- nmfClustering(integrated, ranks = seq(4,15,2),
nrun.per.rank = 5, # Recommend 30 or higher
min.coph = 0.99, # minimum cophenetic coefficient threshold
ncores = 4,
seed = 2024)
## This process takes ~3 minutes on a macOS with an Apple M1 Pro chip and 16 GB of memory
paste0("The selected rank is ", nmf_res$bestK)
## [1] "The selected rank is 4"
To better understand how the cophenetic coefficient changes with different ranks, you can visualize the results.
plot(nmf_res$p)
Once the optimal rank is determined, you can use it to generate the final SE annotations.
## SKCM..InitSE1 SKCM..InitSE74 SKCM..InitSE110 SKCM..InitSE16 SKCM..InitSE111
## 1 2 2 2 2
## SKCM..InitSE24
## 2
## Levels: 1 2 3 4
You can add the newly defined SEs into the meta data
## CID X
## HumanMelanomaPatient1__cell_3655 HumanMelanomaPatient1__cell_3655 1894.706
## HumanMelanomaPatient1__cell_3657 HumanMelanomaPatient1__cell_3657 1942.480
## HumanMelanomaPatient1__cell_3658 HumanMelanomaPatient1__cell_3658 1963.007
## HumanMelanomaPatient1__cell_3660 HumanMelanomaPatient1__cell_3660 1981.600
## HumanMelanomaPatient1__cell_3661 HumanMelanomaPatient1__cell_3661 1742.939
## HumanMelanomaPatient1__cell_3663 HumanMelanomaPatient1__cell_3663 1921.683
## Y CellType CellTypeName Region
## HumanMelanomaPatient1__cell_3655 -6367.766 Fibroblast Fibroblasts Stroma
## HumanMelanomaPatient1__cell_3657 -6369.602 Fibroblast Fibroblasts Stroma
## HumanMelanomaPatient1__cell_3658 -6374.026 Fibroblast Fibroblasts Stroma
## HumanMelanomaPatient1__cell_3660 -6372.266 Fibroblast Fibroblasts Stroma
## HumanMelanomaPatient1__cell_3661 -6374.851 Fibroblast Fibroblasts Stroma
## HumanMelanomaPatient1__cell_3663 -6383.309 Fibroblast Fibroblasts Stroma
## Dist2Interface Region2 InitSE Sample
## HumanMelanomaPatient1__cell_3655 -883.1752 Stroma SKCM..InitSE77 SKCM
## HumanMelanomaPatient1__cell_3657 -894.8463 Stroma SKCM..InitSE63 SKCM
## HumanMelanomaPatient1__cell_3658 -904.1115 Stroma SKCM..InitSE63 SKCM
## HumanMelanomaPatient1__cell_3660 -907.8909 Stroma SKCM..InitSE63 SKCM
## HumanMelanomaPatient1__cell_3661 -874.2712 Stroma SKCM..InitSE60 SKCM
## HumanMelanomaPatient1__cell_3663 -903.6559 Stroma SKCM..InitSE77 SKCM
## SE
## HumanMelanomaPatient1__cell_3655 SE1
## HumanMelanomaPatient1__cell_3657 SE1
## HumanMelanomaPatient1__cell_3658 SE1
## HumanMelanomaPatient1__cell_3660 SE1
## HumanMelanomaPatient1__cell_3661 SE1
## HumanMelanomaPatient1__cell_3663 SE1
Using a different rank
If you simply want to experiment with a different rank, you can use the nmfClustering function. When a single rank is specified, it directly returns an NMFfit object for predicting SE grouping, as shown below:
nmf_res <- nmfClustering(integrated, ranks = 10, nrun.per.rank = 30,
seed = 1, ncores = 1)
ses <- predict(nmf_res)
## This process takes ~6 minutes on a macOS with an Apple M1 Pro chip and 16 GB of memory
You can add the newly defined SEs into the meta data
## CID X
## HumanMelanomaPatient1__cell_3655 HumanMelanomaPatient1__cell_3655 1894.706
## HumanMelanomaPatient1__cell_3657 HumanMelanomaPatient1__cell_3657 1942.480
## HumanMelanomaPatient1__cell_3658 HumanMelanomaPatient1__cell_3658 1963.007
## HumanMelanomaPatient1__cell_3660 HumanMelanomaPatient1__cell_3660 1981.600
## HumanMelanomaPatient1__cell_3661 HumanMelanomaPatient1__cell_3661 1742.939
## HumanMelanomaPatient1__cell_3663 HumanMelanomaPatient1__cell_3663 1921.683
## Y CellType CellTypeName Region
## HumanMelanomaPatient1__cell_3655 -6367.766 Fibroblast Fibroblasts Stroma
## HumanMelanomaPatient1__cell_3657 -6369.602 Fibroblast Fibroblasts Stroma
## HumanMelanomaPatient1__cell_3658 -6374.026 Fibroblast Fibroblasts Stroma
## HumanMelanomaPatient1__cell_3660 -6372.266 Fibroblast Fibroblasts Stroma
## HumanMelanomaPatient1__cell_3661 -6374.851 Fibroblast Fibroblasts Stroma
## HumanMelanomaPatient1__cell_3663 -6383.309 Fibroblast Fibroblasts Stroma
## Dist2Interface Region2 InitSE Sample
## HumanMelanomaPatient1__cell_3655 -883.1752 Stroma SKCM..InitSE77 SKCM
## HumanMelanomaPatient1__cell_3657 -894.8463 Stroma SKCM..InitSE63 SKCM
## HumanMelanomaPatient1__cell_3658 -904.1115 Stroma SKCM..InitSE63 SKCM
## HumanMelanomaPatient1__cell_3660 -907.8909 Stroma SKCM..InitSE63 SKCM
## HumanMelanomaPatient1__cell_3661 -874.2712 Stroma SKCM..InitSE60 SKCM
## HumanMelanomaPatient1__cell_3663 -903.6559 Stroma SKCM..InitSE77 SKCM
## SE
## HumanMelanomaPatient1__cell_3655 SE4
## HumanMelanomaPatient1__cell_3657 SE10
## HumanMelanomaPatient1__cell_3658 SE10
## HumanMelanomaPatient1__cell_3660 SE10
## HumanMelanomaPatient1__cell_3661 SE4
## HumanMelanomaPatient1__cell_3663 SE4
Visualizing SEs
Visualizing the integrated similarity matrix
After reordering spatial clusters based on the newly identified SEs, you can re-draw the heatmap to show the fused similarity matrix of spatial clusters across samples.
ords = names(ses)[order(ses)]
integrated = integrated[ords, ords] # reorder the cells
ann <- data.frame(Sample = gsub("\\.\\..*", "", rownames(integrated)),
SE = paste0("SE", ses[ords]),
row.names = rownames(integrated)) # cell groups
SE_cols <- getColors(length(unique(ann$SE)), palette = 1) # colors for SEs
names(SE_cols) <- unique(ann$SE)
sample_cols <- getColors(length(unique(ann$Sample)), palette = 2) # colors for samples
names(sample_cols) <- unique(ann$Sample)
## draw heatmap
HeatmapView(integrated, show_row_names = FALSE, show_column_names = FALSE,
top_ann = ann, top_ann_col = list(Sample = sample_cols, SE = SE_cols))
drawRectangleAnnotation(ann$SE, ann$SE)
Visualizing SEs in the tissue
To examine the spatial distribution of the identified SEs, you can use the SpatialView function visualize the SEs in the tissues.
SpatialView(finalmeta, by = "SE") + facet_wrap(~Sample, scales = "free")
Association between SEs and pre-annotated regions
To examine the enrichment of SEs in pre-annotated regions, you can visualize the cell fractions from each region, for each SE.
gg <- finalmeta %>% filter(!is.na(SE)) %>% count(SE, Region, Sample) %>%
group_by(Sample, SE) %>% mutate(Frac = n / sum(n)) %>% ## cell type fractions within each sample
group_by(SE, Region) %>% summarise(Frac = mean(Frac)) ## average cell type fractions across all samples
ggplot(gg, aes(SE, Frac, fill = Region)) +
geom_bar(stat = "identity", position = "fill") +
scale_fill_brewer(type = "qual", palette = "Set1") +
theme_bw(base_size = 14) + coord_flip() +
labs(y = "Fraction")
Association between SEs and pre-annotated regions within each sample
gg <- finalmeta %>% filter(!is.na(SE)) %>% count(SE, Region, Sample)
ggplot(gg, aes(SE, n, fill = Region)) +
geom_bar(stat = "identity", position = "fill") +
scale_fill_brewer(type = "qual", palette = "Set1") +
facet_wrap(~Sample) +
theme_bw(base_size = 14) + coord_flip() +
labs(y = "Fraction of regions")
Distance of SEs to tumor/stroma interface
This box plot visualizes the distribution of distances of SEs to the tumor/stroma interface. Positive distances indicate cells located within the tumor region, while negative distances denote cells within the stroma. The SEs are ordered by their median distance, highlighting their spatial localization relative to the tumor/stroma interface.
gg <- finalmeta %>% filter(!is.na(SE)) %>% group_by(Sample, SE) %>%
summarise(Dist2Interface = mean(Dist2Interface)) %>% arrange(Dist2Interface)
gg$SE = factor(gg$SE, levels = unique(gg$SE))
ggplot(gg, aes(SE, Dist2Interface)) + geom_boxplot() +
geom_point(aes(color = Sample)) + theme_bw() +
labs(y = "Distance to tumor/stroma interface (μm)")
Cell type composition of SEs
We can visualize the cell type composition of SEs in a bar plot. The cell type fractions can be averaged across the two samples.
Average cell type composition of SEs across all samples
gg <- finalmeta %>% filter(!is.na(SE)) %>% count(SE, CellType, Sample) %>%
group_by(Sample, SE) %>% mutate(Frac = n / sum(n)) %>% ## cell type fractions within each sample
group_by(SE, CellType) %>% summarise(Frac = mean(Frac)) ## average cell type fractions across all samples
ggplot(gg, aes(SE, Frac, fill = CellType)) +
geom_bar(stat = "identity", position = "fill") +
scale_fill_manual(values = pals::cols25()) +
theme_bw(base_size = 14) + coord_flip() +
labs(y = "Cell type abundance")
Cell type composition of SEs within each sample
gg <- finalmeta %>% filter(!is.na(SE)) %>% count(SE, CellType, Sample)
ggplot(gg, aes(SE, n, fill = CellType)) +
geom_bar(stat = "identity", position = "fill") +
scale_fill_manual(values = pals::cols25()) +
facet_wrap(~Sample) +
theme_bw(base_size = 14) + coord_flip() +
labs(y = "Cell type abundance")
Identification of cell-type-specific SE markers
You can identify cell-type-specific SE markers by differential
expression analysis using the presto
package. Below is an example of how to identify fibroblast-specific
markers for each SE.
DE analysis within each sample
require("presto")
## DE analysis within the first sample
tmpmeta1 = finalmeta %>% filter(CellType=="Fibroblast" & Sample=="SKCM" & (!is.na(SE)))
tmpdata1 = normdata1[, tmpmeta1$CID]
degs1 = wilcoxauc(tmpdata1, tmpmeta1$SE)
## DE analysis within the second sample
tmpmeta2 = finalmeta %>% filter(CellType=="Fibroblast" & Sample=="CRC" & (!is.na(SE)))
tmpdata2 = normdata2[, tmpmeta2$CID]
degs2 = wilcoxauc(tmpdata2, tmpmeta2$SE) # DE analysis
head(degs1)
## feature group avgExpr logFC statistic auc pval
## 1 PDK4 SE10 0.117555097 0.047787757 4701164 0.5269201 2.300787e-09
## 2 TNFRSF17 SE10 0.000000000 0.000000000 4460984 0.5000000 1.000000e+00
## 3 ICAM3 SE10 0.000000000 0.000000000 4460984 0.5000000 1.000000e+00
## 4 FAP SE10 0.102200894 -0.037071606 4241882 0.4754425 1.963176e-05
## 5 GZMB SE10 0.007441764 -0.008093071 4423277 0.4957737 3.439501e-02
## 6 TSC2 SE10 0.006910210 -0.002427312 4446112 0.4983331 3.257778e-01
## padj pct_in pct_out
## 1 1.030353e-08 14.2638037 9.003215
## 2 1.000000e+00 0.0000000 0.000000
## 3 1.000000e+00 0.0000000 0.000000
## 4 6.684535e-05 13.3435583 18.269512
## 5 8.335732e-02 1.0736196 1.914645
## 6 6.484080e-01 0.9969325 1.330020
head(degs2)
## feature group avgExpr logFC statistic auc pval
## 1 PDK4 SE1 0.07290325 -0.08483371 15082423 0.4580113 3.871812e-25
## 2 CX3CL1 SE1 0.03282075 -0.01465999 16182732 0.4914246 7.090800e-04
## 3 CD4 SE1 0.00000000 0.00000000 16465120 0.5000000 1.000000e+00
## 4 SNAI2 SE1 0.35501044 0.11399360 18215060 0.5531408 6.627100e-26
## 5 TNFRSF17 SE1 0.00000000 0.00000000 16465120 0.5000000 1.000000e+00
## 6 ICAM3 SE1 0.00000000 0.00000000 16465120 0.5000000 1.000000e+00
## padj pct_in pct_out
## 1 2.380875e-24 8.921162 17.125293
## 2 1.848993e-03 4.107884 5.818208
## 3 1.000000e+00 0.000000 0.000000
## 4 4.333913e-25 36.431535 26.939403
## 5 1.000000e+00 0.000000 0.000000
## 6 1.000000e+00 0.000000 0.000000
Identifying conserved markers across samples
To identify markers conserved across different samples, we conduct a meta-analysis by averaging the log2 fold changes (log2FC) from the DE analyses.
library(tidyr)
degs <- merge(degs1[, c(1,2,4)], degs2[, c(1,2,4)], by = c("feature", "group"))
degs$AvgLogFC = (degs$logFC.x + degs$logFC.y) / 2
lfcs = degs %>% pivot_wider(id_cols = feature, names_from = group,
values_from = AvgLogFC) %>% as.data.frame
rownames(lfcs) <- lfcs$feature
lfcs <- lfcs[, -1]
head(lfcs)
## SE10 SE3 SE4 SE6 SE7 SE8
## ACKR3 0.04539229 -0.11089193 0.22761738 -0.13915649 -0.16230522 0.05704506
## ACTA2 0.64868983 0.19121333 -0.28277467 -0.15495698 -0.13060735 0.26294794
## ADAMTS4 -0.07880286 -0.03558349 -0.10298639 0.15231638 0.14598004 -0.10437673
## AKT1 -0.11112575 -0.00885807 -0.19659536 0.11531490 0.32156308 -0.09316424
## AKT2 0.01227246 -0.01219427 0.01067868 -0.01456746 0.01794543 -0.02832473
## AKT3 0.05820973 -0.05104454 -0.02226315 -0.01922768 0.02741800 -0.03496944
## SE9
## ACKR3 -2.297758e-02
## ACTA2 -1.850714e-01
## ADAMTS4 -3.370712e-03
## AKT1 -3.274771e-02
## AKT2 -2.510557e-05
## AKT3 -2.496374e-02
To identify markers that are specific to each SE-associated cell state, we calculate the difference between the maximum log2FC for each gene and the second-highest log2FC:
secondmax = apply(lfcs, 1, function(x){ -sort(-x)[2] })
delta = lfcs - secondmax
## Markers are considered specific if they have a positive delta and a log2FC greater than 0.05:
idx = delta>0 & lfcs>0.05
markers = lapply(colnames(lfcs), function(se){
gs = rownames(idx)[idx[, se]]
gs = gs[order(-lfcs[gs, se])]
gs
})
names(markers) = colnames(lfcs)
markers = markers[lengths(markers)>0]
## Select the top five markers
top5 = lapply(markers, function(x){ x[1:min(5, length(x))] })
top5
## $SE10
## [1] "ACTA2" "COL4A1" "MYH11" "TGFB3" "TNC"
##
## $SE3
## [1] "FN1" "CSF1" "COL5A1"
##
## $SE4
## [1] "FOS" "PLA2G2A" "DUSP1" "EGR1" "FBLN1"
##
## $SE6
## [1] "HLA-DRA" "BST2" "CXCL9" "STAT1" "SOD2"
##
## $SE7
## [1] "PKM" "MMP11" "CTNNB1" "TGFBI" "YAP1"
##
## $SE8
## [1] "SFRP2" "COL1A1" "SMOC2" "COL11A1" "MFAP5"
##
## $SE9
## [1] "XBP1" "CXCR4" "CCL2" "PTPRC" "STAT3"
Session info
The session info allows users to replicate the exact environment and identify potential discrepancies in package versions or configurations that might be causing problems.
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] grid parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] tidyr_1.3.1 presto_1.0.0 Rcpp_1.0.13
## [4] pals_1.9 SpatialEcoTyper_1.0.0 RANN_2.6.2
## [7] Matrix_1.7-0 ComplexHeatmap_2.20.0 NMF_0.28
## [10] Biobase_2.64.0 BiocGenerics_0.50.0 cluster_2.1.6
## [13] rngtools_1.5.2 registry_0.5-1 googledrive_2.1.1
## [16] data.table_1.16.0 Seurat_5.1.0 SeuratObject_5.0.2
## [19] sp_2.1-4 ggplot2_3.5.1 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.4.1
## [3] later_1.3.2 tibble_3.2.1
## [5] R.oo_1.26.0 polyclip_1.10-7
## [7] fastDummies_1.7.4 lifecycle_1.0.4
## [9] doParallel_1.0.17 globals_0.16.3
## [11] lattice_0.22-6 MASS_7.3-60.2
## [13] magrittr_2.0.3 plotly_4.10.4
## [15] sass_0.4.9 rmarkdown_2.28
## [17] jquerylib_0.1.4 yaml_2.3.10
## [19] httpuv_1.6.15 glmGamPoi_1.16.0
## [21] sctransform_0.4.1 spam_2.10-0
## [23] spatstat.sparse_3.1-0 reticulate_1.39.0
## [25] mapproj_1.2.11 cowplot_1.1.3
## [27] pbapply_1.7-2 RColorBrewer_1.1-3
## [29] maps_3.4.2 zlibbioc_1.50.0
## [31] abind_1.4-5 GenomicRanges_1.56.1
## [33] Rtsne_0.17 purrr_1.0.2
## [35] R.utils_2.12.3 GenomeInfoDbData_1.2.12
## [37] circlize_0.4.16 IRanges_2.38.1
## [39] S4Vectors_0.42.1 ggrepel_0.9.6
## [41] irlba_2.3.5.1 listenv_0.9.1
## [43] spatstat.utils_3.1-0 goftest_1.2-3
## [45] RSpectra_0.16-2 spatstat.random_3.3-1
## [47] fitdistrplus_1.2-1 parallelly_1.38.0
## [49] DelayedMatrixStats_1.26.0 pkgdown_2.1.0
## [51] DelayedArray_0.30.1 leiden_0.4.3.1
## [53] codetools_0.2-20 tidyselect_1.2.1
## [55] shape_1.4.6.1 farver_2.1.2
## [57] UCSC.utils_1.0.0 matrixStats_1.4.1
## [59] stats4_4.4.1 spatstat.explore_3.3-2
## [61] jsonlite_1.8.8 GetoptLong_1.0.5
## [63] progressr_0.14.0 ggridges_0.5.6
## [65] survival_3.6-4 iterators_1.0.14
## [67] systemfonts_1.1.0 foreach_1.5.2
## [69] tools_4.4.1 ragg_1.3.2
## [71] ica_1.0-3 glue_1.7.0
## [73] SparseArray_1.4.8 gridExtra_2.3
## [75] xfun_0.47 MatrixGenerics_1.16.0
## [77] GenomeInfoDb_1.40.1 withr_3.0.1
## [79] BiocManager_1.30.25 fastmap_1.2.0
## [81] fansi_1.0.6 digest_0.6.37
## [83] R6_2.5.1 mime_0.12
## [85] textshaping_0.4.0 colorspace_2.1-1
## [87] scattermore_1.2 tensor_1.5
## [89] dichromat_2.0-0.1 spatstat.data_3.1-2
## [91] R.methodsS3_1.8.2 utf8_1.2.4
## [93] generics_0.1.3 S4Arrays_1.4.1
## [95] httr_1.4.7 htmlwidgets_1.6.4
## [97] uwot_0.2.2 pkgconfig_2.0.3
## [99] gtable_0.3.5 lmtest_0.9-40
## [101] XVector_0.44.0 htmltools_0.5.8.1
## [103] dotCall64_1.1-1 clue_0.3-65
## [105] scales_1.3.0 png_0.1-8
## [107] spatstat.univar_3.0-1 knitr_1.48
## [109] rstudioapi_0.16.0 reshape2_1.4.4
## [111] rjson_0.2.22 nlme_3.1-164
## [113] curl_5.2.2 cachem_1.1.0
## [115] zoo_1.8-12 GlobalOptions_0.1.2
## [117] stringr_1.5.1 KernSmooth_2.23-24
## [119] miniUI_0.1.1.1 desc_1.4.3
## [121] pillar_1.9.0 vctrs_0.6.5
## [123] promises_1.3.0 xtable_1.8-4
## [125] evaluate_0.24.0 magick_2.8.5
## [127] cli_3.6.3 compiler_4.4.1
## [129] rlang_1.1.4 crayon_1.5.3
## [131] future.apply_1.11.2 labeling_0.4.3
## [133] plyr_1.8.9 fs_1.6.4
## [135] stringi_1.8.4 viridisLite_0.4.2
## [137] deldir_2.0-4 gridBase_0.4-7
## [139] munsell_0.5.1 lazyeval_0.2.2
## [141] spatstat.geom_3.3-2 RcppHNSW_0.6.0
## [143] patchwork_1.2.0 sparseMatrixStats_1.16.0
## [145] future_1.34.0 shiny_1.9.1
## [147] highr_0.11 SummarizedExperiment_1.34.0
## [149] ROCR_1.0-11 gargle_1.5.2
## [151] igraph_2.0.3 bslib_0.8.0