Recovery of Spatial Ecotypes from Spatial Transcriptomics Data • SpatialEcoTyper

SE recovery from single-cell spatial data

In this tutorial, we will demonstrate how to recover spatial ecotypes (SEs) from single-cell spatial transcriptomics data profiled by platforms such as MERSCOPE, CosMx SMI, or Xenium. We will use a subset of a melanoma MERSCOPE sample to illustrate the SE recovery process. The expression data and single-cell metadata can be downloaded from the Google Drive. In this data, single cells were categorized into nine major cell types, including B cells, CD4+ T cells, CD8+ T cells, NK cells, plasma cells, macrophages, dendritic cells (DC), fibroblasts, and endothelial cells.

First load required packages for this vignette

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(googledrive))
library(SpatialEcoTyper)

Loading data

Download the demo data from Google Drive:

drive_deauth() # Disable Google sign-in requirement
drive_download(as_id("13Rc5Rsu8jbnEYYfUse-xQ7ges51LcI7n"), "HumanMelanomaPatient1_subset_counts.tsv.gz", overwrite = TRUE)
drive_download(as_id("12xcZNhpT-xbhcG8kX1QAdTeM9TKeFAUW"), "HumanMelanomaPatient1_subset_scmeta.tsv", overwrite = TRUE)

Load the gene expression matrix and meta data into R.

# Load single-cell gene expression matrix. Rows are genes, columns are cells.
scdata <- fread("HumanMelanomaPatient1_subset_counts.tsv.gz", 
                sep = "\t",header = TRUE, data.table = FALSE)
rownames(scdata) <- scdata[, 1]
scdata <- as.matrix(scdata[, -1])
head(scdata[, 1:5])

##          HumanMelanomaPatient1__cell_3655 HumanMelanomaPatient1__cell_3657
## PDK4                                    0                                1
## TNFRSF17                                0                                0
## ICAM3                                   0                                0
## FAP                                     1                                0
## GZMB                                    0                                0
## TSC2                                    0                                0
##          HumanMelanomaPatient1__cell_3658 HumanMelanomaPatient1__cell_3660
## PDK4                                    1                                0
## TNFRSF17                                0                                0
## ICAM3                                   0                                0
## FAP                                     0                                0
## GZMB                                    0                                0
## TSC2                                    0                                0
##          HumanMelanomaPatient1__cell_3661
## PDK4                                    0
## TNFRSF17                                0
## ICAM3                                   0
## FAP                                     0
## GZMB                                    0
## TSC2                                    0

# Load single-cell metadata, with at least three columns, including X, Y, and CellType
scmeta <- read.table("HumanMelanomaPatient1_subset_scmeta.tsv", 
                      sep = "\t",header = TRUE, row.names = 1)
scdata = scdata[, match(rownames(scmeta), colnames(scdata))]

head(scmeta[, c("X", "Y", "CellType")])

##                                         X         Y   CellType
## HumanMelanomaPatient1__cell_3655 1894.706 -6367.766 Fibroblast
## HumanMelanomaPatient1__cell_3657 1942.480 -6369.602 Fibroblast
## HumanMelanomaPatient1__cell_3658 1963.007 -6374.026 Fibroblast
## HumanMelanomaPatient1__cell_3660 1981.600 -6372.266 Fibroblast
## HumanMelanomaPatient1__cell_3661 1742.939 -6374.851 Fibroblast
## HumanMelanomaPatient1__cell_3663 1921.683 -6383.309 Fibroblast

Data normalization

The gene expression data should be normalized for the SE recovery. The data can be normalized using SCTransform or NormalizeData.

Here, we are normalizing using SCTransform normalization. We recommend to install the glmGamPoi package for faster computation.

if(!"glmGamPoi" %in% installed.packages()){
  BiocManager::install("glmGamPoi")
}
tmpobj <- CreateSeuratObject(scdata) %>%
      SCTransform(clip.range = c(-10, 10), verbose = FALSE)

seurat_version = as.integer(gsub("\\..*", "", as.character(packageVersion("SeuratObject"))))
if(seurat_version<5){
  normdata <- GetAssayData(tmpobj, "data")
}else{
  normdata <- tmpobj[["SCT"]]$data
}

Using NormalizeData for the normalization

normdata <- NormalizeData(scdata)

SE recovery

The RecoverSE function will be used to assign single cells into SEs. Users can either use the default models to recover predefined SEs or use custom models to recover newly defined SEs.

Note: To recover SEs from single-cell data, you must specify either celltypes or se_results in the RecoverSE function. If neither is provided, it will assume that the input data represents spot-level spatial transcriptomics, and SE abundances will be inferred directly from each spot.

The default NMF models were trained on discovery MERSCOPE data, encompassing five cancer types: melanoma, and four carcinomas. These models are tailored to nine distinct cell types: B cells, CD4+ T cells, CD8+ T cells, NK cells, plasma cells, macrophages, dendritic cells, fibroblasts, and endothelial cells. Each model facilitates the recovery of SEs from single-cell datasets, allowing for cell-type-specific SE analysis. Thus, for SE recovery using default models, the cells in the query data should be grouped into one of “B”, “CD4T”, “CD8T”, “NK”, “Plasma”, “Macrophage”, “DC”, “Fibroblast”, and “Endothelial”, case sensitive. All the other cell types will be considered non-SE compartments.

Using default models (1)

Before SE recovery, we recommend to use a unified embedding of spatial microregions by performing de novo Spatial EcoTyper analysis, which integrate gene expression and spatial information. This embedding could enhance the accuracy and refinement of SE recovery results. The detailed tutorial for Spatial EcoTyper analysis is available at Discovery of Spatial Ecotypes from a Single-cell Spatial Dataset.

For demonstration purposes, we used SpatialEcoTyper to group cells into spatial clusters with a resolution of 10. In practice, we recommend experimenting with multiple resolutions to ensure robust and reliable results.

## make sure the cells are grouped into one of "B", "CD4T", "CD8T", "NK", "Plasma", "Macrophage", "DC", "Fibroblast", and "Endothelial".
print(unique(scmeta$CellType))

## [1] "Fibroblast"  "Endothelial" "DC"          "Macrophage"  "CD8T"       
## [6] "CD4T"        "Plasma"      "NK"          "B"

## Spatial EcoTyper analysis: it would take ~2 min
emb = SpatialEcoTyper(normdata, scmeta, resolution = 10)
emb$obj ## A seurat object including the spatial embedding

## An object of class Seurat 
## 2315 features across 2315 samples within 1 assay 
## Active assay: RNA (2315 features, 2315 variable features)
##  3 layers present: counts, data, scale.data
##  2 dimensional reductions calculated: pca, umap

head(emb$metadata[, c("X", "Y", "CellType", "SE")])  ## Single-cell meta data with SE annotations

##                                         X         Y   CellType   SE
## HumanMelanomaPatient1__cell_3655 1894.706 -6367.766 Fibroblast SE25
## HumanMelanomaPatient1__cell_3657 1942.480 -6369.602 Fibroblast SE29
## HumanMelanomaPatient1__cell_3658 1963.007 -6374.026 Fibroblast SE29
## HumanMelanomaPatient1__cell_3660 1981.600 -6372.266 Fibroblast SE29
## HumanMelanomaPatient1__cell_3661 1742.939 -6374.851 Fibroblast SE48
## HumanMelanomaPatient1__cell_3663 1921.683 -6383.309 Fibroblast SE25

Then specify the se_results in the RecoverSE function for SE recovery.

sepreds <- RecoverSE(normdata, se_results = emb)
head(sepreds)

## HumanMelanomaPatient1__cell_3655 HumanMelanomaPatient1__cell_3657 
##                            "SE1"                            "SE4" 
## HumanMelanomaPatient1__cell_3658 HumanMelanomaPatient1__cell_3660 
##                            "SE4"                            "SE4" 
## HumanMelanomaPatient1__cell_3661 HumanMelanomaPatient1__cell_3663 
##                            "SE3"                            "SE1"

Using default models (2)

You can also recover the SEs without using spatial embedding, which could be less accurate due to the lack of spatial information. The cell type annotations are required in this case. Cells should be grouped into one of “B”, “CD4T”, “CD8T”, “NK”, “Plasma”, “Macrophage”, “DC”, “Fibroblast”, and “Endothelial”, case sensitive. All the other cell types will be considered non-SE compartments.

## make sure the cells are grouped into one of "B", "CD4T", "CD8T", "NK", "Plasma", "Macrophage", "DC", "Fibroblast", and "Endothelial".
print(unique(scmeta$CellType))
sepreds <- RecoverSE(normdata, celltypes = scmeta$CellType)
head(sepreds)

Using custom models

To use custom models, users should first develop recovery models following the tutorial Development of SE Recovery Models. The resulting models can be used for SE recovery. An example model is available at SE_Recovery_W_list.rds.

# Download SE recovery model
drive_download(as_id("171WaAe49babYB85Cn1FcoNNE-lzYp1T_"), "SE_Recovery_W_list.rds", overwrite = TRUE)

Using custom models by specifying the Ws argument:

Ws <- readRDS("SE_Recovery_W_list.rds")
## make sure the cell type names are consistent.
print(unique(scmeta$CellType))
print(names(Ws))
sepreds <- RecoverSE(normdata, celltypes = scmeta$CellType, Ws = Ws)

## If Spatial EcoTyper result is available, we recommend:
sepreds <- RecoverSE(normdata, se_results = emb, Ws = Ws)

Visualization of SEs in the tissue

## Add the recovery result into the meta data
scmeta$RecoveredSE <- sepreds[rownames(scmeta)]
## Visualize the SE recovery results
SpatialView(scmeta, by = "RecoveredSE")

Recovery of spatial ecotypes from Visium Spatial Gene Expression data

To recover SEs from Visium spatial transcriptomics data, we first infer SE abundances from each spot. Each spot is then assigned to the SE with the highest inferred abundance, enabling the spatial mapping of SEs across the tissue. In this tutorial, we will use a breast cancer sample to demonstrate the SE recovery from a Visium data. The expression data can be accessed from: V1_Breast_Cancer_Block_A_Section_1_filtered_feature_bc_matrix.h5, which was downloaded from 10x Genomics datasets.

Loading data

First, download the data from Google Drive

drive_download(as_id("15D9LgvZmCZUfsL62cD67JMf8Jcq_UyuB"), "V1_Breast_Cancer_Block_A_Section_1_filtered_feature_bc_matrix.h5", overwrite = TRUE)
drive_download(as_id("15NTZc1HrW_gLS_pmi1ckYLw30kNarf4w"), "V1_Breast_Cancer_Block_A_Section_1_tissue_positions_list.csv", overwrite = TRUE)
## This download should be done within 1 min.

Load the expression data into R using the Read10X_h5 function from the Seurat package.

if(!"hdf5r" %in% installed.packages()) BiocManager::install("hdf5r")
require("hdf5r")
# Load Visium gene expression matrix. Rows are genes, columns are spots.
dat <- Read10X_h5("V1_Breast_Cancer_Block_A_Section_1_filtered_feature_bc_matrix.h5")
# normalize the expression data
dat <- NormalizeData(dat)

meta <- read.csv("V1_Breast_Cancer_Block_A_Section_1_tissue_positions_list.csv", 
                 header = FALSE, row.names = 1)
colnames(meta) <- c("tissue", "row", "col", "imagerow", "imagecol")
meta <- meta[colnames(dat), ]
head(meta)

##                    tissue row col imagerow imagecol
## AAACAAGTATCTCCCA-1      1  50 102    15632    17782
## AAACACCAATAACTGC-1      1  59  19    17734     6447
## AAACAGAGCGACTCCT-1      1  14  94     7079    16716
## AAACAGGGTCTATATT-1      1  47  13    14882     5637
## AAACAGTGTTCCTGGG-1      1  73  43    21069     9712
## AAACATTTCCCGGATT-1      1  61  97    18242    17091

SE recovery

If neither celltypes nor se_results are specified, the RecoverSE function will assume the input data is a gene-by-spot gene expression matrix and infer SE abundances across spots. Users have the option to either use the default models to recover predefined SEs or apply custom models to recover newly defined SEs.

Using default models

preds <- RecoverSE(dat)
head(preds)

##                             SE1          SE2          SE3          SE4
## AAACAAGTATCTCCCA-1 2.127701e-04 6.877800e-05 3.979458e-01 8.534547e-02
## AAACACCAATAACTGC-1 1.321616e-01 6.347672e-02 4.580604e-02 9.893653e-02
## AAACAGAGCGACTCCT-1 1.658118e-01 2.438595e-15 1.257332e-01 1.021615e-01
## AAACAGGGTCTATATT-1 6.495088e-02 1.975689e-16 2.208619e-01 6.972421e-10
## AAACAGTGTTCCTGGG-1 9.104830e-02 2.073793e-01 2.579438e-09 6.489594e-02
## AAACATTTCCCGGATT-1 6.684999e-13 2.987802e-01 2.908525e-08 7.202254e-07
##                             SE5          SE6          SE7          SE8
## AAACAAGTATCTCCCA-1 1.162860e-01 1.723712e-16 3.026670e-01 2.078080e-02
## AAACACCAATAACTGC-1 4.542600e-07 4.511416e-01 3.746409e-16 1.333849e-01
## AAACAGAGCGACTCCT-1 7.829883e-03 3.928577e-05 3.462729e-01 2.635185e-04
## AAACAGGGTCTATATT-1 8.604433e-03 1.976922e-16 7.055828e-01 1.739479e-09
## AAACAGTGTTCCTGGG-1 1.529680e-06 2.947065e-01 2.356171e-15 1.632831e-01
## AAACATTTCCCGGATT-1 1.993179e-11 1.996192e-01 3.022784e-16 2.576151e-01
##                             SE9        nonSE
## AAACAAGTATCTCCCA-1 1.755819e-02 5.913523e-02
## AAACACCAATAACTGC-1 4.131729e-03 7.096041e-02
## AAACAGAGCGACTCCT-1 1.684507e-02 2.350429e-01
## AAACAGGGTCTATATT-1 1.975689e-16 2.054490e-11
## AAACAGTGTTCCTGGG-1 2.351639e-06 1.786829e-01
## AAACATTTCCCGGATT-1 7.419167e-02 1.697931e-01

## This step would take ~3 minutes to complete

Using custom models

To develop SE recovery models, users should follow the tutorial Developing SE Recovery Models. The resulting models can be used for SE recovery. An example model is available at SE_Recovery_W_list.rds. By specifying the Ws parameter (a list of cell-type-specific W matrices) in the RecoverSE function, the custom models will be used for recovering SEs.

# Download SE recovery model
drive_download(as_id("171WaAe49babYB85Cn1FcoNNE-lzYp1T_"), "SE_Recovery_W_list.rds", overwrite = TRUE)
Ws <- readRDS("SE_Recovery_W_list.rds")

## Using custom models by specifying the Ws argument
preds <- RecoverSE(normdata, Ws = Ws)
## This step would take ~3 minutes to complete

Visualization of SEs in the tissue

Each spot can be assigned to the SE with the highest inferred abundance, and the spatial mapping of SEs can be visualized using the SpatialView function.

meta$RecoveredSE = colnames(preds)[apply(preds, 1, which.max)]
meta$Y = -meta$row
SpatialView(meta, by = "RecoveredSE", X = "col", Y = "Y", pt.size = 2)

Session info

The session info allows users to replicate the exact environment and identify potential discrepancies in package versions or configurations that might be causing problems.

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] hdf5r_1.3.11          SpatialEcoTyper_1.0.0 NMF_0.28             
##  [4] Biobase_2.64.0        BiocGenerics_0.50.0   cluster_2.1.6        
##  [7] rngtools_1.5.2        registry_0.5-1        RANN_2.6.2           
## [10] Matrix_1.7-0          googledrive_2.1.1     data.table_1.16.0    
## [13] Seurat_5.1.0          SeuratObject_5.0.2    sp_2.1-4             
## [16] ggplot2_3.5.1         dplyr_1.1.4          
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22            splines_4.4.1              
##   [3] later_1.3.2                 tibble_3.2.1               
##   [5] R.oo_1.26.0                 polyclip_1.10-7            
##   [7] fastDummies_1.7.4           lifecycle_1.0.4            
##   [9] doParallel_1.0.17           pals_1.9                   
##  [11] globals_0.16.3              lattice_0.22-6             
##  [13] MASS_7.3-60.2               magrittr_2.0.3             
##  [15] plotly_4.10.4               sass_0.4.9                 
##  [17] rmarkdown_2.28              jquerylib_0.1.4            
##  [19] yaml_2.3.10                 httpuv_1.6.15              
##  [21] glmGamPoi_1.16.0            sctransform_0.4.1          
##  [23] spam_2.10-0                 spatstat.sparse_3.1-0      
##  [25] reticulate_1.39.0           mapproj_1.2.11             
##  [27] cowplot_1.1.3               pbapply_1.7-2              
##  [29] RColorBrewer_1.1-3          maps_3.4.2                 
##  [31] zlibbioc_1.50.0             abind_1.4-5                
##  [33] GenomicRanges_1.56.1        Rtsne_0.17                 
##  [35] purrr_1.0.2                 R.utils_2.12.3             
##  [37] GenomeInfoDbData_1.2.12     circlize_0.4.16            
##  [39] IRanges_2.38.1              S4Vectors_0.42.1           
##  [41] ggrepel_0.9.6               irlba_2.3.5.1              
##  [43] listenv_0.9.1               spatstat.utils_3.1-0       
##  [45] goftest_1.2-3               RSpectra_0.16-2            
##  [47] spatstat.random_3.3-1       fitdistrplus_1.2-1         
##  [49] parallelly_1.38.0           DelayedMatrixStats_1.26.0  
##  [51] pkgdown_2.1.0               DelayedArray_0.30.1        
##  [53] leiden_0.4.3.1              codetools_0.2-20           
##  [55] tidyselect_1.2.1            shape_1.4.6.1              
##  [57] farver_2.1.2                UCSC.utils_1.0.0           
##  [59] matrixStats_1.4.1           stats4_4.4.1               
##  [61] spatstat.explore_3.3-2      jsonlite_1.8.8             
##  [63] GetoptLong_1.0.5            progressr_0.14.0           
##  [65] ggridges_0.5.6              survival_3.6-4             
##  [67] iterators_1.0.14            systemfonts_1.1.0          
##  [69] foreach_1.5.2               tools_4.4.1                
##  [71] ragg_1.3.2                  ica_1.0-3                  
##  [73] Rcpp_1.0.13                 glue_1.7.0                 
##  [75] SparseArray_1.4.8           gridExtra_2.3              
##  [77] xfun_0.47                   MatrixGenerics_1.16.0      
##  [79] GenomeInfoDb_1.40.1         withr_3.0.1                
##  [81] BiocManager_1.30.25         fastmap_1.2.0              
##  [83] fansi_1.0.6                 digest_0.6.37              
##  [85] R6_2.5.1                    mime_0.12                  
##  [87] textshaping_0.4.0           colorspace_2.1-1           
##  [89] scattermore_1.2             tensor_1.5                 
##  [91] dichromat_2.0-0.1           spatstat.data_3.1-2        
##  [93] R.methodsS3_1.8.2           utf8_1.2.4                 
##  [95] tidyr_1.3.1                 generics_0.1.3             
##  [97] S4Arrays_1.4.1              httr_1.4.7                 
##  [99] htmlwidgets_1.6.4           uwot_0.2.2                 
## [101] pkgconfig_2.0.3             gtable_0.3.5               
## [103] ComplexHeatmap_2.20.0       lmtest_0.9-40              
## [105] XVector_0.44.0              htmltools_0.5.8.1          
## [107] dotCall64_1.1-1             clue_0.3-65                
## [109] scales_1.3.0                png_0.1-8                  
## [111] spatstat.univar_3.0-1       knitr_1.48                 
## [113] rstudioapi_0.16.0           reshape2_1.4.4             
## [115] rjson_0.2.22                nlme_3.1-164               
## [117] curl_5.2.2                  cachem_1.1.0               
## [119] zoo_1.8-12                  GlobalOptions_0.1.2        
## [121] stringr_1.5.1               KernSmooth_2.23-24         
## [123] miniUI_0.1.1.1              desc_1.4.3                 
## [125] pillar_1.9.0                grid_4.4.1                 
## [127] vctrs_0.6.5                 promises_1.3.0             
## [129] xtable_1.8-4                evaluate_0.24.0            
## [131] cli_3.6.3                   compiler_4.4.1             
## [133] rlang_1.1.4                 crayon_1.5.3               
## [135] future.apply_1.11.2         labeling_0.4.3             
## [137] plyr_1.8.9                  fs_1.6.4                   
## [139] stringi_1.8.4               viridisLite_0.4.2          
## [141] deldir_2.0-4                gridBase_0.4-7             
## [143] munsell_0.5.1               lazyeval_0.2.2             
## [145] spatstat.geom_3.3-2         RcppHNSW_0.6.0             
## [147] patchwork_1.2.0             bit64_4.5.2                
## [149] sparseMatrixStats_1.16.0    future_1.34.0              
## [151] shiny_1.9.1                 highr_0.11                 
## [153] SummarizedExperiment_1.34.0 ROCR_1.0-11                
## [155] gargle_1.5.2                igraph_2.0.3               
## [157] bslib_0.8.0                 bit_4.5.0