NMF Model Development for Spatial Ecotype Recovery from Single-Cell and Spatial Transcriptomics Data • SpatialEcoTyper

Overview

Users can develop machine learning models to enable spatial ecotype (SE) recovery from held-out single-cell datasets (e.g., scRNA-seq or single-cell spatial transcriptomics). The Spatial EcoTyper framework leverages non-negative matrix factorization (NMF) as its core algorithm for SE recovery, given its robustness and demonstrated effectiveness in recovering carcinoma ecotypes from multimodal expression data in our prior study. Here, we will demonstrate how to train the model to recover SEs from single-cell gene expression data.

Development of SE recovery model

In this tutorial, we will use the Vizgen MERSCOPE data of two samples to demonstrate the process for developing cell-type-specific NMF models for recovering SEs from single-cell gene expression profiles, including both single-cell spatial transcriptomics and single-cell RNA-seq data.

First load required packages for this vignette

suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(NMF))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(dplyr))
library(SpatialEcoTyper)

Loading data

Load gene expression matrix

## Read the expression matrix for the first sample
scdata <- fread("https://spatialecotyper.stanford.edu/inc/inc.public.vignettes.php?file=Melanoma1_subset_counts.tsv.gz", 
                sep = "\t",header = TRUE, data.table = FALSE)
rownames(scdata) <- scdata[, 1]  ## set the first column as row names
scdata <- scdata[, -1]  ## drop the first column
scdata <- NormalizeData(scdata)  ## Normalize the expression matrix

## Read the expression matrix for the second sample
scdata2 <- fread("https://spatialecotyper.stanford.edu/inc/inc.public.vignettes.php?file=CRC2_subset_counts.tsv.gz", 
                sep = "\t",header = TRUE, data.table = FALSE)
rownames(scdata2) <- scdata2[, 1]  ## set the first column as row names
scdata2 <- scdata2[, -1]  ## drop the first column
scdata2 <- NormalizeData(scdata2)  ## Normalize the expression matrix

## Combine the two expression matrices
genes <- intersect(rownames(scdata), rownames(scdata2))
scdata <- cbind(scdata[genes, ], scdata2[genes, ])
head(scdata[, 1:5])

##          HumanMelanomaPatient1__cell_3655 HumanMelanomaPatient1__cell_3657
## PDK4                             0.000000                         4.221184
## TNFRSF17                         0.000000                         0.000000
## ICAM3                            0.000000                         0.000000
## FAP                              3.552438                         0.000000
## GZMB                             0.000000                         0.000000
## TSC2                             0.000000                         0.000000
##          HumanMelanomaPatient1__cell_3658 HumanMelanomaPatient1__cell_3660
## PDK4                             3.208945                                0
## TNFRSF17                         0.000000                                0
## ICAM3                            0.000000                                0
## FAP                              0.000000                                0
## GZMB                             0.000000                                0
## TSC2                             0.000000                                0
##          HumanMelanomaPatient1__cell_3661
## PDK4                                    0
## TNFRSF17                                0
## ICAM3                                   0
## FAP                                     0
## GZMB                                    0
## TSC2                                    0

Load cell type and SE annotations

## Load the single-cell meta data derived from MultiSpatialEcoTyper analysis
scmeta <- read.table("https://spatialecotyper.stanford.edu/inc/inc.public.vignettes.php?file=MultiSE_metadata_final.tsv", sep = "\t", 
                     header = TRUE, row.names = 1)
scmeta$SE[is.na(scmeta$SE)] = "nonSE" ## cells with NA predictions were assigned into the nonSE group

## Matches the row names of meta data and column names of expression matrix
cells <- intersect(colnames(scdata), rownames(scmeta))
scdata <- scdata[, cells]
scmeta <- scmeta[cells, ]
head(scmeta[, c("Sample", "CellType", "SE")])

##                                  Sample   CellType  SE
## HumanMelanomaPatient1__cell_3655   SKCM Fibroblast SE7
## HumanMelanomaPatient1__cell_3657   SKCM Fibroblast SE1
## HumanMelanomaPatient1__cell_3658   SKCM Fibroblast SE1
## HumanMelanomaPatient1__cell_3660   SKCM Fibroblast SE1
## HumanMelanomaPatient1__cell_3661   SKCM Fibroblast SE7
## HumanMelanomaPatient1__cell_3663   SKCM Fibroblast SE7

Training

The NMFGenerateWList function will be used to train cell-type specific NMF models for recovering SEs. Prior to NMF, each gene is scaled to mean 0 and unit variance within each cell type-specific expression matrix. To satisfy the non-negativity requirement of NMF, the cell type-specific expression matrices are individually processed using posneg transformation, which converts each expression matrix into two matrices, one containing only positive values and the other containing only negative values with the sign inverted. The two matrices are subsequently concatenated to produce the training data. The NMFGenerateWList function will return a list of feature × SE matrices, each representing a cell type-specific gene signature matrix with rows as features and columns as SEs.

By default, for each cell type, the NMFGenerateWList balances cell numbers across samples (via the Sample argument) by randomly subsampling each sample to a target size equal to the median number of cells per sample, downsamples a total of 2,500 cells across all samples (downsample argument), and selects the top 300 variable features (nfeature argument) for model training. We recommend adjusting these parameters based on the characteristics of your dataset and available computational resources.

For a quick demonstration, we train the model on the small example dataset using a simplified, built-in feature selection strategy.

# training
Ws = NMFGenerateWList(scdata, scmeta, 
                      CellType = "CellType",
                      SE = "SE", Sample = "Sample",
                      nfeature = 300, 
                      nfeature.per.se = 50, 
                      ncores = 8)
class(Ws)

## [1] "list"

names(Ws)

## [1] "B"           "CD4T"        "CD8T"        "DC"          "Endothelial"
## [6] "Fibroblast"  "Macrophage"  "NK"          "Plasma"

head(Ws[[2]])

##                   SE7       SE3       SE4       SE8       SE6       SE1
## FOXP3__pos  0.5004127 0.6673784 0.3470760 0.6797704 0.4872819 1.0338540
## CXCR4__neg  0.2357855 0.6635520 0.2114708 1.0913564 0.3938264 0.6861448
## ICAM3__neg  0.4751354 0.5266059 0.4683919 0.6175067 0.4831921 0.5026104
## STAT3__neg  0.4748124 0.5139846 0.4922895 0.6149234 0.5547521 0.5395274
## CCL5__neg   0.5197869 0.5246213 0.4483178 0.5640050 0.5111583 0.6468350
## BCL2L1__pos 0.5060547 0.5936575 0.4228397 0.6190426 0.5072450 0.4724431
##                   SE9       SE5       SE2
## FOXP3__pos  0.5133462 0.5363018 0.7832876
## CXCR4__neg  0.2967064 0.5645048 0.9790789
## ICAM3__neg  0.4267446 0.6177860 0.8404026
## STAT3__neg  0.5390224 0.5108202 0.7087075
## CCL5__neg   0.5040774 0.5681576 0.6594520
## BCL2L1__pos 0.6042824 0.5519758 0.6467089

The resulting W matrices could be used to recover SEs from spatial transcriptomics and single-cell RNA-seq using the RecoverSE function. The detailed documentation for SE recovery is available at Recovering Spatial Ecotypes from Single-Cell RNA-seq Data and Recovering Spatial Ecotypes from Single-Cell Spatial Transcriptomics Data.

Note: To obtain a robust model, we recommend performing multiple runs and constructing an ensemble. In our analysis (Zhang et al.), we repeated the training 50 times and averaged the resulting basis matrices to generate an ensemble matrix W for each cell type.

The NMF model training can be time-consuming. To improve efficiency, we recommend running training jobs as Slurm array jobs and allocating sufficient computational resources (e.g., CPU cores) to each task. An example Slurm array script is provided below.

nmf_trainings.R

# ============================================================
# Demo script for NMF training (do not run as-is without setup)
# Designed for Slurm array jobs: one run per task
# ============================================================

# ---- Parse command-line argument (task index) ----
args <- commandArgs(trailingOnly = TRUE)

if (length(args) < 1) {
  stop("Usage: Rscript nmf_trainings.R <task_id>")
}

i <- as.integer(args[1])
if (is.na(i)) {
  stop("Error: <task_id> must be an integer.")
}

message("Running NMF training for iteration: ", i)

# ---- Load required libraries and data (user-defined) ----
library(SpatialEcoTyper)
# scdata <- ...
# scmeta <- ...

# ---- Run NMF training ----
Ws <- NMFGenerateWList(
  scdata, scmeta,
  CellType = "CellType",
  SE = "SE",
  Sample = "Sample",
  nfeature = 500,
  nfeature.per.se = Inf,  # disable internal feature selection
  ncores = 16
)

# ---- Save output ----
saveRDS(Ws, paste0("path/to/SE_recovery_nmf_models_", i, ".rds"))

submit_slurm_jobs_nmf.sh

#!/bin/bash
#SBATCH --job-name=nmf_train
#SBATCH --array=1-50
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --time=24:00:00
#SBATCH --output=logs/nmf_%A_%a.out
Rscript nmf_trainings.R $SLURM_ARRAY_TASK_ID

After all jobs are completed, the resulting basis matrices can be aggregated for each cell type using the AggregateRecoverModels function. For each basis matrix, genes with the highest and most specific expression for each cell state are identified. Genes that are consistently selected for the same cell state across more than half of the repetitions are retained in the final model by default. The final basis matrix is then generated by averaging the corresponding weights across repetitions.

files = list.files("path/to/", "SE_recovery_nmf_models_", 
                   full.names = TRUE)
Ws_list = lapply(files, readRDS)
Ws_final = AggregateRecoverModels(Ws_list)

Next step

Using the SE recovery model, users can recover spatial ecotypes from held-out single-cell datasets, including both single-cell spatial transcriptomics and single-cell RNA-seq data, as demonstrated separately in Tutorial 5 and Tutorial 6 separately.

Session info

The session info allows users to replicate the exact environment and identify potential discrepancies in package versions or configurations that might be causing problems.

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 26.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SpatialEcoTyper_1.0.2 RANN_2.6.2            Matrix_1.7-0         
##  [4] dplyr_1.1.4           data.table_1.16.0     NMF_0.28             
##  [7] Biobase_2.64.0        BiocGenerics_0.50.0   cluster_2.1.6        
## [10] rngtools_1.5.2        registry_0.5-1        Seurat_5.1.0         
## [13] SeuratObject_5.0.2    sp_2.1-4             
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22       splines_4.4.1          later_1.3.2           
##   [4] R.oo_1.26.0            tibble_3.2.1           polyclip_1.10-7       
##   [7] fastDummies_1.7.4      lifecycle_1.0.4        sf_1.1-0              
##  [10] doParallel_1.0.17      globals_0.16.3         lattice_0.22-6        
##  [13] MASS_7.3-60.2          magrittr_2.0.3         plotly_4.10.4         
##  [16] sass_0.4.9             rmarkdown_2.28         jquerylib_0.1.4       
##  [19] yaml_2.3.10            httpuv_1.6.15          sctransform_0.4.1     
##  [22] spam_2.10-0            spatstat.sparse_3.1-0  reticulate_1.39.0     
##  [25] cowplot_1.1.3          pbapply_1.7-2          DBI_1.3.0             
##  [28] RColorBrewer_1.1-3     abind_1.4-5            Rtsne_0.17            
##  [31] R.utils_2.12.3         purrr_1.0.2            circlize_0.4.16       
##  [34] IRanges_2.38.1         S4Vectors_0.42.1       ggrepel_0.9.6         
##  [37] irlba_2.3.5.1          listenv_0.9.1          spatstat.utils_3.1-0  
##  [40] units_1.0-1            goftest_1.2-3          RSpectra_0.16-2       
##  [43] spatstat.random_3.3-1  fitdistrplus_1.2-1     parallelly_1.38.0     
##  [46] pkgdown_2.1.0          leiden_0.4.3.1         codetools_0.2-20      
##  [49] tidyselect_1.2.1       shape_1.4.6.1          matrixStats_1.4.1     
##  [52] stats4_4.4.1           spatstat.explore_3.3-2 jsonlite_1.8.8        
##  [55] GetoptLong_1.0.5       e1071_1.7-16           progressr_0.14.0      
##  [58] ggridges_0.5.6         survival_3.6-4         iterators_1.0.14      
##  [61] systemfonts_1.1.0      foreach_1.5.2          tools_4.4.1           
##  [64] ragg_1.3.2             ica_1.0-3              Rcpp_1.0.13           
##  [67] glue_1.7.0             gridExtra_2.3          xfun_0.52             
##  [70] withr_3.0.1            BiocManager_1.30.25    fastmap_1.2.0         
##  [73] boot_1.3-30            fansi_1.0.6            spData_2.3.4          
##  [76] digest_0.6.37          R6_2.5.1               mime_0.12             
##  [79] wk_0.9.5               textshaping_0.4.0      colorspace_2.1-1      
##  [82] scattermore_1.2        tensor_1.5             spatstat.data_3.1-2   
##  [85] R.methodsS3_1.8.2      utf8_1.2.4             tidyr_1.3.1           
##  [88] generics_0.1.3         class_7.3-22           httr_1.4.7            
##  [91] htmlwidgets_1.6.4      spdep_1.4-2            uwot_0.2.2            
##  [94] pkgconfig_2.0.3        gtable_0.3.5           ComplexHeatmap_2.20.0 
##  [97] lmtest_0.9-40          htmltools_0.5.8.1      dotCall64_1.1-1       
## [100] clue_0.3-65            scales_1.3.0           png_0.1-8             
## [103] spatstat.univar_3.0-1  knitr_1.48             rstudioapi_0.16.0     
## [106] reshape2_1.4.4         rjson_0.2.22           nlme_3.1-164          
## [109] proxy_0.4-27           cachem_1.1.0           zoo_1.8-12            
## [112] GlobalOptions_0.1.2    stringr_1.5.1          KernSmooth_2.23-24    
## [115] miniUI_0.1.1.1         s2_1.1.9               desc_1.4.3            
## [118] pillar_1.9.0           grid_4.4.1             vctrs_0.6.5           
## [121] promises_1.3.0         xtable_1.8-4           evaluate_0.24.0       
## [124] cli_3.6.3              compiler_4.4.1         rlang_1.1.4           
## [127] crayon_1.5.3           future.apply_1.11.2    classInt_0.4-11       
## [130] plyr_1.8.9             fs_1.6.4               stringi_1.8.4         
## [133] viridisLite_0.4.2      deldir_2.0-4           gridBase_0.4-7        
## [136] munsell_0.5.1          lazyeval_0.2.2         spatstat.geom_3.3-2   
## [139] RcppHNSW_0.6.0         patchwork_1.2.0        future_1.34.0         
## [142] ggplot2_3.5.1          shiny_1.9.1            ROCR_1.0-11           
## [145] igraph_2.0.3           bslib_0.8.0