
NMF Model Development for Spatial Ecotype Recovery from Single-Cell and Spatial Transcriptomics Data
Source:vignettes/TrainRecoveryModel.Rmd
TrainRecoveryModel.RmdOverview
Users can develop machine learning models to enable spatial ecotype (SE) recovery from held-out single-cell datasets (e.g., scRNA-seq or single-cell spatial transcriptomics). The Spatial EcoTyper framework leverages non-negative matrix factorization (NMF) as its core algorithm for SE recovery, given its robustness and demonstrated effectiveness in recovering carcinoma ecotypes from multimodal expression data in our prior study. Here, we will demonstrate how to train the model to recover SEs from single-cell gene expression data.
Development of SE recovery model
In this tutorial, we will use the Vizgen MERSCOPE data of two samples to demonstrate the process for developing cell-type-specific NMF models for recovering SEs from single-cell gene expression profiles, including both single-cell spatial transcriptomics and single-cell RNA-seq data.
First load required packages for this vignette
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(NMF))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(dplyr))
library(SpatialEcoTyper)Loading data
Load gene expression matrix
## Read the expression matrix for the first sample
scdata <- fread("https://spatialecotyper.stanford.edu/inc/inc.public.vignettes.php?file=Melanoma1_subset_counts.tsv.gz",
sep = "\t",header = TRUE, data.table = FALSE)
rownames(scdata) <- scdata[, 1] ## set the first column as row names
scdata <- scdata[, -1] ## drop the first column
scdata <- NormalizeData(scdata) ## Normalize the expression matrix
## Read the expression matrix for the second sample
scdata2 <- fread("https://spatialecotyper.stanford.edu/inc/inc.public.vignettes.php?file=CRC2_subset_counts.tsv.gz",
sep = "\t",header = TRUE, data.table = FALSE)
rownames(scdata2) <- scdata2[, 1] ## set the first column as row names
scdata2 <- scdata2[, -1] ## drop the first column
scdata2 <- NormalizeData(scdata2) ## Normalize the expression matrix
## Combine the two expression matrices
genes <- intersect(rownames(scdata), rownames(scdata2))
scdata <- cbind(scdata[genes, ], scdata2[genes, ])
head(scdata[, 1:5])## HumanMelanomaPatient1__cell_3655 HumanMelanomaPatient1__cell_3657
## PDK4 0.000000 4.221184
## TNFRSF17 0.000000 0.000000
## ICAM3 0.000000 0.000000
## FAP 3.552438 0.000000
## GZMB 0.000000 0.000000
## TSC2 0.000000 0.000000
## HumanMelanomaPatient1__cell_3658 HumanMelanomaPatient1__cell_3660
## PDK4 3.208945 0
## TNFRSF17 0.000000 0
## ICAM3 0.000000 0
## FAP 0.000000 0
## GZMB 0.000000 0
## TSC2 0.000000 0
## HumanMelanomaPatient1__cell_3661
## PDK4 0
## TNFRSF17 0
## ICAM3 0
## FAP 0
## GZMB 0
## TSC2 0
Load cell type and SE annotations
## Load the single-cell meta data derived from MultiSpatialEcoTyper analysis
scmeta <- read.table("https://spatialecotyper.stanford.edu/inc/inc.public.vignettes.php?file=MultiSE_metadata_final.tsv", sep = "\t",
header = TRUE, row.names = 1)
scmeta$SE[is.na(scmeta$SE)] = "nonSE" ## cells with NA predictions were assigned into the nonSE group
## Matches the row names of meta data and column names of expression matrix
cells <- intersect(colnames(scdata), rownames(scmeta))
scdata <- scdata[, cells]
scmeta <- scmeta[cells, ]
head(scmeta[, c("Sample", "CellType", "SE")])## Sample CellType SE
## HumanMelanomaPatient1__cell_3655 SKCM Fibroblast SE7
## HumanMelanomaPatient1__cell_3657 SKCM Fibroblast SE1
## HumanMelanomaPatient1__cell_3658 SKCM Fibroblast SE1
## HumanMelanomaPatient1__cell_3660 SKCM Fibroblast SE1
## HumanMelanomaPatient1__cell_3661 SKCM Fibroblast SE7
## HumanMelanomaPatient1__cell_3663 SKCM Fibroblast SE7
Training
The NMFGenerateWList
function will be used to train cell-type specific NMF models for
recovering SEs. Prior to NMF, each gene is scaled to mean 0 and unit
variance within each cell type-specific expression matrix. To satisfy
the non-negativity requirement of NMF, the cell type-specific expression
matrices are individually processed using posneg transformation, which
converts each expression matrix into two matrices, one containing only
positive values and the other containing only negative values with the
sign inverted. The two matrices are subsequently concatenated to produce
the training data. The NMFGenerateWList function
will return a list of feature × SE matrices, each
representing a cell type-specific gene signature matrix with rows as
features and columns as SEs.
By default, for each cell type, the NMFGenerateWList balances
cell numbers across samples (via the Sample argument) by
randomly subsampling each sample to a target size equal to the median
number of cells per sample, downsamples a total of 2,500 cells across
all samples (downsample argument), and selects the top 300
variable features (nfeature argument) for model training.
We recommend adjusting these parameters based on the characteristics of
your dataset and available computational resources.
For a quick demonstration, we train the model on the small example dataset using a simplified, built-in feature selection strategy.
# training
Ws = NMFGenerateWList(scdata, scmeta,
CellType = "CellType",
SE = "SE", Sample = "Sample",
nfeature = 300,
nfeature.per.se = 50,
ncores = 8)
class(Ws)## [1] "list"
names(Ws)## [1] "B" "CD4T" "CD8T" "DC" "Endothelial"
## [6] "Fibroblast" "Macrophage" "NK" "Plasma"
head(Ws[[2]])## SE7 SE3 SE4 SE8 SE6 SE1
## FOXP3__pos 0.5004127 0.6673784 0.3470760 0.6797704 0.4872819 1.0338540
## CXCR4__neg 0.2357855 0.6635520 0.2114708 1.0913564 0.3938264 0.6861448
## ICAM3__neg 0.4751354 0.5266059 0.4683919 0.6175067 0.4831921 0.5026104
## STAT3__neg 0.4748124 0.5139846 0.4922895 0.6149234 0.5547521 0.5395274
## CCL5__neg 0.5197869 0.5246213 0.4483178 0.5640050 0.5111583 0.6468350
## BCL2L1__pos 0.5060547 0.5936575 0.4228397 0.6190426 0.5072450 0.4724431
## SE9 SE5 SE2
## FOXP3__pos 0.5133462 0.5363018 0.7832876
## CXCR4__neg 0.2967064 0.5645048 0.9790789
## ICAM3__neg 0.4267446 0.6177860 0.8404026
## STAT3__neg 0.5390224 0.5108202 0.7087075
## CCL5__neg 0.5040774 0.5681576 0.6594520
## BCL2L1__pos 0.6042824 0.5519758 0.6467089
The resulting W matrices could be used to recover SEs from spatial transcriptomics and single-cell RNA-seq using the RecoverSE function. The detailed documentation for SE recovery is available at Recovering Spatial Ecotypes from Single-Cell RNA-seq Data and Recovering Spatial Ecotypes from Single-Cell Spatial Transcriptomics Data.
Note: To obtain a robust model, we recommend performing multiple runs and constructing an ensemble. In our analysis (Zhang et al.), we repeated the training 50 times and averaged the resulting basis matrices to generate an ensemble matrix W for each cell type. For feature selection, we identified genes with the highest and most specific expression in each cell state from each run, and retained those consistently selected for the same cell state in more than half of the runs.
The NMF model training can be time-consuming. To improve efficiency, we recommend running training jobs as Slurm array jobs and and allocating sufficient computational resources (e.g., CPU cores) to each task. the resulting W matrices according to the procedure described above. An example Slurm array script is provided below.
nmf_trainings.R
# ============================================================
# Demo script for NMF training (do not run as-is without setup)
# Designed for Slurm array jobs: one run per task
# ============================================================
# ---- Parse command-line argument (task index) ----
args <- commandArgs(trailingOnly = TRUE)
if (length(args) < 1) {
stop("Usage: Rscript nmf_trainings.R <task_id>")
}
i <- as.integer(args[1])
if (is.na(i)) {
stop("Error: <task_id> must be an integer.")
}
message("Running NMF training for iteration: ", i)
# ---- Load required libraries and data (user-defined) ----
library(SpatialEcoTyper)
# scdata <- ...
# scmeta <- ...
# ---- Run NMF training ----
Ws <- NMFGenerateWList(
scdata, scmeta,
CellType = "CellType",
SE = "SE",
Sample = "Sample",
nfeature = 500,
nfeature.per.se = Inf, # disable internal feature selection
ncores = 16
)
# ---- Save output ----
saveRDS(Ws, paste0("path/to/SE_recovery_nmf_models_", i, ".rds"))submit_slurm_jobs_nmf.sh
#!/bin/bash
#SBATCH --job-name=nmf_train
#SBATCH --array=1-50
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --time=24:00:00
#SBATCH --output=logs/nmf_%A_%a.out
Rscript nmf_trainings.R $SLURM_ARRAY_TASK_IDAfter all jobs finish, you can average the resulting basis matrices. For feature selection, genes showing the highest and most specific expression in each cell state were identified from each basis matrix as described, and then genes that were selected for the same cell state in over half of the repetitions were retained in the ensemble matrix.
Next step
Using the SE recovery model, users can recover spatial ecotypes from held-out single-cell datasets, including both single-cell spatial transcriptomics and single-cell RNA-seq data, as demonstrated separately in Tutorial 5 and Tutorial 6 separately.
Session info
The session info allows users to replicate the exact environment and identify potential discrepancies in package versions or configurations that might be causing problems.
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 26.4.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SpatialEcoTyper_1.0.2 RANN_2.6.2 Matrix_1.7-0
## [4] dplyr_1.1.4 data.table_1.16.0 NMF_0.28
## [7] Biobase_2.64.0 BiocGenerics_0.50.0 cluster_2.1.6
## [10] rngtools_1.5.2 registry_0.5-1 Seurat_5.1.0
## [13] SeuratObject_5.0.2 sp_2.1-4
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.4.1 later_1.3.2
## [4] R.oo_1.26.0 tibble_3.2.1 polyclip_1.10-7
## [7] fastDummies_1.7.4 lifecycle_1.0.4 sf_1.1-0
## [10] doParallel_1.0.17 globals_0.16.3 lattice_0.22-6
## [13] MASS_7.3-60.2 magrittr_2.0.3 plotly_4.10.4
## [16] sass_0.4.9 rmarkdown_2.28 jquerylib_0.1.4
## [19] yaml_2.3.10 httpuv_1.6.15 sctransform_0.4.1
## [22] spam_2.10-0 spatstat.sparse_3.1-0 reticulate_1.39.0
## [25] cowplot_1.1.3 pbapply_1.7-2 DBI_1.3.0
## [28] RColorBrewer_1.1-3 abind_1.4-5 Rtsne_0.17
## [31] R.utils_2.12.3 purrr_1.0.2 circlize_0.4.16
## [34] IRanges_2.38.1 S4Vectors_0.42.1 ggrepel_0.9.6
## [37] irlba_2.3.5.1 listenv_0.9.1 spatstat.utils_3.1-0
## [40] units_1.0-1 goftest_1.2-3 RSpectra_0.16-2
## [43] spatstat.random_3.3-1 fitdistrplus_1.2-1 parallelly_1.38.0
## [46] pkgdown_2.1.0 leiden_0.4.3.1 codetools_0.2-20
## [49] tidyselect_1.2.1 shape_1.4.6.1 matrixStats_1.4.1
## [52] stats4_4.4.1 spatstat.explore_3.3-2 jsonlite_1.8.8
## [55] GetoptLong_1.0.5 e1071_1.7-16 progressr_0.14.0
## [58] ggridges_0.5.6 survival_3.6-4 iterators_1.0.14
## [61] systemfonts_1.1.0 foreach_1.5.2 tools_4.4.1
## [64] ragg_1.3.2 ica_1.0-3 Rcpp_1.0.13
## [67] glue_1.7.0 gridExtra_2.3 xfun_0.52
## [70] withr_3.0.1 BiocManager_1.30.25 fastmap_1.2.0
## [73] boot_1.3-30 fansi_1.0.6 spData_2.3.4
## [76] digest_0.6.37 R6_2.5.1 mime_0.12
## [79] wk_0.9.5 textshaping_0.4.0 colorspace_2.1-1
## [82] scattermore_1.2 tensor_1.5 spatstat.data_3.1-2
## [85] R.methodsS3_1.8.2 utf8_1.2.4 tidyr_1.3.1
## [88] generics_0.1.3 class_7.3-22 httr_1.4.7
## [91] htmlwidgets_1.6.4 spdep_1.4-2 uwot_0.2.2
## [94] pkgconfig_2.0.3 gtable_0.3.5 ComplexHeatmap_2.20.0
## [97] lmtest_0.9-40 htmltools_0.5.8.1 dotCall64_1.1-1
## [100] clue_0.3-65 scales_1.3.0 png_0.1-8
## [103] spatstat.univar_3.0-1 knitr_1.48 rstudioapi_0.16.0
## [106] reshape2_1.4.4 rjson_0.2.22 nlme_3.1-164
## [109] proxy_0.4-27 cachem_1.1.0 zoo_1.8-12
## [112] GlobalOptions_0.1.2 stringr_1.5.1 KernSmooth_2.23-24
## [115] miniUI_0.1.1.1 s2_1.1.9 desc_1.4.3
## [118] pillar_1.9.0 grid_4.4.1 vctrs_0.6.5
## [121] promises_1.3.0 xtable_1.8-4 evaluate_0.24.0
## [124] cli_3.6.3 compiler_4.4.1 rlang_1.1.4
## [127] crayon_1.5.3 future.apply_1.11.2 classInt_0.4-11
## [130] plyr_1.8.9 fs_1.6.4 stringi_1.8.4
## [133] viridisLite_0.4.2 deldir_2.0-4 gridBase_0.4-7
## [136] munsell_0.5.1 lazyeval_0.2.2 spatstat.geom_3.3-2
## [139] RcppHNSW_0.6.0 patchwork_1.2.0 future_1.34.0
## [142] ggplot2_3.5.1 shiny_1.9.1 ROCR_1.0-11
## [145] igraph_2.0.3 bslib_0.8.0