
NMF Model Development for Spatial Ecotype Deconvolution from Bulk Gene Expression Data
Source:vignettes/TrainDeconvModel.Rmd
TrainDeconvModel.RmdDevelopment of SE deconvolution model to infer SE abundances from bulk gene expression data
This tutorial will demonstrate how to develop an NMF model to deconvolve SEs from bulk gene expression data. This process requires bulk gene expression data with known SE compositions, which we can generate by aggregating single-cell RNA-seq data into pseudo-bulk mixtures. For demonstration, we will use single-cell RNA-seq data from a melanoma sample.
The raw count for this sample is available at scRNAseq_demo_counts.tsv and SE recovery results are available at scRNAseq_demo_RecoveredSEs.tsv.
First load required packages for this vignette
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(parallel))
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(data.table))
library(SpatialEcoTyper)## Loading required package: Matrix
## Loading required package: RANN
## Loading required package: NMF
## Loading required package: registry
## Loading required package: rngtools
## Loading required package: cluster
## NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 2/2
## To enable shared memory capabilities, try: install.extras('
## NMF
## ')
## Registered S3 methods overwritten by 'proxy':
## method from
## print.registry_field registry
## print.registry_entry registry
Loading data
# Load single-cell gene expression matrix.
counts = fread("https://spatialecotyper.stanford.edu/inc/inc.public.vignettes.php?file=scRNAseq_demo_counts.tsv", sep = "\t", data.table = FALSE)
rownames(counts) = counts[, 1] ## Set the first column as row names
counts = counts[, -1] ## Drop the first column
head(counts[, 1:5])## AAACCTGCACATGACT-1 AAACCTGGTGGTCCGT-1 AAACCTGGTTTGTGTG-1
## AL627309.1 0 0 0
## AL627309.5 0 0 0
## AP006222.2 0 0 0
## AC114498.1 0 0 0
## AL669831.2 0 0 0
## LINC01409 0 0 0
## AAACCTGTCCGATATG-1 AAACCTGTCTAACGGT-1
## AL627309.1 0 0
## AL627309.5 1 0
## AP006222.2 0 0
## AC114498.1 0 0
## AL669831.2 0 0
## LINC01409 0 0
# Load SE recovery results
SEs = read.table("https://spatialecotyper.stanford.edu/inc/inc.public.vignettes.php?file=scRNAseq_demo_RecoveredSEs.tsv", sep = "\t", row.names = 1, header = TRUE)
SEs = SEs[match(colnames(counts), rownames(SEs)), 1] # extract SE predictions
length(SEs)## [1] 1337
table(SEs)## SEs
## NonSE SE1 SE2 SE3 SE4 SE5 SE6 SE7 SE8 SE9
## 912 71 45 24 5 43 74 75 71 17
Note: For demonstration purposes, we used 1,337 cells grouped into SEs to create pseudo-bulk mixtures. While this limited cell number offers a basic example, it does not fully capture the diverse characteristics of SEs, potentially affecting the robustness of model training and subsequent SE recovery. In practice, we recommend using a more comprehensive dataset that accurately reflects the properties of SEs, ensuring that the training process results in a reliable recovery model.
Creating pseudobulk data
The CreatePseudobulks
function will be used to create pseudo-bulk mixtures from single-cell
(spatial) transcriptomics data with all cells grouped into SEs. It
samples cell fractions from a Gaussian distribution, sets negative
fractions to 0 and re-normalizes fractions to sum to 1 across all SEs.
Based on the resulting fractions, it samples 1,000 cells per pseudo-bulk
mixture with replacement, aggregates their transcriptomes in non-log
linear space, and normalizes the resulting mixture to log CPM using
Seurat’s NormalizeData.
result = CreatePseudobulks(counts = counts, groups = SEs, n_mixtures = 100)
## This step took ~4 min for the demo.
head(result$Mixtures[, 1:5]) ## Gene expression matrix of pseudobulks## Pseudobulk1 Pseudobulk2 Pseudobulk3 Pseudobulk4 Pseudobulk5
## AL627309.1 0.0174480160 0.021700286 0.025335709 0.021771366 0.0178844132
## AL627309.5 0.0556490311 0.066083860 0.047873652 0.048452191 0.0458422569
## AP006222.2 0.0015996830 0.001385931 0.001539961 0.001719366 0.0019482776
## AC114498.1 0.0009226095 0.001849908 0.012487922 0.007915024 0.0009235317
## AL669831.2 0.0003700004 0.000000000 0.000000000 0.000000000 0.0000000000
## LINC01409 0.0388370335 0.046997175 0.048489613 0.033545565 0.0350300209
head(result$Fracs) ## SE fractions in pseudobulks## NonSE SE1 SE2 SE3 SE4 SE5
## Pseudobulk1 0.09634633 0.06078223 0.08168480 0.18168768 0.15460683 0.05722716
## Pseudobulk2 0.05791127 0.11749288 0.08818621 0.09086087 0.05990615 0.02844153
## Pseudobulk3 0.09915892 0.13414687 0.10468587 0.07654495 0.09224340 0.10049175
## Pseudobulk4 0.14761371 0.10797752 0.09270873 0.09114589 0.04933215 0.07106387
## Pseudobulk5 0.00000000 0.11593232 0.13905860 0.13928982 0.09286052 0.09044864
## Pseudobulk6 0.07690522 0.26140848 0.11224667 0.00000000 0.09605090 0.14198815
## SE6 SE7 SE8 SE9
## Pseudobulk1 0.11596934 0.14275343 0.07695790 0.03198430
## Pseudobulk2 0.08035147 0.17205767 0.19535529 0.10943665
## Pseudobulk3 0.13693882 0.07876109 0.14704196 0.02998637
## Pseudobulk4 0.07181558 0.06432914 0.18721874 0.11679468
## Pseudobulk5 0.10664955 0.09198408 0.11168859 0.11208787
## Pseudobulk6 0.06597923 0.18534277 0.03238605 0.02769253
Training
The NMFGenerateW function will be used to train an NMF model for SE deconvolution based on the provided SE fractions and gene expression matrix. Before applying NMF, each gene’s expression is scaled to have a mean of 0 and unit variance by default. To meet the non-negativity requirement of NMF, the expression matrix is transformed using the posneg method. This transformation splits the expression matrix into two matrices: one containing only positive values and the other containing the absolute values of the negative values. These two matrices are then concatenated to form the final training data for the NMF model.
Note: By default, the NMFGenerateW selects the top
2000 variable features (nfeature argument) for training.
After training, it retains 50 genes per SE (nfeature.per.se
argument) in the model. Users can tune the parameters accordingly.
W = NMFGenerateW(result$Fracs, result$Mixtures)
## This step took less than 1 min for the demo.
head(W)## NonSE SE1 SE2 SE3 SE4
## IGLV2-14__neg 2.220446e-16 5.000166e-06 1.299754e-13 3.953218e-01 4.777894e-03
## IGLV2-14__pos 4.300850e+00 1.036021e-02 2.019052e-01 2.220446e-16 4.481028e-14
## ADIRF__pos 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 4.460434e+00
## IGKV1-5__neg 2.094720e-12 2.220446e-16 2.004657e-01 5.162830e-01 2.379044e+00
## PPP1R14A__pos 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 4.435008e+00
## FRZB__pos 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 4.428378e+00
## SE5 SE6 SE7 SE8 SE9
## IGLV2-14__neg 1.757031e+00 8.017933e-01 4.792380e-01 1.355624e+00 5.159578e-02
## IGLV2-14__pos 2.220446e-16 2.220446e-16 4.000668e-16 2.220446e-16 2.220446e-16
## ADIRF__pos 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16
## IGKV1-5__neg 3.935260e-05 6.133635e-01 2.220446e-16 3.135628e-01 4.204170e-01
## PPP1R14A__pos 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16
## FRZB__pos 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16
## Save the model for late use.
saveRDS(W, "SE_Deconvolution_W.rds")The resulting W matrix can be used to infer SE abundances from bulk gene expression profiles using the DeconvoluteSE function. The detailed documentation for SE deconvolution is available at Inferring Spatial Ecotype Abundances from Bulk Gene Expression Data.
Session info
The session info allows users to replicate the exact environment and identify potential discrepancies in package versions or configurations that might be causing problems.
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 26.4.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SpatialEcoTyper_1.0.2 NMF_0.28 Biobase_2.64.0
## [4] BiocGenerics_0.50.0 cluster_2.1.6 rngtools_1.5.2
## [7] registry_0.5-1 RANN_2.6.2 Matrix_1.7-0
## [10] data.table_1.16.0 Seurat_5.1.0 SeuratObject_5.0.2
## [13] sp_2.1-4 ggplot2_3.5.1 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.4.1 later_1.3.2
## [4] tibble_3.2.1 polyclip_1.10-7 fastDummies_1.7.4
## [7] lifecycle_1.0.4 sf_1.1-0 doParallel_1.0.17
## [10] globals_0.16.3 lattice_0.22-6 MASS_7.3-60.2
## [13] magrittr_2.0.3 plotly_4.10.4 sass_0.4.9
## [16] rmarkdown_2.28 jquerylib_0.1.4 yaml_2.3.10
## [19] httpuv_1.6.15 sctransform_0.4.1 spam_2.10-0
## [22] spatstat.sparse_3.1-0 reticulate_1.39.0 cowplot_1.1.3
## [25] pbapply_1.7-2 DBI_1.3.0 RColorBrewer_1.1-3
## [28] abind_1.4-5 Rtsne_0.17 purrr_1.0.2
## [31] circlize_0.4.16 IRanges_2.38.1 S4Vectors_0.42.1
## [34] ggrepel_0.9.6 irlba_2.3.5.1 listenv_0.9.1
## [37] spatstat.utils_3.1-0 units_1.0-1 goftest_1.2-3
## [40] RSpectra_0.16-2 spatstat.random_3.3-1 fitdistrplus_1.2-1
## [43] parallelly_1.38.0 pkgdown_2.1.0 leiden_0.4.3.1
## [46] codetools_0.2-20 tidyselect_1.2.1 shape_1.4.6.1
## [49] matrixStats_1.4.1 stats4_4.4.1 spatstat.explore_3.3-2
## [52] jsonlite_1.8.8 GetoptLong_1.0.5 e1071_1.7-16
## [55] progressr_0.14.0 ggridges_0.5.6 survival_3.6-4
## [58] iterators_1.0.14 systemfonts_1.1.0 foreach_1.5.2
## [61] tools_4.4.1 ragg_1.3.2 ica_1.0-3
## [64] Rcpp_1.0.13 glue_1.7.0 gridExtra_2.3
## [67] xfun_0.52 withr_3.0.1 BiocManager_1.30.25
## [70] fastmap_1.2.0 boot_1.3-30 fansi_1.0.6
## [73] spData_2.3.4 digest_0.6.37 R6_2.5.1
## [76] mime_0.12 wk_0.9.5 textshaping_0.4.0
## [79] colorspace_2.1-1 scattermore_1.2 tensor_1.5
## [82] spatstat.data_3.1-2 utf8_1.2.4 tidyr_1.3.1
## [85] generics_0.1.3 class_7.3-22 httr_1.4.7
## [88] htmlwidgets_1.6.4 spdep_1.4-2 uwot_0.2.2
## [91] pkgconfig_2.0.3 gtable_0.3.5 ComplexHeatmap_2.20.0
## [94] lmtest_0.9-40 htmltools_0.5.8.1 dotCall64_1.1-1
## [97] clue_0.3-65 scales_1.3.0 png_0.1-8
## [100] spatstat.univar_3.0-1 knitr_1.48 rstudioapi_0.16.0
## [103] reshape2_1.4.4 rjson_0.2.22 nlme_3.1-164
## [106] proxy_0.4-27 cachem_1.1.0 zoo_1.8-12
## [109] GlobalOptions_0.1.2 stringr_1.5.1 KernSmooth_2.23-24
## [112] miniUI_0.1.1.1 s2_1.1.9 desc_1.4.3
## [115] pillar_1.9.0 grid_4.4.1 vctrs_0.6.5
## [118] promises_1.3.0 xtable_1.8-4 evaluate_0.24.0
## [121] cli_3.6.3 compiler_4.4.1 rlang_1.1.4
## [124] crayon_1.5.3 future.apply_1.11.2 classInt_0.4-11
## [127] plyr_1.8.9 fs_1.6.4 stringi_1.8.4
## [130] viridisLite_0.4.2 deldir_2.0-4 gridBase_0.4-7
## [133] munsell_0.5.1 lazyeval_0.2.2 spatstat.geom_3.3-2
## [136] RcppHNSW_0.6.0 patchwork_1.2.0 future_1.34.0
## [139] shiny_1.9.1 ROCR_1.0-11 igraph_2.0.3
## [142] bslib_0.8.0