NMF Model Development for Spatial Ecotype Deconvolution from Bulk Gene Expression Data • SpatialEcoTyper

Development of SE deconvolution model to infer SE abundances from bulk gene expression data

This tutorial will demonstrate how to develop an NMF model to deconvolve SEs from bulk gene expression data. This process requires bulk gene expression data with known SE compositions, which we can generate by aggregating single-cell RNA-seq data into pseudo-bulk mixtures. For demonstration, we will use single-cell RNA-seq data from a melanoma sample.

Raw count data for this sample are available at scRNAseq_demo_counts.tsv and SE recovery results are available at scRNAseq_demo_RecoveredSEs.tsv.

First load required packages for this vignette

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(parallel))
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(data.table))
library(SpatialEcoTyper)

## Loading required package: Matrix

## Loading required package: RANN

## Loading required package: NMF

## Loading required package: registry

## Loading required package: rngtools

## Loading required package: cluster

## NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 2/2

##   To enable shared memory capabilities, try: install.extras('
## NMF
## ')

## Registered S3 methods overwritten by 'proxy':
##   method               from    
##   print.registry_field registry
##   print.registry_entry registry

Loading data

# Load single-cell gene expression matrix.
counts = fread("https://spatialecotyper.stanford.edu/inc/inc.public.vignettes.php?file=scRNAseq_demo_counts.tsv", sep = "\t", data.table = FALSE)
rownames(counts) = counts[, 1] ## Set the first column as row names
counts = counts[, -1] ## Drop the first column
head(counts[, 1:5])

##            AAACCTGCACATGACT-1 AAACCTGGTGGTCCGT-1 AAACCTGGTTTGTGTG-1
## AL627309.1                  0                  0                  0
## AL627309.5                  0                  0                  0
## AP006222.2                  0                  0                  0
## AC114498.1                  0                  0                  0
## AL669831.2                  0                  0                  0
## LINC01409                   0                  0                  0
##            AAACCTGTCCGATATG-1 AAACCTGTCTAACGGT-1
## AL627309.1                  0                  0
## AL627309.5                  1                  0
## AP006222.2                  0                  0
## AC114498.1                  0                  0
## AL669831.2                  0                  0
## LINC01409                   0                  0

# Load SE recovery results
SEs = read.table("https://spatialecotyper.stanford.edu/inc/inc.public.vignettes.php?file=scRNAseq_demo_RecoveredSEs.tsv", sep = "\t", row.names = 1, header = TRUE)
SEs = SEs[match(colnames(counts), rownames(SEs)), 1] # extract SE predictions
length(SEs)

## [1] 1337

table(SEs)

## SEs
## NonSE   SE1   SE2   SE3   SE4   SE5   SE6   SE7   SE8   SE9 
##   912    71    45    24     5    43    74    75    71    17

Note: For demonstration purposes, we used 1,337 cells grouped into SEs to create pseudo-bulk mixtures. While this limited cell number offers a basic example, it does not fully capture the diverse characteristics of SEs, potentially affecting the robustness of model training and subsequent SE recovery. In practice, we recommend using a more comprehensive dataset that accurately reflects the properties of SEs, ensuring that the training process results in a reliable recovery model.

Creating pseudobulk data

The CreatePseudobulks function will be used to create pseudo-bulk mixtures from single-cell (spatial) transcriptomics data with all cells grouped into SEs. It samples cell fractions from a Gaussian distribution, sets negative fractions to 0 and re-normalizes fractions to sum to 1 across all SEs. Based on the resulting fractions, it samples 1,000 cells per pseudo-bulk mixture with replacement, aggregates their transcriptomes in non-log linear space, and normalizes the resulting mixture to log CPM using Seurat’s NormalizeData.

result = CreatePseudobulks(counts = counts, groups = SEs, n_mixtures = 100)
## This step took ~4 min for the demo.
head(result$Mixtures[, 1:5]) ## Gene expression matrix of pseudobulks

##             Pseudobulk1 Pseudobulk2 Pseudobulk3 Pseudobulk4  Pseudobulk5
## AL627309.1 0.0174480160 0.021700286 0.025335709 0.021771366 0.0178844132
## AL627309.5 0.0556490311 0.066083860 0.047873652 0.048452191 0.0458422569
## AP006222.2 0.0015996830 0.001385931 0.001539961 0.001719366 0.0019482776
## AC114498.1 0.0009226095 0.001849908 0.012487922 0.007915024 0.0009235317
## AL669831.2 0.0003700004 0.000000000 0.000000000 0.000000000 0.0000000000
## LINC01409  0.0388370335 0.046997175 0.048489613 0.033545565 0.0350300209

head(result$Fracs) ## SE fractions in pseudobulks

##                  NonSE        SE1        SE2        SE3        SE4        SE5
## Pseudobulk1 0.09634633 0.06078223 0.08168480 0.18168768 0.15460683 0.05722716
## Pseudobulk2 0.05791127 0.11749288 0.08818621 0.09086087 0.05990615 0.02844153
## Pseudobulk3 0.09915892 0.13414687 0.10468587 0.07654495 0.09224340 0.10049175
## Pseudobulk4 0.14761371 0.10797752 0.09270873 0.09114589 0.04933215 0.07106387
## Pseudobulk5 0.00000000 0.11593232 0.13905860 0.13928982 0.09286052 0.09044864
## Pseudobulk6 0.07690522 0.26140848 0.11224667 0.00000000 0.09605090 0.14198815
##                    SE6        SE7        SE8        SE9
## Pseudobulk1 0.11596934 0.14275343 0.07695790 0.03198430
## Pseudobulk2 0.08035147 0.17205767 0.19535529 0.10943665
## Pseudobulk3 0.13693882 0.07876109 0.14704196 0.02998637
## Pseudobulk4 0.07181558 0.06432914 0.18721874 0.11679468
## Pseudobulk5 0.10664955 0.09198408 0.11168859 0.11208787
## Pseudobulk6 0.06597923 0.18534277 0.03238605 0.02769253

Training

The NMFGenerateW function will be used to train an NMF model for SE deconvolution based on the provided SE fractions and gene expression matrix. Before applying NMF, each gene’s expression is scaled to have a mean of 0 and unit variance by default. To meet the non-negativity requirement of NMF, the expression matrix is transformed using the posneg method. This transformation splits the expression matrix into two matrices: one containing only positive values and the other containing the absolute values of the negative values. These two matrices are then concatenated to form the final training data for the NMF model.

Note: By default, the NMFGenerateW selects the top 2000 variable features (nfeature argument) for training. After training, it retains 50 genes per SE (nfeature.per.se argument) in the model. Users can tune the parameters accordingly.

W = NMFGenerateW(result$Fracs, result$Mixtures)
## This step took less than 1 min for the demo.
head(W)

##                      NonSE          SE1          SE2          SE3          SE4
## IGLV2-14__neg 2.220446e-16 5.000166e-06 1.299754e-13 3.953218e-01 4.777894e-03
## IGLV2-14__pos 4.300850e+00 1.036021e-02 2.019052e-01 2.220446e-16 4.481028e-14
## ADIRF__pos    2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 4.460434e+00
## IGKV1-5__neg  2.094720e-12 2.220446e-16 2.004657e-01 5.162830e-01 2.379044e+00
## PPP1R14A__pos 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 4.435008e+00
## FRZB__pos     2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 4.428378e+00
##                        SE5          SE6          SE7          SE8          SE9
## IGLV2-14__neg 1.757031e+00 8.017933e-01 4.792380e-01 1.355624e+00 5.159578e-02
## IGLV2-14__pos 2.220446e-16 2.220446e-16 4.000668e-16 2.220446e-16 2.220446e-16
## ADIRF__pos    2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16
## IGKV1-5__neg  3.935260e-05 6.133635e-01 2.220446e-16 3.135628e-01 4.204170e-01
## PPP1R14A__pos 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16
## FRZB__pos     2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16

## Save the model for late use.
saveRDS(W, "SE_Deconvolution_W.rds")

The resulting W matrix can be used to infer SE abundances from bulk gene expression profiles using the DeconvoluteSE function. The detailed documentation for SE deconvolution is available at Inferring Spatial Ecotype Abundances from Bulk Gene Expression Data.

Session info

The session info allows users to replicate the exact environment and identify potential discrepancies in package versions or configurations that might be causing problems.

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 26.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SpatialEcoTyper_1.0.3 NMF_0.28              Biobase_2.64.0       
##  [4] BiocGenerics_0.50.0   cluster_2.1.6         rngtools_1.5.2       
##  [7] registry_0.5-1        RANN_2.6.2            Matrix_1.7-0         
## [10] data.table_1.16.0     Seurat_5.1.0          SeuratObject_5.0.2   
## [13] sp_2.1-4              ggplot2_3.5.1         dplyr_1.1.4          
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22       splines_4.4.1          later_1.3.2           
##   [4] tibble_3.2.1           polyclip_1.10-7        fastDummies_1.7.4     
##   [7] lifecycle_1.0.4        sf_1.1-0               doParallel_1.0.17     
##  [10] globals_0.16.3         lattice_0.22-6         MASS_7.3-60.2         
##  [13] magrittr_2.0.3         plotly_4.10.4          sass_0.4.9            
##  [16] rmarkdown_2.28         jquerylib_0.1.4        yaml_2.3.10           
##  [19] httpuv_1.6.15          sctransform_0.4.1      spam_2.10-0           
##  [22] spatstat.sparse_3.1-0  reticulate_1.39.0      cowplot_1.1.3         
##  [25] pbapply_1.7-2          DBI_1.3.0              RColorBrewer_1.1-3    
##  [28] abind_1.4-5            Rtsne_0.17             purrr_1.0.2           
##  [31] circlize_0.4.16        IRanges_2.38.1         S4Vectors_0.42.1      
##  [34] ggrepel_0.9.6          irlba_2.3.5.1          listenv_0.9.1         
##  [37] spatstat.utils_3.1-0   units_1.0-1            goftest_1.2-3         
##  [40] RSpectra_0.16-2        spatstat.random_3.3-1  fitdistrplus_1.2-1    
##  [43] parallelly_1.38.0      pkgdown_2.1.0          leiden_0.4.3.1        
##  [46] codetools_0.2-20       tidyselect_1.2.1       shape_1.4.6.1         
##  [49] matrixStats_1.4.1      stats4_4.4.1           spatstat.explore_3.3-2
##  [52] jsonlite_1.8.8         GetoptLong_1.0.5       e1071_1.7-16          
##  [55] progressr_0.14.0       ggridges_0.5.6         survival_3.6-4        
##  [58] iterators_1.0.14       systemfonts_1.1.0      foreach_1.5.2         
##  [61] tools_4.4.1            ragg_1.3.2             ica_1.0-3             
##  [64] Rcpp_1.0.13            glue_1.7.0             gridExtra_2.3         
##  [67] xfun_0.52              withr_3.0.1            BiocManager_1.30.25   
##  [70] fastmap_1.2.0          boot_1.3-30            fansi_1.0.6           
##  [73] spData_2.3.4           digest_0.6.37          R6_2.5.1              
##  [76] mime_0.12              wk_0.9.5               textshaping_0.4.0     
##  [79] colorspace_2.1-1       scattermore_1.2        tensor_1.5            
##  [82] spatstat.data_3.1-2    utf8_1.2.4             tidyr_1.3.1           
##  [85] generics_0.1.3         class_7.3-22           httr_1.4.7            
##  [88] htmlwidgets_1.6.4      spdep_1.4-2            uwot_0.2.2            
##  [91] pkgconfig_2.0.3        gtable_0.3.5           ComplexHeatmap_2.20.0 
##  [94] lmtest_0.9-40          htmltools_0.5.8.1      dotCall64_1.1-1       
##  [97] clue_0.3-65            scales_1.3.0           png_0.1-8             
## [100] spatstat.univar_3.0-1  knitr_1.48             rstudioapi_0.16.0     
## [103] reshape2_1.4.4         rjson_0.2.22           nlme_3.1-164          
## [106] proxy_0.4-27           cachem_1.1.0           zoo_1.8-12            
## [109] GlobalOptions_0.1.2    stringr_1.5.1          KernSmooth_2.23-24    
## [112] miniUI_0.1.1.1         s2_1.1.9               desc_1.4.3            
## [115] pillar_1.9.0           grid_4.4.1             vctrs_0.6.5           
## [118] promises_1.3.0         xtable_1.8-4           evaluate_0.24.0       
## [121] cli_3.6.3              compiler_4.4.1         rlang_1.1.4           
## [124] crayon_1.5.3           future.apply_1.11.2    classInt_0.4-11       
## [127] plyr_1.8.9             fs_1.6.4               stringi_1.8.4         
## [130] viridisLite_0.4.2      deldir_2.0-4           gridBase_0.4-7        
## [133] munsell_0.5.1          lazyeval_0.2.2         spatstat.geom_3.3-2   
## [136] RcppHNSW_0.6.0         patchwork_1.2.0        future_1.34.0         
## [139] shiny_1.9.1            ROCR_1.0-11            igraph_2.0.3          
## [142] bslib_0.8.0