Development of NMF Models for Spatial Ecotype Recovery • SpatialEcoTyper

Development of SE recovery model

After performing Spatial EcoTyper analysis, users can develop a machine learning model to recover spatial ecotypes (SEs) from various types of gene expression data, including spatial transcriptomics and single-cell RNA-seq. In this tutorial, we will use the Vizgen MERSCOPE data of two samples to demonstrate the training process.

The gene expression matrices, cell type and SE annotations required for the training can be obtained from here.

First load required packages for this vignette

suppressPackageStartupMessages(library(googledrive))
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(NMF))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(dplyr))
library(SpatialEcoTyper)

Loading data

Download the data from Google Drive

drive_deauth() # Disable Google sign-in requirement
drive_download(as_id("17mbc56M0GTq-9MR0GqHas-L-yIXdnxtu"), "HumanMelanomaPatient1_subset_counts.tsv.gz", overwrite = TRUE)
drive_download(as_id("17nL0ceONPw8ByAtDe34DiXiGLMVBLlPX"), "HumanColonCancerPatient2_subset_counts.tsv.gz", overwrite = TRUE)
drive_download(as_id("18HsreR4QH6hh5L-YO7toP9lg62dVv8iX"), "MultiSE_metadata_final.tsv", overwrite = TRUE)

Load gene expression matrix

## Read the expression matrix for the first sample
scdata <- fread("HumanMelanomaPatient1_subset_counts.tsv.gz", 
                sep = "\t",header = TRUE, data.table = FALSE)
rownames(scdata) <- scdata[, 1]  ## set the first column as row names
scdata <- scdata[, -1]  ## drop the first column
scdata <- NormalizeData(scdata)  ## Normalize the expression matrix

## Read the expression matrix for the second sample
scdata2 <- fread("HumanColonCancerPatient2_subset_counts.tsv.gz", 
                sep = "\t",header = TRUE, data.table = FALSE)
rownames(scdata2) <- scdata2[, 1]  ## set the first column as row names
scdata2 <- scdata2[, -1]  ## drop the first column
scdata2 <- NormalizeData(scdata2)  ## Normalize the expression matrix

## Combine the two expression matrices
genes <- intersect(rownames(scdata), rownames(scdata2))
scdata <- cbind(scdata[genes, ], scdata2[genes, ])
head(scdata[, 1:5])

##          HumanMelanomaPatient1__cell_3655 HumanMelanomaPatient1__cell_3657
## PDK4                             0.000000                         4.221184
## TNFRSF17                         0.000000                         0.000000
## ICAM3                            0.000000                         0.000000
## FAP                              3.552438                         0.000000
## GZMB                             0.000000                         0.000000
## TSC2                             0.000000                         0.000000
##          HumanMelanomaPatient1__cell_3658 HumanMelanomaPatient1__cell_3660
## PDK4                             3.208945                                0
## TNFRSF17                         0.000000                                0
## ICAM3                            0.000000                                0
## FAP                              0.000000                                0
## GZMB                             0.000000                                0
## TSC2                             0.000000                                0
##          HumanMelanomaPatient1__cell_3661
## PDK4                                    0
## TNFRSF17                                0
## ICAM3                                   0
## FAP                                     0
## GZMB                                    0
## TSC2                                    0

Load cell type and SE annotations

## Load the single-cell meta data derived from MultiSpatialEcoTyper analysis
scmeta <- read.table("MultiSE_metadata_final.tsv", sep = "\t", 
                     header = TRUE, row.names = 1)
scmeta$SE[is.na(scmeta$SE)] = "nonSE" ## cells with NA predictions were assigned into the NonSE group

## Matches the row names of meta data and column names of expression matrix
cells <- intersect(colnames(scdata), rownames(scmeta))
scdata <- scdata[, cells]
scmeta <- scmeta[cells, ]
head(scmeta[, c("Sample", "CellType", "SE")])

##                                  Sample   CellType  SE
## HumanMelanomaPatient1__cell_3655   SKCM Fibroblast SE3
## HumanMelanomaPatient1__cell_3657   SKCM Fibroblast SE4
## HumanMelanomaPatient1__cell_3658   SKCM Fibroblast SE4
## HumanMelanomaPatient1__cell_3660   SKCM Fibroblast SE4
## HumanMelanomaPatient1__cell_3661   SKCM Fibroblast SE3
## HumanMelanomaPatient1__cell_3663   SKCM Fibroblast SE3

Training

The NMFGenerateWList function will be used to train cell-type specific NMF models for recovering SEs. Prior to NMF, each gene is scaled to mean 0 and unit variance within each cell type-specific expression matrix. To satisfy the non-negativity requirement of NMF, the cell type-specific expression matrices are individually processed using posneg transformation, which converts each expression matrix into two matrices, one containing only positive values and the other containing only negative values with the sign inverted. The two matrices are subsequently concatenated to produce the training data. The NMFGenerateWList function will return a list of feature × SE matrices, each representing a cell type-specific gene signature matrix with rows as features and columns as SEs.

Note: By default, the NMFGenerateWList selects the top 2000 variable features (nfeature argument) and down-samples 2500 cells (downsample argument) within each cell type for training. After training, it retains 50 genes per SE (nfeature.per.se argument) in the model. Users can tune the parameters accordingly.

# training
Ws = NMFGenerateWList(scdata, scmeta, 
                      CellType = "CellType",
                      SE = "SE", Sample = "Sample")
## This step would take ~3 mins for the demo.
class(Ws)

## [1] "list"

names(Ws)

## [1] "B"           "CD4T"        "CD8T"        "DC"          "Endothelial"
## [6] "Fibroblast"  "Macrophage"  "NK"          "Plasma"

head(Ws[[2]])

##                   SE1       SE5       SE6      SE10       SE8       SE3
## FOXP3__pos  0.4719930 0.4885273 0.3651733 0.6901381 0.8495508 0.5111029
## CCL5__neg   0.4636977 0.5029325 0.4671653 0.5279928 0.6660964 0.5879144
## NFKB2__neg  0.5959560 0.5353461 0.3610576 0.4095969 0.5328200 0.6356051
## NFKBIA__neg 0.5449949 0.5124919 0.4053110 0.3867929 0.4872992 0.5444012
## XBP1__neg   0.3641057 0.4390831 0.4536565 0.5994809 0.6964391 0.4286859
## IRF3__neg   0.4771004 0.4980158 0.4737179 0.5178389 0.4861118 0.4835203
##                   SE4       SE9       SE2       SE7
## FOXP3__pos  0.7223714 0.4173739 0.5810581 0.5287318
## CCL5__neg   0.7176785 0.3887930 0.5669015 0.7065557
## NFKB2__neg  0.7882044 0.4825905 1.0177761 0.1657350
## NFKBIA__neg 0.5245634 0.4819461 0.9273532 0.7001782
## XBP1__neg   0.4714610 0.5722981 0.6128531 0.8757327
## IRF3__neg   0.4070195 0.5213065 0.6902136 0.9023205

The resulting W matrices could be used to recover SEs from spatial transcriptomics and single-cell RNA-seq using the RecoverSE function. The detailed documentation for SE recovery is available at Recovery of Spatial Ecotypes from Single-Cell Gene Expression Data and Recovery of Spatial Ecotypes from Spatial Transcriptomics Data.

Development of SE deconvolution model

Users can also develop an NMF model to deconvolve SEs from bulk gene expression data. The training data can be pseudo-bulk mixtures by aggregating single-cell transcriptomics. In this tutorial, we will create pseudo-bulk mixtures from single-cell RNA-seq data of a melanoma sample.

The raw count for this sample is available at WU2161_counts.tsv and SE recovery results are available at WU2161_RecoveredSEs.tsv.

Loading data

# Download single-cell gene expression matrix and SE recovery results.
# The downloads should be finished within 1min.
drive_download(as_id("17VAeOnz6vTt2s0ZeTrK1kITdJ3Yus4ei"), "WU2161_counts.tsv", overwrite = TRUE)
drive_download(as_id("17kFjOHhmWxWAKm-LDqy6wV27o5lvz8L2"), "WU2161_RecoveredSEs.tsv", overwrite = TRUE)

# Load single-cell gene expression matrix.
counts = fread("WU2161_counts.tsv", sep = "\t", data.table = FALSE)
rownames(counts) = counts[, 1] ## Set the first column as row names
counts = counts[, -1] ## Drop the first column
head(counts[, 1:5])

##            AAACCTGCACATGACT-1 AAACCTGGTGGTCCGT-1 AAACCTGGTTTGTGTG-1
## AL627309.1                  0                  0                  0
## AL627309.5                  0                  0                  0
## AP006222.2                  0                  0                  0
## AC114498.1                  0                  0                  0
## AL669831.2                  0                  0                  0
## LINC01409                   0                  0                  0
##            AAACCTGTCCGATATG-1 AAACCTGTCTAACGGT-1
## AL627309.1                  0                  0
## AL627309.5                  1                  0
## AP006222.2                  0                  0
## AC114498.1                  0                  0
## AL669831.2                  0                  0
## LINC01409                   0                  0

# Load SE recovery results
SEs = read.table("WU2161_RecoveredSEs.tsv", sep = "\t", row.names = 1, header = TRUE)
SEs = SEs[match(colnames(counts), rownames(SEs)), 1] # extract SE predictions
length(SEs)

## [1] 1337

table(SEs)

## SEs
## nonSE   SE1   SE2   SE3   SE4   SE5   SE6   SE7   SE8   SE9 
##   912    71    45    24     5    43    74    75    71    17

Note: For demonstration purposes, we used 1,337 cells grouped into SEs to create pseudo-bulk mixtures. While this limited cell number offers a basic example, it does not fully capture the diverse characteristics of SEs, potentially affecting the robustness of model training and subsequent SE recovery. In practice, we recommend using a more comprehensive dataset that accurately reflects the properties of SEs, ensuring that the training process results in a reliable recovery model.

Creating pseudobulk data

The CreatePseudobulks function will be used to create pseudo-bulk mixtures from single-cell (spatial) transcriptomics data with all cells grouped into SEs. It samples cell fractions from a Gaussian distribution, sets negative fractions to 0 and re-normalizes fractions to sum to 1 across all SEs. Based on the resulting fractions, it samples 1,000 cells per pseudo-bulk mixture with replacement, aggregates their transcriptomes in non-log linear space, and normalize the resulting mixture to logarithm CPM using Seurat’s NormalizeData.

result = CreatePseudobulks(counts = counts, groups = SEs, n_mixtures = 100)
## This step took ~4 min for the demo.
head(result$Mixtures[, 1:5]) ## Gene expression matrix of pseudobulks

##            Pseudobulk1  Pseudobulk2 Pseudobulk3  Pseudobulk4  Pseudobulk5
## AL627309.1 0.020129003 0.0166225359 0.023307824 0.0098541618 0.0088996542
## AL627309.5 0.042782118 0.0507958729 0.065149253 0.0307522233 0.0470043992
## AP006222.2 0.001258585 0.0047970105 0.002820776 0.0005117182 0.0027994077
## AC114498.1 0.001848058 0.0009235317 0.002770807 0.0018443685 0.0009235317
## AL669831.2 0.000000000 0.0000000000 0.000000000 0.0000000000 0.0000000000
## LINC01409  0.028300784 0.0323725374 0.042793627 0.0507739102 0.0274993034

head(result$Fracs) ## SE fractions in pseudobulks

##                  nonSE        SE1        SE2        SE3         SE4        SE5
## Pseudobulk1 0.16490065 0.04918395 0.04180656 0.13380601 0.108023707 0.10246218
## Pseudobulk2 0.06857820 0.14872621 0.11331726 0.11852688 0.051279421 0.09434107
## Pseudobulk3 0.08345357 0.17803366 0.19856639 0.00000000 0.002316577 0.04424433
## Pseudobulk4 0.09014638 0.07754488 0.06834655 0.06010091 0.080744805 0.11287459
## Pseudobulk5 0.12412216 0.00000000 0.05709886 0.11705402 0.102333027 0.21175799
## Pseudobulk6 0.04653528 0.07215946 0.05076475 0.13719262 0.114854678 0.17734928
##                    SE6        SE7        SE8        SE9
## Pseudobulk1 0.15013277 0.11912636 0.09232816 0.03822966
## Pseudobulk2 0.10492044 0.08010585 0.09546042 0.12474424
## Pseudobulk3 0.15682092 0.15585820 0.03024719 0.15045916
## Pseudobulk4 0.13277544 0.06254595 0.17260478 0.14231574
## Pseudobulk5 0.06920083 0.04372533 0.22512173 0.04958606
## Pseudobulk6 0.06965773 0.16811845 0.02792748 0.13544026

Training

The NMFGenerateW function will be used to train an NMF model for SE deconvolution based on the provided SE fractions and gene expression matrix. Before applying NMF, each gene’s expression is scaled to have a mean of 0 and unit variance by default. To meet the non-negativity requirement of NMF, the expression matrix is transformed using the posneg method. This transformation splits the expression matrix into two matrices: one containing only positive values and the other containing the absolute values of the negative values. These two matrices are then concatenated to form the final training data for the NMF model.

Note: By default, the NMFGenerateW selects the top 2000 variable features (nfeature argument) for training. After training, it retains 50 genes per SE (nfeature.per.se argument) in the model. Users can tune the parameters accordingly.

W = NMFGenerateW(result$Fracs, result$Mixtures)
## This step took less than 1 min for the demo.
head(W)

##                       nonSE          SE1          SE2          SE3          SE4
## IGLV1-40__pos  6.435004e-02 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16
## IGHV4-34__pos  2.220446e-16 2.220446e-16 1.352908e-07 1.027098e-12 2.220446e-16
## IGLV2-14__pos  4.468224e+00 2.220446e-16 1.323646e-15 2.220446e-16 1.962789e-01
## IGKV2D-30__pos 1.061495e-15 2.220446e-16 6.075700e-01 1.292995e-15 1.082039e-15
## IGKV1-33__pos  2.220446e-16 2.220446e-16 1.326303e-08 2.206336e-12 2.220446e-16
## NLRP3__pos     2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16
##                         SE5          SE6          SE7          SE8          SE9
## IGLV1-40__pos  1.003067e-06 2.220446e-16 4.618030e+00 9.300405e-10 2.220446e-16
## IGHV4-34__pos  1.264496e-11 2.220446e-16 4.671389e+00 2.220446e-16 2.220446e-16
## IGLV2-14__pos  2.220446e-16 2.220446e-16 2.220446e-16 5.860832e-09 2.220446e-16
## IGKV2D-30__pos 2.220446e-16 2.220446e-16 4.054420e+00 2.220446e-16 4.717884e-13
## IGKV1-33__pos  2.841738e-12 2.220446e-16 4.649631e+00 2.220446e-16 3.816478e-16
## NLRP3__pos     2.220446e-16 4.617904e+00 2.220446e-16 2.220446e-16 2.220446e-16

## Save the model for late use.
saveRDS(W, "SE_Deconvolution_W.rds")

The resulting W matrix can be used to infer SE abundances from bulk gene expression profiles using the DeconvoluteSE function. The detailed documentation for SE deconvolution is available at Recovery of Spatial Ecotypes from Bulk Gene Expression Data.

Session info

The session info allows users to replicate the exact environment and identify potential discrepancies in package versions or configurations that might be causing problems.

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SpatialEcoTyper_1.0.0 RANN_2.6.2            Matrix_1.7-0         
##  [4] dplyr_1.1.4           data.table_1.16.0     NMF_0.28             
##  [7] Biobase_2.64.0        BiocGenerics_0.50.0   cluster_2.1.6        
## [10] rngtools_1.5.2        registry_0.5-1        Seurat_5.1.0         
## [13] SeuratObject_5.0.2    sp_2.1-4              googledrive_2.1.1    
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3     shape_1.4.6.1          rstudioapi_0.16.0     
##   [4] jsonlite_1.8.8         magrittr_2.0.3         spatstat.utils_3.1-0  
##   [7] rmarkdown_2.28         GlobalOptions_0.1.2    fs_1.6.4              
##  [10] ragg_1.3.2             vctrs_0.6.5            ROCR_1.0-11           
##  [13] spatstat.explore_3.3-2 htmltools_0.5.8.1      curl_5.2.2            
##  [16] sass_0.4.9             sctransform_0.4.1      parallelly_1.38.0     
##  [19] KernSmooth_2.23-24     bslib_0.8.0            htmlwidgets_1.6.4     
##  [22] desc_1.4.3             ica_1.0-3              plyr_1.8.9            
##  [25] plotly_4.10.4          zoo_1.8-12             cachem_1.1.0          
##  [28] igraph_2.0.3           iterators_1.0.14       mime_0.12             
##  [31] lifecycle_1.0.4        pkgconfig_2.0.3        R6_2.5.1              
##  [34] fastmap_1.2.0          clue_0.3-65            fitdistrplus_1.2-1    
##  [37] future_1.34.0          shiny_1.9.1            digest_0.6.37         
##  [40] colorspace_2.1-1       S4Vectors_0.42.1       patchwork_1.2.0       
##  [43] tensor_1.5             RSpectra_0.16-2        irlba_2.3.5.1         
##  [46] textshaping_0.4.0      progressr_0.14.0       fansi_1.0.6           
##  [49] spatstat.sparse_3.1-0  httr_1.4.7             polyclip_1.10-7       
##  [52] abind_1.4-5            compiler_4.4.1         gargle_1.5.2          
##  [55] withr_3.0.1            doParallel_1.0.17      fastDummies_1.7.4     
##  [58] R.utils_2.12.3         MASS_7.3-60.2          rjson_0.2.22          
##  [61] tools_4.4.1            lmtest_0.9-40          httpuv_1.6.15         
##  [64] future.apply_1.11.2    goftest_1.2-3          R.oo_1.26.0           
##  [67] glue_1.7.0             nlme_3.1-164           promises_1.3.0        
##  [70] grid_4.4.1             gridBase_0.4-7         Rtsne_0.17            
##  [73] reshape2_1.4.4         generics_0.1.3         gtable_0.3.5          
##  [76] spatstat.data_3.1-2    R.methodsS3_1.8.2      tidyr_1.3.1           
##  [79] utf8_1.2.4             spatstat.geom_3.3-2    RcppAnnoy_0.0.22      
##  [82] foreach_1.5.2          ggrepel_0.9.6          pillar_1.9.0          
##  [85] stringr_1.5.1          spam_2.10-0            RcppHNSW_0.6.0        
##  [88] later_1.3.2            circlize_0.4.16        splines_4.4.1         
##  [91] lattice_0.22-6         survival_3.6-4         deldir_2.0-4          
##  [94] tidyselect_1.2.1       ComplexHeatmap_2.20.0  miniUI_0.1.1.1        
##  [97] pbapply_1.7-2          knitr_1.48             gridExtra_2.3         
## [100] IRanges_2.38.1         scattermore_1.2        stats4_4.4.1          
## [103] xfun_0.47              matrixStats_1.4.1      stringi_1.8.4         
## [106] lazyeval_0.2.2         yaml_2.3.10            evaluate_0.24.0       
## [109] codetools_0.2-20       tibble_3.2.1           BiocManager_1.30.25   
## [112] cli_3.6.3              uwot_0.2.2             xtable_1.8-4          
## [115] reticulate_1.39.0      systemfonts_1.1.0      munsell_0.5.1         
## [118] jquerylib_0.1.4        Rcpp_1.0.13            globals_0.16.3        
## [121] spatstat.random_3.3-1  png_0.1-8              spatstat.univar_3.0-1 
## [124] pkgdown_2.1.0          ggplot2_3.5.1          dotCall64_1.1-1       
## [127] listenv_0.9.1          viridisLite_0.4.2      scales_1.3.0          
## [130] ggridges_0.5.6         crayon_1.5.3           leiden_0.4.3.1        
## [133] purrr_1.0.2            GetoptLong_1.0.5       rlang_1.1.4           
## [136] cowplot_1.1.3