Skip to contents

Overview

After performing Spatial EcoTyper analysis, users can develop machine learning models for spatial ecotype (SE) recovery, including reference-guided SE annotation in held-out single-cell datasets (scRNA-seq or single-cell spatial transcriptomics) and SE deconvolution to infer SE abundances from bulk gene expression data (bulk RNA-seq or Visium). The Spatial EcoTyper framework employs non-negative matrix factorization (NMF) as the core algorithm in both analyses, given its robustness and proven effectiveness in recovering carcinoma ecotypes from multimodal expression data in our prior study. Here, we will demonstrate how to train NMF models for SE recovery from single-cell gene expression data and for SE deconvolution from bulk gene expression data separately.

Development of recovery model to annotate SEs in single-cell gene expression data

In this tutorial, we will use the Vizgen MERSCOPE data of two samples to demonstrate the process for developing cell-type-specific NMF models for recovering SEs from single-cell gene expression profiles, including both single-cell spatial transcriptomics and single-cell RNA-seq data.

The gene expression matrices, cell type and SE annotations required for the training can be obtained from here.

First load required packages for this vignette

Loading data

Download the data from Google Drive

drive_deauth() # Disable Google sign-in requirement
drive_download(as_id("17mbc56M0GTq-9MR0GqHas-L-yIXdnxtu"), "HumanMelanomaPatient1_subset_counts.tsv.gz", overwrite = TRUE)
drive_download(as_id("17nL0ceONPw8ByAtDe34DiXiGLMVBLlPX"), "HumanColonCancerPatient2_subset_counts.tsv.gz", overwrite = TRUE)
drive_download(as_id("1t3MX84RHm5BQMzcAvXtmzc-Dj4e6Y7Ik"), "MultiSE_metadata_final.tsv", overwrite = TRUE)

Load gene expression matrix

## Read the expression matrix for the first sample
scdata <- fread("HumanMelanomaPatient1_subset_counts.tsv.gz", 
                sep = "\t",header = TRUE, data.table = FALSE)
rownames(scdata) <- scdata[, 1]  ## set the first column as row names
scdata <- scdata[, -1]  ## drop the first column
scdata <- NormalizeData(scdata)  ## Normalize the expression matrix

## Read the expression matrix for the second sample
scdata2 <- fread("HumanColonCancerPatient2_subset_counts.tsv.gz", 
                sep = "\t",header = TRUE, data.table = FALSE)
rownames(scdata2) <- scdata2[, 1]  ## set the first column as row names
scdata2 <- scdata2[, -1]  ## drop the first column
scdata2 <- NormalizeData(scdata2)  ## Normalize the expression matrix

## Combine the two expression matrices
genes <- intersect(rownames(scdata), rownames(scdata2))
scdata <- cbind(scdata[genes, ], scdata2[genes, ])
head(scdata[, 1:5])
##          HumanMelanomaPatient1__cell_3655 HumanMelanomaPatient1__cell_3657
## PDK4                             0.000000                         4.221184
## TNFRSF17                         0.000000                         0.000000
## ICAM3                            0.000000                         0.000000
## FAP                              3.552438                         0.000000
## GZMB                             0.000000                         0.000000
## TSC2                             0.000000                         0.000000
##          HumanMelanomaPatient1__cell_3658 HumanMelanomaPatient1__cell_3660
## PDK4                             3.208945                                0
## TNFRSF17                         0.000000                                0
## ICAM3                            0.000000                                0
## FAP                              0.000000                                0
## GZMB                             0.000000                                0
## TSC2                             0.000000                                0
##          HumanMelanomaPatient1__cell_3661
## PDK4                                    0
## TNFRSF17                                0
## ICAM3                                   0
## FAP                                     0
## GZMB                                    0
## TSC2                                    0

Load cell type and SE annotations

## Load the single-cell meta data derived from MultiSpatialEcoTyper analysis
scmeta <- read.table("MultiSE_metadata_final.tsv", sep = "\t", 
                     header = TRUE, row.names = 1)
scmeta$SE[is.na(scmeta$SE)] = "nonSE" ## cells with NA predictions were assigned into the nonSE group

## Matches the row names of meta data and column names of expression matrix
cells <- intersect(colnames(scdata), rownames(scmeta))
scdata <- scdata[, cells]
scmeta <- scmeta[cells, ]
head(scmeta[, c("Sample", "CellType", "SE")])
##                                  Sample   CellType  SE
## HumanMelanomaPatient1__cell_3655   SKCM Fibroblast SE6
## HumanMelanomaPatient1__cell_3657   SKCM Fibroblast SE1
## HumanMelanomaPatient1__cell_3658   SKCM Fibroblast SE1
## HumanMelanomaPatient1__cell_3660   SKCM Fibroblast SE1
## HumanMelanomaPatient1__cell_3661   SKCM Fibroblast SE6
## HumanMelanomaPatient1__cell_3663   SKCM Fibroblast SE6

Training

The NMFGenerateWList function will be used to train cell-type specific NMF models for recovering SEs. Prior to NMF, each gene is scaled to mean 0 and unit variance within each cell type-specific expression matrix. To satisfy the non-negativity requirement of NMF, the cell type-specific expression matrices are individually processed using posneg transformation, which converts each expression matrix into two matrices, one containing only positive values and the other containing only negative values with the sign inverted. The two matrices are subsequently concatenated to produce the training data. The NMFGenerateWList function will return a list of feature × SE matrices, each representing a cell type-specific gene signature matrix with rows as features and columns as SEs.

Note: By default, the NMFGenerateWList selects the top 2000 variable features (nfeature argument) and down-samples 2500 cells (downsample argument) within each cell type for training. After training, it retains 50 genes per SE (nfeature.per.se argument) in the model. Users can tune the parameters accordingly.

# training
Ws = NMFGenerateWList(scdata, scmeta, 
                      CellType = "CellType",
                      SE = "SE", Sample = "Sample")
## This step would take ~3 mins for the demo.
class(Ws)
## [1] "list"
names(Ws)
## [1] "B"           "CD4T"        "CD8T"        "DC"          "Endothelial"
## [6] "Fibroblast"  "Macrophage"  "NK"          "Plasma"
head(Ws[[2]])
##                   SE6      SE10       SE8       SE5       SE4       SE3
## CCL5__neg   0.5145161 0.5397050 0.4430638 0.7122035 0.5261429 0.4446076
## XBP1__neg   0.4026504 0.5881324 0.4534266 0.9834871 0.4475261 0.3841000
## NFKBIA__neg 0.5205229 0.3957970 0.3920701 0.3677413 0.6758230 0.5218947
## CDKN1B__neg 0.3609662 0.5577056 0.5121339 0.7531579 0.4968649 0.4651990
## CXCR6__pos  0.4714863 0.4993970 0.2598859 0.8154450 0.6067278 0.4657444
## NFKB1__neg  0.5238008 0.4750225 0.4536049 0.4313634 0.6072406 0.5600733
##                   SE1       SE2       SE7       SE9
## CCL5__neg   0.6530544 0.5411019 0.7678666 0.6780336
## XBP1__neg   0.6401507 0.4460380 0.5866628 0.8487692
## NFKBIA__neg 0.6758407 0.5994373 0.5995011 1.0137611
## CDKN1B__neg 0.7280766 0.4544132 0.6294411 0.7415819
## CXCR6__pos  0.6089321 0.6168503 0.4655022 0.8689492
## NFKB1__neg  0.5656818 0.4970913 0.6098723 0.8485226

The resulting W matrices could be used to recover SEs from spatial transcriptomics and single-cell RNA-seq using the RecoverSE function. The detailed documentation for SE recovery is available at Recovery of Spatial Ecotypes from Single-Cell Gene Expression Data and Recovery of Spatial Ecotypes from Spatial Transcriptomics Data.

Development of SE deconvolution model to infer SE abundances from bulk gene expression data

In this tutorial, we will demonstrate how to develop an NMF model to deconvolve SEs from bulk gene expression data. This process requires bulk training data with known SE compositions, which we can generated by aggregating single-cell RNA-seq data into pseudo-bulk mixtures. For demonstration, we will use single-cell RNA-seq data from a melanoma sample.

The raw count for this sample is available at WU2161_counts.tsv and SE recovery results are available at WU2161_RecoveredSEs.tsv.

Loading data

# Download single-cell gene expression matrix and SE recovery results.
# The downloads should be finished within 1min.
drive_download(as_id("17VAeOnz6vTt2s0ZeTrK1kITdJ3Yus4ei"), "WU2161_counts.tsv", overwrite = TRUE)
drive_download(as_id("17kFjOHhmWxWAKm-LDqy6wV27o5lvz8L2"), "WU2161_RecoveredSEs.tsv", overwrite = TRUE)
# Load single-cell gene expression matrix.
counts = fread("WU2161_counts.tsv", sep = "\t", data.table = FALSE)
rownames(counts) = counts[, 1] ## Set the first column as row names
counts = counts[, -1] ## Drop the first column
head(counts[, 1:5])
##            AAACCTGCACATGACT-1 AAACCTGGTGGTCCGT-1 AAACCTGGTTTGTGTG-1
## AL627309.1                  0                  0                  0
## AL627309.5                  0                  0                  0
## AP006222.2                  0                  0                  0
## AC114498.1                  0                  0                  0
## AL669831.2                  0                  0                  0
## LINC01409                   0                  0                  0
##            AAACCTGTCCGATATG-1 AAACCTGTCTAACGGT-1
## AL627309.1                  0                  0
## AL627309.5                  1                  0
## AP006222.2                  0                  0
## AC114498.1                  0                  0
## AL669831.2                  0                  0
## LINC01409                   0                  0
# Load SE recovery results
SEs = read.table("WU2161_RecoveredSEs.tsv", sep = "\t", row.names = 1, header = TRUE)
SEs = SEs[match(colnames(counts), rownames(SEs)), 1] # extract SE predictions
length(SEs)
## [1] 1337
table(SEs)
## SEs
## nonSE   SE1   SE2   SE3   SE4   SE5   SE6   SE7   SE8   SE9 
##   912    71    45    24     5    43    74    75    71    17

Note: For demonstration purposes, we used 1,337 cells grouped into SEs to create pseudo-bulk mixtures. While this limited cell number offers a basic example, it does not fully capture the diverse characteristics of SEs, potentially affecting the robustness of model training and subsequent SE recovery. In practice, we recommend using a more comprehensive dataset that accurately reflects the properties of SEs, ensuring that the training process results in a reliable recovery model.

Creating pseudobulk data

The CreatePseudobulks function will be used to create pseudo-bulk mixtures from single-cell (spatial) transcriptomics data with all cells grouped into SEs. It samples cell fractions from a Gaussian distribution, sets negative fractions to 0 and re-normalizes fractions to sum to 1 across all SEs. Based on the resulting fractions, it samples 1,000 cells per pseudo-bulk mixture with replacement, aggregates their transcriptomes in non-log linear space, and normalize the resulting mixture to logarithm CPM using Seurat’s NormalizeData.

result = CreatePseudobulks(counts = counts, groups = SEs, n_mixtures = 100)
## This step took ~4 min for the demo.
head(result$Mixtures[, 1:5]) ## Gene expression matrix of pseudobulks
##            Pseudobulk1  Pseudobulk2 Pseudobulk3  Pseudobulk4 Pseudobulk5
## AL627309.1 0.034510811 0.0248976190 0.017730486 0.0116095624 0.012006108
## AL627309.5 0.048433430 0.0423069886 0.037175210 0.0388799578 0.053369352
## AP006222.2 0.001979247 0.0017176501 0.002230716 0.0014918933 0.003404467
## AC114498.1 0.002768040 0.0009226095 0.000000000 0.0009244557 0.000000000
## AL669831.2 0.000000000 0.0000000000 0.000000000 0.0000000000 0.000000000
## LINC01409  0.049101136 0.0454449643 0.039554865 0.0454409663 0.040022663
head(result$Fracs) ## SE fractions in pseudobulks
##                  nonSE        SE1        SE2        SE3         SE4        SE5
## Pseudobulk1 0.13566259 0.03714595 0.08700476 0.11247413 0.006952763 0.16925317
## Pseudobulk2 0.11053902 0.09494654 0.07522259 0.13304700 0.035826405 0.09667114
## Pseudobulk3 0.09117524 0.10499265 0.09641997 0.08563370 0.131132847 0.14371435
## Pseudobulk4 0.14938164 0.05010742 0.06586831 0.11229859 0.086042379 0.12967002
## Pseudobulk5 0.16958606 0.04087059 0.07243148 0.12060727 0.035014712 0.12229811
## Pseudobulk6 0.15181360 0.05296392 0.14561045 0.05306015 0.117005773 0.06958625
##                    SE6        SE7        SE8        SE9
## Pseudobulk1 0.13480120 0.04463829 0.10635447 0.16571267
## Pseudobulk2 0.10409005 0.10669564 0.15664085 0.08632077
## Pseudobulk3 0.06665823 0.04219794 0.08302129 0.15505378
## Pseudobulk4 0.14790042 0.10413103 0.05321619 0.10138399
## Pseudobulk5 0.12102642 0.14975237 0.04712587 0.12128714
## Pseudobulk6 0.11155113 0.12587378 0.06088209 0.11165287

Training

The NMFGenerateW function will be used to train an NMF model for SE deconvolution based on the provided SE fractions and gene expression matrix. Before applying NMF, each gene’s expression is scaled to have a mean of 0 and unit variance by default. To meet the non-negativity requirement of NMF, the expression matrix is transformed using the posneg method. This transformation splits the expression matrix into two matrices: one containing only positive values and the other containing the absolute values of the negative values. These two matrices are then concatenated to form the final training data for the NMF model.

Note: By default, the NMFGenerateW selects the top 2000 variable features (nfeature argument) for training. After training, it retains 50 genes per SE (nfeature.per.se argument) in the model. Users can tune the parameters accordingly.

W = NMFGenerateW(result$Fracs, result$Mixtures)
## This step took less than 1 min for the demo.
head(W)
##                      nonSE          SE1          SE2          SE3          SE4
## IGHV3-49__pos 2.220446e-16 2.220446e-16 1.176118e-15 2.220446e-16 2.220446e-16
## IGHG2__pos    2.220446e-16 2.220446e-16 2.220446e-16 4.371307e-12 2.220446e-16
## IGKV2-28__pos 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16
## IGHV1-2__neg  3.516628e-15 4.123817e-01 9.560283e-02 5.087919e-01 1.835669e+00
## IGHV3-49__neg 3.818675e-05 1.055389e+00 2.649221e-13 1.820891e-01 4.904564e-05
## IGHV5-51__pos 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16
##                        SE5          SE6          SE7          SE8          SE9
## IGHV3-49__pos 2.220446e-16 1.390739e-13 4.909769e+00 1.867123e-15 2.220446e-16
## IGHG2__pos    2.220446e-16 7.146394e-05 4.572860e+00 2.220446e-16 2.220446e-16
## IGKV2-28__pos 2.220446e-16 2.220446e-16 4.542140e+00 1.800026e-12 2.220446e-16
## IGHV1-2__neg  2.811440e-01 3.454982e-03 1.250619e+00 1.488007e-01 1.197966e-06
## IGHV3-49__neg 7.453432e-07 4.681541e-02 2.220446e-16 1.442175e-01 3.107528e+00
## IGHV5-51__pos 2.220446e-16 2.220446e-16 4.513844e+00 2.220446e-16 2.220446e-16
## Save the model for late use.
saveRDS(W, "SE_Deconvolution_W.rds")

The resulting W matrix can be used to infer SE abundances from bulk gene expression profiles using the DeconvoluteSE function. The detailed documentation for SE deconvolution is available at Recovery of Spatial Ecotypes from Bulk Gene Expression Data.

Session info

The session info allows users to replicate the exact environment and identify potential discrepancies in package versions or configurations that might be causing problems.

## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SpatialEcoTyper_1.0.1 RANN_2.6.2            Matrix_1.7-0         
##  [4] dplyr_1.1.4           data.table_1.16.0     NMF_0.28             
##  [7] Biobase_2.64.0        BiocGenerics_0.50.0   cluster_2.1.6        
## [10] rngtools_1.5.2        registry_0.5-1        Seurat_5.1.0         
## [13] SeuratObject_5.0.2    sp_2.1-4              googledrive_2.1.1    
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3     shape_1.4.6.1          rstudioapi_0.16.0     
##   [4] jsonlite_1.8.8         magrittr_2.0.3         spatstat.utils_3.1-0  
##   [7] rmarkdown_2.28         GlobalOptions_0.1.2    fs_1.6.4              
##  [10] ragg_1.3.2             vctrs_0.6.5            ROCR_1.0-11           
##  [13] spatstat.explore_3.3-2 htmltools_0.5.8.1      curl_5.2.2            
##  [16] sass_0.4.9             sctransform_0.4.1      parallelly_1.38.0     
##  [19] KernSmooth_2.23-24     bslib_0.8.0            htmlwidgets_1.6.4     
##  [22] desc_1.4.3             ica_1.0-3              plyr_1.8.9            
##  [25] plotly_4.10.4          zoo_1.8-12             cachem_1.1.0          
##  [28] igraph_2.0.3           iterators_1.0.14       mime_0.12             
##  [31] lifecycle_1.0.4        pkgconfig_2.0.3        R6_2.5.1              
##  [34] fastmap_1.2.0          clue_0.3-65            fitdistrplus_1.2-1    
##  [37] future_1.34.0          shiny_1.9.1            digest_0.6.37         
##  [40] colorspace_2.1-1       S4Vectors_0.42.1       patchwork_1.2.0       
##  [43] tensor_1.5             RSpectra_0.16-2        irlba_2.3.5.1         
##  [46] textshaping_0.4.0      progressr_0.14.0       fansi_1.0.6           
##  [49] spatstat.sparse_3.1-0  httr_1.4.7             polyclip_1.10-7       
##  [52] abind_1.4-5            compiler_4.4.1         gargle_1.5.2          
##  [55] withr_3.0.1            doParallel_1.0.17      fastDummies_1.7.4     
##  [58] R.utils_2.12.3         MASS_7.3-60.2          rjson_0.2.22          
##  [61] tools_4.4.1            lmtest_0.9-40          httpuv_1.6.15         
##  [64] future.apply_1.11.2    goftest_1.2-3          R.oo_1.26.0           
##  [67] glue_1.7.0             nlme_3.1-164           promises_1.3.0        
##  [70] grid_4.4.1             gridBase_0.4-7         Rtsne_0.17            
##  [73] reshape2_1.4.4         generics_0.1.3         gtable_0.3.5          
##  [76] spatstat.data_3.1-2    R.methodsS3_1.8.2      tidyr_1.3.1           
##  [79] utf8_1.2.4             spatstat.geom_3.3-2    RcppAnnoy_0.0.22      
##  [82] foreach_1.5.2          ggrepel_0.9.6          pillar_1.9.0          
##  [85] stringr_1.5.1          spam_2.10-0            RcppHNSW_0.6.0        
##  [88] later_1.3.2            circlize_0.4.16        splines_4.4.1         
##  [91] lattice_0.22-6         survival_3.6-4         deldir_2.0-4          
##  [94] tidyselect_1.2.1       ComplexHeatmap_2.20.0  miniUI_0.1.1.1        
##  [97] pbapply_1.7-2          knitr_1.48             gridExtra_2.3         
## [100] IRanges_2.38.1         scattermore_1.2        stats4_4.4.1          
## [103] xfun_0.52              matrixStats_1.4.1      stringi_1.8.4         
## [106] lazyeval_0.2.2         yaml_2.3.10            evaluate_0.24.0       
## [109] codetools_0.2-20       tibble_3.2.1           BiocManager_1.30.25   
## [112] cli_3.6.3              uwot_0.2.2             xtable_1.8-4          
## [115] reticulate_1.39.0      systemfonts_1.1.0      munsell_0.5.1         
## [118] jquerylib_0.1.4        Rcpp_1.0.13            globals_0.16.3        
## [121] spatstat.random_3.3-1  png_0.1-8              spatstat.univar_3.0-1 
## [124] pkgdown_2.1.0          ggplot2_3.5.1          dotCall64_1.1-1       
## [127] listenv_0.9.1          viridisLite_0.4.2      scales_1.3.0          
## [130] ggridges_0.5.6         crayon_1.5.3           leiden_0.4.3.1        
## [133] purrr_1.0.2            GetoptLong_1.0.5       rlang_1.1.4           
## [136] cowplot_1.1.3