Development of NMF Models for Spatial Ecotype Recovery
Source:vignettes/TrainRecoveryModels.Rmd
TrainRecoveryModels.Rmd
Development of SE recovery model
After performing Spatial EcoTyper analysis, users can develop a machine learning model to recover spatial ecotypes (SEs) from various types of gene expression data, including spatial transcriptomics and single-cell RNA-seq. In this tutorial, we will use the Vizgen MERSCOPE data of two samples to demonstrate the training process.
The gene expression matrices, cell type and SE annotations required for the training can be obtained from here.
First load required packages for this vignette
suppressPackageStartupMessages(library(googledrive))
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(NMF))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(dplyr))
library(SpatialEcoTyper)
Loading data
Download the data from Google Drive
drive_deauth() # Disable Google sign-in requirement
drive_download(as_id("17mbc56M0GTq-9MR0GqHas-L-yIXdnxtu"), "HumanMelanomaPatient1_subset_counts.tsv.gz", overwrite = TRUE)
drive_download(as_id("17nL0ceONPw8ByAtDe34DiXiGLMVBLlPX"), "HumanColonCancerPatient2_subset_counts.tsv.gz", overwrite = TRUE)
drive_download(as_id("18HsreR4QH6hh5L-YO7toP9lg62dVv8iX"), "MultiSE_metadata_final.tsv", overwrite = TRUE)
Load gene expression matrix
## Read the expression matrix for the first sample
scdata <- fread("HumanMelanomaPatient1_subset_counts.tsv.gz",
sep = "\t",header = TRUE, data.table = FALSE)
rownames(scdata) <- scdata[, 1] ## set the first column as row names
scdata <- scdata[, -1] ## drop the first column
scdata <- NormalizeData(scdata) ## Normalize the expression matrix
## Read the expression matrix for the second sample
scdata2 <- fread("HumanColonCancerPatient2_subset_counts.tsv.gz",
sep = "\t",header = TRUE, data.table = FALSE)
rownames(scdata2) <- scdata2[, 1] ## set the first column as row names
scdata2 <- scdata2[, -1] ## drop the first column
scdata2 <- NormalizeData(scdata2) ## Normalize the expression matrix
## Combine the two expression matrices
genes <- intersect(rownames(scdata), rownames(scdata2))
scdata <- cbind(scdata[genes, ], scdata2[genes, ])
head(scdata[, 1:5])
## HumanMelanomaPatient1__cell_3655 HumanMelanomaPatient1__cell_3657
## PDK4 0.000000 4.221184
## TNFRSF17 0.000000 0.000000
## ICAM3 0.000000 0.000000
## FAP 3.552438 0.000000
## GZMB 0.000000 0.000000
## TSC2 0.000000 0.000000
## HumanMelanomaPatient1__cell_3658 HumanMelanomaPatient1__cell_3660
## PDK4 3.208945 0
## TNFRSF17 0.000000 0
## ICAM3 0.000000 0
## FAP 0.000000 0
## GZMB 0.000000 0
## TSC2 0.000000 0
## HumanMelanomaPatient1__cell_3661
## PDK4 0
## TNFRSF17 0
## ICAM3 0
## FAP 0
## GZMB 0
## TSC2 0
Load cell type and SE annotations
## Load the single-cell meta data derived from MultiSpatialEcoTyper analysis
scmeta <- read.table("MultiSE_metadata_final.tsv", sep = "\t",
header = TRUE, row.names = 1)
scmeta$SE[is.na(scmeta$SE)] = "nonSE" ## cells with NA predictions were assigned into the NonSE group
## Matches the row names of meta data and column names of expression matrix
cells <- intersect(colnames(scdata), rownames(scmeta))
scdata <- scdata[, cells]
scmeta <- scmeta[cells, ]
head(scmeta[, c("Sample", "CellType", "SE")])
## Sample CellType SE
## HumanMelanomaPatient1__cell_3655 SKCM Fibroblast SE3
## HumanMelanomaPatient1__cell_3657 SKCM Fibroblast SE4
## HumanMelanomaPatient1__cell_3658 SKCM Fibroblast SE4
## HumanMelanomaPatient1__cell_3660 SKCM Fibroblast SE4
## HumanMelanomaPatient1__cell_3661 SKCM Fibroblast SE3
## HumanMelanomaPatient1__cell_3663 SKCM Fibroblast SE3
Training
The NMFGenerateWList
function will be used to train cell-type specific NMF models for
recovering SEs. Prior to NMF, each gene is scaled to mean 0 and unit
variance within each cell type-specific expression matrix. To satisfy
the non-negativity requirement of NMF, the cell type-specific expression
matrices are individually processed using posneg transformation, which
converts each expression matrix into two matrices, one containing only
positive values and the other containing only negative values with the
sign inverted. The two matrices are subsequently concatenated to produce
the training data. The NMFGenerateWList function
will return a list of feature × SE
matrices, each
representing a cell type-specific gene signature matrix with rows as
features and columns as SEs.
Note: By default, the NMFGenerateWList selects
the top 2000 variable features (nfeature
argument) and
down-samples 2500 cells (downsample
argument) within each
cell type for training. After training, it retains 50 genes per SE
(nfeature.per.se
argument) in the model. Users can tune the
parameters accordingly.
# training
Ws = NMFGenerateWList(scdata, scmeta,
CellType = "CellType",
SE = "SE", Sample = "Sample")
## This step would take ~3 mins for the demo.
class(Ws)
## [1] "list"
names(Ws)
## [1] "B" "CD4T" "CD8T" "DC" "Endothelial"
## [6] "Fibroblast" "Macrophage" "NK" "Plasma"
head(Ws[[2]])
## SE1 SE5 SE6 SE10 SE8 SE3
## FOXP3__pos 0.4719930 0.4885273 0.3651733 0.6901381 0.8495508 0.5111029
## CCL5__neg 0.4636977 0.5029325 0.4671653 0.5279928 0.6660964 0.5879144
## NFKB2__neg 0.5959560 0.5353461 0.3610576 0.4095969 0.5328200 0.6356051
## NFKBIA__neg 0.5449949 0.5124919 0.4053110 0.3867929 0.4872992 0.5444012
## XBP1__neg 0.3641057 0.4390831 0.4536565 0.5994809 0.6964391 0.4286859
## IRF3__neg 0.4771004 0.4980158 0.4737179 0.5178389 0.4861118 0.4835203
## SE4 SE9 SE2 SE7
## FOXP3__pos 0.7223714 0.4173739 0.5810581 0.5287318
## CCL5__neg 0.7176785 0.3887930 0.5669015 0.7065557
## NFKB2__neg 0.7882044 0.4825905 1.0177761 0.1657350
## NFKBIA__neg 0.5245634 0.4819461 0.9273532 0.7001782
## XBP1__neg 0.4714610 0.5722981 0.6128531 0.8757327
## IRF3__neg 0.4070195 0.5213065 0.6902136 0.9023205
The resulting W matrices could be used to recover SEs from spatial transcriptomics and single-cell RNA-seq using the RecoverSE function. The detailed documentation for SE recovery is available at Recovery of Spatial Ecotypes from Single-Cell Gene Expression Data and Recovery of Spatial Ecotypes from Spatial Transcriptomics Data.
Development of SE deconvolution model
Users can also develop an NMF model to deconvolve SEs from bulk gene expression data. The training data can be pseudo-bulk mixtures by aggregating single-cell transcriptomics. In this tutorial, we will create pseudo-bulk mixtures from single-cell RNA-seq data of a melanoma sample.
The raw count for this sample is available at WU2161_counts.tsv and SE recovery results are available at WU2161_RecoveredSEs.tsv.
Loading data
# Download single-cell gene expression matrix and SE recovery results.
# The downloads should be finished within 1min.
drive_download(as_id("17VAeOnz6vTt2s0ZeTrK1kITdJ3Yus4ei"), "WU2161_counts.tsv", overwrite = TRUE)
drive_download(as_id("17kFjOHhmWxWAKm-LDqy6wV27o5lvz8L2"), "WU2161_RecoveredSEs.tsv", overwrite = TRUE)
# Load single-cell gene expression matrix.
counts = fread("WU2161_counts.tsv", sep = "\t", data.table = FALSE)
rownames(counts) = counts[, 1] ## Set the first column as row names
counts = counts[, -1] ## Drop the first column
head(counts[, 1:5])
## AAACCTGCACATGACT-1 AAACCTGGTGGTCCGT-1 AAACCTGGTTTGTGTG-1
## AL627309.1 0 0 0
## AL627309.5 0 0 0
## AP006222.2 0 0 0
## AC114498.1 0 0 0
## AL669831.2 0 0 0
## LINC01409 0 0 0
## AAACCTGTCCGATATG-1 AAACCTGTCTAACGGT-1
## AL627309.1 0 0
## AL627309.5 1 0
## AP006222.2 0 0
## AC114498.1 0 0
## AL669831.2 0 0
## LINC01409 0 0
# Load SE recovery results
SEs = read.table("WU2161_RecoveredSEs.tsv", sep = "\t", row.names = 1, header = TRUE)
SEs = SEs[match(colnames(counts), rownames(SEs)), 1] # extract SE predictions
length(SEs)
## [1] 1337
table(SEs)
## SEs
## nonSE SE1 SE2 SE3 SE4 SE5 SE6 SE7 SE8 SE9
## 912 71 45 24 5 43 74 75 71 17
Note: For demonstration purposes, we used 1,337 cells grouped into SEs to create pseudo-bulk mixtures. While this limited cell number offers a basic example, it does not fully capture the diverse characteristics of SEs, potentially affecting the robustness of model training and subsequent SE recovery. In practice, we recommend using a more comprehensive dataset that accurately reflects the properties of SEs, ensuring that the training process results in a reliable recovery model.
Creating pseudobulk data
The CreatePseudobulks
function will be used to create pseudo-bulk mixtures from single-cell
(spatial) transcriptomics data with all cells grouped into SEs. It
samples cell fractions from a Gaussian distribution, sets negative
fractions to 0 and re-normalizes fractions to sum to 1 across all SEs.
Based on the resulting fractions, it samples 1,000 cells per pseudo-bulk
mixture with replacement, aggregates their transcriptomes in non-log
linear space, and normalize the resulting mixture to logarithm CPM using
Seurat’s NormalizeData
.
result = CreatePseudobulks(counts = counts, groups = SEs, n_mixtures = 100)
## This step took ~4 min for the demo.
head(result$Mixtures[, 1:5]) ## Gene expression matrix of pseudobulks
## Pseudobulk1 Pseudobulk2 Pseudobulk3 Pseudobulk4 Pseudobulk5
## AL627309.1 0.020129003 0.0166225359 0.023307824 0.0098541618 0.0088996542
## AL627309.5 0.042782118 0.0507958729 0.065149253 0.0307522233 0.0470043992
## AP006222.2 0.001258585 0.0047970105 0.002820776 0.0005117182 0.0027994077
## AC114498.1 0.001848058 0.0009235317 0.002770807 0.0018443685 0.0009235317
## AL669831.2 0.000000000 0.0000000000 0.000000000 0.0000000000 0.0000000000
## LINC01409 0.028300784 0.0323725374 0.042793627 0.0507739102 0.0274993034
head(result$Fracs) ## SE fractions in pseudobulks
## nonSE SE1 SE2 SE3 SE4 SE5
## Pseudobulk1 0.16490065 0.04918395 0.04180656 0.13380601 0.108023707 0.10246218
## Pseudobulk2 0.06857820 0.14872621 0.11331726 0.11852688 0.051279421 0.09434107
## Pseudobulk3 0.08345357 0.17803366 0.19856639 0.00000000 0.002316577 0.04424433
## Pseudobulk4 0.09014638 0.07754488 0.06834655 0.06010091 0.080744805 0.11287459
## Pseudobulk5 0.12412216 0.00000000 0.05709886 0.11705402 0.102333027 0.21175799
## Pseudobulk6 0.04653528 0.07215946 0.05076475 0.13719262 0.114854678 0.17734928
## SE6 SE7 SE8 SE9
## Pseudobulk1 0.15013277 0.11912636 0.09232816 0.03822966
## Pseudobulk2 0.10492044 0.08010585 0.09546042 0.12474424
## Pseudobulk3 0.15682092 0.15585820 0.03024719 0.15045916
## Pseudobulk4 0.13277544 0.06254595 0.17260478 0.14231574
## Pseudobulk5 0.06920083 0.04372533 0.22512173 0.04958606
## Pseudobulk6 0.06965773 0.16811845 0.02792748 0.13544026
Training
The NMFGenerateW function will be used to train an NMF model for SE deconvolution based on the provided SE fractions and gene expression matrix. Before applying NMF, each gene’s expression is scaled to have a mean of 0 and unit variance by default. To meet the non-negativity requirement of NMF, the expression matrix is transformed using the posneg method. This transformation splits the expression matrix into two matrices: one containing only positive values and the other containing the absolute values of the negative values. These two matrices are then concatenated to form the final training data for the NMF model.
Note: By default, the NMFGenerateW selects the top
2000 variable features (nfeature
argument) for training.
After training, it retains 50 genes per SE (nfeature.per.se
argument) in the model. Users can tune the parameters accordingly.
W = NMFGenerateW(result$Fracs, result$Mixtures)
## This step took less than 1 min for the demo.
head(W)
## nonSE SE1 SE2 SE3 SE4
## IGLV1-40__pos 6.435004e-02 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16
## IGHV4-34__pos 2.220446e-16 2.220446e-16 1.352908e-07 1.027098e-12 2.220446e-16
## IGLV2-14__pos 4.468224e+00 2.220446e-16 1.323646e-15 2.220446e-16 1.962789e-01
## IGKV2D-30__pos 1.061495e-15 2.220446e-16 6.075700e-01 1.292995e-15 1.082039e-15
## IGKV1-33__pos 2.220446e-16 2.220446e-16 1.326303e-08 2.206336e-12 2.220446e-16
## NLRP3__pos 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16
## SE5 SE6 SE7 SE8 SE9
## IGLV1-40__pos 1.003067e-06 2.220446e-16 4.618030e+00 9.300405e-10 2.220446e-16
## IGHV4-34__pos 1.264496e-11 2.220446e-16 4.671389e+00 2.220446e-16 2.220446e-16
## IGLV2-14__pos 2.220446e-16 2.220446e-16 2.220446e-16 5.860832e-09 2.220446e-16
## IGKV2D-30__pos 2.220446e-16 2.220446e-16 4.054420e+00 2.220446e-16 4.717884e-13
## IGKV1-33__pos 2.841738e-12 2.220446e-16 4.649631e+00 2.220446e-16 3.816478e-16
## NLRP3__pos 2.220446e-16 4.617904e+00 2.220446e-16 2.220446e-16 2.220446e-16
## Save the model for late use.
saveRDS(W, "SE_Deconvolution_W.rds")
The resulting W matrix can be used to infer SE abundances from bulk gene expression profiles using the DeconvoluteSE function. The detailed documentation for SE deconvolution is available at Recovery of Spatial Ecotypes from Bulk Gene Expression Data.
Session info
The session info allows users to replicate the exact environment and identify potential discrepancies in package versions or configurations that might be causing problems.
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SpatialEcoTyper_1.0.0 RANN_2.6.2 Matrix_1.7-0
## [4] dplyr_1.1.4 data.table_1.16.0 NMF_0.28
## [7] Biobase_2.64.0 BiocGenerics_0.50.0 cluster_2.1.6
## [10] rngtools_1.5.2 registry_0.5-1 Seurat_5.1.0
## [13] SeuratObject_5.0.2 sp_2.1-4 googledrive_2.1.1
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 shape_1.4.6.1 rstudioapi_0.16.0
## [4] jsonlite_1.8.8 magrittr_2.0.3 spatstat.utils_3.1-0
## [7] rmarkdown_2.28 GlobalOptions_0.1.2 fs_1.6.4
## [10] ragg_1.3.2 vctrs_0.6.5 ROCR_1.0-11
## [13] spatstat.explore_3.3-2 htmltools_0.5.8.1 curl_5.2.2
## [16] sass_0.4.9 sctransform_0.4.1 parallelly_1.38.0
## [19] KernSmooth_2.23-24 bslib_0.8.0 htmlwidgets_1.6.4
## [22] desc_1.4.3 ica_1.0-3 plyr_1.8.9
## [25] plotly_4.10.4 zoo_1.8-12 cachem_1.1.0
## [28] igraph_2.0.3 iterators_1.0.14 mime_0.12
## [31] lifecycle_1.0.4 pkgconfig_2.0.3 R6_2.5.1
## [34] fastmap_1.2.0 clue_0.3-65 fitdistrplus_1.2-1
## [37] future_1.34.0 shiny_1.9.1 digest_0.6.37
## [40] colorspace_2.1-1 S4Vectors_0.42.1 patchwork_1.2.0
## [43] tensor_1.5 RSpectra_0.16-2 irlba_2.3.5.1
## [46] textshaping_0.4.0 progressr_0.14.0 fansi_1.0.6
## [49] spatstat.sparse_3.1-0 httr_1.4.7 polyclip_1.10-7
## [52] abind_1.4-5 compiler_4.4.1 gargle_1.5.2
## [55] withr_3.0.1 doParallel_1.0.17 fastDummies_1.7.4
## [58] R.utils_2.12.3 MASS_7.3-60.2 rjson_0.2.22
## [61] tools_4.4.1 lmtest_0.9-40 httpuv_1.6.15
## [64] future.apply_1.11.2 goftest_1.2-3 R.oo_1.26.0
## [67] glue_1.7.0 nlme_3.1-164 promises_1.3.0
## [70] grid_4.4.1 gridBase_0.4-7 Rtsne_0.17
## [73] reshape2_1.4.4 generics_0.1.3 gtable_0.3.5
## [76] spatstat.data_3.1-2 R.methodsS3_1.8.2 tidyr_1.3.1
## [79] utf8_1.2.4 spatstat.geom_3.3-2 RcppAnnoy_0.0.22
## [82] foreach_1.5.2 ggrepel_0.9.6 pillar_1.9.0
## [85] stringr_1.5.1 spam_2.10-0 RcppHNSW_0.6.0
## [88] later_1.3.2 circlize_0.4.16 splines_4.4.1
## [91] lattice_0.22-6 survival_3.6-4 deldir_2.0-4
## [94] tidyselect_1.2.1 ComplexHeatmap_2.20.0 miniUI_0.1.1.1
## [97] pbapply_1.7-2 knitr_1.48 gridExtra_2.3
## [100] IRanges_2.38.1 scattermore_1.2 stats4_4.4.1
## [103] xfun_0.47 matrixStats_1.4.1 stringi_1.8.4
## [106] lazyeval_0.2.2 yaml_2.3.10 evaluate_0.24.0
## [109] codetools_0.2-20 tibble_3.2.1 BiocManager_1.30.25
## [112] cli_3.6.3 uwot_0.2.2 xtable_1.8-4
## [115] reticulate_1.39.0 systemfonts_1.1.0 munsell_0.5.1
## [118] jquerylib_0.1.4 Rcpp_1.0.13 globals_0.16.3
## [121] spatstat.random_3.3-1 png_0.1-8 spatstat.univar_3.0-1
## [124] pkgdown_2.1.0 ggplot2_3.5.1 dotCall64_1.1-1
## [127] listenv_0.9.1 viridisLite_0.4.2 scales_1.3.0
## [130] ggridges_0.5.6 crayon_1.5.3 leiden_0.4.3.1
## [133] purrr_1.0.2 GetoptLong_1.0.5 rlang_1.1.4
## [136] cowplot_1.1.3