Recovery of Spatial Ecotypes from Single-Cell Gene Expression Data
Source:vignettes/Recovery_scRNA.Rmd
Recovery_scRNA.Rmd
Overview
In this tutorial, we will illustrate how to recover spatial ecotypes (SEs) from single-cell RNA-seq data.
First load required packages for this vignette
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(parallel))
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(googledrive))
library(SpatialEcoTyper)
Data preparation
The recovery process requires a normalized gene expression matrix and a vector of cell type annotations.
Starting from a Seurat object
A seurat object for the demo data can be accessed from WU2161_seurat_obj.rds.
drive_deauth() # Disable Google sign-in requirement
drive_download(as_id("11ORWQxxWCNUtceEDwu2IyBpDMa6P-bxS"), "WU2161_seurat_obj.rds", overwrite = TRUE)
First, load the seurat object into R:
# Load the seurat object
obj <- readRDS("WU2161_seurat_obj.rds")
obj
## An object of class Seurat
## 27425 features across 1337 samples within 1 assay
## Active assay: RNA (27425 features, 2000 variable features)
## 3 layers present: counts, data, scale.data
## 2 dimensional reductions calculated: pca, umap
# Cell type annotations
unique(obj$CellType)
## [1] "Macrophage" "B" "CD4T" "CD8T" "Other"
## [6] "Fibroblast" "Epithelial" "Plasma"
Next, normalize the gene expression data using SCTransform
or NormalizeData.
Here, we are normalizing using SCTransform
. We recommend to
install the glmGamPoi
package for faster computation.
if(!"glmGamPoi" %in% installed.packages()){
BiocManager::install("glmGamPoi")
}
obj <- SCTransform(obj, verbose = FALSE)
Then, extract the normalized gene expression matrix and cell type annotations from the Seurat object.
seurat_version = as.integer(gsub("\\..*", "", as.character(packageVersion("SeuratObject"))))
if(seurat_version<5){
normdata <- GetAssayData(obj, "data")
}else{
normdata <- obj[["SCT"]]$data
}
head(normdata[, 1:5])
## 6 x 5 sparse Matrix of class "dgCMatrix"
## AAACCTGCACATGACT-1 AAACCTGGTGGTCCGT-1 AAACCTGGTTTGTGTG-1
## AL627309.1 . . .
## AL627309.5 . . .
## AP006222.2 . . .
## LINC01409 . . .
## LINC01128 . . .
## LINC00115 . . .
## AAACCTGTCCGATATG-1 AAACCTGTCTAACGGT-1
## AL627309.1 . .
## AL627309.5 . .
## AP006222.2 . .
## LINC01409 . .
## LINC01128 . .
## LINC00115 . .
ctann = obj$CellType
head(ctann)
## AAACCTGCACATGACT-1 AAACCTGGTGGTCCGT-1 AAACCTGGTTTGTGTG-1 AAACCTGTCCGATATG-1
## "Macrophage" "B" "CD4T" "Macrophage"
## AAACCTGTCTAACGGT-1 AAACGGGCACTTAAGC-1
## "B" "CD8T"
table(ctann)
## ctann
## B CD4T CD8T Epithelial Fibroblast Macrophage Other
## 97 138 425 121 44 374 91
## Plasma
## 47
Starting from Sparse matrix
Sparse matrix in the .mtx format can be imported using the
ReadMtx
function from the Seurat
package. The
demo data can be accessed from WU2161.
The cell type annotations are available in WU2161_celltype_ann.tsv.
## Load the gene expression matrix
scdata = ReadMtx(mtx = "matrix.mtx", features = "features.tsv",
cells = "barcodes.tsv", feature.column = 1, cell.column = 1)
## Normalize the data
normdata = NormalizeData(scdata)
head(normdata[, 1:5])
## Load cell type annotation
ctann = read.table("WU2161_celltype_ann.tsv", sep = "\t",
header = TRUE, row.names = 1)
ctann = ctann[match(colnames(normdata), rownames(ctann)), 1]
table(ctann)
Starting from tab-delimited files
Tab-delimited files can be loaded into R using the fread
function from the data.table
package. The TSV file for the
expression matrix can be accessed from WU2161_counts.tsv
and the cell type annotations are available in WU2161_celltype_ann.tsv.
## Download data from google drive
drive_download(as_id("17VAeOnz6vTt2s0ZeTrK1kITdJ3Yus4ei"), "WU2161_counts.tsv", overwrite = TRUE)
drive_download(as_id("17Ax4LMOClMBu6h_WUcwXtFw4HuIU8_AQ"), "WU2161_celltype_ann.tsv", overwrite = TRUE)
## Load the gene expression matrix
scdata = fread("WU2161_counts.tsv", sep = "\t", data.table = FALSE)
rownames(scdata) = scdata[, 1] ## Set the first column as row names
scdata = scdata[, -1] ## Drop the first column
## Normalize the data
normdata = NormalizeData(scdata)
head(normdata[, 1:5])
## AAACCTGCACATGACT-1 AAACCTGGTGGTCCGT-1 AAACCTGGTTTGTGTG-1
## AL627309.1 0 0 0
## AL627309.5 0 0 0
## AP006222.2 0 0 0
## AC114498.1 0 0 0
## AL669831.2 0 0 0
## LINC01409 0 0 0
## AAACCTGTCCGATATG-1 AAACCTGTCTAACGGT-1
## AL627309.1 0.0000000 0
## AL627309.5 0.2951209 0
## AP006222.2 0.0000000 0
## AC114498.1 0.0000000 0
## AL669831.2 0.0000000 0
## LINC01409 0.0000000 0
## Load cell type annotation
ctann = read.table("WU2161_celltype_ann.tsv", sep = "\t",
header = TRUE, row.names = 1)
ctann = ctann[match(colnames(normdata), rownames(ctann)), 1]
table(ctann)
## ctann
## B CD4T CD8T Epithelial Fibroblast Macrophage Other
## 97 138 425 121 44 374 91
## Plasma
## 47
SE recovery
The RecoverSE function will be used to assign single cells into SEs. Users can either use default model to recover predefined SEs or use custom model to recover newly defined SEs.
Note: When using RecoverSE with single-cell
RNA-seq data, it is essential to specify the celltypes
parameter. If cell type annotations are not provided, the function will
assume that the input data corresponds to bulk spatial transcriptomics
(e.g., Visium), and will infer SE abundances from each spot.
Using default models
The default NMF models were trained on discovery MERSCOPE data, encompassing five cancer types: melanoma, and four carcinomas. These models are tailored to nine distinct cell types: B cells, CD4+ T cells, CD8+ T cells, NK cells, plasma cells, macrophages, dendritic cells, fibroblasts, and endothelial cells. Each model facilitates the recovery of SEs from single-cell datasets, allowing for cell-type-specific SE analysis.
For SE recovery, the cells in the query data should be grouped into one of “B”, “CD4T”, “CD8T”, “NK”, “Plasma”, “Macrophage”, “DC”, “Fibroblast”, and “Endothelial”, case sensitive. All the other cell types will be considered non-SE compartments.
## AAACCTGCACATGACT-1 AAACCTGGTGGTCCGT-1 AAACCTGGTTTGTGTG-1 AAACCTGTCCGATATG-1
## "SE8" "nonSE" "SE1" "SE8"
## AAACCTGTCTAACGGT-1 AAACGGGCACTTAAGC-1
## "nonSE" "nonSE"
Using custom models
To use custom models, users should first develop recovery models following the tutorial Development of SE Recovery Models. The resulting models can be used for SE recovery. An example model is available at SE_Recovery_W_list.rds.
Download the example models
drive_download(as_id("171WaAe49babYB85Cn1FcoNNE-lzYp1T_"), "SE_Recovery_W_list.rds", overwrite = TRUE)
Load the custom models
## [1] "CD4T" "CD8T" "DC" "Endothelial" "Fibroblast"
## [6] "Macrophage" "NK" "Plasma"
head(Ws[[1]]) ## feature by SE matrix
## SE01 SE02 SE03 SE04 SE05 SE06
## CD226__pos 0.4145936 0.2555579 0.3323735 0.3241147 0.2263783 0.3279133
## CDKN1B__pos 0.5729663 0.4552065 0.5120732 0.4627805 0.4453117 0.5156023
## CXCR4__pos 0.5865790 0.2771225 0.4093821 0.3607076 0.3589891 0.3937211
## DUSP1__pos 0.4233067 0.3462390 0.2269798 0.1616814 0.3265539 0.2562118
## KLF2__pos 0.5889618 0.2314586 0.3867504 0.2295206 0.2998234 0.5173246
## ICAM2__pos 0.4593230 0.2611751 0.4427930 0.3143084 0.3679498 0.4532637
## SE07 SE08 SE09 SE10 SE11
## CD226__pos 0.2189382 0.3417897 0.2563481 0.2360192 0.2815938
## CDKN1B__pos 0.4414750 0.4747166 0.4414508 0.3922732 0.3821411
## CXCR4__pos 0.3751669 0.4456410 0.3023204 0.2629156 0.2729461
## DUSP1__pos 0.3682236 0.2211759 0.1713413 0.1699067 0.1121763
## KLF2__pos 0.3570251 0.4327206 0.1732251 0.2086729 0.2118797
## ICAM2__pos 0.2660441 0.5494585 0.2878338 0.3711269 0.2575091
Using custom models for SE recovery by specifying the Ws
argument.
## AAACCTGCACATGACT-1 AAACCTGGTGGTCCGT-1 AAACCTGGTTTGTGTG-1 AAACCTGTCCGATATG-1
## "SE10" "nonSE" "SE01" "SE10"
## AAACCTGTCTAACGGT-1 AAACGGGCACTTAAGC-1
## "nonSE" "SE03"
Session info
The session info allows users to replicate the exact environment and identify potential discrepancies in package versions or configurations that might be causing problems.
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SpatialEcoTyper_0.0.5 NMF_0.28 Biobase_2.64.0
## [4] BiocGenerics_0.50.0 cluster_2.1.6 rngtools_1.5.2
## [7] registry_0.5-1 RANN_2.6.2 Matrix_1.7-0
## [10] googledrive_2.1.1 data.table_1.16.0 Seurat_5.1.0
## [13] SeuratObject_5.0.2 sp_2.1-4 ggplot2_3.5.1
## [16] dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.4.1
## [3] later_1.3.2 tibble_3.2.1
## [5] polyclip_1.10-7 fastDummies_1.7.4
## [7] lifecycle_1.0.4 doParallel_1.0.17
## [9] globals_0.16.3 lattice_0.22-6
## [11] MASS_7.3-60.2 magrittr_2.0.3
## [13] plotly_4.10.4 sass_0.4.9
## [15] rmarkdown_2.28 jquerylib_0.1.4
## [17] yaml_2.3.10 httpuv_1.6.15
## [19] glmGamPoi_1.16.0 sctransform_0.4.1
## [21] spam_2.10-0 spatstat.sparse_3.1-0
## [23] reticulate_1.39.0 cowplot_1.1.3
## [25] pbapply_1.7-2 RColorBrewer_1.1-3
## [27] zlibbioc_1.50.0 abind_1.4-5
## [29] GenomicRanges_1.56.1 Rtsne_0.17
## [31] purrr_1.0.2 circlize_0.4.16
## [33] GenomeInfoDbData_1.2.12 IRanges_2.38.1
## [35] S4Vectors_0.42.1 ggrepel_0.9.6
## [37] irlba_2.3.5.1 listenv_0.9.1
## [39] spatstat.utils_3.1-0 goftest_1.2-3
## [41] RSpectra_0.16-2 spatstat.random_3.3-1
## [43] fitdistrplus_1.2-1 parallelly_1.38.0
## [45] DelayedMatrixStats_1.26.0 pkgdown_2.1.0
## [47] DelayedArray_0.30.1 leiden_0.4.3.1
## [49] codetools_0.2-20 tidyselect_1.2.1
## [51] shape_1.4.6.1 UCSC.utils_1.0.0
## [53] matrixStats_1.4.1 stats4_4.4.1
## [55] spatstat.explore_3.3-2 jsonlite_1.8.8
## [57] GetoptLong_1.0.5 progressr_0.14.0
## [59] ggridges_0.5.6 survival_3.6-4
## [61] iterators_1.0.14 systemfonts_1.1.0
## [63] foreach_1.5.2 tools_4.4.1
## [65] ragg_1.3.2 ica_1.0-3
## [67] Rcpp_1.0.13 glue_1.7.0
## [69] SparseArray_1.4.8 gridExtra_2.3
## [71] xfun_0.47 MatrixGenerics_1.16.0
## [73] GenomeInfoDb_1.40.1 withr_3.0.1
## [75] BiocManager_1.30.25 fastmap_1.2.0
## [77] fansi_1.0.6 digest_0.6.37
## [79] R6_2.5.1 mime_0.12
## [81] textshaping_0.4.0 colorspace_2.1-1
## [83] scattermore_1.2 tensor_1.5
## [85] spatstat.data_3.1-2 utf8_1.2.4
## [87] tidyr_1.3.1 generics_0.1.3
## [89] S4Arrays_1.4.1 httr_1.4.7
## [91] htmlwidgets_1.6.4 uwot_0.2.2
## [93] pkgconfig_2.0.3 gtable_0.3.5
## [95] ComplexHeatmap_2.20.0 lmtest_0.9-40
## [97] XVector_0.44.0 htmltools_0.5.8.1
## [99] dotCall64_1.1-1 clue_0.3-65
## [101] scales_1.3.0 png_0.1-8
## [103] spatstat.univar_3.0-1 knitr_1.48
## [105] rstudioapi_0.16.0 reshape2_1.4.4
## [107] rjson_0.2.22 nlme_3.1-164
## [109] curl_5.2.2 cachem_1.1.0
## [111] zoo_1.8-12 GlobalOptions_0.1.2
## [113] stringr_1.5.1 KernSmooth_2.23-24
## [115] miniUI_0.1.1.1 desc_1.4.3
## [117] pillar_1.9.0 grid_4.4.1
## [119] vctrs_0.6.5 promises_1.3.0
## [121] xtable_1.8-4 evaluate_0.24.0
## [123] cli_3.6.3 compiler_4.4.1
## [125] rlang_1.1.4 crayon_1.5.3
## [127] future.apply_1.11.2 plyr_1.8.9
## [129] fs_1.6.4 stringi_1.8.4
## [131] viridisLite_0.4.2 deldir_2.0-4
## [133] gridBase_0.4-7 munsell_0.5.1
## [135] lazyeval_0.2.2 spatstat.geom_3.3-2
## [137] RcppHNSW_0.6.0 patchwork_1.2.0
## [139] sparseMatrixStats_1.16.0 future_1.34.0
## [141] shiny_1.9.1 SummarizedExperiment_1.34.0
## [143] ROCR_1.0-11 gargle_1.5.2
## [145] igraph_2.0.3 bslib_0.8.0