An overview of scFeatures' functions
Source:vignettes/scFeatures_summary.Rmd
scFeatures_summary.Rmd
Introduction
This vignette provides an overview of scFeatures package. scFeatures generates features for single-cell RNA-seq, spatial proteomic and spatial transcriptomic data. The features span across six categories representing different molecular views of cellular characteristics. These include (i) cell type proportions, (ii) cell type specific gene expressions, (iii) cell type specific pathway expressions, (iv) cell type specific cell-cell interaction (CCI) scores, (v) overall aggregated gene expressions and (vi) spatial metrics. These features enable a more comprehensive multi-view representation of the expression data and can be used for downstream analysis such as association study.
library(scFeatures)
Quick start
scFeatures can be run using one line of code
scfeatures_result <- scFeatures(data)
, which generates a
list of data frames containing all feature types in the form of samples
x features.
data <- readRDS(
system.file("extdata", "example_scrnaseq.rds", package = "scFeatures")
)
data <- process_data(data, normalise = TRUE) # perform normalisation
scfeatures_result <- scFeatures(data)
Below we list the function for generating each individual feature types for scRNA-seq, spatial proteomic and spatial transcriptomic data.
scFeatures for single-cell RNA-seq data
scFeatures accept a Seurat object containing the sample
and celltype
columns. For SingleCellExperiment or
SpatialExperiment object, we provide a function
makeSeurat()
for the conversion.
#---------------------------------------------------------------------------
# read in data
#---------------------------------------------------------------------------
data <- readRDS(
system.file("extdata", "example_scrnaseq.rds", package = "scFeatures")
)
data <- process_data(data, normalise = TRUE) # perform normalisation
# format the patient ID as `a_cond_b` eg, `patient09_cond_nonresponder`.
# This is so that we can retrieve the sample ID and the corresponding condition
# when running downstream analysis on the generated features
data$sample <- paste0(data$sample, "_cond_", data$condition)
#---------------------------------------------------------------------------
# cell type proportions
#---------------------------------------------------------------------------
feature_proportion_raw <- run_proportion_raw(data)
feature_proportion_logit <- run_proportion_logit(data)
feature_proportion_ratio <- run_proportion_ratio(data)
#---------------------------------------------------------------------------
# cell type specific gene expressions
#---------------------------------------------------------------------------
# optional step, run if mitochondria amd ribosomal genes are not of interest
data_remove_mito <- remove_mito(data)
# by default, use the top variable genes
# users can change the number of top variable genes by changing `num_top_gene`
# alternatively, users can also input genes of interest in a dataframe format
genes_of_interest <- data.frame(
marker = c("S100A11", "GZMB", "DUSP1"),
celltype = c(
"CD8, T Effector Memory",
"CD8, T Effector Memory",
"Naive T Cells"
)
)
feature_gene_mean_celltype <- run_gene_mean_celltype(data_remove_mito)
feature_gene_prop_celltype <- run_gene_prop_celltype(
data_remove_mito, genes = genes_of_interest
)
feature_gene_cor_celltype <- run_gene_cor_celltype(
data_remove_mito, num_top_gene = 50
)
# by default, pick around 100 most variable genes per cell type, can change
# this number through the num_top_gene argument
#---------------------------------------------------------------------------
# cell type specific pathway expressions
#---------------------------------------------------------------------------
feature_pathway_gsva <- run_pathway_gsva(data, species = "Homo sapiens")
feature_pathway_mean <- run_pathway_mean(data, species = "Homo sapiens")
# by default, use the 50 hallmark pathways, users can also input their gene
# set of interest in a list format
geneset <- list(
"pathway_a" = c("CAPNS1", "TLCD1"),
"pathway_b" = c("PEX6", "DPRXP4")
)
feature_pathway_prop <- run_pathway_prop(
data, geneset = geneset,
species = "Homo sapiens"
)
#---------------------------------------------------------------------------
# cell type specific cell-cell interactions
#---------------------------------------------------------------------------
feature_CCI <- run_CCI(data)
#---------------------------------------------------------------------------
# Bulk expressions
#---------------------------------------------------------------------------
# by default, use the top variable genes
# users can change the number of top variable genes by changing `num_top_gene`
# alternatively, users can also input their genes of interest in a list format
feature_gene_mean_bulk <- run_gene_mean(data)
feature_gene_cor_bulk <- run_gene_cor(data, num_top_gene = 5)
genes_of_interest <- c("TIGIT", "PDCD1")
feature_gene_prop_bulk <- run_gene_prop(data, genes = genes_of_interest)
# by default, pick 1500 most variable genes, can change this number through the
# num_top_gene argument
scFeatures for spatial proteomics
The input data is a Seurat object containing celltype
,
sample
, x_coord
and y_coord
columns.
For SingleCellExperiment or SpatialExperiment object, we provide a
function makeSeurat()
for the conversion. use
Note that, as spatial proteomics contain few genes, the feature categories of pathway expressions and cell-cell interactions are not applicable.
#---------------------------------------------------------------------------
# read in data
#---------------------------------------------------------------------------
data <- readRDS(
system.file("extdata", "example_scrnaseq.rds", package = "scFeatures")
)
# randomly generate some x and y coordinate to make this a
# toy "spatial proteomics" data
x <- sample(1:100, ncol(data) , replace = T)
y <- sample(1:100, ncol(data) , replace = T)
data <- makeSeurat(data, spatialCoords = list(x,y))
data <- process_data(data, normalise = T)
#---------------------------------------------------------------------------
# cell type proportions
#---------------------------------------------------------------------------
feature_proportion_raw <- run_proportion_raw(data, type = "spatial_p")
feature_proportion_logit <- run_proportion_logit(data, type = "spatial_p")
feature_proportion_ratio <- run_proportion_ratio(data, type = "spatial_p")
#---------------------------------------------------------------------------
# cell type specific gene expressions
#---------------------------------------------------------------------------
feature_gene_mean_celltype <- run_gene_mean_celltype(data, type = "spatial_p")
feature_gene_prop_celltype <- run_gene_prop_celltype(data, type = "spatial_p")
feature_gene_cor_celltype <- run_gene_cor_celltype(data, type = "spatial_p")
#---------------------------------------------------------------------------
# Bulk expressions
#---------------------------------------------------------------------------
feature_gene_mean_bulk <- run_gene_mean(data, type = "spatial_p")
feature_gene_prop_bulk <- run_gene_prop(data, type = "spatial_p")
feature_gene_cor_bulk <- run_gene_cor(data, type = "spatial_p")
#---------------------------------------------------------------------------
# Spatial metrics
#---------------------------------------------------------------------------
feature_L_stats <- run_L_function(data, type = "spatial_p")
feature_morans_I <- run_Morans_I(data, type = "spatial_p")
feature_celltype_interact <- run_celltype_interaction(data, type = "spatial_p")
feature_nn_correlation <- run_nn_correlation(data, type = "spatial_p")
scFeature for spatial transcriptomics
The input data is a Seurat object containing sample
,
x_cord
and y_cord
columns.
Additionally, apart from the RNA
assay, which contains the
gene expression of each spot, a predictions
assay is also
needed.
The predictions
assay is a matrix in the form of cell
types x spot, which stores the cell type probability of each spot. This
can be obtained from performing cell type prediction using reference
data.
#---------------------------------------------------------------------------
# read in data
#---------------------------------------------------------------------------
data <- readRDS(
system.file("extdata", "example_scrnaseq.rds", package = "scFeatures")
)
# generate a toy "spatial transcriptomics" data
data$celltype <- NULL #spatial transcriptomics don't have celltype
# randomly generate some x and y coordinates
x <- sample(1:100, ncol(data) , replace = T)
y <- sample(1:100, ncol(data) , replace = T)
# for spatial transcriptomics, we need to estimate the number of cells per spot
data <- get_num_cell_per_spot(data)
# also need a dataframe of celltype probability at each spot
# here we randomly create one
nrow <- 5 #pretend there are 5 cell types
ncol <- ncol(data)
# Create a matrix of random numbers
matrix <- matrix(runif(nrow * ncol), nrow, ncol)
# Normalize the columns of the matrix so that each column sum to 1
prediction.scores <- sweep(matrix, 2, colSums(matrix), "/")
colnames(prediction.scores) <- colnames(data)
rownames(prediction.scores) <- c(paste0("celltype", 1:5))
# format the data using makeSeurat
data <- makeSeurat(data, spatialCoords = list(x,y), spotProbability = prediction.scores)
data <- process_data(data, normalise = T)
# format the patient ID as `a_cond_b` eg, `patient09_cond_nonresponder`.
# This is so that we can retrieve the sample ID and the corresponding
# condition when running downstream analysis on the generated features
data$sample <- paste0(data$sample, "_cond_", data$condition)
#---------------------------------------------------------------------------
# cell type proportions
#---------------------------------------------------------------------------
feature_proportion_raw <- run_proportion_raw(data, type = "spatial_t")
feature_proportion_logit <- run_proportion_logit(data, type = "spatial_t")
feature_proportion_ratio <- run_proportion_ratio(data, type = "spatial_t")
#---------------------------------------------------------------------------
# cell type specific gene expressions
#---------------------------------------------------------------------------
data_remove_mito <- remove_mito(data)
feature_gene_mean_celltype <- run_gene_mean_celltype(
data_remove_mito, type = "spatial_t", genes = rownames(data_remove_mito)[1:10] #to speed up example
)
#---------------------------------------------------------------------------
# cell type specific pathway expressions
#---------------------------------------------------------------------------
feature_pathway_mean <- run_pathway_mean(data, geneset = NULL, type = "spatial_t")
#---------------------------------------------------------------------------
# Bulk expressions
#---------------------------------------------------------------------------
feature_gene_mean_bulk <- run_gene_mean(data, type = "spatial_t")
feature_gene_prop_bulk <- run_gene_prop(data, type = "spatial_t")
feature_gene_cor_bulk <- run_gene_cor(data, type = "spatial_t")
#---------------------------------------------------------------------------
# Spatial metrics
#---------------------------------------------------------------------------
feature_L_stats <- run_L_function(data, type = "spatial_t")
feature_morans_I <- run_Morans_I(data, type = "spatial_t")
feature_celltype_interact <- run_celltype_interaction(data, type = "spatial_t")
feature_nn_correlation <- run_nn_correlation(data, type = "spatial_t")
sessionInfo()
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 11 (bullseye)
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] scFeatures_0.99.9 BiocStyle_2.24.0
#>
#> loaded via a namespace (and not attached):
#> [1] rappdirs_0.3.3 rtracklayer_1.56.1
#> [3] scattermore_0.8 SpatialExperiment_1.6.1
#> [5] R.methodsS3_1.8.2 SeuratObject_4.1.2
#> [7] ragg_1.2.3 tidyr_1.2.1
#> [9] ggplot2_3.3.6 bit64_4.0.5
#> [11] knitr_1.40 R.utils_2.12.0
#> [13] irlba_2.3.5.1 DelayedArray_0.22.0
#> [15] data.table_1.14.2 rpart_4.1.16
#> [17] KEGGREST_1.36.3 RCurl_1.98-1.9
#> [19] AnnotationFilter_1.20.0 generics_0.1.3
#> [21] BiocGenerics_0.42.0 GenomicFeatures_1.48.4
#> [23] ScaledMatrix_1.4.1 cowplot_1.1.1
#> [25] RSQLite_2.2.18 EnsDb.Mmusculus.v79_2.99.0
#> [27] RANN_2.6.1 future_1.28.0
#> [29] bit_4.0.4 spatstat.data_3.0-0
#> [31] xml2_1.3.3 httpuv_1.6.6
#> [33] SummarizedExperiment_1.26.1 assertthat_0.2.1
#> [35] xfun_0.33 hms_1.1.2
#> [37] jquerylib_0.1.4 babelgene_22.9
#> [39] evaluate_0.17 promises_1.2.0.1
#> [41] fansi_1.0.3 restfulr_0.0.15
#> [43] progress_1.2.2 caTools_1.18.2
#> [45] dbplyr_2.2.1 igraph_1.3.5
#> [47] DBI_1.1.3 htmlwidgets_1.5.4
#> [49] spatstat.geom_3.0-3 stats4_4.2.1
#> [51] purrr_0.3.5 ellipsis_0.3.2
#> [53] RSpectra_0.16-1 dplyr_1.0.10
#> [55] bookdown_0.30 annotate_1.74.0
#> [57] RcppParallel_5.1.5 biomaRt_2.52.0
#> [59] deldir_1.0-6 sparseMatrixStats_1.8.0
#> [61] MatrixGenerics_1.8.1 vctrs_0.4.2
#> [63] SingleCellExperiment_1.18.1 Biobase_2.56.0
#> [65] ensembldb_2.20.2 ROCR_1.0-11
#> [67] abind_1.4-5 withr_2.5.0
#> [69] cachem_1.0.6 SIMLR_1.22.0
#> [71] progressr_0.11.0 sctransform_0.3.5
#> [73] GenomicAlignments_1.32.1 prettyunits_1.1.1
#> [75] scran_1.24.1 goftest_1.2-3
#> [77] cluster_2.1.4 ape_5.6-2
#> [79] lazyeval_0.2.2 crayon_1.5.2
#> [81] spatstat.explore_3.0-5 edgeR_3.38.4
#> [83] pkgconfig_2.0.3 GenomeInfoDb_1.32.4
#> [85] nlme_3.1-160 ProtGenerics_1.28.0
#> [87] rlang_1.0.6 globals_0.16.1
#> [89] lifecycle_1.0.3 miniUI_0.1.1.1
#> [91] filelock_1.0.2 BiocFileCache_2.4.0
#> [93] rsvd_1.0.5 rprojroot_2.0.3
#> [95] polyclip_1.10-0 GSVA_1.44.5
#> [97] matrixStats_0.62.0 lmtest_0.9-40
#> [99] graph_1.74.0 Matrix_1.5-3
#> [101] Rhdf5lib_1.18.2 zoo_1.8-11
#> [103] ggridges_0.5.4 GlobalOptions_0.1.2
#> [105] pheatmap_1.0.12 png_0.1-7
#> [107] viridisLite_0.4.1 rjson_0.2.21
#> [109] bitops_1.0-7 R.oo_1.25.0
#> [111] rhdf5filters_1.8.0 KernSmooth_2.23-20
#> [113] SingleCellSignalR_1.8.0 Biostrings_2.64.1
#> [115] blob_1.2.3 DelayedMatrixStats_1.18.1
#> [117] shape_1.4.6 stringr_1.4.1
#> [119] parallelly_1.32.1 spatstat.random_3.0-1
#> [121] S4Vectors_0.34.0 beachmat_2.12.0
#> [123] scales_1.2.1 GSEABase_1.58.0
#> [125] memoise_2.0.1 magrittr_2.0.3
#> [127] plyr_1.8.7 ica_1.0-3
#> [129] gplots_3.1.3 zlibbioc_1.42.0
#> [131] compiler_4.2.1 dqrng_0.3.0
#> [133] BiocIO_1.6.0 RColorBrewer_1.1-3
#> [135] fitdistrplus_1.1-8 Rsamtools_2.12.0
#> [137] cli_3.4.1 XVector_0.36.0
#> [139] listenv_0.8.0 EnsDb.Hsapiens.v79_2.99.0
#> [141] patchwork_1.1.2 pbapply_1.5-0
#> [143] MASS_7.3-58.1 mgcv_1.8-40
#> [145] tidyselect_1.2.0 stringi_1.7.8
#> [147] textshaping_0.3.6 yaml_2.3.5
#> [149] BiocSingular_1.12.0 locfit_1.5-9.6
#> [151] ggrepel_0.9.1 grid_4.2.1
#> [153] sass_0.4.2 tools_4.2.1
#> [155] future.apply_1.9.1 parallel_4.2.1
#> [157] circlize_0.4.15 rstudioapi_0.14
#> [159] foreach_1.5.2 bluster_1.6.0
#> [161] AUCell_1.18.1 metapod_1.4.0
#> [163] gridExtra_2.3 Rtsne_0.16
#> [165] DropletUtils_1.16.0 proxyC_0.3.3
#> [167] digest_0.6.29 BiocManager_1.30.18
#> [169] rgeos_0.5-9 shiny_1.7.2
#> [171] pracma_2.4.2 Rcpp_1.0.9
#> [173] GenomicRanges_1.48.0 scuttle_1.6.3
#> [175] later_1.3.0 RcppAnnoy_0.0.19
#> [177] httr_1.4.4 AnnotationDbi_1.58.0
#> [179] colorspace_2.0-3 XML_3.99-0.11
#> [181] fs_1.5.2 tensor_1.5
#> [183] reticulate_1.26 IRanges_2.30.1
#> [185] splines_4.2.1 uwot_0.1.14
#> [187] statmod_1.4.37 spatstat.utils_3.0-1
#> [189] pkgdown_2.0.6 sp_1.5-0
#> [191] multtest_2.52.0 plotly_4.10.0
#> [193] systemfonts_1.0.4 xtable_1.8-4
#> [195] jsonlite_1.8.2 R6_2.5.1
#> [197] pillar_1.8.1 htmltools_0.5.3
#> [199] mime_0.12 glue_1.6.2
#> [201] fastmap_1.1.0 BiocParallel_1.30.4
#> [203] BiocNeighbors_1.14.0 codetools_0.2-18
#> [205] utf8_1.2.2 lattice_0.20-45
#> [207] bslib_0.4.0 spatstat.sparse_3.0-0
#> [209] tibble_3.1.8 curl_4.3.3
#> [211] leiden_0.4.3 gtools_3.9.3
#> [213] magick_2.7.3 survival_3.4-0
#> [215] limma_3.52.4 rmarkdown_2.17
#> [217] desc_1.4.2 munsell_0.5.0
#> [219] rhdf5_2.40.0 GenomeInfoDbData_1.2.8
#> [221] iterators_1.0.14 HDF5Array_1.24.2
#> [223] msigdbr_7.5.1 reshape2_1.4.4
#> [225] gtable_0.3.1 spatstat.core_2.4-4
#> [227] Seurat_4.2.0