Skip to contents

Introduction

This vignette provides an overview of scFeatures package. scFeatures generates features for single-cell RNA-seq, spatial proteomic and spatial transcriptomic data. The features span across six categories representing different molecular views of cellular characteristics. These include (i) cell type proportions, (ii) cell type specific gene expressions, (iii) cell type specific pathway expressions, (iv) cell type specific cell-cell interaction (CCI) scores, (v) overall aggregated gene expressions and (vi) spatial metrics. These features enable a more comprehensive multi-view representation of the expression data and can be used for downstream analysis such as association study.

library(scFeatures)

Quick start

scFeatures can be run using one line of code scfeatures_result <- scFeatures(data), which generates a list of data frames containing all feature types in the form of samples x features.

data <- readRDS(
  system.file("extdata", "example_scrnaseq.rds", package = "scFeatures")
)
data <- process_data(data, normalise = TRUE) #  perform normalisation

scfeatures_result <- scFeatures(data)

Below we list the function for generating each individual feature types for scRNA-seq, spatial proteomic and spatial transcriptomic data.

scFeatures for single-cell RNA-seq data

scFeatures accept a Seurat object containing the sample and celltype columns. For SingleCellExperiment or SpatialExperiment object, we provide a function makeSeurat() for the conversion.

#---------------------------------------------------------------------------
# read in data
#---------------------------------------------------------------------------
data <- readRDS(
  system.file("extdata", "example_scrnaseq.rds", package = "scFeatures")
)
data <- process_data(data, normalise = TRUE) #  perform normalisation

# format the patient ID as `a_cond_b` eg, `patient09_cond_nonresponder`.
# This is so that we can retrieve the sample ID and the corresponding condition
# when running downstream analysis on the generated features
data$sample <- paste0(data$sample, "_cond_", data$condition)

#---------------------------------------------------------------------------
# cell type proportions
#---------------------------------------------------------------------------
feature_proportion_raw <- run_proportion_raw(data)
feature_proportion_logit <- run_proportion_logit(data)
feature_proportion_ratio <- run_proportion_ratio(data)


#---------------------------------------------------------------------------
# cell type specific gene expressions
#---------------------------------------------------------------------------
# optional step, run if mitochondria amd ribosomal genes are not of interest
data_remove_mito <- remove_mito(data)

# by default, use the top variable genes
# users can change the number of top variable genes by changing `num_top_gene`
# alternatively, users can also input genes of interest in a dataframe format
genes_of_interest <- data.frame(
    marker = c("S100A11", "GZMB", "DUSP1"),
    celltype = c(
      "CD8, T Effector Memory", 
      "CD8, T Effector Memory", 
      "Naive T Cells"
    )
)

feature_gene_mean_celltype <- run_gene_mean_celltype(data_remove_mito)
feature_gene_prop_celltype <- run_gene_prop_celltype(
  data_remove_mito, genes = genes_of_interest
)
feature_gene_cor_celltype <- run_gene_cor_celltype(
  data_remove_mito, num_top_gene = 50
)
# by default, pick around 100 most variable genes per cell type, can change
# this number through the num_top_gene argument


#---------------------------------------------------------------------------
# cell type specific pathway expressions
#---------------------------------------------------------------------------
feature_pathway_gsva <- run_pathway_gsva(data, species = "Homo sapiens")
feature_pathway_mean <- run_pathway_mean(data, species = "Homo sapiens")
# by default, use the 50 hallmark pathways, users can also input their gene
# set of interest in a list format
geneset <- list(
    "pathway_a" = c("CAPNS1", "TLCD1"),
    "pathway_b" = c("PEX6", "DPRXP4")
)
feature_pathway_prop <- run_pathway_prop(
  data, geneset = geneset,
  species = "Homo sapiens"
)


#---------------------------------------------------------------------------
# cell type specific cell-cell interactions
#---------------------------------------------------------------------------
feature_CCI <- run_CCI(data)


#---------------------------------------------------------------------------
# Bulk expressions
#---------------------------------------------------------------------------

# by default, use the top variable genes
# users can change the number of top variable genes by changing `num_top_gene`
# alternatively, users can also input their genes of interest in a list format
feature_gene_mean_bulk <- run_gene_mean(data)
feature_gene_cor_bulk <- run_gene_cor(data, num_top_gene = 5)
genes_of_interest <- c("TIGIT", "PDCD1")
feature_gene_prop_bulk <- run_gene_prop(data, genes = genes_of_interest)
# by default, pick 1500 most variable genes, can change this number through the
# num_top_gene argument

scFeatures for spatial proteomics

The input data is a Seurat object containing celltype, sample, x_coord and y_coord columns.

For SingleCellExperiment or SpatialExperiment object, we provide a function makeSeurat() for the conversion. use

Note that, as spatial proteomics contain few genes, the feature categories of pathway expressions and cell-cell interactions are not applicable.

#---------------------------------------------------------------------------
# read in data
#---------------------------------------------------------------------------

data <- readRDS(
  system.file("extdata", "example_scrnaseq.rds", package = "scFeatures")
)

# randomly generate some x and y coordinate to make this a 
# toy "spatial proteomics" data
x <- sample(1:100, ncol(data) , replace = T)
y <- sample(1:100, ncol(data) , replace = T)
data <- makeSeurat(data, spatialCoords = list(x,y))
data <- process_data(data, normalise = T)  

#---------------------------------------------------------------------------
# cell type proportions
#---------------------------------------------------------------------------
feature_proportion_raw <- run_proportion_raw(data, type = "spatial_p")
feature_proportion_logit <- run_proportion_logit(data, type = "spatial_p")
feature_proportion_ratio <- run_proportion_ratio(data, type = "spatial_p")


#---------------------------------------------------------------------------
# cell type specific gene expressions
#---------------------------------------------------------------------------
feature_gene_mean_celltype <- run_gene_mean_celltype(data, type = "spatial_p")
feature_gene_prop_celltype <- run_gene_prop_celltype(data, type = "spatial_p")
feature_gene_cor_celltype <- run_gene_cor_celltype(data, type = "spatial_p")


#---------------------------------------------------------------------------
# Bulk expressions
#---------------------------------------------------------------------------
feature_gene_mean_bulk <- run_gene_mean(data, type = "spatial_p")
feature_gene_prop_bulk <- run_gene_prop(data, type = "spatial_p")
feature_gene_cor_bulk <- run_gene_cor(data, type = "spatial_p")


#---------------------------------------------------------------------------
# Spatial metrics
#---------------------------------------------------------------------------
feature_L_stats <- run_L_function(data, type = "spatial_p")
feature_morans_I <- run_Morans_I(data, type = "spatial_p")
feature_celltype_interact <- run_celltype_interaction(data, type = "spatial_p")
feature_nn_correlation <- run_nn_correlation(data, type = "spatial_p")

scFeature for spatial transcriptomics

The input data is a Seurat object containing sample, x_cord and y_cord columns.
Additionally, apart from the RNA assay, which contains the gene expression of each spot, a predictions assay is also needed.

The predictions assay is a matrix in the form of cell types x spot, which stores the cell type probability of each spot. This can be obtained from performing cell type prediction using reference data.

#---------------------------------------------------------------------------
# read in data
#---------------------------------------------------------------------------

data <- readRDS(
  system.file("extdata", "example_scrnaseq.rds", package = "scFeatures")
)

# generate a toy "spatial transcriptomics" data
data$celltype <- NULL #spatial transcriptomics don't have celltype 

# randomly generate some x and y coordinates
x <- sample(1:100, ncol(data) , replace = T)
y <- sample(1:100, ncol(data) , replace = T)

# for spatial transcriptomics, we need to estimate the number of cells per spot
data <- get_num_cell_per_spot(data)

# also need a dataframe of celltype probability at each spot 
# here we randomly create one 
nrow <- 5 #pretend there are 5 cell types 
ncol <- ncol(data) 

# Create a matrix of random numbers
matrix <- matrix(runif(nrow * ncol), nrow, ncol)
# Normalize the columns of the matrix so that each column sum to 1
prediction.scores <-  sweep(matrix, 2, colSums(matrix), "/")
colnames(prediction.scores) <- colnames(data)
rownames(prediction.scores) <- c(paste0("celltype", 1:5))

# format the data using makeSeurat
data <- makeSeurat(data, spatialCoords = list(x,y), spotProbability = prediction.scores)
data <- process_data(data, normalise = T)  




# format the patient ID as `a_cond_b` eg, `patient09_cond_nonresponder`.
# This is so that we can retrieve the sample ID and the corresponding
# condition when running downstream analysis on the generated features
data$sample <- paste0(data$sample, "_cond_", data$condition)

#---------------------------------------------------------------------------
# cell type proportions
#---------------------------------------------------------------------------
feature_proportion_raw <- run_proportion_raw(data, type = "spatial_t")
feature_proportion_logit <- run_proportion_logit(data, type = "spatial_t")
feature_proportion_ratio <- run_proportion_ratio(data, type = "spatial_t")


#---------------------------------------------------------------------------
# cell type specific gene expressions
#---------------------------------------------------------------------------
data_remove_mito <- remove_mito(data)
feature_gene_mean_celltype <- run_gene_mean_celltype(
  data_remove_mito, type = "spatial_t", genes = rownames(data_remove_mito)[1:10] #to speed up example
)


#---------------------------------------------------------------------------
# cell type specific pathway expressions
#---------------------------------------------------------------------------
feature_pathway_mean <- run_pathway_mean(data, geneset = NULL, type = "spatial_t")
 
#---------------------------------------------------------------------------
# Bulk expressions
#---------------------------------------------------------------------------
feature_gene_mean_bulk <- run_gene_mean(data, type = "spatial_t")
feature_gene_prop_bulk <- run_gene_prop(data, type = "spatial_t")
feature_gene_cor_bulk <- run_gene_cor(data, type = "spatial_t")


#---------------------------------------------------------------------------
# Spatial metrics
#---------------------------------------------------------------------------
feature_L_stats <- run_L_function(data, type = "spatial_t")
feature_morans_I <- run_Morans_I(data, type = "spatial_t")
feature_celltype_interact <- run_celltype_interaction(data, type = "spatial_t")
feature_nn_correlation <- run_nn_correlation(data, type = "spatial_t")

sessionInfo()

sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 11 (bullseye)
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] scFeatures_0.99.9 BiocStyle_2.24.0 
#> 
#> loaded via a namespace (and not attached):
#>   [1] rappdirs_0.3.3              rtracklayer_1.56.1         
#>   [3] scattermore_0.8             SpatialExperiment_1.6.1    
#>   [5] R.methodsS3_1.8.2           SeuratObject_4.1.2         
#>   [7] ragg_1.2.3                  tidyr_1.2.1                
#>   [9] ggplot2_3.3.6               bit64_4.0.5                
#>  [11] knitr_1.40                  R.utils_2.12.0             
#>  [13] irlba_2.3.5.1               DelayedArray_0.22.0        
#>  [15] data.table_1.14.2           rpart_4.1.16               
#>  [17] KEGGREST_1.36.3             RCurl_1.98-1.9             
#>  [19] AnnotationFilter_1.20.0     generics_0.1.3             
#>  [21] BiocGenerics_0.42.0         GenomicFeatures_1.48.4     
#>  [23] ScaledMatrix_1.4.1          cowplot_1.1.1              
#>  [25] RSQLite_2.2.18              EnsDb.Mmusculus.v79_2.99.0 
#>  [27] RANN_2.6.1                  future_1.28.0              
#>  [29] bit_4.0.4                   spatstat.data_3.0-0        
#>  [31] xml2_1.3.3                  httpuv_1.6.6               
#>  [33] SummarizedExperiment_1.26.1 assertthat_0.2.1           
#>  [35] xfun_0.33                   hms_1.1.2                  
#>  [37] jquerylib_0.1.4             babelgene_22.9             
#>  [39] evaluate_0.17               promises_1.2.0.1           
#>  [41] fansi_1.0.3                 restfulr_0.0.15            
#>  [43] progress_1.2.2              caTools_1.18.2             
#>  [45] dbplyr_2.2.1                igraph_1.3.5               
#>  [47] DBI_1.1.3                   htmlwidgets_1.5.4          
#>  [49] spatstat.geom_3.0-3         stats4_4.2.1               
#>  [51] purrr_0.3.5                 ellipsis_0.3.2             
#>  [53] RSpectra_0.16-1             dplyr_1.0.10               
#>  [55] bookdown_0.30               annotate_1.74.0            
#>  [57] RcppParallel_5.1.5          biomaRt_2.52.0             
#>  [59] deldir_1.0-6                sparseMatrixStats_1.8.0    
#>  [61] MatrixGenerics_1.8.1        vctrs_0.4.2                
#>  [63] SingleCellExperiment_1.18.1 Biobase_2.56.0             
#>  [65] ensembldb_2.20.2            ROCR_1.0-11                
#>  [67] abind_1.4-5                 withr_2.5.0                
#>  [69] cachem_1.0.6                SIMLR_1.22.0               
#>  [71] progressr_0.11.0            sctransform_0.3.5          
#>  [73] GenomicAlignments_1.32.1    prettyunits_1.1.1          
#>  [75] scran_1.24.1                goftest_1.2-3              
#>  [77] cluster_2.1.4               ape_5.6-2                  
#>  [79] lazyeval_0.2.2              crayon_1.5.2               
#>  [81] spatstat.explore_3.0-5      edgeR_3.38.4               
#>  [83] pkgconfig_2.0.3             GenomeInfoDb_1.32.4        
#>  [85] nlme_3.1-160                ProtGenerics_1.28.0        
#>  [87] rlang_1.0.6                 globals_0.16.1             
#>  [89] lifecycle_1.0.3             miniUI_0.1.1.1             
#>  [91] filelock_1.0.2              BiocFileCache_2.4.0        
#>  [93] rsvd_1.0.5                  rprojroot_2.0.3            
#>  [95] polyclip_1.10-0             GSVA_1.44.5                
#>  [97] matrixStats_0.62.0          lmtest_0.9-40              
#>  [99] graph_1.74.0                Matrix_1.5-3               
#> [101] Rhdf5lib_1.18.2             zoo_1.8-11                 
#> [103] ggridges_0.5.4              GlobalOptions_0.1.2        
#> [105] pheatmap_1.0.12             png_0.1-7                  
#> [107] viridisLite_0.4.1           rjson_0.2.21               
#> [109] bitops_1.0-7                R.oo_1.25.0                
#> [111] rhdf5filters_1.8.0          KernSmooth_2.23-20         
#> [113] SingleCellSignalR_1.8.0     Biostrings_2.64.1          
#> [115] blob_1.2.3                  DelayedMatrixStats_1.18.1  
#> [117] shape_1.4.6                 stringr_1.4.1              
#> [119] parallelly_1.32.1           spatstat.random_3.0-1      
#> [121] S4Vectors_0.34.0            beachmat_2.12.0            
#> [123] scales_1.2.1                GSEABase_1.58.0            
#> [125] memoise_2.0.1               magrittr_2.0.3             
#> [127] plyr_1.8.7                  ica_1.0-3                  
#> [129] gplots_3.1.3                zlibbioc_1.42.0            
#> [131] compiler_4.2.1              dqrng_0.3.0                
#> [133] BiocIO_1.6.0                RColorBrewer_1.1-3         
#> [135] fitdistrplus_1.1-8          Rsamtools_2.12.0           
#> [137] cli_3.4.1                   XVector_0.36.0             
#> [139] listenv_0.8.0               EnsDb.Hsapiens.v79_2.99.0  
#> [141] patchwork_1.1.2             pbapply_1.5-0              
#> [143] MASS_7.3-58.1               mgcv_1.8-40                
#> [145] tidyselect_1.2.0            stringi_1.7.8              
#> [147] textshaping_0.3.6           yaml_2.3.5                 
#> [149] BiocSingular_1.12.0         locfit_1.5-9.6             
#> [151] ggrepel_0.9.1               grid_4.2.1                 
#> [153] sass_0.4.2                  tools_4.2.1                
#> [155] future.apply_1.9.1          parallel_4.2.1             
#> [157] circlize_0.4.15             rstudioapi_0.14            
#> [159] foreach_1.5.2               bluster_1.6.0              
#> [161] AUCell_1.18.1               metapod_1.4.0              
#> [163] gridExtra_2.3               Rtsne_0.16                 
#> [165] DropletUtils_1.16.0         proxyC_0.3.3               
#> [167] digest_0.6.29               BiocManager_1.30.18        
#> [169] rgeos_0.5-9                 shiny_1.7.2                
#> [171] pracma_2.4.2                Rcpp_1.0.9                 
#> [173] GenomicRanges_1.48.0        scuttle_1.6.3              
#> [175] later_1.3.0                 RcppAnnoy_0.0.19           
#> [177] httr_1.4.4                  AnnotationDbi_1.58.0       
#> [179] colorspace_2.0-3            XML_3.99-0.11              
#> [181] fs_1.5.2                    tensor_1.5                 
#> [183] reticulate_1.26             IRanges_2.30.1             
#> [185] splines_4.2.1               uwot_0.1.14                
#> [187] statmod_1.4.37              spatstat.utils_3.0-1       
#> [189] pkgdown_2.0.6               sp_1.5-0                   
#> [191] multtest_2.52.0             plotly_4.10.0              
#> [193] systemfonts_1.0.4           xtable_1.8-4               
#> [195] jsonlite_1.8.2              R6_2.5.1                   
#> [197] pillar_1.8.1                htmltools_0.5.3            
#> [199] mime_0.12                   glue_1.6.2                 
#> [201] fastmap_1.1.0               BiocParallel_1.30.4        
#> [203] BiocNeighbors_1.14.0        codetools_0.2-18           
#> [205] utf8_1.2.2                  lattice_0.20-45            
#> [207] bslib_0.4.0                 spatstat.sparse_3.0-0      
#> [209] tibble_3.1.8                curl_4.3.3                 
#> [211] leiden_0.4.3                gtools_3.9.3               
#> [213] magick_2.7.3                survival_3.4-0             
#> [215] limma_3.52.4                rmarkdown_2.17             
#> [217] desc_1.4.2                  munsell_0.5.0              
#> [219] rhdf5_2.40.0                GenomeInfoDbData_1.2.8     
#> [221] iterators_1.0.14            HDF5Array_1.24.2           
#> [223] msigdbr_7.5.1               reshape2_1.4.4             
#> [225] gtable_0.3.1                spatstat.core_2.4-4        
#> [227] Seurat_4.2.0