Merge single-cell data from different batches and experiments leveraging (pseudo)-replicates, control genes and pseudo-bulk.

scMerge2(
  exprsMat,
  batch,
  cellTypes = NULL,
  condition = NULL,
  ctl = rownames(exprsMat),
  chosen.hvg = NULL,
  ruvK = 20,
  use_bpparam = BiocParallel::SerialParam(),
  use_bsparam = BiocSingular::RandomParam(),
  use_bnparam = BiocNeighbors::AnnoyParam(),
  pseudoBulk_fn = "create_pseudoBulk",
  k_pseudoBulk = 30,
  k_celltype = 10,
  exprsMat_counts = NULL,
  cosineNorm = TRUE,
  return_subset = FALSE,
  return_subset_genes = NULL,
  return_matrix = TRUE,
  byChunk = TRUE,
  chunkSize = 50000,
  verbose = TRUE,
  seed = 1
)

Arguments

exprsMat

A gene (row) by cell (column) log-transformed matrix to be adjusted.

batch

A vector indicating the batch information for each cell in the batch-combined matrix.

cellTypes

An optional vector indicating the cell type information for each cell in the batch-combined matrix. If it is NULL, pseudo-replicate procedure will be run to identify cell type.

condition

An optional vector indicating the condition information for each cell in the batch-combined matrix.

ctl

A character vector of negative control. It should have a non-empty intersection with the rows of exprsMat

chosen.hvg

An optional character vector of highly variable genes chosen.

ruvK

An integer indicates the number of unwanted variation factors that are removed, default is 20.

use_bpparam

A BiocParallelParam class object from the BiocParallel package is used. Default is SerialParam().

use_bsparam

A BiocSingularParam class object from the BiocSingular package is used. Default is RandomParam().

use_bnparam

A BiocNeighborsParam class object from the BiocNeighbors package is used. Default is AnnoyParam().

pseudoBulk_fn

A character indicates the way of pseudobulk constructed.

k_pseudoBulk

An integer indicates the number of pseudobulk constructed within each cell grouping. Default is 30.

k_celltype

An integer indicates the number of nearest neighbours used in buildSNNGraph when grouping cells within each batch. Default is 10.

exprsMat_counts

A gene (row) by cell (column) counts matrix to be adjusted.

cosineNorm

A logical vector indicates whether cosine normalisation is performed on input data.

return_subset

If TRUE, adjusted matrix of only a subset of genes (hvg or indicates in return_subset_genes) will be return.

return_subset_genes

An optional character vector of indicates the subset of genes will be adjusted.

return_matrix

A logical value indicates whether the adjusted matrix is calculated and returned.

byChunk

A logical value indicates whether it calculates the adjusted matrix by chunk

chunkSize

A numeric indicates the size of the chunk. If FALSE, then only the estimated parameters will be returned.

verbose

If TRUE, then all intermediate steps will be shown. Default to FALSE.

seed

A numeric input indicates the seed used.

Value

Returns a list object with following components:

  • newY: if return_matrix is TRUE, the adjusted matrix will be return.

  • fullalpha: Alpha estimated from the fastRUVIII model.

  • M: Replicate matrix.

Author

Yingxin Lin

Examples

## Loading example data
data('example_sce', package = 'scMerge')
## Previously computed stably expressed genes
data('segList_ensemblGeneID', package = 'scMerge')
## Running an example data with minimal inputs
library(SingleCellExperiment)
exprsMat <- scMerge2(exprsMat = logcounts(example_sce),
batch = example_sce$batch,
ctl = segList_ensemblGeneID$mouse$mouse_scSEG)
#> [1] "Cluster within batch"
#> Warning: You're computing too large a percentage of total singular values, use a standard svd instead.
#> Warning: You're computing too large a percentage of total singular values, use a standard svd instead.
#> [1] "Normalising data"
#> [1] "Constructing pseudo-bulk"
#> Dimension of pseudo-bulk expression: [1] 1047  131
#> [1] "Identifying MNC using pseudo-bulk:"

#> [1] "Running RUV"
assay(example_sce, "scMerge2") <- exprsMat$newY


example_sce = scater::runPCA(example_sce, exprs_values = 'scMerge2')                                       
scater::plotPCA(example_sce, colour_by = 'cellTypes', shape = 'batch')