Overview
The BenchHubStudy object is designed to encapsulate all
necessary components in a benchmarking study, including the data and
functions associated. It provides a unified structure for benchmark
developers to share their work and for method developers to interact
with an existing benchmark study.
-
Benchmark developers can store Trio
objects (containing the input data, metrics, and supporting evidence),
any mapping functions and distribute a ready-to-use study object.
- Method developers can apply their methods to the provided data and evaluate their outputs using the built-in metrics.
This vignette provides guide for both usage.
For Benchmark Developer
This section demonstrates how to create a BenchmarkStudy
object from a benchmarking study.
Initialising the Study
We begin by creating an empty BenchmarkStudy object.
study <- BenchmarkStudy$new()Adding Trio
Assume the benchmark developer has created a trio object for one of the data used for the benchmarking. Here, we use a Trio from the curated trio dataset.
# Download the Trio from the database using its name
example_trio <- suppressMessages( Trio$new("benchhub_vignette_example", cachePath = TRUE) )
# Add to study
study$addTrio(example_trio)Define mapping function
A mapping function is a helper function that process the data into a format that can then be input into the evaluation metrics.
For example, in simulation studies, the output from a simulator is a cell x gene count matrix. Suppose we want to compare the sparsity of genes between the simulated data and an experimental data, we need a mapping function that calculates the sparsity of genes from the count matrix.
We define two examples of mapping functions.
Example 1: calculate the sparsity (proportion of zero counts) per gene.
# Define the mapping function
proportion_zero_gene <- function(data) {
return(colMeans(counts(data) == 0))
}
# Add the mapping function
study$addMappingFunction(
name = "Fraction Zero Genes",
func = proportion_zero_gene,
inputDescription = "SingleCellExperiment or matrix with gene expression counts.",
outputDescription = "Numeric vector of fraction of zero counts per gene.",
exampleUsage = paste(
"## Minimal example",
"#mat <- matrix(rpois(100, lambda = 1), nrow = 10, ncol = 10)",
"#mat <- SingleCellExperiment(assays = list(counts = mat)",
"#res <- study$runMapping('Fraction.Zero.Genes', mat)",
"#head(res)",
sep = "\n"
)
)## Warning: Mapping function name has been modified to be a valid R variable name.
## ℹ Original name: Fraction Zero Genes
## ℹ Modified name: Fraction.Zero.Genes
Example 2: calculate the normalized library size per cell.
# Define the mapping function
norm_lib_size <- function(data) {
lib_size <- log1p(rowSums(counts(data)))
dge <- edgeR::DGEList(counts = Matrix::t(counts(data)))
norm_factors <- edgeR::calcNormFactors(dge, method = "TMM")$samples$norm.factors
return(unname(lib_size * norm_factors))
}
# Add the mapping function, it is optional but recommended to add example usage
study$addMappingFunction(
name = "normalized library size",
func = norm_lib_size,
inputDescription = "SingleCellExperiment or matrix with gene expression counts.",
outputDescription = "Numeric vector of normalised library size per cell."
)## Warning: Mapping function name has been modified to be a valid R variable name.
## ℹ Original name: normalized library size
## ℹ Modified name: normalized.library.size
Uploading to Curated Trio Datasets
Once the BenchmarkStudy object includes: - A study name and
description
- At least three Trio objects
- Mapping functions [optional]
The BenchmarkStudy object can be uploaded to the database.
# because we need at least three Trio
# pretend we have added three unique Trios
study$addTrio(example_trio)
study$addTrio(example_trio)
# Note that each Trio needs to be already exist in the database.
# Trios can be uploaded to the database using the `writeCTD()` function.
# For more details, see the vignette on Trio construction.
# example_trio$writeCTD("example_trio")
# Set name and description manually
study$name <- "Benchhubstudy vignette"
study$description <- "This study compares simulated spatial transcriptomics data."The object can be uploaded using the
writeBenchmarkStudy() function.
# We comment it out in the vignette so that it does not get repetitively added
# study$writeBenchmarkStudy()For Method Developer
This section illustrates how a method developer can use the benchmark study object created by another user, apply their method, and evaluate its performance.
Loading the Study
A BenchmarkStudy object can be downloaded from the database through its name.
study <- suppressMessages( BenchmarkStudy$new("Benchhubstudy vignette", fetchFromCtd = T))Inspect the list of available trios, and available mapping functions
We see that this study has three Trios. Each Trio has supporting evidence that we can compare with.
length(study$trios)## [1] 3
study$trios[[1]]##
## ── Trio Object ─────────────────────────────────────────────────────────────────
##
## ── Dataset
## Dataset Details:
## Formal class 'SingleCellExperiment' [package "SingleCellExperiment"] with 9
## slots
## Data Source: "figshare"
## Dataset ID: "29565947/57477553"
## Cache Path: "/home/runner/.cache/R/TrioR"
## Split Indices: "None"
##
## ── Supporting Evidence
## Number of Supporting Evidence: 3
## Names of Supporting Evidence: "Fraction zero genes", "Fraction zero cells", and
## "normalized library size"
##
## ── Metrics
## Number of Metrics: 1
## Names of Metrics: "KS_stats"
This study provides two mapping function to process the data into a format that can be used for evaluation.
Each mapping function has documentation.
study$listMappingFunctions()## [1] "Fraction.Zero.Genes" "normalized.library.size"
doc <- study$getMappingFunctionDocumentation("Fraction.Zero.Genes" )
doc ## $inputDescription
## [1] "SingleCellExperiment or matrix with gene expression counts."
##
## $outputDescription
## [1] "Numeric vector of fraction of zero counts per gene."
##
## $exampleUsage
## [1] "## Minimal example\nmat <- matrix(rpois(100, lambda = 1), nrow = 10, ncol = 10)\nmat <- SingleCellExperiment(assays = list(counts = mat)\nres <- study$runMapping('Fraction.Zero.Genes', mat)\nhead(res)"
# Because the example Usage is a multiple line vector, need to use cat to print it nicely.
cat(doc$exampleUsage, "\n")## ## Minimal example
## mat <- matrix(rpois(100, lambda = 1), nrow = 10, ncol = 10)
## mat <- SingleCellExperiment(assays = list(counts = mat)
## res <- study$runMapping('Fraction.Zero.Genes', mat)
## head(res)
Preparing for evaluation
This benchmark study wants to assess the quality of simulated data.
Suppose the method developer has generated a simulated data with their method, and they want to compare with the real data to see how realistic the simulated data is.
# Suppose this is a simulated data
set.seed(0)
sim <- mockSCE(ncells = 4744, ngenes = 1000)The method developer can apply the mapping functions to the simulated data to generate the relevant features to evaluate such as the sparsity of the genes and the library size.
mydata_prop_zero_gene <- study$runMapping("Fraction.Zero.Genes", sim)
mydata_lib_size <- study$runMapping( "normalized.library.size", sim)Evaluate
Now we can compare the simulated data against an experimental data
using the evaluate function.
The evaluate function is in the format of
study$evaluate(trio_name, list( supporting evidence = output to compare with )).
In the function below, the Fraction zero genes and
normalized library size are the names of the supporting
evidence that can be found in the example_trio_1 Trio
object.
result <- study$evaluate("benchhub_vignette_example", # name of the trio to compare with , can be accessed by study$trios[[1]]$name
list("Fraction zero genes" = mydata_prop_zero_gene , # name of the supporting evidence
"normalized library size" = mydata_prop_zero_gene )) # name of the supporting evidence
#
result## # A tibble: 2 × 4
## datasetID evidence metric result
## <chr> <chr> <chr> <dbl>
## 1 29565947/57477553 Fraction zero genes KS_stats 268.
## 2 29565947/57477553 normalized library size KS_stats 98.1