6 Unsupervised clustering for cell annotation
Labeling the identity of your cells is a key step in any spatial processing protocol in order to determine differential cell type compositions and changes which occur in specific cell types during disease. The method by which this is done can differ from study to study, but there are two main approaches to this: clustering and annotation.
# set parameters
set.seed(51773)
use_mc <- TRUE
if (use_mc) {
nCores <- max(parallel::detectCores()/2, 1)
} else {
nCores <- 1
}
BPPARAM <- simpleSeg:::generateBPParam(nCores)
theme_set(theme_classic())
6.1 Clustering vs Annotation
Clustering is an unsupervised method of labelling cells. An algorithm identifies clusters of similar cells based on marker expression patterns, and the resulting clusters need to be manually identified based on biological domain knowledge. Cell annotation is a supervised method which requires a separate, reference dataset. The algorithm uses a reference dataset to assign a cell type label to each cell in the dataset. There are advantages and disadvantages to both. We will first demonstrate the use of our packages to perform both clustering and annotation, and then discuss how to choose between the two approaches.
6.2 Clustering with FuseSOM
FuseSOM is an unsupervised clustering tool for highly multiplexed in situ imaging cytometry assays. It combines a Self Organising Map
architecture and a MultiView
integration of correlation-based metrics for robustness and high accuracy. It has been streamlined to accept multiple data structures including SingleCellExperiment
objects, SpatialExperiment
objects, and DataFrames
.
6.2.1 FuseSOM Matrix Input
To demonstrate the functionality of FuseSOM
, we will use the Risom 2022 dataset, which profiles the spatial landscape of ductal carcinoma in situ (DCIS). We will be using the markers used in the original study to perform clustering.
# load in the data
data("risom_dat")
# define the markers of interest
risomMarkers <- c('CD45','SMA','CK7','CK5','VIM','CD31','PanKRT','ECAD',
'Tryptase','MPO','CD20','CD3','CD8','CD4','CD14','CD68','FAP',
'CD36','CD11c','HLADRDPDQ','P63','CD44')
# we will be using the manual_gating_phenotype as the true cell type to gauge
# performance
names(risom_dat)[names(risom_dat) == 'manual_gating_phenotype'] <- 'CellType'
Now that we have loaded the data and defined the markers of interest, we can run the FuseSOM algorithm using the runFuseSOM
function. We specify the number of clusters to be 23 based on prior domain knowledge. The output contains the cluster labels as well as the Self Organizing Map
model.
risomRes <- runFuseSOM(data = risom_dat, markers = risomMarkers,
numClusters = 23)
Lets look at the distribution of the clusters.
# get the distribution of the clusters
round(table(risomRes$clusters)/sum(table(risomRes$clusters)), 2)
cluster_1 cluster_10 cluster_11 cluster_12 cluster_13 cluster_14 cluster_15
0.32 0.04 0.01 0.02 0.06 0.03 0.02
cluster_16 cluster_17 cluster_18 cluster_19 cluster_2 cluster_20 cluster_21
0.03 0.02 0.08 0.02 0.01 0.05 0.01
cluster_22 cluster_23 cluster_3 cluster_4 cluster_5 cluster_6 cluster_7
0.05 0.07 0.00 0.01 0.04 0.06 0.02
cluster_8 cluster_9
0.01 0.01
It appears that 32% of cells have been assigned to cluster_1
. Next, lets generate a heatmap of the marker expression for each cluster using the markerHeatmap
function.
risomHeat <- FuseSOM::markerHeatmap(data = risom_dat, markers = risomMarkers,
clusters = risomRes$clusters, clusterMarkers = TRUE)
How do I identify imperfect clustering?
- Do our cell-type specific markers clearly separate out by cluster? We expect to see discrete expression of our markers in specific cell types, e.g., CD4 being expressed in T cells exclusively.
- If we instead see “smearing” of our markers across clusters, where several clusters express high levels of a cell type specific marker such as CD4, it is likely a normalisation issue.
Three common issues which cause imperfect clustering have been outlined below:
- Imperfect segmentation: excessive lateral marker spill over can severely impact downstream clustering, as cell type specific markers leak into nearby cells. This should largely be diagnosed in the segmentation step and will need to be fixed by optimising the upstream segmentation algorithm.
- Imperfect normalization: excessively variable intensities across images could cause issues in the normalization process. This can generally be diagnosed with density plots and box plots for specific markers across images and can be fixed by identifying the exact issue, e.g. extremely high values for a small subset of images, and choosing a normalization strategy to remove/reduce this effect.
-
Imperfect clustering: choosing a
k
that’s too low or too high could lead to imperfect clustering. This is usually diagnosed by clusters which either express too many markers very highly or express too few markers, and is usually remedied by choosing an idealk
based on an elbow plot described below.
6.2.2 Using FuseSOM to estimate the number of clusters
When the number of expected cell types or clusters is not known beforehand, the estimateNumCluster
function can be used to estimate the number of clusters. Two methods have been developed to calculate the number of clusters:
- Discriminant based method:
- A method developed in house based on discriminant based maximum clusterability projection pursuit
- Distance based methods which includes:
- The Gap Statistic
- The Jump Statistic
- The Slope Statistic
- The Within Cluster Dissimilarity Statistic
- The Silhouette Statistic
We run estimateNumCluster
and specify method = c("Discriminant", "Distance")
to use both approaches.
# lets estimate the number of clusters using all the methods
# original clustering has 23 clusters so we will set kseq from 2:25
# we pass it the SOM model generated in the previous step
risomKest <- estimateNumCluster(data = risomRes$model, kSeq = 2:25,
method = c("Discriminant", "Distance"))
We can then use this result to determine the best number of clusters for this dataset based on the different metrics.
# what is the best number of clusters determined by the discriminant method?
risomKest$Discriminant
[1] 7
According to the Discriminant method, the optimal number of clusters is 7.
We can use the optiPlot()
function to generate an elbow plot with the optimal number of clusters for the distance based methods.
# we can plot the results using the optiplot function
pSlope <- optiPlot(risomKest, method = 'slope')
pSlope
pJump <- optiPlot(risomKest, method = 'jump')
pJump
pWcd <- optiPlot(risomKest, method = 'wcd')
pWcd
pGap <- optiPlot(risomKest, method = 'gap')
pGap
pSil <- optiPlot(risomKest, method = 'silhouette')
pSil
From the plots, we see that the Jump
statistic almost perfectly captures the correct number of clusters. The Gap
statistic is a close second with 15 clusters. All the other methods significantly underestimate the number of clusters.
- How do we choose our
k
? We’re generally looking for thek
before the point of greatest inflection, or the point beyond which increasingk
results in minimal improvement to clustering quality. - Is there one best choice for
k
? There can be several options ofk
if there are several points of inflection. Choose thek
which best reflects the number of clusters you expect to get from the tissue. For instance, if you are interested in broader cell populations, you might pick a lower value ofk
, and if you are interested in identifying subpopulations, you might pick a larger value fork
.
6.2.3 FuseSOM with SingleCellExperiment object as input
The FuseSOM algorithm is also equipped to take in a SingleCellExperiment
object as input. The results of the pipeline will be written to either the metadata
or the colData
fields.
First, we create a SingleCellExperiment
object using the Risom 2022 data.
# create an SCE object using Risom 2022 data
colDat <- risom_dat[, setdiff(colnames(risom_dat), risomMarkers)]
sce <- SingleCellExperiment(assays = list(counts = t(risom_dat[, names(risom_dat) != "CellType"])),
colData = colDat)
sce
class: SingleCellExperiment
dim: 22 69672
metadata(0):
assays(1): counts
rownames(22): CD45 SMA ... P63 CD44
rowData names(0):
colnames: NULL
colData names(1): X
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
Next, we pass it to the runFuseSOM
function. Here, we can provide the assay in which the data is stored (counts
) and specify the column to store the clusters in using clusterCol = "clusters"
. The Self Organizing Map
that is generated will be stored in the metadata field.
risomRessce <- runFuseSOM(sce, markers = risomMarkers, clusterCol = "clusters",
assay = 'counts', numClusters = 23, verbose = FALSE)
colnames(colData(risomRessce))
[1] "X" "clusters"
names(metadata(risomRessce))
[1] "SOM"
Notice how the there is now a clusters
column in the colData
and a SOM field in the metadata.
If necessary, you can run runFuseSOM
with a new cluster number and specify a new clusterCol
. If clusterCol
contains a new name, the new clusters will be stored in the new column. Otherwise, it will overwrite the the current clusters
column. Running FuseSOM on the same object will overwrite the SOM field in the metadata.
Just like before, we can plot a heatmap of the resulting clusters across all markers.
data <- risom_dat[, risomMarkers] # get the original data used
clusters <- colData(risomRessce)$clusters # extract the clusters from the SCE object
# generate the heatmap
risomHeatsce <- markerHeatmap(data = risom_dat, markers = risomMarkers,
clusters = clusters, clusterMarkers = TRUE)
Or we can directly plot from the SCE using the scater
package.
# Visualise marker expression in each cluster.
scater::plotGroupedHeatmap(
risomRessce,
features = risomMarkers,
group = "clusters",
exprs_values = "counts",
center = TRUE,
scale = TRUE,
zlim = c(-3, 3),
cluster_rows = FALSE,
block = "clusters"
)
6.2.4 Using FuseSOM to estimate the number of clusters for SingleCellExperiment objects
Just like before, we will use estimateNumCluster
on our Risom SingleCellExperiment
object.
# lets estimate the number of clusters using all the methods
# original clustering has 23 clusters so we will set kseq from 2:25
risomRessce <- estimateNumCluster(data = risomRessce, kSeq = 2:25,
method = c("Discriminant", "Distance"))
You have provided a dataset of class: SingleCellExperiment
Now Computing the Number of Clusters using Discriminant Analysis
Now Computing The Number Of Clusters Using Distance Analysis
names(metadata(risomRessce))
[1] "SOM" "clusterEstimation"
The metadata now contains a clusterEstimation
field which holds the results from the estimateNumCluster
function.
We can assess the results of cluster estimation as below.
# what is the best number of clusters determined by the discriminant method?
metadata(risomRessce)$clusterEstimation$Discriminant
[1] 7
According to the discrminant method, the optimal number of clusters is 7.
# we can plot the results using the optiplot function
pSlope <- optiPlot(risomRessce, method = 'slope')
pSlope
pJump <- optiPlot(risomRessce, method = 'jump')
pJump
pWcd <- optiPlot(risomRessce, method = 'wcd')
pWcd
pGap <- optiPlot(risomRessce, method = 'gap')
pGap
pSil <- optiPlot(risomRessce, method = 'silhouette')
pSil
Again, we see that the Jump
statistic almost perfectly captures the correct number of clusters with 24 clusters. The Gap
method is a close second with 15 clusters. All the other methods significantly underestimate the number of clusters.
In the next section, we will demonstrate how our packages can be used to perform supervised cell annotation, and then discuss how to choose between clustering and annotation approaches.
6.3 sessionInfo
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Australia/Sydney
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] ggplot2_3.5.1 scuttle_1.16.0
[3] simpleSeg_1.8.0 MLmetrics_1.1.3
[5] STexampleData_1.14.0 SpatialExperiment_1.16.0
[7] SingleCellExperiment_1.28.1 SummarizedExperiment_1.36.0
[9] Biobase_2.66.0 GenomicRanges_1.58.0
[11] GenomeInfoDb_1.42.0 IRanges_2.40.0
[13] S4Vectors_0.44.0 MatrixGenerics_1.18.0
[15] matrixStats_1.4.1 ExperimentHub_2.14.0
[17] AnnotationHub_3.14.0 BiocFileCache_2.14.0
[19] dbplyr_2.5.0 BiocGenerics_0.52.0
[21] FuseSOM_1.8.0
loaded via a namespace (and not attached):
[1] splines_4.4.1 later_1.4.1 bitops_1.0-9
[4] ggplotify_0.1.2 filelock_1.0.3 tibble_3.2.1
[7] svgPanZoom_0.3.4 polyclip_1.10-7 lifecycle_1.0.4
[10] rstatix_0.7.2 fastcluster_1.2.6 prabclus_2.3-4
[13] lattice_0.22-6 MASS_7.3-61 backports_1.5.0
[16] magrittr_2.0.3 rmarkdown_2.29 yaml_2.3.10
[19] httpuv_1.6.15 flexmix_2.3-19 sp_2.1-4
[22] DBI_1.2.3 RColorBrewer_1.1-3 abind_1.4-8
[25] zlibbioc_1.52.0 purrr_1.0.2 RCurl_1.98-1.16
[28] nnet_7.3-19 yulab.utils_0.1.8 rappdirs_0.3.3
[31] GenomeInfoDbData_1.2.13 ggrepel_0.9.6 irlba_2.3.5.1
[34] spatstat.utils_3.1-1 terra_1.7-78 pheatmap_1.0.12
[37] vegan_2.6-8 svglite_2.1.3 permute_0.9-7
[40] codetools_0.2-20 DelayedArray_0.32.0 tidyselect_1.2.1
[43] raster_3.6-30 farver_2.1.2 UCSC.utils_1.2.0
[46] ScaledMatrix_1.14.0 viridis_0.6.5 jsonlite_1.8.9
[49] BiocNeighbors_2.0.0 Formula_1.2-5 scater_1.34.0
[52] systemfonts_1.1.0 tools_4.4.1 Rcpp_1.0.13-1
[55] glue_1.8.0 mnormt_2.1.1 gridExtra_2.3
[58] SparseArray_1.6.0 xfun_0.49 mgcv_1.9-1
[61] EBImage_4.48.0 dplyr_1.1.4 HDF5Array_1.34.0
[64] withr_3.0.2 shinydashboard_0.7.2 BiocManager_1.30.25
[67] fastmap_1.2.0 rhdf5filters_1.18.0 fansi_1.0.6
[70] rsvd_1.0.5 digest_0.6.37 R6_2.5.1
[73] mime_0.12 gridGraphics_0.5-1 DataVisualizations_1.3.2
[76] colorspace_2.1-1 spatstat.data_3.1-4 jpeg_0.1-10
[79] RSQLite_2.3.8 diptest_0.77-1 utf8_1.2.4
[82] tidyr_1.3.1 generics_0.1.3 robustbase_0.99-4-1
[85] class_7.3-22 httr_1.4.7 htmlwidgets_1.6.4
[88] S4Arrays_1.6.0 pkgconfig_2.0.3 gtable_0.3.6
[91] modeltools_0.2-23 blob_1.2.4 XVector_0.46.0
[94] htmltools_0.5.8.1 carData_3.0-5 fftwtools_0.9-11
[97] scales_1.3.0 png_0.1-8 spatstat.univar_3.1-1
[100] knitr_1.49 rstudioapi_0.17.1 rjson_0.2.23
[103] nlme_3.1-166 curl_6.0.1 proxy_0.4-27
[106] cachem_1.1.0 rhdf5_2.50.0 stringr_1.5.1
[109] BiocVersion_3.20.0 parallel_4.4.1 vipor_0.4.7
[112] AnnotationDbi_1.68.0 pillar_1.9.0 grid_4.4.1
[115] vctrs_0.6.5 promises_1.3.2 ggpubr_0.6.0
[118] BiocSingular_1.22.0 car_3.1-3 beachmat_2.22.0
[121] xtable_1.8-4 cluster_2.1.6 princurve_2.1.6
[124] beeswarm_0.4.0 evaluate_1.0.1 magick_2.8.5
[127] cli_3.6.3 locfit_1.5-9.10 compiler_4.4.1
[130] rlang_1.1.4 crayon_1.5.3 analogue_0.17-7
[133] ggsignif_0.6.4 labeling_0.4.3 mclust_6.1.1
[136] fs_1.6.5 ggbeeswarm_0.7.2 stringi_1.8.4
[139] psych_2.4.6.26 deldir_2.0-4 viridisLite_0.4.2
[142] BiocParallel_1.40.0 nnls_1.6 FCPS_1.3.4
[145] cytomapper_1.18.0 munsell_0.5.1 Biostrings_2.74.0
[148] tiff_0.1-12 coop_0.6-3 profileModel_0.6.1
[151] spatstat.geom_3.3-4 Matrix_1.7-1 bit64_4.5.2
[154] Rhdf5lib_1.28.0 fpc_2.2-13 KEGGREST_1.46.0
[157] shiny_1.9.1 brglm_0.7.2 kernlab_0.9-33
[160] broom_1.0.7 memoise_2.0.1 DEoptimR_1.1-3-1
[163] bit_4.5.0