6  Unsupervised clustering for cell annotation

Labeling the identity of your cells is a key step in any spatial processing protocol in order to determine differential cell type compositions and changes which occur in specific cell types during disease. The method by which this is done can differ from study to study, but there are two main approaches to this: clustering and annotation.

# load required libraries
library(FuseSOM)
library(STexampleData)
library(MLmetrics)
library(simpleSeg)
library(scuttle)
library(ggplot2)
library(SingleCellExperiment)
# set parameters
set.seed(51773)
use_mc <- TRUE
if (use_mc) {
  nCores <- max(parallel::detectCores()/2, 1)
} else {
  nCores <- 1
}
BPPARAM <- simpleSeg:::generateBPParam(nCores)
theme_set(theme_classic())

6.1 Clustering vs Annotation

Clustering is an unsupervised method of labelling cells. An algorithm identifies clusters of similar cells based on marker expression patterns, and the resulting clusters need to be manually identified based on biological domain knowledge. Cell annotation is a supervised method which requires a separate, reference dataset. The algorithm uses a reference dataset to assign a cell type label to each cell in the dataset. There are advantages and disadvantages to both. We will first demonstrate the use of our packages to perform both clustering and annotation, and then discuss how to choose between the two approaches.

6.2 Clustering with FuseSOM

FuseSOM is an unsupervised clustering tool for highly multiplexed in situ imaging cytometry assays. It combines a Self Organising Map architecture and a MultiView integration of correlation-based metrics for robustness and high accuracy. It has been streamlined to accept multiple data structures including SingleCellExperiment objects, SpatialExperiment objects, and DataFrames.

6.2.1 FuseSOM Matrix Input

To demonstrate the functionality of FuseSOM, we will use the Risom 2022 dataset, which profiles the spatial landscape of ductal carcinoma in situ (DCIS). We will be using the markers used in the original study to perform clustering.

# load in the data
data("risom_dat")

# define the markers of interest
risomMarkers <- c('CD45','SMA','CK7','CK5','VIM','CD31','PanKRT','ECAD',
                   'Tryptase','MPO','CD20','CD3','CD8','CD4','CD14','CD68','FAP',
                   'CD36','CD11c','HLADRDPDQ','P63','CD44')

# we will be using the manual_gating_phenotype as the true cell type to gauge 
# performance
names(risom_dat)[names(risom_dat) == 'manual_gating_phenotype'] <- 'CellType'

Now that we have loaded the data and defined the markers of interest, we can run the FuseSOM algorithm using the runFuseSOM function. We specify the number of clusters to be 23 based on prior domain knowledge. The output contains the cluster labels as well as the Self Organizing Map model.

risomRes <- runFuseSOM(data = risom_dat, markers = risomMarkers, 
                        numClusters = 23)

Lets look at the distribution of the clusters.

# get the distribution of the clusters
round(table(risomRes$clusters)/sum(table(risomRes$clusters)), 2)

 cluster_1 cluster_10 cluster_11 cluster_12 cluster_13 cluster_14 cluster_15 
      0.32       0.04       0.01       0.02       0.06       0.03       0.02 
cluster_16 cluster_17 cluster_18 cluster_19  cluster_2 cluster_20 cluster_21 
      0.03       0.02       0.08       0.02       0.01       0.05       0.01 
cluster_22 cluster_23  cluster_3  cluster_4  cluster_5  cluster_6  cluster_7 
      0.05       0.07       0.00       0.01       0.04       0.06       0.02 
 cluster_8  cluster_9 
      0.01       0.01 

It appears that 32% of cells have been assigned to cluster_1. Next, lets generate a heatmap of the marker expression for each cluster using the markerHeatmap function.

risomHeat <- FuseSOM::markerHeatmap(data = risom_dat, markers = risomMarkers,
                            clusters = risomRes$clusters, clusterMarkers = TRUE)

Common problems with clustering

How do I identify imperfect clustering?

  1. Do our cell-type specific markers clearly separate out by cluster? We expect to see discrete expression of our markers in specific cell types, e.g., CD4 being expressed in T cells exclusively.
  2. If we instead see “smearing” of our markers across clusters, where several clusters express high levels of a cell type specific marker such as CD4, it is likely a normalisation issue.
Remedying imperfect clustering

Three common issues which cause imperfect clustering have been outlined below:

  1. Imperfect segmentation: excessive lateral marker spill over can severely impact downstream clustering, as cell type specific markers leak into nearby cells. This should largely be diagnosed in the segmentation step and will need to be fixed by optimising the upstream segmentation algorithm.
  2. Imperfect normalization: excessively variable intensities across images could cause issues in the normalization process. This can generally be diagnosed with density plots and box plots for specific markers across images and can be fixed by identifying the exact issue, e.g. extremely high values for a small subset of images, and choosing a normalization strategy to remove/reduce this effect.
  3. Imperfect clustering: choosing a k that’s too low or too high could lead to imperfect clustering. This is usually diagnosed by clusters which either express too many markers very highly or express too few markers, and is usually remedied by choosing an ideal k based on an elbow plot described below.

6.2.2 Using FuseSOM to estimate the number of clusters

When the number of expected cell types or clusters is not known beforehand, the estimateNumCluster function can be used to estimate the number of clusters. Two methods have been developed to calculate the number of clusters:

  1. Discriminant based method:
    • A method developed in house based on discriminant based maximum clusterability projection pursuit
  2. Distance based methods which includes:
    • The Gap Statistic
    • The Jump Statistic
    • The Slope Statistic
    • The Within Cluster Dissimilarity Statistic
    • The Silhouette Statistic

We run estimateNumCluster and specify method = c("Discriminant", "Distance") to use both approaches.

# lets estimate the number of clusters using all the methods
# original clustering has 23 clusters so we will set kseq from 2:25
# we pass it the SOM model generated in the previous step
risomKest <- estimateNumCluster(data = risomRes$model, kSeq = 2:25, 
                                  method = c("Discriminant", "Distance"))

We can then use this result to determine the best number of clusters for this dataset based on the different metrics.

# what is the best number of clusters determined by the discriminant method?
risomKest$Discriminant 
[1] 7

According to the Discriminant method, the optimal number of clusters is 7.

We can use the optiPlot() function to generate an elbow plot with the optimal number of clusters for the distance based methods.

# we can plot the results using the optiplot function
pSlope <- optiPlot(risomKest, method = 'slope')
pSlope

pJump <- optiPlot(risomKest, method = 'jump')
pJump

pWcd <- optiPlot(risomKest, method = 'wcd')
pWcd

pGap <- optiPlot(risomKest, method = 'gap')
pGap

pSil <- optiPlot(risomKest, method = 'silhouette')
pSil

From the plots, we see that the Jump statistic almost perfectly captures the correct number of clusters. The Gap statistic is a close second with 15 clusters. All the other methods significantly underestimate the number of clusters.

Estimating the value of k
  1. How do we choose our k? We’re generally looking for the k before the point of greatest inflection, or the point beyond which increasing k results in minimal improvement to clustering quality.
  2. Is there one best choice for k? There can be several options of k if there are several points of inflection. Choose the k which best reflects the number of clusters you expect to get from the tissue. For instance, if you are interested in broader cell populations, you might pick a lower value of k , and if you are interested in identifying subpopulations, you might pick a larger value for k .

6.2.3 FuseSOM with SingleCellExperiment object as input

The FuseSOM algorithm is also equipped to take in a SingleCellExperiment object as input. The results of the pipeline will be written to either the metadata or the colData fields.

First, we create a SingleCellExperiment object using the Risom 2022 data.

# create an SCE object using Risom 2022 data
colDat <- risom_dat[, setdiff(colnames(risom_dat), risomMarkers)]
sce <- SingleCellExperiment(assays = list(counts = t(risom_dat[, names(risom_dat) != "CellType"])),
                                 colData = colDat)

sce
class: SingleCellExperiment 
dim: 22 69672 
metadata(0):
assays(1): counts
rownames(22): CD45 SMA ... P63 CD44
rowData names(0):
colnames: NULL
colData names(1): X
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

Next, we pass it to the runFuseSOM function. Here, we can provide the assay in which the data is stored (counts) and specify the column to store the clusters in using clusterCol = "clusters". The Self Organizing Map that is generated will be stored in the metadata field.

risomRessce <- runFuseSOM(sce, markers = risomMarkers, clusterCol = "clusters",
                          assay = 'counts', numClusters = 23, verbose = FALSE)

colnames(colData(risomRessce))
[1] "X"        "clusters"
names(metadata(risomRessce))
[1] "SOM"

Notice how the there is now a clusters column in the colData and a SOM field in the metadata.

If necessary, you can run runFuseSOM with a new cluster number and specify a new clusterCol. If clusterCol contains a new name, the new clusters will be stored in the new column. Otherwise, it will overwrite the the current clusters column. Running FuseSOM on the same object will overwrite the SOM field in the metadata.

Just like before, we can plot a heatmap of the resulting clusters across all markers.

data <- risom_dat[, risomMarkers] # get the original data used
clusters <- colData(risomRessce)$clusters # extract the clusters from the SCE object

# generate the heatmap
risomHeatsce <- markerHeatmap(data = risom_dat, markers = risomMarkers,
                            clusters = clusters, clusterMarkers = TRUE)

Or we can directly plot from the SCE using the scater package.

# Visualise marker expression in each cluster.
scater::plotGroupedHeatmap(
  risomRessce,
  features = risomMarkers,
  group = "clusters",
  exprs_values = "counts",
  center = TRUE,
  scale = TRUE,
  zlim = c(-3, 3),
  cluster_rows = FALSE,
  block = "clusters"
)

6.2.4 Using FuseSOM to estimate the number of clusters for SingleCellExperiment objects

Just like before, we will use estimateNumCluster on our Risom SingleCellExperiment object.

# lets estimate the number of clusters using all the methods
# original clustering has 23 clusters so we will set kseq from 2:25
risomRessce <- estimateNumCluster(data = risomRessce, kSeq = 2:25, 
                                  method = c("Discriminant", "Distance"))
You have provided a dataset of class: SingleCellExperiment
Now Computing the Number of Clusters using Discriminant Analysis
Now Computing The Number Of Clusters Using Distance Analysis
names(metadata(risomRessce))
[1] "SOM"               "clusterEstimation"

The metadata now contains a clusterEstimation field which holds the results from the estimateNumCluster function.

We can assess the results of cluster estimation as below.

# what is the best number of clusters determined by the discriminant method?
metadata(risomRessce)$clusterEstimation$Discriminant 
[1] 7

According to the discrminant method, the optimal number of clusters is 7.

# we can plot the results using the optiplot function
pSlope <- optiPlot(risomRessce, method = 'slope')
pSlope

pJump <- optiPlot(risomRessce, method = 'jump')
pJump

pWcd <- optiPlot(risomRessce, method = 'wcd')
pWcd

pGap <- optiPlot(risomRessce, method = 'gap')
pGap

pSil <- optiPlot(risomRessce, method = 'silhouette')
pSil

Again, we see that the Jump statistic almost perfectly captures the correct number of clusters with 24 clusters. The Gap method is a close second with 15 clusters. All the other methods significantly underestimate the number of clusters.

In the next section, we will demonstrate how our packages can be used to perform supervised cell annotation, and then discuss how to choose between clustering and annotation approaches.

6.3 sessionInfo

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Australia/Sydney
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] ggplot2_3.5.1               scuttle_1.16.0             
 [3] simpleSeg_1.8.0             MLmetrics_1.1.3            
 [5] STexampleData_1.14.0        SpatialExperiment_1.16.0   
 [7] SingleCellExperiment_1.28.1 SummarizedExperiment_1.36.0
 [9] Biobase_2.66.0              GenomicRanges_1.58.0       
[11] GenomeInfoDb_1.42.0         IRanges_2.40.0             
[13] S4Vectors_0.44.0            MatrixGenerics_1.18.0      
[15] matrixStats_1.4.1           ExperimentHub_2.14.0       
[17] AnnotationHub_3.14.0        BiocFileCache_2.14.0       
[19] dbplyr_2.5.0                BiocGenerics_0.52.0        
[21] FuseSOM_1.8.0              

loaded via a namespace (and not attached):
  [1] splines_4.4.1            later_1.4.1              bitops_1.0-9            
  [4] ggplotify_0.1.2          filelock_1.0.3           tibble_3.2.1            
  [7] svgPanZoom_0.3.4         polyclip_1.10-7          lifecycle_1.0.4         
 [10] rstatix_0.7.2            fastcluster_1.2.6        prabclus_2.3-4          
 [13] lattice_0.22-6           MASS_7.3-61              backports_1.5.0         
 [16] magrittr_2.0.3           rmarkdown_2.29           yaml_2.3.10             
 [19] httpuv_1.6.15            flexmix_2.3-19           sp_2.1-4                
 [22] DBI_1.2.3                RColorBrewer_1.1-3       abind_1.4-8             
 [25] zlibbioc_1.52.0          purrr_1.0.2              RCurl_1.98-1.16         
 [28] nnet_7.3-19              yulab.utils_0.1.8        rappdirs_0.3.3          
 [31] GenomeInfoDbData_1.2.13  ggrepel_0.9.6            irlba_2.3.5.1           
 [34] spatstat.utils_3.1-1     terra_1.7-78             pheatmap_1.0.12         
 [37] vegan_2.6-8              svglite_2.1.3            permute_0.9-7           
 [40] codetools_0.2-20         DelayedArray_0.32.0      tidyselect_1.2.1        
 [43] raster_3.6-30            farver_2.1.2             UCSC.utils_1.2.0        
 [46] ScaledMatrix_1.14.0      viridis_0.6.5            jsonlite_1.8.9          
 [49] BiocNeighbors_2.0.0      Formula_1.2-5            scater_1.34.0           
 [52] systemfonts_1.1.0        tools_4.4.1              Rcpp_1.0.13-1           
 [55] glue_1.8.0               mnormt_2.1.1             gridExtra_2.3           
 [58] SparseArray_1.6.0        xfun_0.49                mgcv_1.9-1              
 [61] EBImage_4.48.0           dplyr_1.1.4              HDF5Array_1.34.0        
 [64] withr_3.0.2              shinydashboard_0.7.2     BiocManager_1.30.25     
 [67] fastmap_1.2.0            rhdf5filters_1.18.0      fansi_1.0.6             
 [70] rsvd_1.0.5               digest_0.6.37            R6_2.5.1                
 [73] mime_0.12                gridGraphics_0.5-1       DataVisualizations_1.3.2
 [76] colorspace_2.1-1         spatstat.data_3.1-4      jpeg_0.1-10             
 [79] RSQLite_2.3.8            diptest_0.77-1           utf8_1.2.4              
 [82] tidyr_1.3.1              generics_0.1.3           robustbase_0.99-4-1     
 [85] class_7.3-22             httr_1.4.7               htmlwidgets_1.6.4       
 [88] S4Arrays_1.6.0           pkgconfig_2.0.3          gtable_0.3.6            
 [91] modeltools_0.2-23        blob_1.2.4               XVector_0.46.0          
 [94] htmltools_0.5.8.1        carData_3.0-5            fftwtools_0.9-11        
 [97] scales_1.3.0             png_0.1-8                spatstat.univar_3.1-1   
[100] knitr_1.49               rstudioapi_0.17.1        rjson_0.2.23            
[103] nlme_3.1-166             curl_6.0.1               proxy_0.4-27            
[106] cachem_1.1.0             rhdf5_2.50.0             stringr_1.5.1           
[109] BiocVersion_3.20.0       parallel_4.4.1           vipor_0.4.7             
[112] AnnotationDbi_1.68.0     pillar_1.9.0             grid_4.4.1              
[115] vctrs_0.6.5              promises_1.3.2           ggpubr_0.6.0            
[118] BiocSingular_1.22.0      car_3.1-3                beachmat_2.22.0         
[121] xtable_1.8-4             cluster_2.1.6            princurve_2.1.6         
[124] beeswarm_0.4.0           evaluate_1.0.1           magick_2.8.5            
[127] cli_3.6.3                locfit_1.5-9.10          compiler_4.4.1          
[130] rlang_1.1.4              crayon_1.5.3             analogue_0.17-7         
[133] ggsignif_0.6.4           labeling_0.4.3           mclust_6.1.1            
[136] fs_1.6.5                 ggbeeswarm_0.7.2         stringi_1.8.4           
[139] psych_2.4.6.26           deldir_2.0-4             viridisLite_0.4.2       
[142] BiocParallel_1.40.0      nnls_1.6                 FCPS_1.3.4              
[145] cytomapper_1.18.0        munsell_0.5.1            Biostrings_2.74.0       
[148] tiff_0.1-12              coop_0.6-3               profileModel_0.6.1      
[151] spatstat.geom_3.3-4      Matrix_1.7-1             bit64_4.5.2             
[154] Rhdf5lib_1.28.0          fpc_2.2-13               KEGGREST_1.46.0         
[157] shiny_1.9.1              brglm_0.7.2              kernlab_0.9-33          
[160] broom_1.0.7              memoise_2.0.1            DEoptimR_1.1-3-1        
[163] bit_4.5.0