5  Quality Control

With cell segmentation complete, we can move on to performing quality control checks. In many spatial imaging protocols, there tends to be a degree of variability in the intensity of each image. For example, in one image, the CD3 stain may be too strong, but in another image the CD3 staining may be particularly weak. This variability is often inevitable and can be hard to correct for during the imaging process. Hence, it is important that we identify when such variance occurs and correct it. The simpleSeg package includes additional functions for detecting batch effects and normalising expression data to minimise their impact.

# set parameters
set.seed(51773)
use_mc <- TRUE
if (use_mc) {
  nCores <- max(parallel::detectCores()/2, 1)
} else {
  nCores <- 1
}
BPPARAM <- simpleSeg:::generateBPParam(nCores)
theme_set(theme_classic())

5.1 simpleSeg: Do my images have a batch effect?

First, let’s load in the images we previously segmented out in the last section. The SpatialDatasets package conveniently provides the segmented out images for the Ferguson 2022 dataset.

# load in segmented data
fergusonSPE <- SpatialDatasets::spe_Ferguson_2022()

Next, we can check if the marker intensities of each cell require some form of transformation or normalisation. The reason we do this is two-fold:

  1. The intensities of images are often highly skewed, preventing any meaningful downstream analysis.
  2. The intensities across different images are often different, meaning that what is considered “positive” can be different across images.

By transforming and normalising the data, we aim to reduce these two effects. Below, we extract the marker intensities from the counts assay and take a closer look at the CD3 marker, which should be expressed in the majority of T cells.

# plot densities of CD3 for each image
fergusonSPE |> 
  join_features(features = rownames(fergusonSPE), shape = "wide", assay = "counts") |> 
  ggplot(aes(x = CD3, colour = imageID)) + 
  geom_density() + 
  theme(legend.position = "none")

Tip

What we’re looking for

  1. Do the CD3+ and CD3- peaks clearly separate out in the density plot? To ensure that downstream clustering goes smoothly, we want our cell type specific markers to show 2 distinct peaks representing our CD3+ and CD3- cells. If we see 3 or more peaks where we don’t expect, this might be an indicator that further normalisation is required.
  2. Are our CD3+ and CD3- peaks consistent across our images? We want to make sure that our density plots for CD3 are largely the same across images so that a CD3+ cell in one image is equivalent to a CD3+ cell in another image.

Here, we can see that the intensities are very clearly skewed, and it is difficult to distinguish a CD3- cell from a CD3+ cell. Further, we can clearly see some image-level batch effects, where across images, the intensity peaks differ drastically.

Another method of visualising batch effects is using a dimensionality reduction technique and visualising how the images separate out on a 2D plot. If no batch effect is expected, we should see the images largely overlap with each other.

set.seed(51773)

# specify a subset of informative markers for UMAP and clustering
ct_markers <- c("podoplanin", "CD13", "CD31",
                "panCK", "CD3", "CD4", "CD8a",
                "CD20", "CD68", "CD16", "CD14", 
                "HLADR", "CD66a")

# perform dimension reduction using UMAP
fergusonSPE <- scater::runUMAP(
  fergusonSPE,
  subset_row = ct_markers,
  exprs_values = "counts"
)

# select a subset of images to plot
someImages <- unique(fergusonSPE$imageID)[c(1, 5, 10, 20, 30, 40)]

# UMAP by imageID
scater::plotReducedDim(
  fergusonSPE[, fergusonSPE$imageID %in% someImages],
  dimred = "UMAP",
  colour_by = "imageID"
)

Time for this code chunk to run with 5.5 cores: 47.35 seconds

The UMAP also indicates that some level of batch effect exists in our dataset.

We can use the normalizeCells function from the simpleSeg package to correct for image-level batch effects. We specify the following parameters:

  • transformation is an optional argument which specifies the function to be applied to the data. We do not apply an arcsinh transformation here, as we already apply a square root transform in the simpleSeg function.
  • method = c("trim99", "mean", PC1") is an optional argument which specifies the normalisation method(s) to be performed. A comprehensive table of methods is provided below.
  • assayIn = "counts" is a required argument which specifies the name of the assay that contains our intensity data. In our context, this is called counts.
# leave out the nuclei markers from our normalisation process
useMarkers <- rownames(fergusonSPE)[!rownames(fergusonSPE) %in% c("DNA1", "DNA2", "HH3")]

# transform and normalise the marker expression of each cell type
fergusonSPE <- normalizeCells(fergusonSPE,
                        markers = useMarkers,
                        transformation = NULL,
                        method = c("trim99", "mean", "PC1"),
                        assayIn = "counts",
                        cores = nCores)

This modified data is then stored in the norm assay by default, but can be changed using the assayOut parameter.

5.1.0.1 method Parameters

Method Description
“mean” Divides the marker cellular marker intensities by their mean.
“minMax” Subtracts the minimum value and scales markers between 0 and 1.
“trim99” Sets the highest 1% of values to the value of the 99th percentile.`
“PC1” Removes the 1st principal component) can be performed with one call of the function, in the order specified by the user.

We can then plot the same density curve for the CD3 marker using the normalised data.

# plot densities of CD3 for each image
fergusonSPE |> 
  join_features(features = rownames(fergusonSPE), shape = "wide", assay = "norm") |> 
  ggplot(aes(x = CD3, colour = imageID)) + 
  geom_density() + 
  theme(legend.position = "none")

Tip

Questions revisited

  1. Do the CD3+ and CD3- peaks clearly separate out in the density plot? If not, we can try optimising the transformation if the distribution looks heavily skewed.
  2. Are our CD3+ and CD3- peaks consistent across our images? We can try to be more stringent in our normalisation, such as by removing the 1st PC (method = c(..., "PC1")) or scaling the values for all images between 0 and 1 (method = c(..., "minMax")).

Here, we can see that the normalised data appears more bimodal, and we can clearly observe a CD3+ peak at 5.00, and a CD3- peak at around 3.00. Image-level batch effects also appear to have been mitigated.

We can also visualise the effect of normalisation on the UMAP, which shows that the images now overlap with each other to a much greater extent.

set.seed(51773)

# perform dimension reduction using UMAP
fergusonSPE <- scater::runUMAP(
  fergusonSPE,
  subset_row = ct_markers,
  exprs_values = "norm",
  name = "normUMAP"
)

someImages <- unique(fergusonSPE$imageID)[c(1, 5, 10, 20, 30, 40)]

# UMAP by imageID
scater::plotReducedDim(
  fergusonSPE[, fergusonSPE$imageID %in% someImages],
  dimred = "normUMAP",
  colour_by = "imageID"
)

Time for this code chunk to run with 5.5 cores: 46.42 seconds

Now that we have completed quality control checks and normalised the expression data to address variability and batch effects, we can proceed to the next step: clustering and cell annotation.

5.2 sessionInfo

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Australia/Sydney
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] SpatialDatasets_1.4.0           SpatialExperiment_1.16.0       
 [3] ExperimentHub_2.14.0            AnnotationHub_3.14.0           
 [5] BiocFileCache_2.14.0            dbplyr_2.5.0                   
 [7] scater_1.34.0                   scuttle_1.16.0                 
 [9] simpleSeg_1.8.0                 ggplot2_3.5.1                  
[11] ttservice_0.4.1                 tidyr_1.3.1                    
[13] dplyr_1.1.4                     tidySingleCellExperiment_1.16.0
[15] SingleCellExperiment_1.28.1     SummarizedExperiment_1.36.0    
[17] Biobase_2.66.0                  GenomicRanges_1.58.0           
[19] GenomeInfoDb_1.42.0             IRanges_2.40.0                 
[21] S4Vectors_0.44.0                BiocGenerics_0.52.0            
[23] MatrixGenerics_1.18.0           matrixStats_1.4.1              

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3      rstudioapi_0.17.1       jsonlite_1.8.9         
  [4] magrittr_2.0.3          spatstat.utils_3.1-1    ggbeeswarm_0.7.2       
  [7] magick_2.8.5            farver_2.1.2            rmarkdown_2.29         
 [10] zlibbioc_1.52.0         vctrs_0.6.5             memoise_2.0.1          
 [13] RCurl_1.98-1.16         terra_1.7-78            svgPanZoom_0.3.4       
 [16] htmltools_0.5.8.1       S4Arrays_1.6.0          curl_6.0.1             
 [19] BiocNeighbors_2.0.0     raster_3.6-30           Rhdf5lib_1.28.0        
 [22] SparseArray_1.6.0       rhdf5_2.50.0            htmlwidgets_1.6.4      
 [25] cachem_1.1.0            plotly_4.10.4           mime_0.12              
 [28] lifecycle_1.0.4         pkgconfig_2.0.3         rsvd_1.0.5             
 [31] Matrix_1.7-1            R6_2.5.1                fastmap_1.2.0          
 [34] GenomeInfoDbData_1.2.13 shiny_1.9.1             digest_0.6.37          
 [37] colorspace_2.1-1        AnnotationDbi_1.68.0    irlba_2.3.5.1          
 [40] RSQLite_2.3.8           beachmat_2.22.0         labeling_0.4.3         
 [43] filelock_1.0.3          fansi_1.0.6             nnls_1.6               
 [46] httr_1.4.7              polyclip_1.10-7         abind_1.4-8            
 [49] compiler_4.4.1          bit64_4.5.2             withr_3.0.2            
 [52] tiff_0.1-12             BiocParallel_1.40.0     DBI_1.2.3              
 [55] viridis_0.6.5           HDF5Array_1.34.0        cytomapper_1.18.0      
 [58] rappdirs_0.3.3          DelayedArray_0.32.0     rjson_0.2.23           
 [61] tools_4.4.1             vipor_0.4.7             beeswarm_0.4.0         
 [64] httpuv_1.6.15           glue_1.8.0              EBImage_4.48.0         
 [67] rhdf5filters_1.18.0     promises_1.3.2          grid_4.4.1             
 [70] generics_0.1.3          gtable_0.3.6            spatstat.data_3.1-4    
 [73] data.table_1.16.2       BiocSingular_1.22.0     ScaledMatrix_1.14.0    
 [76] sp_2.1-4                utf8_1.2.4              XVector_0.46.0         
 [79] RcppAnnoy_0.0.22        spatstat.geom_3.3-4     BiocVersion_3.20.0     
 [82] ggrepel_0.9.6           pillar_1.9.0            stringr_1.5.1          
 [85] later_1.4.1             lattice_0.22-6          bit_4.5.0              
 [88] deldir_2.0-4            tidyselect_1.2.1        locfit_1.5-9.10        
 [91] Biostrings_2.74.0       knitr_1.49              gridExtra_2.3          
 [94] svglite_2.1.3           xfun_0.49               shinydashboard_0.7.2   
 [97] stringi_1.8.4           UCSC.utils_1.2.0        fftwtools_0.9-11       
[100] yaml_2.3.10             lazyeval_0.2.2          evaluate_1.0.1         
[103] codetools_0.2-20        tibble_3.2.1            BiocManager_1.30.25    
[106] cli_3.6.3               uwot_0.2.2              xtable_1.8-4           
[109] systemfonts_1.1.0       munsell_0.5.1           Rcpp_1.0.13-1          
[112] png_0.1-8               spatstat.univar_3.1-1   parallel_4.4.1         
[115] ellipsis_0.3.2          blob_1.2.4              jpeg_0.1-10            
[118] bitops_1.0-9            viridisLite_0.4.2       scales_1.3.0           
[121] purrr_1.0.2             crayon_1.5.3            rlang_1.1.4            
[124] cowplot_1.1.3           KEGGREST_1.46.0