4  Cell segmentation and pre-processing

The typical first step in analysing spatial data is cell segmentation, which involves identifying and isolating individual cells in each image. This step is essential for assigning cell type labels and evaluating cellular relationships within the tissue.

In this section, we will outline the process of reading and pre-processing images obtained from various imaging technologies to prepare them for downstream analysis. Additionally, we will demonstrate how our simpleSeg package facilitates efficient and straightforward cell segmentation, as well as how to evaluate the quality of the segmentation results.

# load required libraries
library(cytomapper)
library(ggplot2)
library(simpleSeg)

We recommend setting the number of cores to enable running code in parallel. Please choose a number that is appropriate for your resources. A minimum of 2 cores is suggested since running this workflow can be computationally intensive.

# set parameters
set.seed(51773)
use_mc <- TRUE
if (use_mc) {
  nCores <- max(parallel::detectCores()/2, 1)
} else {
  nCores <- 1
}
BPPARAM <- simpleSeg:::generateBPParam(nCores)
theme_set(theme_classic())

4.1 Reading in data with cytomapper

We will be using the Ferguson 2022 dataset to demonstrate how to perform pre-processing and cell segmentation. This dataset can be accessed through the SpatialDatasets package and is available in the form of single-channel TIFF images. In single-channel images, each pixel represents intensity values for a single marker. The loadImages function from the cytomapper package can be used to load all the TIFF images into a CytoImageList object and store the images as an h5 file on-disk in a temporary directory using the h5FilesPath = HDF5Array::getHDF5DumpDir() parameter.

We will also assign the metadata columns of the CytoImageList object using the mcols function.

pathToZip <- SpatialDatasets::Ferguson_Images()
pathToImages <- "data/processing/images"
unzip(pathToZip, exdir = "data/processing/images")

# store images in a CytoImageList on_disk as h5 files to save memory
images <- cytomapper::loadImages(
  pathToImages,
  single_channel = TRUE,
  on_disk = TRUE,
  h5FilesPath = HDF5Array::getHDF5DumpDir(),
  BPPARAM = BPPARAM
)

# assign metadata columns
mcols(images) <- S4Vectors::DataFrame(imageID = names(images))

Time for this code chunk to run with 5.5 cores: 156.64 seconds

When reading the image channels directly from the names of the TIFF images, they will often need to be cleaned for ease of downstream processing. The channel names can be accessed from the CytoImageList object using the channelNames function.

channelNames(images) <- channelNames(images) |>
                          # remove preceding letters
                          sub(pattern = ".*_", replacement = "", x = _) |> 
                          # remove the .ome
                          sub(pattern = ".ome", replacement = "", x = _)

Similarly, the image names will be taken from the folder name containing the individual TIFF images for each channel. These will often also need to be cleaned.

# cleaning image names to obtain image IDs
split_names <- function(x) {
  sapply(strsplit(x, "_"), `[`, 3)
}

names(images) <- names(images) |> split_names()
mcols(images) <- S4Vectors::DataFrame(imageID = names(images))

4.2 Cell segmentation with simpleSeg

For the sake of simplicity and efficiency, we will be subsetting our images down to 10 images.

images <- images[1:10]

Next, we can perform our segmentation. The simpleSeg package on Bioconductor provides functionality for user friendly, watershed based segmentation on multiplexed cellular images based on the intensity of user-specified protein marker channels. The main function, simpleSeg, can be used to perform a simple cell segmentation process that traces out the nuclei using a specified channel.

In the particular example below, we have asked simpleSeg to do the following:

  • nucleus = c("HH3"): trace out the nuclei signal in the images using the HH3 channel.
  • pca = TRUE: segment out the nuclei mask using principal component analysis of all channels and using the principal components most aligned with the nuclei channel (in this case, HH3).
  • cellBody = "dilate": use a dilation strategy of segmentation, expanding out from the nucleus by a specified discSize. In this case, discSize = 3, which means simpleSeg dilates out from the nucleus by 3 pixels.
  • sizeSelection = 20: ensure that only cells with a size greater than 20 pixels will be used.
  • transform = "sqrt": perform square root transformation on each of the channels prior to segmentation.
  • tissue = c("panCK", "CD45", "HH3"): use the specified tissue mask to filter out all background noise outside the tissue mask. This allows us to ignore background noise which happens outside of the tumour core.

There are many other parameters that can be specified in simpleSeg (smooth, watershed, tolerance, and ext), and we encourage the user to select the parameters which best suit their biological context.

masks <- simpleSeg(images,
                   nucleus = c("HH3"),
                   pca = TRUE,
                   cellBody = "dilate",
                   discSize = 3,
                   sizeSelection = 20,
                   transform = "sqrt",
                   tissue = c("panCK", "CD45", "HH3"),
                   cores = nCores)

Time for this code chunk to run with 5.5 cores: 53.68 seconds

4.2.1 Visualise separation

We can examine the performance of the cell segmentation using the display and colorLabels functions from the EBImage package. If used in an interactive session, display allows you to zoom in and out of the image.

# display image F3
EBImage::display(colorLabels(masks[[1]]))

4.2.2 Visualise outlines

The plotPixels function from the cytomapper package makes it easy to overlay the mask on top of the nucleus intensity marker to see how well our segmentation process has performed.

plotPixels(image = images["F3"], 
           mask = masks["F3"],
           img_id = "imageID", 
           colour_by = c("HH3"), 
           display = "single",
           colour = list(HH3 = c("black","blue")),
           legend = NULL,
           bcg = list(
             HH3 = c(1, 1, 2)
           ))

Here, we can see that the segmentation appears to be performing reasonably.

If you see over or under-segmentation of your images, discSize is a key parameter in the simpleSeg function for optimising the size of the dilation disc after segmenting out the nuclei.

We can also visualise multiple markers at once instead of just the HH3 marker to see how the segmentation mask performs.

plotPixels(image = images["F3"], 
           mask = masks["F3"],
           img_id = "imageID", 
           colour_by = c("HH3", "CD31", "FX111A"), 
           display = "single",
           colour = list(HH3 = c("black","blue"),
                         CD31 = c("black", "red"),
                         FX111A = c("black", "green")),
           legend = NULL,
           bcg = list(
             HH3 = c(1, 1, 2),
             CD31 = c(0, 1, 2),
             FX111A = c(0, 1, 1.5)
           ))

Tip

What to look for and change to obtain an ideal segmentation

  1. Does the segmentation capture the full nucleus? If not, perhaps you need to try a different transformation to improve the thresholding of the nuclei marker. You could also try using pca = TRUE which will borrow information across the markers to help find the nuclei.
  2. How much of the cell body is the segmentation missing? Try increasing the dilation around the nucleus by setting discSize = 7.
  3. Are the segmentations capturing neighbouring cells? Try decreasing the dilation to limit lateral spillover of marker signal by setting discSize = 2.

Here, we can see that our segmentation mask has done a good job of capturing the CD31 signal, but perhaps not such a good job of capturing the FXIIIA signal, which often lies outside of our dilated nuclear mask. This suggests that we might need to increase the discSize or other parameters of simpleSeg.

In particular, the cellBody and watershed parameters can strongly influence the way cells are segmented using simpleSeg. We have provided further details on how the user may specify cell body identification and watershedding in the tables below.

As simpleSeg is a nuclei-based dilation method, it suffers from tissues where cells might be multi-nucleated, or where cells have non-circular or elliptical morphologies. For tissues where you might expect these cells, it may be preferable to choose a different segmentation method.

4.2.2.1 cellBody Parameters

Method Description
“distance” Performs watershedding on a distance map of the thresholded nuclei signal. With a pixels distance being defined as the distance from the closest background signal.
“intensity” Performs watershedding using the intensity of the nuclei marker.
“combine” Combines the previous two methods by multiplying the distance map by the nuclei marker intensity.

4.2.2.2 watershed Parameters

Method Description
“dilation” Dilates the nuclei by an amount defined by the user. The size of the dilatation in pixels may be specified with the discDize argument.
“discModel” Uses all the markers to predict the presence of dilated ‘discs’ around the nuclei. The model therefore learns which markers are typically present in the cell cytoplasm and generates a mask based on this.
“marker” The user may specify one or multiple dedicated cytoplasm markers to predict the cytoplasm. This can be done using cellBody = "marker name"/"index"
“None” The nuclei mask is returned directly.

4.3 Summarise cell features

We can use the measureObjects from cytomapper to calculate the average intensity of each channel within each cell, as well as other morphological features. By default, measureObjects will return a SingleCellExperiment object, where the channel intensities are stored in the counts assay and the spatial location of each cell is stored in colData in the m.cx and m.cy columns.

However, you can also specify measureObjects to return a SpatialExperiment object by specifying return_as = "spe". In a SpatialExperiment object, the spatial location of each cell is stored in the spatialCoords slot as m.cx and m.cy, which simplifies plotting. In this demonstration, we will return a SpatialExperiment object.

# summarise the expression of each marker in each cell
cells <- cytomapper::measureObjects(masks,
                                    images,
                                    img_id = "imageID",
                                    return_as = "spe",
                                    BPPARAM = BPPARAM)

spatialCoordsNames(cells) <- c("x", "y")

cells
class: SpatialExperiment 
dim: 36 37953 
metadata(0):
assays(1): counts
rownames(36): panCK CD20 ... DNA1 DNA2
rowData names(0):
colnames: NULL
colData names(8): imageID object_id ... sample_id objectNum
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : x y
imgData names(1): sample_id

Time for this code chunk to run with 5.5 cores: 48.93 seconds

So far, we have processed our raw TIFF images, performed cell segmentation to isolate individual cells, and then stored our data as a SpatialExperiment object. We can now move on to quality control, data transformation, and normalisation to address batch effects.

4.4 sessionInfo

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Australia/Sydney
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] SpatialDatasets_1.4.0       SpatialExperiment_1.16.0   
 [3] ExperimentHub_2.14.0        AnnotationHub_3.14.0       
 [5] BiocFileCache_2.14.0        dbplyr_2.5.0               
 [7] simpleSeg_1.8.0             ggplot2_3.5.1              
 [9] cytomapper_1.18.0           SingleCellExperiment_1.28.1
[11] SummarizedExperiment_1.36.0 Biobase_2.66.0             
[13] GenomicRanges_1.58.0        GenomeInfoDb_1.42.0        
[15] IRanges_2.40.0              S4Vectors_0.44.0           
[17] BiocGenerics_0.52.0         MatrixGenerics_1.18.0      
[19] matrixStats_1.4.1           EBImage_4.48.0             

loaded via a namespace (and not attached):
  [1] DBI_1.2.3               bitops_1.0-9            deldir_2.0-4           
  [4] gridExtra_2.3           rlang_1.1.4             magrittr_2.0.3         
  [7] svgPanZoom_0.3.4        shinydashboard_0.7.2    RSQLite_2.3.8          
 [10] compiler_4.4.1          spatstat.geom_3.3-4     png_0.1-8              
 [13] systemfonts_1.1.0       fftwtools_0.9-11        vctrs_0.6.5            
 [16] pkgconfig_2.0.3         crayon_1.5.3            fastmap_1.2.0          
 [19] magick_2.8.5            XVector_0.46.0          utf8_1.2.4             
 [22] promises_1.3.2          rmarkdown_2.29          UCSC.utils_1.2.0       
 [25] ggbeeswarm_0.7.2        purrr_1.0.2             bit_4.5.0              
 [28] xfun_0.49               cachem_1.1.0            zlibbioc_1.52.0        
 [31] jsonlite_1.8.9          blob_1.2.4              later_1.4.1            
 [34] rhdf5filters_1.18.0     DelayedArray_0.32.0     spatstat.utils_3.1-1   
 [37] Rhdf5lib_1.28.0         BiocParallel_1.40.0     jpeg_0.1-10            
 [40] tiff_0.1-12             terra_1.7-78            parallel_4.4.1         
 [43] R6_2.5.1                RColorBrewer_1.1-3      spatstat.data_3.1-4    
 [46] spatstat.univar_3.1-1   Rcpp_1.0.13-1           knitr_1.49             
 [49] httpuv_1.6.15           Matrix_1.7-1            nnls_1.6               
 [52] tidyselect_1.2.1        yaml_2.3.10             rstudioapi_0.17.1      
 [55] abind_1.4-8             viridis_0.6.5           codetools_0.2-20       
 [58] curl_6.0.1              lattice_0.22-6          tibble_3.2.1           
 [61] KEGGREST_1.46.0         shiny_1.9.1             withr_3.0.2            
 [64] evaluate_1.0.1          polyclip_1.10-7         Biostrings_2.74.0      
 [67] filelock_1.0.3          BiocManager_1.30.25     pillar_1.9.0           
 [70] generics_0.1.3          sp_2.1-4                RCurl_1.98-1.16        
 [73] BiocVersion_3.20.0      munsell_0.5.1           scales_1.3.0           
 [76] xtable_1.8-4            glue_1.8.0              tools_4.4.1            
 [79] locfit_1.5-9.10         rhdf5_2.50.0            grid_4.4.1             
 [82] AnnotationDbi_1.68.0    colorspace_2.1-1        GenomeInfoDbData_1.2.13
 [85] raster_3.6-30           beeswarm_0.4.0          HDF5Array_1.34.0       
 [88] vipor_0.4.7             cli_3.6.3               rappdirs_0.3.3         
 [91] fansi_1.0.6             S4Arrays_1.6.0          viridisLite_0.4.2      
 [94] svglite_2.1.3           dplyr_1.1.4             gtable_0.3.6           
 [97] digest_0.6.37           SparseArray_1.6.0       rjson_0.2.23           
[100] htmlwidgets_1.6.4       memoise_2.0.1           htmltools_0.5.8.1      
[103] lifecycle_1.0.4         httr_1.4.7              mime_0.12              
[106] bit64_4.5.2