4 Cell segmentation and pre-processing
The typical first step in analysing spatial data is cell segmentation, which involves identifying and isolating individual cells in each image. This step is essential for assigning cell type labels and evaluating cellular relationships within the tissue.
In this section, we will outline the process of reading and pre-processing images obtained from various imaging technologies to prepare them for downstream analysis. Additionally, we will demonstrate how our simpleSeg package facilitates efficient and straightforward cell segmentation, as well as how to evaluate the quality of the segmentation results.
We recommend setting the number of cores to enable running code in parallel. Please choose a number that is appropriate for your resources. A minimum of 2 cores is suggested since running this workflow can be computationally intensive.
# set parameters
set.seed(51773)
use_mc <- TRUE
if (use_mc) {
nCores <- max(parallel::detectCores()/2, 1)
} else {
nCores <- 1
}
BPPARAM <- simpleSeg:::generateBPParam(nCores)
theme_set(theme_classic())
4.1 Reading in data with cytomapper
We will be using the Ferguson 2022 dataset to demonstrate how to perform pre-processing and cell segmentation. This dataset can be accessed through the SpatialDatasets
package and is available in the form of single-channel TIFF images. In single-channel images, each pixel represents intensity values for a single marker. The loadImages
function from the cytomapper
package can be used to load all the TIFF images into a CytoImageList
object and store the images as an h5 file on-disk in a temporary directory using the h5FilesPath = HDF5Array::getHDF5DumpDir()
parameter.
We will also assign the metadata columns of the CytoImageList
object using the mcols
function.
pathToZip <- SpatialDatasets::Ferguson_Images()
pathToImages <- "data/processing/images"
unzip(pathToZip, exdir = "data/processing/images")
# store images in a CytoImageList on_disk as h5 files to save memory
images <- cytomapper::loadImages(
pathToImages,
single_channel = TRUE,
on_disk = TRUE,
h5FilesPath = HDF5Array::getHDF5DumpDir(),
BPPARAM = BPPARAM
)
# assign metadata columns
mcols(images) <- S4Vectors::DataFrame(imageID = names(images))
Time for this code chunk to run with 5.5 cores: 156.64 seconds
When reading the image channels directly from the names of the TIFF images, they will often need to be cleaned for ease of downstream processing. The channel names can be accessed from the CytoImageList
object using the channelNames
function.
channelNames(images) <- channelNames(images) |>
# remove preceding letters
sub(pattern = ".*_", replacement = "", x = _) |>
# remove the .ome
sub(pattern = ".ome", replacement = "", x = _)
Similarly, the image names will be taken from the folder name containing the individual TIFF images for each channel. These will often also need to be cleaned.
4.2 Cell segmentation with simpleSeg
For the sake of simplicity and efficiency, we will be subsetting our images down to 10 images.
images <- images[1:10]
Next, we can perform our segmentation. The simpleSeg package on Bioconductor provides functionality for user friendly, watershed based segmentation on multiplexed cellular images based on the intensity of user-specified protein marker channels. The main function, simpleSeg
, can be used to perform a simple cell segmentation process that traces out the nuclei using a specified channel.
In the particular example below, we have asked simpleSeg
to do the following:
-
nucleus = c("HH3")
: trace out the nuclei signal in the images using the HH3 channel. -
pca = TRUE
: segment out the nuclei mask using principal component analysis of all channels and using the principal components most aligned with the nuclei channel (in this case, HH3). -
cellBody = "dilate"
: use a dilation strategy of segmentation, expanding out from the nucleus by a specifieddiscSize
. In this case,discSize = 3
, which means simpleSeg dilates out from the nucleus by 3 pixels. -
sizeSelection = 20
: ensure that only cells with a size greater than 20 pixels will be used. -
transform = "sqrt"
: perform square root transformation on each of the channels prior to segmentation. -
tissue = c("panCK", "CD45", "HH3")
: use the specified tissue mask to filter out all background noise outside the tissue mask. This allows us to ignore background noise which happens outside of the tumour core.
There are many other parameters that can be specified in simpleSeg
(smooth
, watershed
, tolerance
, and ext
), and we encourage the user to select the parameters which best suit their biological context.
masks <- simpleSeg(images,
nucleus = c("HH3"),
pca = TRUE,
cellBody = "dilate",
discSize = 3,
sizeSelection = 20,
transform = "sqrt",
tissue = c("panCK", "CD45", "HH3"),
cores = nCores)
Time for this code chunk to run with 5.5 cores: 53.68 seconds
4.2.1 Visualise separation
We can examine the performance of the cell segmentation using the display
and colorLabels
functions from the EBImage
package. If used in an interactive session, display
allows you to zoom in and out of the image.
# display image F3
EBImage::display(colorLabels(masks[[1]]))
4.2.2 Visualise outlines
The plotPixels
function from the cytomapper
package makes it easy to overlay the mask on top of the nucleus intensity marker to see how well our segmentation process has performed.
plotPixels(image = images["F3"],
mask = masks["F3"],
img_id = "imageID",
colour_by = c("HH3"),
display = "single",
colour = list(HH3 = c("black","blue")),
legend = NULL,
bcg = list(
HH3 = c(1, 1, 2)
))
Here, we can see that the segmentation appears to be performing reasonably.
If you see over or under-segmentation of your images, discSize
is a key parameter in the simpleSeg
function for optimising the size of the dilation disc after segmenting out the nuclei.
We can also visualise multiple markers at once instead of just the HH3 marker to see how the segmentation mask performs.
plotPixels(image = images["F3"],
mask = masks["F3"],
img_id = "imageID",
colour_by = c("HH3", "CD31", "FX111A"),
display = "single",
colour = list(HH3 = c("black","blue"),
CD31 = c("black", "red"),
FX111A = c("black", "green")),
legend = NULL,
bcg = list(
HH3 = c(1, 1, 2),
CD31 = c(0, 1, 2),
FX111A = c(0, 1, 1.5)
))
What to look for and change to obtain an ideal segmentation
- Does the segmentation capture the full nucleus? If not, perhaps you need to try a different transformation to improve the thresholding of the nuclei marker. You could also try using
pca = TRUE
which will borrow information across the markers to help find the nuclei. - How much of the cell body is the segmentation missing? Try increasing the dilation around the nucleus by setting
discSize = 7
. - Are the segmentations capturing neighbouring cells? Try decreasing the dilation to limit lateral spillover of marker signal by setting
discSize = 2
.
Here, we can see that our segmentation mask has done a good job of capturing the CD31 signal, but perhaps not such a good job of capturing the FXIIIA signal, which often lies outside of our dilated nuclear mask. This suggests that we might need to increase the discSize
or other parameters of simpleSeg
.
In particular, the cellBody
and watershed
parameters can strongly influence the way cells are segmented using simpleSeg
. We have provided further details on how the user may specify cell body identification and watershedding in the tables below.
As simpleSeg
is a nuclei-based dilation method, it suffers from tissues where cells might be multi-nucleated, or where cells have non-circular or elliptical morphologies. For tissues where you might expect these cells, it may be preferable to choose a different segmentation method.
4.2.2.1 cellBody
Parameters
Method | Description |
---|---|
“distance” | Performs watershedding on a distance map of the thresholded nuclei signal. With a pixels distance being defined as the distance from the closest background signal. |
“intensity” | Performs watershedding using the intensity of the nuclei marker. |
“combine” | Combines the previous two methods by multiplying the distance map by the nuclei marker intensity. |
4.2.2.2 watershed
Parameters
Method | Description |
---|---|
“dilation” | Dilates the nuclei by an amount defined by the user. The size of the dilatation in pixels may be specified with the discDize argument. |
“discModel” | Uses all the markers to predict the presence of dilated ‘discs’ around the nuclei. The model therefore learns which markers are typically present in the cell cytoplasm and generates a mask based on this. |
“marker” | The user may specify one or multiple dedicated cytoplasm markers to predict the cytoplasm. This can be done using cellBody = "marker name"/"index" |
“None” | The nuclei mask is returned directly. |
4.3 Summarise cell features
We can use the measureObjects
from cytomapper
to calculate the average intensity of each channel within each cell, as well as other morphological features. By default, measureObjects
will return a SingleCellExperiment
object, where the channel intensities are stored in the counts
assay and the spatial location of each cell is stored in colData
in the m.cx
and m.cy
columns.
However, you can also specify measureObjects
to return a SpatialExperiment
object by specifying return_as = "spe"
. In a SpatialExperiment
object, the spatial location of each cell is stored in the spatialCoords
slot as m.cx
and m.cy
, which simplifies plotting. In this demonstration, we will return a SpatialExperiment
object.
# summarise the expression of each marker in each cell
cells <- cytomapper::measureObjects(masks,
images,
img_id = "imageID",
return_as = "spe",
BPPARAM = BPPARAM)
spatialCoordsNames(cells) <- c("x", "y")
cells
class: SpatialExperiment
dim: 36 37953
metadata(0):
assays(1): counts
rownames(36): panCK CD20 ... DNA1 DNA2
rowData names(0):
colnames: NULL
colData names(8): imageID object_id ... sample_id objectNum
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : x y
imgData names(1): sample_id
Time for this code chunk to run with 5.5 cores: 48.93 seconds
So far, we have processed our raw TIFF images, performed cell segmentation to isolate individual cells, and then stored our data as a SpatialExperiment
object. We can now move on to quality control, data transformation, and normalisation to address batch effects.
4.4 sessionInfo
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Australia/Sydney
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] SpatialDatasets_1.4.0 SpatialExperiment_1.16.0
[3] ExperimentHub_2.14.0 AnnotationHub_3.14.0
[5] BiocFileCache_2.14.0 dbplyr_2.5.0
[7] simpleSeg_1.8.0 ggplot2_3.5.1
[9] cytomapper_1.18.0 SingleCellExperiment_1.28.1
[11] SummarizedExperiment_1.36.0 Biobase_2.66.0
[13] GenomicRanges_1.58.0 GenomeInfoDb_1.42.0
[15] IRanges_2.40.0 S4Vectors_0.44.0
[17] BiocGenerics_0.52.0 MatrixGenerics_1.18.0
[19] matrixStats_1.4.1 EBImage_4.48.0
loaded via a namespace (and not attached):
[1] DBI_1.2.3 bitops_1.0-9 deldir_2.0-4
[4] gridExtra_2.3 rlang_1.1.4 magrittr_2.0.3
[7] svgPanZoom_0.3.4 shinydashboard_0.7.2 RSQLite_2.3.8
[10] compiler_4.4.1 spatstat.geom_3.3-4 png_0.1-8
[13] systemfonts_1.1.0 fftwtools_0.9-11 vctrs_0.6.5
[16] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
[19] magick_2.8.5 XVector_0.46.0 utf8_1.2.4
[22] promises_1.3.2 rmarkdown_2.29 UCSC.utils_1.2.0
[25] ggbeeswarm_0.7.2 purrr_1.0.2 bit_4.5.0
[28] xfun_0.49 cachem_1.1.0 zlibbioc_1.52.0
[31] jsonlite_1.8.9 blob_1.2.4 later_1.4.1
[34] rhdf5filters_1.18.0 DelayedArray_0.32.0 spatstat.utils_3.1-1
[37] Rhdf5lib_1.28.0 BiocParallel_1.40.0 jpeg_0.1-10
[40] tiff_0.1-12 terra_1.7-78 parallel_4.4.1
[43] R6_2.5.1 RColorBrewer_1.1-3 spatstat.data_3.1-4
[46] spatstat.univar_3.1-1 Rcpp_1.0.13-1 knitr_1.49
[49] httpuv_1.6.15 Matrix_1.7-1 nnls_1.6
[52] tidyselect_1.2.1 yaml_2.3.10 rstudioapi_0.17.1
[55] abind_1.4-8 viridis_0.6.5 codetools_0.2-20
[58] curl_6.0.1 lattice_0.22-6 tibble_3.2.1
[61] KEGGREST_1.46.0 shiny_1.9.1 withr_3.0.2
[64] evaluate_1.0.1 polyclip_1.10-7 Biostrings_2.74.0
[67] filelock_1.0.3 BiocManager_1.30.25 pillar_1.9.0
[70] generics_0.1.3 sp_2.1-4 RCurl_1.98-1.16
[73] BiocVersion_3.20.0 munsell_0.5.1 scales_1.3.0
[76] xtable_1.8-4 glue_1.8.0 tools_4.4.1
[79] locfit_1.5-9.10 rhdf5_2.50.0 grid_4.4.1
[82] AnnotationDbi_1.68.0 colorspace_2.1-1 GenomeInfoDbData_1.2.13
[85] raster_3.6-30 beeswarm_0.4.0 HDF5Array_1.34.0
[88] vipor_0.4.7 cli_3.6.3 rappdirs_0.3.3
[91] fansi_1.0.6 S4Arrays_1.6.0 viridisLite_0.4.2
[94] svglite_2.1.3 dplyr_1.1.4 gtable_0.3.6
[97] digest_0.6.37 SparseArray_1.6.0 rjson_0.2.23
[100] htmlwidgets_1.6.4 memoise_2.0.1 htmltools_0.5.8.1
[103] lifecycle_1.0.4 httr_1.4.7 mime_0.12
[106] bit64_4.5.2