Unlocking single cell spatial omics analyses with scdney • scdneySpatialBiocAsia2024

Presenting Authors

Farhan Ameen $^{1,2,3}$ , Alex Qin $^{1,2,3}$ , Shreya Rao $^{1,2,3}$ , Ellis Patrick $^{1,2,3}$ .

$^1$ Westmead Institute for Medical Research, University of Sydney, Australia
$^2$ Sydney Precision Data Science Centre, University of Sydney, Australia
$^3$ School of Mathematics and Statistics, University of Sydney, Australia

Contact: ellis.patrick@sydney.edu.au

Overview

Understanding the interplay between different types of cells and their immediate environment is critical for understanding the mechanisms of cells themselves and their function in the context of human diseases. Recent advances in high dimensional in situ cytometry technologies have fundamentally revolutionized our ability to observe these complex cellular relationships providing an unprecedented characterisation of cellular heterogeneity in a tissue environment.

Description

In this workshop we will introduce some of the key analytical concepts needed to analyse data from high dimensional spatial omics technologies such as, PhenoCycler, IMC, Xenium and MERFISH. We will show how functionality from our Bioconductor packages simpleSeg, FuseSOM, scClassify, scHot, spicyR, listClust, statial, scFeatures and ClassifyR can be used to address various biological hypotheses. By the end of this workshop attendees will be able to implement and assess some of the key steps of a spatial analysis pipeline including cell segmentation, feature normalisation, cell type identification, microenvironment and cell-state characterisation, spatial hypothesis testing and patient classification. Understanding these key steps will provide attendees with the core skills needed to interrogate the comprehensive spatial information generated by these exciting new technologies.

Pre-requisites

It is expected that students will have:

basic knowledge of R syntax,
this workshop will not provide an in-depth description of cell-resolution spatial omics technologies.

R / Bioconductor packages used

Several single cell R packages will be used from the scdney package, for more information visit: https://sydneybiox.github.io/scdney/.

Learning objectives

Understand and visualise spatial omics datasets.
Identify key biological questions that can be addressed with these technologies and spatial analysis.
Understand the key analytical steps involved in spatial omics analysis, and perform these steps using R.
Evaluate the performance of data normalisation and cell segmentation.
Understand and generate individual feature representations from spatial omics data.
Develop appreciation on how to assess performance of classification models.
Perform disease outcome prediction using the feature representation and robust classification framework.

Context

In the following we will re-analyse some IMC data (Ferguson et al, 2022) profiling the spatial landscape of head and neck cutaneous squamous cell carcinomas (HNcSCC), the second most common type of skin cancer. The majority of HNcSCC can be treated with surgery and good local control, but a subset of large tumours infiltrate subcutaneous tissue and are considered high risk for local recurrence and metastases. The key conclusion of this manuscript (amongst others) is that spatial information about cells and the immune environment can be used to predict primary tumour progression or metastases in patients. We will use our spicy workflow to reach a similar conclusion.

Setup

Installation

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install(c( "simpleSeg", "cytomapper", "scClassify", "scHOT", "FuseSOM","glmnet", "spicyR", "lisaClust","Statial", "scFeatures", "ClassifyR", "tidyverse", "scater", "SingleCellExperiment", "STexampleData", "SpatialDatasets", "tidySingleCellExperiment", "scuttle", "batchelor"))

Load packages

# packages from scdney
library(scdneySpatialBiocAsia2024)
library(SpatialDatasets)
library(simpleSeg)
library(FuseSOM)
library(lisaClust)
library(spicyR)
library(Statial)
library(ClassifyR)

# Other required packages
library(tidySingleCellExperiment)
library(tidyverse)
library(SingleCellExperiment)
library(cytomapper)
library(scater)
library(glmnet)
library(knitr)

theme_set(theme_classic())

nCores <- 8  # Feel free to parallelise things if you have the cores to spare.
BPPARAM <- simpleSeg:::generateBPParam(nCores)
source(system.file("extdata", "utils.R", package = "scdneySpatialBiocAsia2024"))
options("restore_SingleCellExperiment_show" = TRUE)

Download images

Feel free to follow along for this section. Much of the segmentation pipeline is computationally demanding, and most likely won’t be able to run on the Bioconductor Galaxy servers.

# Reads images to cache in a directory specified by the pathToImages variable.
pathToImages <- SpatialDatasets::Ferguson_Images()
tmp <- tempfile() # Creates a temporary directory.
unzip(pathToImages, exdir = tmp) # Unzips files to the temporary directory.

# Store images in a CytoImageList on_disk as h5 files to save memory.
images <- cytomapper::loadImages(
  tmp, # Read files from the temporary directory.
  single_channel = TRUE,
  on_disk = TRUE,
  h5FilesPath = HDF5Array::getHDF5DumpDir(),
  BPPARAM = BPPARAM
)

mcols(images) <- S4Vectors::DataFrame(imageID = names(images))

# Clean channel names
cn <- channelNames(images) # Read in channel names
head(cn)

cn <- sub(".*_", "", cn) # Remove preceding letters
cn <- sub(".ome", "", cn) # Remove the .ome
head(cn)

channelNames(images) <- cn # Reassign channel names

# Clean image names
head(names(images))

nam <- sapply(strsplit(names(images), "_"), `[`, 3)
head(nam)

names(images) <- nam # Reassigning image names
mcols(images)[["imageID"]] <- nam # Reassigning image names

simpleSeg: Segmentation

simpleSeg Package

Run simpleSeg

If your images are stored in a list or CytoImageList they can be segmented with a simple call to simpleSeg(). To summarise, simpleSeg() is an R implementation of a simple segmentation technique which traces out the nuclei using a specified channel using nucleus then dilates around the traced nuclei by a specified amount using discSize. The nucleus can be traced out using either one specified channel, or by using the principal components of all channels most correlated to the specified nuclear channel by setting pca = TRUE.

In the particular example below, we have asked simpleSeg to do the following:

By setting nucleus = c("HH3"), we’ve asked simpleSeg to trace out the nuclei signal in the images using the HH3 channel. By setting pca = TRUE, simpleSeg segments out the nuclei mask using a principal component analysis of all channels and using the principal components most aligned with the nuclei channel, in this case, HH3. By setting cellBody = "dilate", simpleSeg uses a dilation strategy of segmentation, expanding out from the nucleus by a specified discSize. By setting discSize = 3, simpleSeg dilates out from the nucleus by 3 pixels. By setting sizeSelection = 20, simpleSeg ensures that only cells with a size greater than 20 pixels will be used. By setting transform = "sqrt", simpleSeg square root transforms each of the channels prior to segmentation. By setting tissue = c("panCK", "CD45", "HH3"), we specify a tissue mask which simpleSeg uses, filtering out all background noise outside the tissue mask. This is important as these are tumour cores, wand hence circular, so we’d want to ignore background noise which happens outside of the tumour core.

There are many other parameters that can be specified in simpleSeg (smooth, watershed, tolerance, and ext), and we encourage the user to select the best parameters which suit their biological context.

masks <- simpleSeg(images,
                   nucleus = c("HH3"),
                   pca = TRUE,
                   cellBody = "dilate",
                   discSize = 3,
                   sizeSelection = 20,
                   transform = "sqrt",
                   tissue = c("panCK", "CD45", "HH3"),
                   cores = nCores
                   )

Visualise outlines

The plotPixels function in cytomapper makes it easy to overlay the mask on top of the nucleus intensity marker to see how well our segmentation process has performed. Here we can see that the segmentation appears to be performing reasonably.

If you see over or under-segmentation of your images, discSize is a key parameter in simpleSeg() for optimising the size of the dilation disc after segmenting out the nuclei.

plotPixels(image = images["F3"], 
           mask = masks["F3"],
           img_id = "imageID", 
           colour_by = c("HH3"), 
           display = "single",
           colour = list(HH3 = c("black","blue")),
           legend = NULL,
           bcg = list(
             HH3 = c(1, 1, 2)
           ))

If you wish to visualise multiple markers instead of just the HH3 marker and see how the segmentation mask performs, this can also be done. Here, we can see that our segmentation mask has done a good job of capturing the CD31 signal, but perhaps not such a good job of capturing the FXIIIA signal, which often lies outside of our dilated nuclear mask. This could suggest that we might need to increase the discSize of our dilation.

plotPixels(image = images["F3"], 
           mask = masks["F3"],
           img_id = "imageID", 
           colour_by = c("HH3", "CD31", "FX111A"), 
           display = "single",
           colour = list(HH3 = c("black","blue"),
                         CD31 = c("black", "red"),
                         FX111A = c("black", "green") ),
           legend = NULL,
           bcg = list(
             HH3 = c(1, 1, 2),
             CD31 = c(0, 1, 2),
             FX111A = c(0, 1, 1.5)
           ))

Questions

Is there any information we’re not capturing with this segmentation?
What parameters might you change to improve the segmentation?
What are some intrinsic limitations of the simpleSeg method?

In order to characterise the phenotypes of each of the segmented cells, measureObjects() from cytomapper will calculate the average intensity of each channel within each cell as well as a few morphological features. By default, the measureObjects() function will return a SingleCellExperiment object, where the channel intensities are stored in the counts assay and the spatial location of each cell is stored in colData in the m.cx and m.cy columns.

However, you can also specify measureObjects() to return a SpatialExperiment object by specifying return_as = "spe". As a SpatialExperiment object, the spatial location of each cell is stored in the spatialCoords slot, as m.cx and m.cy, which simplifies plotting. In this demonstration, we will return a SpatialExperiment object.

# Summarise the expression of each marker in each cell
cells <- cytomapper::measureObjects(masks,
                                    images,
                                    img_id = "imageID",
                                    return_as = "spe",
                                    BPPARAM = BPPARAM)

spatialCoordsNames(cells) <- c("x", "y")

Next, to associate features in our image with disease progression, it is important to read in information which links image identifiers to their progression status. We will do this here, making sure that our imageID column matches.

clinical <- read.csv(
  system.file(
    "extdata/clinicalData_TMA1_2021_AF.csv",
    package = "scdneySpatialBiocAsia2024"
  )
)

rownames(clinical) <- clinical$imageID
clinical <- clinical[names(images), ]

colData(cells) <- cbind(colData(cells), clinical[cells$imageID, ])

simpleSeg: Normalisation

We’ve made it easy to follow on from the normalisation section. Our SpatialDatasets package has the segmented SCE file stored in the function spe_Ferguson_2022().

#Download data now
cells = SpatialDatasets::spe_Ferguson_2022()

cells$x = spatialCoords(cells)[, "x"]
cells$y = spatialCoords(cells)[, "y"]

Time for this code chunk to run: 2.73 seconds

Normalise the data

We should check to see if the marker intensities of each cell require some form of transformation or normalisation. The reason we do this is two-fold:
1) The intensities of images are often highly skewed, preventing any meaningful downstream analysis.
2) The intensities across different images are often different, meaning that what is considered “positive” can be different across images.

By transforming and normalising the data, we aim to reduce these two effects. Here we extract the intensities from the counts assay. Looking at CD3 which should be expressed in the majority of the T cells, the intensities are clearly very skewed, and it is difficult to see what is considered a CD3- cell, and what is a CD3+ cell.

# Plot densities of CD3 for each image.
cells |> 
  join_features(features = rownames(cells), shape = "wide", assay = "counts") |> 
  ggplot(aes(x = CD3, colour = imageID)) + 
  geom_density() + 
  theme(legend.position = "none")

Questions

Can we see what is a CD3+ vs a CD3- cell here?

Dimension reduction and visualisation

As our data is stored in a SpatialExperiment we can also use scater to perform and visualise our data in a lower dimension to look for batch effects in our images. We can see that before normalisation, our UMAP shows a clear batch effect between images.

# Usually we specify a subset of the original markers which are informative to separating out distinct cell types for the UMAP and clustering.
ct_markers <- c("podoplanin", "CD13", "CD31",
                "panCK", "CD3", "CD4", "CD8a",
                "CD20", "CD68", "CD16", "CD14", "HLADR", "CD66a")

set.seed(51773)
# Perform dimension reduction using UMAP.
cells <- scater::runUMAP(
  cells,
  subset_row = ct_markers,
  exprs_values = "counts"
)

# Select a subset of images to plot.
someImages <- unique(cells$imageID)[c(1, 5, 10, 20, 30, 40)]

# UMAP by imageID.
scater::plotReducedDim(
  cells[, cells$imageID %in% someImages],
  dimred = "UMAP",
  colour_by = "imageID"
)

Time for this code chunk to run: 95.78 seconds

We can transform and normalise our data using the normalizeCells function. In the normalizeCells() function, we specify the following parameters. transformation is an optional argument which specifies the function to be applied to the data. We do not apply an arcsinh transformation here, as we already apply a square root transform in the simpleSeg() function. method = c("trim99", "mean", PC1") is an optional argument which specifies the normalisation method/s to be performed. Here, we: 1) Trim the 99th percentile 2) Divide by the mean 3) Remove the 1st principal component assayIn = "counts" is a required argument which specifies what the assay you’ll be taking the intensity data from is named. In our context, this is called counts.

This modified data is then stored in the norm assay by default. We can see that this normalised data appears more bimodal, not perfectly, but likely to a sufficient degree for clustering, as we can at least observe a clear CD3+ peak at 1.00, and a CD3- peak at around 0.3.

# Leave out the nuclei markers from our normalisation process. 
useMarkers <- rownames(cells)[!rownames(cells) %in% c("DNA1", "DNA2", "HH3")]

# Transform and normalise the marker expression of each cell type.
cells <- simpleSeg::normalizeCells(cells,
                        markers = useMarkers,
                        transformation = NULL,
                        method = c("trim99", "mean", "PC1"),
                        assayIn = "counts",
                        cores = nCores
)

# Plot densities of CD3 for each image
cells |> 
  join_features(features = rownames(cells), shape = "wide", assay = "norm") |> 
  ggplot(aes(x = CD3, colour = imageID)) + 
  geom_density() + 
  theme(legend.position = "none")

Time for this code chunk to run: 7.83 seconds

Questions

Can we see what is a CD3+ vs a CD3- cell here?

set.seed(51773)
# Perform dimension reduction using UMAP.
cells <- scater::runUMAP(
  cells,
  subset_row = ct_markers,
  exprs_values = "norm",
  name = "normUMAP"
)

someImages <- unique(cells$imageID)[c(1, 5, 10, 20, 30, 40)]

# UMAP by imageID.
scater::plotReducedDim(
  cells[, cells$imageID %in% someImages],
  dimred = "normUMAP",
  colour_by = "imageID"
)

Time for this code chunk to run: 98.51 seconds

Questions

What are we hoping to achieve with normalisation?
What has changed between the pre-normalisation and post-normalisation correction UMAP?

FuseSOM

FuseSOM Package FuseSOM Article

To dissect more information about cell interactions with each other, and cell-specific differences between groups, we next perform clustering. The runFuseSOM() function from the FuseSOM package conveniently runs clustering on any SingleCellExperiment object, by specifying the number of clusters under numClusters.

# Set seed.
set.seed(51773)

# Generate SOM and cluster cells into 10 groups
cells <- runFuseSOM(
  cells,
  markers = ct_markers,
  assay = "norm",
  numClusters = 10
)

Time for this code chunk to run: 3.24 seconds

We can also observe how reasonable our choice of k = 10 was, using the estimateNumCluster() and optiPlot() functions. Here we examine the Gap method, but others such as Silhouette and Within Cluster Distance are also available. We can see that there are elbow points in the gap statistic at k = 7, k = 10, and k = 11. We’ve specified k = 10, striking a good balance between the number of clusters and the gap statistic.

cells <- estimateNumCluster(cells, kSeq = 2:30)
optiPlot(cells, method = "gap")

Questions

Was our choice of k = 10 a good choice?
What other k might have made sense?
From a biological perspective, what would increasing k do?

Attempt to interpret the phenotype of each cluster

We can begin the process of understanding what each of these cell clusters are by using the plotGroupedHeatmap function from scater. At the least, here we can see we capture all the major immune populations that we expect to see, including the CD4 and CD8 T cells, the CD20+ B cells, the CD68+ myeloid populations, the CD66+ granulocytes, the podoplanin+ epithelial cells, and the panCK+ tumour cells.

# Visualise marker expression in each cluster.
scater::plotGroupedHeatmap(
  cells,
  features = ct_markers,
  group = "clusters",
  exprs_values = "norm",
  center = TRUE,
  scale = TRUE,
  zlim = c(-3, 3),
  cluster_rows = FALSE,
  block = "clusters"
)

Given domain-specific knowledge of the tumour-immune landscape, we can go ahead and annotate these clusters as cell types given their expression profiles.

cells <- cells |>
  mutate(cellType = case_when(
    clusters == "cluster_1" ~ "GC", # Granulocytes
    clusters == "cluster_2" ~ "MC", # Myeloid cells
    clusters == "cluster_3" ~ "SC", # Squamous cells
    clusters == "cluster_4" ~ "EP", # Epithelial cells
    clusters == "cluster_5" ~ "SC", # Squamous cells
    clusters == "cluster_6" ~ "TC_CD4", # CD4 T cells
    clusters == "cluster_7" ~ "BC", # B cells
    clusters == "cluster_8" ~ "EC", # Endothelial cells
    clusters == "cluster_9" ~ "TC_CD8", # CD8 T cells
    clusters == "cluster_10" ~ "DC" # Dendritic cells
  ))

We might also be interested in how these cell types are distributed on the images themselves. Here we examine the distribution of clusters on image F3, noting the healthy epithelial and endothelial structures surrounded by tumour cells.

reducedDim(cells, "spatialCoords") <- spatialCoords(cells)

cells |> 
  filter(imageID == "F3") |> 
  plotReducedDim("spatialCoords", colour_by = "cellType")

We find it always useful to check the number of cells in each cluster.

# Check cell type frequencies.
cells$cellType |>
  table() |>
  sort()

#> 
#>     DC     BC     MC     GC TC_CD8 TC_CD4     EC     EP     SC 
#>   2411   4322   5283   7993   8534  11753  14159  14170  87288

Questions

Does our distribution of cell types make sense?
What would we normally expect as the most expressed cell type and the least expressed cell type for this dataset?

We can also use the UMAP we computed earlier to visualise our data in a lower dimension and see how well our annotated cell types cluster out.

# UMAP by cell type
scater::plotReducedDim(
  cells[, cells$imageID %in% someImages],
  dimred = "normUMAP",
  colour_by = "cellType"
)

spicyR: Test spatial relationships

spicyR Package spicyR paper

spicyR uses the L-function to determine localisation or dispersion between cell types. The L-function measures attraction or dispersion between cell types, with larger values (> 0) suggesting increased attraction, and lower values (< 0) suggesting avoidance between cell types. We specify a range of radii to calculate the L-function over and set sigma = 50 to account for tissue inhomogeneity.

spicyTest <- spicy(cells,
                  condition = "group",
                  cellTypeCol = "cellType",
                  imageIDCol = "imageID",
                  Rs = 1:10*10,
                  sigma = 50,
                  BPPARAM = BPPARAM)

Questions

What does an L-function value of 0 suggest?
How would you choose the optimal radii (Rs) for a given dataset?

To save time we have saved the output of the above code.

load(system.file("extdata", "spicyTest.rda", package = "scdneySpatialBiocAsia2024"))
topPairs(spicyTest, n = 10)

#>             intercept coefficient    p.value adj.pvalue   from     to
#> BC__EC     -1.1346644    8.304020 0.03391529  0.6931721     BC     EC
#> EC__BC     -1.5909700    7.919971 0.04134661  0.6931721     EC     BC
#> GC__TC_CD4 -7.6316673   11.005639 0.05047096  0.6931721     GC TC_CD4
#> TC_CD4__GC -7.5406782   11.077602 0.05709461  0.6931721 TC_CD4     GC
#> TC_CD4__EP -0.7117393    4.853317 0.07261120  0.6931721 TC_CD4     EP
#> EP__TC_CD4 -0.7070918    4.820661 0.07510118  0.6931721     EP TC_CD4
#> EP__TC_CD8 -3.1445464    5.779536 0.08047994  0.6931721     EP TC_CD8
#> TC_CD8__EP -2.9313474    5.688420 0.09522058  0.6931721 TC_CD8     EP
#> TC_CD4__EC  3.1984391    4.366122 0.09716499  0.6931721 TC_CD4     EC
#> EC__TC_CD4  3.1052061    4.263971 0.10461424  0.6931721     EC TC_CD4

We can visualise these tests using the signifPlot function.

# Visualise which relationships are changing the most.
signifPlot(spicyTest)

Questions 1. Which cell types appear to have significant co-localistion/dispersion? 2. Does this align with what we might expect to see in a biological context?

spicyR also has functionality for plotting out individual pairwise relationships.

spicyBoxPlot(spicyTest, 
             from = "BC", 
             to = "EC")

Alternatively, we can look at the most differentially localised relationship between progressors and non-progressors by specifying rank = 1.

spicyBoxPlot(spicyTest, 
             rank = 1)

lisaClust: Find cellular neighbourhoods

lisaClust Package lisaClust paper

lisaClust allows us to identify regions where spatial associations between cell types is similar. Here we use the lisaClust function to clusters cells into 5 regions with distinct spatial ordering.

set.seed(51773)

cells <- lisaClust(
  cells,
  k = 4,
  sigma = 50,
  cellType = "cellType",
  BPPARAM = BPPARAM
)

Time for this code chunk to run: 36.5 seconds

To save time the results of the above code have been saved and can be loaded in.

Region - cell type enrichment heatmap

We can interpret which spatial orderings the regions are quantifying using the regionMap function. This plots the frequency of each cell type in a region relative to what you would expect by chance.

# Visualise the enrichment of each cell type in each region
regionMap(cells, cellType = "cellType", limit = c(0.2, 2))

Visualise regions

By default, these identified regions are stored in the regions column in the colData of our SpatialExperiment object. We can quickly examine the spatial arrangement of these regions using ggplot on image F3.

cells |> 
  filter(imageID == "F3") |> 
  plotReducedDim("spatialCoords", colour_by = "region")

While much slower, the hatchingPlot function can also be used to view regions of co-localisation simultaneously with the cell type calls.

# Use hatching to visualise regions and cell types.
hatchingPlot(
  cells,
  useImages = "F3",
  cellType = "cellType",
  nbp = 300
)

Time for this code chunk to run: 71.16 seconds

Questions

How could the choice of k affect our clustering results?
Looking at the graph above, would a different value of k be better?

Test for association with progression

We can use the colTest function to test for associations between the proportions of cells in each region and progression status.

# Test if the proportion of each region is associated
# with progression status.
testRegion <- colTest(
  cells,
  feature = "region",
  condition = "group",
  type = "ttest")

testRegion

#>          mean in group NP mean in group P tval.t  pval adjPval  cluster
#> region_1            0.022          0.0079   1.90 0.066    0.20 region_1
#> region_3            0.092          0.1300  -1.70 0.100    0.20 region_3
#> region_4            0.340          0.3300   0.63 0.530    0.71 region_4
#> region_2            0.540          0.5400   0.20 0.840    0.84 region_2

Time for this code chunk to run: 0.19 seconds

Questions

How could results from lisaClust be used in conjunction with results from spicyR?

Statial

Statial Package

SpatioMark

The first step in analysing these changes is to calculate the spatial proximity (getDistances) of each cell to every cell type. These values will then be stored in the reducedDims slot of the SingleCellExperiment object under the names distances. SpatioMark also provides functionality to look into proximal cell abundance using the getAbundance() function, which is further explored within the Statial package vignette

cells <- getDistances(cells,
  maxDist = 200,
  nCores = nCores,
  cellType = "cellType",
  spatialCoords = c("x", "y")
)

Time for this code chunk to run: 6.95 seconds

We can then visualise an example image, specified with image = "F3" and a particular marker interaction with cell type localisation. To visualise these changes, we specify two cell types with the from and to parameters, and a marker with the marker parameter (cell-cell-marker interactions). Here, we specify the changes in the marker podoplanin in SC as its localisation to EP increases or decreases, where we can observe that podoplanin decreases as its distance to the dense clump of tumour cells increases.

p <- plotStateChanges(
  cells = cells,
  cellType = "cellType",
  spatialCoords = c("x", "y"),
  type = "distances",
  image = "F3",
  from = "SC",
  to = "EP",
  marker = "podoplanin",
  size = 1,
  shape = 19,
  interactive = FALSE,
  plotModelFit = FALSE,
  method = "lm"
)

p

#> $image

#> 
#> $scatter

SpatioMark aims to holistically uncover all such significant relationships by looking at all interactions across all images. The calcStateChanges() function provided by Statial can be expanded for this exact purpose - by not specifying cell types, a marker, or an image, calcStateChanges() will examine the most significant correlations between distance and marker expression across the entire dataset.

state_dist <- calcStateChanges(
  cells = cells,
  cellType = "cellType",
  type = "distances",
  assay = 2,
  nCores = nCores,
  minCells = 100
)

head(state_dist[state_dist$imageID == "F3",], n = 10)

#>       imageID primaryCellType otherCellType     marker          coef      tval
#> 51990      F3              SC            EP podoplanin -0.0006911103 -16.90036
#> 23703      F3              EP            GC      CXCR3  0.0012103305  15.04988
#> 23704      F3              EP            GC     pSTAT3  0.0012299816  14.79933
#> 51978      F3              SC            GC       PDL2  0.0008802428  13.12601
#> 23669      F3              EP            MC       CCR7  0.0011476888  13.32900
#> 23595      F3              EP        TC_CD8      CXCR3  0.0016741880  12.95156
#> 51981      F3              SC            GC       ICOS  0.0005186181  11.75830
#> 23696      F3              EP            GC      CADM1  0.0011398625  12.18028
#> 11426      F3              EC            GC       CD31  0.0010070252  12.40866
#> 23650      F3              EP            EC       TIM3  0.0033539928  12.05028
#>               pval          fdr
#> 51990 4.203881e-59 9.004712e-57
#> 23703 8.041089e-41 8.191646e-39
#> 23704 8.916789e-40 8.712976e-38
#> 51978 1.839014e-37 1.637728e-35
#> 23669 9.360248e-34 6.863845e-32
#> 23595 3.025912e-32 2.085497e-30
#> 51981 1.113579e-30 7.070268e-29
#> 23696 3.239532e-29 1.903862e-27
#> 11426 4.839241e-29 2.825645e-27
#> 23650 1.030900e-28 5.930114e-27

Time for this code chunk to run: 5.81 seconds

The results from our SpatioMark outputs can be converted from a data.frame to a matrix, using the prepMatrix() function. Note, the choice of extracting either the t-statistic or the coefficient of the linear regression can be specified using the column = "tval" parameter, with the coefficient being the default extracted parameter. We can see that with SpatioMark, we get some features which are significant after adjusting for FDR.

# Preparing outcome vector
outcome <- cells$group[!duplicated(cells$imageID)]
names(outcome) <- cells$imageID[!duplicated(cells$imageID)]

# Preparing features for Statial
distMat <- prepMatrix(state_dist)

distMat <- distMat[names(outcome), ]

# Remove some very small values
distMat <- distMat[, colMeans(abs(distMat) > 0.0001) > .8]

survivalResults <- colTest(distMat, outcome, type = "ttest")

head(survivalResults)

#>                   mean in group NP mean in group P tval.t    pval adjPval
#> EC__EC__CD13              -0.00240        -9.6e-04   -3.6 0.00086    0.27
#> SC__SC__VISTA             -0.00470        -1.7e-03   -3.3 0.00190    0.27
#> SC__TC_CD4__CXCR3         -0.00015         8.8e-04   -3.4 0.00230    0.27
#> SC__EP__Ki67              -0.00037        -7.3e-05   -3.2 0.00260    0.27
#> EC__SC__CD68               0.00029        -3.9e-04    3.2 0.00300    0.27
#> SC__BC__DNA1               0.02600        -1.4e-01    3.0 0.00480    0.27
#>                             cluster
#> EC__EC__CD13           EC__EC__CD13
#> SC__SC__VISTA         SC__SC__VISTA
#> SC__TC_CD4__CXCR3 SC__TC_CD4__CXCR3
#> SC__EP__Ki67           SC__EP__Ki67
#> EC__SC__CD68           EC__SC__CD68
#> SC__BC__DNA1           SC__BC__DNA1

Time for this code chunk to run: 0.4 seconds

Kontextual

Pre-print

# Cluster cell type hierarchy
hierarchy <- treekoR::getClusterTree(t(assay(cells, "norm")),
                                     clusters = cells$cellType,
                                     hierarchy_method = "hopach")


parentList = Statial::getParentPhylo(hierarchy)

# Visualise tree
hierarchy$clust_tree |> 
  plot()

# Add parent 4 and 5
parentList$parent_4 = c(parentList$parent_2, parentList$parent_3)
parentList$parent_5 = c(parentList$parent_1, "EC")

all = unique(cells$cellType)

# Create data frame with all pairwise combinations of cells
treeDf = Statial::parentCombinations(all, parentList = parentList)

Time for this code chunk to run: 1.28 seconds

kontext = Statial::Kontextual(
  cells = cells,
  cellType = "cellType",
  r = 50,
  parentDf = treeDf,
  cores = nCores
)

To save time we have saved the output of the results above.

load(system.file("extdata", "kontext.rda", package = "scdneySpatialBiocAsia2024"))

We can use the spicyR::spicy to test for associations between the Kontextual values and progression status. To do so we use the alternateResult argument.

# Converting Kontextual result into data matrix
kontextMat <- Statial::prepMatrix(kontext)

# Replace NAs with 0
kontextMat[is.na(kontextMat)] <- 0

# Test association with outcome
kontextOutcome = spicyR::spicy(
  cells = cells,
  alternateResult = kontextMat,
  condition = "group",
  BPPARAM = BPPARAM
)

# Plot results
signifPlot(kontextOutcome)

Time for this code chunk to run: 8.1 seconds

Questions

Compare this plot to the spicyR results previously using signifPlot(spicyTest), how do Kontextual results compare to spicyR?
What are the interpretation differences between spicyR and Kontextual?
Use the plotKontextual() function to visualise some relationships (example below), is there anything that strikes out to you?

spicyR::signifPlot(spicyTest)

relationship = "DC__MC__parent_1"
image = "A4"

kontext |> 
  filter(test == relationship)

#>    imageID             test    original  kontextual  r inhomL
#> 1       A2 DC__MC__parent_1          NA          NA 50  FALSE
#> 2       A3 DC__MC__parent_1          NA          NA 50  FALSE
#> 3       A4 DC__MC__parent_1  -1.2198675   4.3620608 50  FALSE
#> 4       A5 DC__MC__parent_1   4.9391183  13.7580084 50  FALSE
#> 5       A6 DC__MC__parent_1   4.8588481  21.5756601 50  FALSE
#> 6       B1 DC__MC__parent_1   1.4916789  14.6484380 50  FALSE
#> 7       B2 DC__MC__parent_1  42.0245591  42.4761851 50  FALSE
#> 8      B21 DC__MC__parent_1  18.7324922  11.7984102 50  FALSE
#> 9       B3 DC__MC__parent_1  -5.7037591 -11.1252409 50  FALSE
#> 10      B4 DC__MC__parent_1   5.7735201   7.3325112 50  FALSE
#> 11      B6 DC__MC__parent_1  -3.0038293  -4.5809347 50  FALSE
#> 12      C1 DC__MC__parent_1  -3.0485103  -3.9483082 50  FALSE
#> 13      C2 DC__MC__parent_1   9.6928669   2.9790645 50  FALSE
#> 14      C3 DC__MC__parent_1   3.5870170   8.5049826 50  FALSE
#> 15      C4 DC__MC__parent_1   4.9411490   3.5646171 50  FALSE
#> 16      D2 DC__MC__parent_1 -20.0201139 -16.3681840 50  FALSE
#> 17      D3 DC__MC__parent_1  -7.4228765 -10.0050620 50  FALSE
#> 18      D4 DC__MC__parent_1   7.3041191  12.7749710 50  FALSE
#> 19      D5 DC__MC__parent_1  17.0540227  15.7993752 50  FALSE
#> 20      D6 DC__MC__parent_1  -6.0189812  -3.6871048 50  FALSE
#> 21      E1 DC__MC__parent_1  24.2134521  23.7525460 50  FALSE
#> 22      F3 DC__MC__parent_1  19.9966298  15.6131758 50  FALSE
#> 23      F4 DC__MC__parent_1  10.9176333  15.4254430 50  FALSE
#> 24      F5 DC__MC__parent_1  18.8820455  33.1354772 50  FALSE
#> 25      G4 DC__MC__parent_1  -1.5974267  -1.7029526 50  FALSE
#> 26      G5 DC__MC__parent_1  36.5752016  48.1433899 50  FALSE
#> 27      G6 DC__MC__parent_1  -5.0998227  -5.0921937 50  FALSE
#> 28      H1 DC__MC__parent_1  -7.1141532   0.7983119 50  FALSE
#> 29      H2 DC__MC__parent_1   2.6489041   8.3324522 50  FALSE
#> 30      H3 DC__MC__parent_1  20.3715132  31.8401893 50  FALSE
#> 31      H4 DC__MC__parent_1  13.4471471  16.6023987 50  FALSE
#> 32      H5 DC__MC__parent_1   2.9177892  15.6049344 50  FALSE
#> 33      H6 DC__MC__parent_1  -2.7544379  -0.2669482 50  FALSE
#> 34      I1 DC__MC__parent_1  12.6469476  17.3885399 50  FALSE
#> 35      I2 DC__MC__parent_1  -3.7060034   4.1567045 50  FALSE
#> 36      I3 DC__MC__parent_1  34.9020933  11.2716509 50  FALSE
#> 37      I4 DC__MC__parent_1  17.5909949  15.3714102 50  FALSE
#> 38      I5 DC__MC__parent_1  29.7187708  20.4201482 50  FALSE
#> 39      J3 DC__MC__parent_1  18.4084451  33.6605655 50  FALSE
#> 40      J4 DC__MC__parent_1  -7.2197007   1.3456391 50  FALSE
#> 41      J5 DC__MC__parent_1  -3.9792716  -1.8509027 50  FALSE
#> 42      J6 DC__MC__parent_1   0.7816835   4.4492464 50  FALSE
#> 43      K1 DC__MC__parent_1  -3.7436428  -1.2394727 50  FALSE
#> 44      K2 DC__MC__parent_1  10.1971570  14.0565621 50  FALSE

# Function from inst/extdata/utils.R file
plotKontextual(relationship, imageChoose = image)

Time for this code chunk to run: 0.88 seconds

ClassifyR: Patient classification

ClassifyR Package

Journal article

Our ClassifyR package formalises a convenient framework for evaluating classification in R. We provide functionality to easily include four key modelling stages; Data transformation, feature selection, classifier training and prediction; into a cross-validation loop. Here we use the crossValidate function to perform 50 repeats of 5-fold cross-validation to evaluate the performance of a model (classifier) with feature selection (selectionMethod) applied to five quantifications of our IMC data; 1) Cell type proportions 2) Cell type localisation from spicyR 3) Region proportions from lisaClust 4) Cell type localisation in reference to a parent cell type from Kontextual 5) Cell changes in response to proximal changes from SpatioMark

# Create list to store data.frames
data <- list()

# Add proportions of each cell type in each image
data[["Proportions"]] <- getProp(cells, "cellType")

# Add pair-wise associations
spicyMat <- spicyR::bind(spicyTest)
spicyMat[is.na(spicyMat)] <- 0
spicyMat <- spicyMat |>
  select(!condition) |>
  tibble::column_to_rownames("imageID")

data[["SpicyR"]] <- spicyMat


# Add proportions of each region in each image to the list of dataframes.
data[["LisaClust"]] <- getProp(cells, "region")


# Add SpatioMark features
data[["SpatioMark"]] <- distMat

# Add Kontextual features
data[["Kontextual"]] <- kontextMat

Time for this code chunk to run: 0.1 seconds

We have saved the data list and outcomes for convenience.

load(system.file("extdata", "data.rda", package = "scdneySpatialBiocAsia2024"))

# Set seed
set.seed(51773)

# Perform 5-fold cross-validation with 50 repeats.
cv <- ClassifyR::crossValidate(
  measurements = data,
  outcome = outcome,
  selectionMethod = "t-test",
  classifier = "LASSOGLM",
  nFolds = 5,
  nFeatures = 20,
  nRepeats = 50,
  nCores = nCores
)

We have saved the results for convenience.

load(system.file("extdata", "cv.rda", package = "scdneySpatialBiocAsia2024"))

Visualise cross-validated prediction performance

Here we use the performancePlot function to assess the AUC from each repeat of the 5-fold cross-validation.

performancePlot(
  cv,
  metric = "AUC",
  characteristicsList = list(x = "Assay Name"),
  orderingList = list("Assay Name" = c("Proportions", "SpicyR", "LisaClust", "Kontextual", "SpatioMark"))
)

Questions

Run the following chunk of code, what is this plot telling us?
Do different methods perform differently on different samples?
Knowing this, what might we do to improve our model?

samplesMetricMap(cv)

#> TableGrob (2 x 1) "arrange": 2 grobs
#>   z     cells    name                 grob
#> 1 1 (2-2,1-1) arrange       gtable[layout]
#> 2 2 (1-1,1-1) arrange text[GRID.text.2314]

The following code chunk demonstrates how we might use information from multiple matrices to perform classification.

# DO NOT RUN
multiCV <- ClassifyR::crossValidate(
  measurements = data,
  outcome = outcome,
  selectionMethod = "t-test",
  classifier = "LASSOGLM",
  nFolds = 5,
  nFeatures = 20,
  nRepeats = 50,
  multiViewMethod = "merge",
  nCores = nCores
)

plot = performancePlot(
  multiCV,
  metric = "AUC",
  characteristicsList = list(x = "Assay Name"),
)

Session Information

sessionInfo()

#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] scdneySpatialBiocAsia2024_2.0.1 knitr_1.48                     
#>  [3] glmnet_4.1-8                    Matrix_1.7-1                   
#>  [5] scater_1.34.0                   scuttle_1.16.0                 
#>  [7] cytomapper_1.18.0               EBImage_4.48.0                 
#>  [9] lubridate_1.9.3                 forcats_1.0.0                  
#> [11] stringr_1.5.1                   purrr_1.0.2                    
#> [13] readr_2.1.5                     tibble_3.2.1                   
#> [15] tidyverse_2.0.0                 ggplot2_3.5.1                  
#> [17] ttservice_0.4.1                 tidyr_1.3.1                    
#> [19] dplyr_1.1.4                     tidySingleCellExperiment_1.16.0
#> [21] ClassifyR_3.10.0                survival_3.7-0                 
#> [23] BiocParallel_1.40.0             MultiAssayExperiment_1.32.0    
#> [25] generics_0.1.3                  Statial_1.8.0                  
#> [27] spicyR_1.18.0                   lisaClust_1.14.4               
#> [29] FuseSOM_1.8.0                   simpleSeg_1.8.0                
#> [31] SpatialDatasets_1.4.0           SpatialExperiment_1.16.0       
#> [33] SingleCellExperiment_1.28.0     SummarizedExperiment_1.36.0    
#> [35] GenomicRanges_1.58.0            GenomeInfoDb_1.42.0            
#> [37] IRanges_2.40.0                  S4Vectors_0.44.0               
#> [39] MatrixGenerics_1.18.0           matrixStats_1.4.1              
#> [41] ExperimentHub_2.14.0            AnnotationHub_3.14.0           
#> [43] BiocFileCache_2.14.0            dbplyr_2.5.0                   
#> [45] Biobase_2.66.0                  BiocGenerics_0.52.0            
#> 
#> loaded via a namespace (and not attached):
#>   [1] tiff_0.1-12                 FCPS_1.3.4                 
#>   [3] nnet_7.3-19                 goftest_1.2-3              
#>   [5] Biostrings_2.74.0           HDF5Array_1.34.0           
#>   [7] TH.data_1.1-2               vctrs_0.6.5                
#>   [9] spatstat.random_3.3-2       digest_0.6.37              
#>  [11] png_0.1-8                   shape_1.4.6.1              
#>  [13] proxy_0.4-27                ggrepel_0.9.6              
#>  [15] deldir_2.0-4                permute_0.9-7              
#>  [17] magick_2.8.5                MASS_7.3-61                
#>  [19] pkgdown_2.1.1               reshape2_1.4.4             
#>  [21] httpuv_1.6.15               foreach_1.5.2              
#>  [23] withr_3.0.2                 ggfun_0.1.7                
#>  [25] psych_2.4.6.26              xfun_0.48                  
#>  [27] ggpubr_0.6.0                ellipsis_0.3.2             
#>  [29] memoise_2.0.1               ggbeeswarm_0.7.2           
#>  [31] RProtoBufLib_2.18.0         diptest_0.77-1             
#>  [33] princurve_2.1.6             systemfonts_1.1.0          
#>  [35] tidytree_0.4.6              zoo_1.8-12                 
#>  [37] GlobalOptions_0.1.2         ragg_1.3.3                 
#>  [39] V8_6.0.0                    DEoptimR_1.1-3             
#>  [41] prabclus_2.3-4              Formula_1.2-5              
#>  [43] KEGGREST_1.46.0             promises_1.3.0             
#>  [45] httr_1.4.7                  rstatix_0.7.2              
#>  [47] rhdf5filters_1.18.0         fpc_2.2-13                 
#>  [49] rhdf5_2.50.0                UCSC.utils_1.2.0           
#>  [51] concaveman_1.1.0            curl_5.2.3                 
#>  [53] zlibbioc_1.52.0             ScaledMatrix_1.14.0        
#>  [55] analogue_0.17-7             polyclip_1.10-7            
#>  [57] GenomeInfoDbData_1.2.13     SparseArray_1.6.0          
#>  [59] fftwtools_0.9-11            doParallel_1.0.17          
#>  [61] xtable_1.8-4                desc_1.4.3                 
#>  [63] evaluate_1.0.1              S4Arrays_1.6.0             
#>  [65] hms_1.1.3                   irlba_2.3.5.1              
#>  [67] colorspace_2.1-1            filelock_1.0.3             
#>  [69] spatstat.data_3.1-2         flexmix_2.3-19             
#>  [71] magrittr_2.0.3              ggtree_3.14.0              
#>  [73] later_1.3.2                 viridis_0.6.5              
#>  [75] modeltools_0.2-23           lattice_0.22-6             
#>  [77] robustbase_0.99-4-1         spatstat.geom_3.3-3        
#>  [79] XML_3.99-0.17               cowplot_1.1.3              
#>  [81] RcppAnnoy_0.0.22            ggupset_0.4.0              
#>  [83] class_7.3-22                svgPanZoom_0.3.4           
#>  [85] pillar_1.9.0                nlme_3.1-166               
#>  [87] iterators_1.0.14            compiler_4.4.1             
#>  [89] beachmat_2.22.0             stringi_1.8.4              
#>  [91] tensor_1.5                  minqa_1.2.8                
#>  [93] plyr_1.8.9                  treekoR_1.14.0             
#>  [95] crayon_1.5.3                abind_1.4-8                
#>  [97] gridGraphics_0.5-1          locfit_1.5-9.10            
#>  [99] sp_2.1-4                    bit_4.5.0                  
#> [101] terra_1.7-83                sandwich_3.1-1             
#> [103] multcomp_1.4-26             fastcluster_1.2.6          
#> [105] codetools_0.2-20            textshaping_0.4.0          
#> [107] BiocSingular_1.22.0         bslib_0.8.0                
#> [109] coop_0.6-3                  GetoptLong_1.0.5           
#> [111] plotly_4.10.4               mime_0.12                  
#> [113] splines_4.4.1               circlize_0.4.16            
#> [115] Rcpp_1.0.13                 profileModel_0.6.1         
#> [117] blob_1.2.4                  utf8_1.2.4                 
#> [119] clue_0.3-65                 BiocVersion_3.20.0         
#> [121] lme4_1.1-35.5               fs_1.6.4                   
#> [123] nnls_1.6                    ggsignif_0.6.4             
#> [125] ggplotify_0.1.2             scam_1.2-17                
#> [127] statmod_1.5.0               tzdb_0.4.0                 
#> [129] svglite_2.1.3               tweenr_2.0.3               
#> [131] pkgconfig_2.0.3             pheatmap_1.0.12            
#> [133] tools_4.4.1                 cachem_1.1.0               
#> [135] RSQLite_2.3.7               viridisLite_0.4.2          
#> [137] DBI_1.2.3                   numDeriv_2016.8-1.1        
#> [139] fastmap_1.2.0               rmarkdown_2.28             
#> [141] scales_1.3.0                grid_4.4.1                 
#> [143] shinydashboard_0.7.2        broom_1.0.7                
#> [145] sass_0.4.9                  patchwork_1.3.0            
#> [147] brglm_0.7.2                 BiocManager_1.30.25        
#> [149] carData_3.0-5               farver_2.1.2               
#> [151] mgcv_1.9-1                  yaml_2.3.10                
#> [153] ggthemes_5.1.0              cli_3.6.3                  
#> [155] hopach_2.66.0               lifecycle_1.0.4            
#> [157] uwot_0.2.2                  mvtnorm_1.3-1              
#> [159] kernlab_0.9-33              backports_1.5.0            
#> [161] cytolib_2.18.0              timechange_0.3.0           
#> [163] gtable_0.3.6                rjson_0.2.23               
#> [165] ape_5.8                     parallel_4.4.1             
#> [167] limma_3.62.1                edgeR_4.4.0                
#> [169] jsonlite_1.8.9              bitops_1.0-9               
#> [171] bit64_4.5.2                 Rtsne_0.17                 
#> [173] FlowSOM_2.14.0              yulab.utils_0.1.7          
#> [175] vegan_2.6-8                 spatstat.utils_3.1-0       
#> [177] BiocNeighbors_2.0.0         ranger_0.16.0              
#> [179] flowCore_2.18.0             bdsmatrix_1.3-7            
#> [181] jquerylib_0.1.4             highr_0.11                 
#> [183] spatstat.univar_3.0-1       lazyeval_0.2.2             
#> [185] ConsensusClusterPlus_1.70.0 shiny_1.9.1                
#> [187] diffcyt_1.26.0              htmltools_0.5.8.1          
#> [189] rappdirs_0.3.3              glue_1.8.0                 
#> [191] XVector_0.46.0              RCurl_1.98-1.16            
#> [193] treeio_1.30.0               mclust_6.1.1               
#> [195] mnormt_2.1.1                coxme_2.2-22               
#> [197] jpeg_0.1-10                 gridExtra_2.3              
#> [199] boot_1.3-31                 igraph_2.1.1               
#> [201] R6_2.5.1                    ggiraph_0.8.10             
#> [203] labeling_0.4.3              ggh4x_0.2.8                
#> [205] cluster_2.1.6               Rhdf5lib_1.28.0            
#> [207] aplot_0.2.3                 nloptr_2.1.1               
#> [209] DelayedArray_0.32.0         tidyselect_1.2.1           
#> [211] vipor_0.4.7                 ggforce_0.4.2              
#> [213] raster_3.6-30               car_3.1-3                  
#> [215] AnnotationDbi_1.68.0        rsvd_1.0.5                 
#> [217] munsell_0.5.1               DataVisualizations_1.3.2   
#> [219] data.table_1.16.2           ComplexHeatmap_2.22.0      
#> [221] htmlwidgets_1.6.4           RColorBrewer_1.1-3         
#> [223] rlang_1.1.4                 spatstat.sparse_3.1-0      
#> [225] spatstat.explore_3.3-3      colorRamps_2.3.4           
#> [227] uuid_1.2-1                  lmerTest_3.1-3             
#> [229] ggnewscale_0.5.0            fansi_1.0.6                
#> [231] beeswarm_0.4.0