Unlocking single cell spatial omics analyses with scdney • ScdneySpatial

Presenting Authors

Farhan Ameen$^{1,2,3}$, Alex Qin$^{1,2,3}$, Nick Robertson$^{1,2,3}$, Ellis Patrick$^{1,2,3}$.

$^1$ Westmead Institute for Medical Research, University of Sydney, Australia
$^2$ Sydney Precision Data Science Centre, University of Sydney, Australia
$^3$ School of Mathematics and Statistics, University of Sydney, Australia

Contact: ellis.patrick@sydney.edu.au

Overview

Understanding the interplay between different types of cells and their immediate environment is critical for understanding the mechanisms of cells themselves and their function in the context of human diseases. Recent advances in high dimensional in situ cytometry technologies have fundamentally revolutionized our ability to observe these complex cellular relationships providing an unprecedented characterisation of cellular heterogeneity in a tissue environment.

Description

In this workshop we will introduce some of the key analytical concepts needed to analyse data from high dimensional spatial omics technologies such as, PhenoCycler, IMC, Xenium and MERFISH. We will show how functionality from our Bioconductor packages simpleSeg, FuseSOM, scClassify, scHot, spicyR, listClust, statial, scFeatures and ClassifyR can be used to address various biological hypotheses. By the end of this workshop attendees will be able to implement and assess some of the key steps of a spatial analysis pipeline including cell segmentation, feature normalisation, cell type identification, microenvironment and cell-state characterisation, spatial hypothesis testing and patient classification. Understanding these key steps will provide attendees with the core skills needed to interrogate the comprehensive spatial information generated by these exciting new technologies.

Pre-requisites

It is expected that students will have:

basic knowledge of R syntax,
this workshop will not provide an in-depth description of cell-resolution spatial omics technologies.

R / Bioconductor packages used

Several single cell R packages will be used from the scdney package, for more information visit: https://sydneybiox.github.io/scdney/.

Time outline

Activity	Time
Data visualisation & cell segmentation	1h 30m
Cell type clustering & classification	1h 30m
Spatial analysis & feature generation	2h
Patient Classification	1h 30m

Learning objectives

Understand and visualise spatial omics datasets.
Identify key biological questions that can be addressed with these technologies and spatial analysis.
Understand the key analytical steps involved in spatial omics analysis, and perform these steps using R.
Evaluate the performance of data normalisation and cell segmentation.
Understand and generate individual feature representations from spatial omics data.
Develop appreciation on how to assess performance of classification models.
Perform disease outcome prediction using the feature representation and robust classification framework.

Workshop

Setup

Installation

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install(c( "simpleSeg", "cytomapper", "scClassify", "scHOT", "FuseSOM","glmnet", "spicyR", "lisaClust","Statial", "scFeatures", "ClassifyR", "tidyverse", "scater", "SingleCellExperiment", "STexampleData", "SpatialDatasets", "tidySingleCellExperiment", "scuttle", "batchelor"))

Download datasets

Please download these datasets before you start the workshop

#Download data now
spe <- STexampleData::seqFISH_mouseEmbryo()
kerenSPE = SpatialDatasets::spe_Keren_2018()

Load packages

# packages from scdney
library(ClassifyR)
library(FuseSOM)
library(lisaClust)
library(scClassify)
library(scFeatures) 
library(scHOT)
library(simpleSeg)
library(SpatialDatasets)
library(spicyR)
library(Statial)

# Other required packages
library(BumpyMatrix)
library(cytomapper)
library(batchelor)
library(ggplot2)
library(ggsurvfit)
library(glmnet)
library(plotly)
library(reshape)
library(scater)
library(scuttle)
library(SingleCellExperiment)
library(STexampleData)
library(tidySingleCellExperiment)
library(tidyverse)

theme_set(theme_classic())

nCores <- 8  # Feel free to parallelise things if you have the cores to spare.
BPPARAM <- simpleSeg:::generateBPParam(nCores)
source(system.file("extdata", "celltype_colours.R", package = "ScdneySpatial"))
options("restore_SingleCellExperiment_show" = TRUE)

Module 1: Data exploration

In this workshop, we will be working with two datasets to explore how biological phenotypes, cellular interactions, and patterns of gene expression are correlated with disease. Both of these datasets will be used in different contexts, hopefully these contexts are representative of scenarios you will encounter in your own datasets.

We will use two motivating datasets:

Keren et al, 2018: A multiplexed ion beam imaging by time-of-flight (MIBI-TOF) dataset profilining tissue from triple-negative breast cancer patients. The primary question we will address with this dataset is if we can predict risk of cancer recurrence and overall survival time based on imaging data?
Lohoff et al, 2022: A seqFISH study of early mouse organogenesis. We will use a subset of data that is made available from the STExampleData package. The primary question we will address with this dataset is if we can identify key transcriptomic drivers of the developing brain?

The purpose of the this section is primarily to introduce the SpatialExperiment class which is used to store information from the imaging experiments in R. The goal will be to get comfortable enough manipulating and exploring these objects so that you can progress through the remainder of the workshop comfortably. Here we will download a dataset stored in the STexampleData R package, examine the structure, visualise the data and perform some exploratory analyses.

Here we download the seqFISH mouse embryo data. This comes in the format of a SpatialExperiment object, where summarized information from an imaging dataset can be compiled and accessed with relative ease.

#spe <- STexampleData::seqFISH_mouseEmbryo()
spe

#> class: SpatialExperiment 
#> dim: 351 11026 
#> metadata(0):
#> assays(2): counts molecules
#> rownames(351): Abcc4 Acp5 ... Zfp57 Zic3
#> rowData names(1): gene_name
#> colnames(11026): embryo1_Pos0_cell10_z2 embryo1_Pos0_cell100_z2 ...
#>   embryo1_Pos28_cell97_z2 embryo1_Pos28_cell98_z2
#> colData names(14): cell_id embryo ... segmentation_vertices sample_id
#> spatialCoords names(2) : x y
#> imgData names(0):

We can use functions designed for SingleCellExperiment objects in the scater package for plotting via the reducedDim slot. We multiply the spatial coordinates by a matrix to flip the y-axis and ensure we fix the aspect ratio.

spe <- logNormCounts(spe)
coord_transform <- matrix(c(1,0,0,-1), 2, 2, byrow = TRUE)
reducedDim(spe, "spatialCoords") <- spatialCoords(spe) %*% coord_transform
plotReducedDim(spe, "spatialCoords", colour_by = c("Sox2"), point_size = 1) +
  coord_fixed()

Questions

How many cells are in this data?
How many genes?
Plot gene expression mapping point size to the cell area.

# try to answer the above question using the spe object. 
# you may want to check the SingleCellExperiment vignette.
# https://bioconductor.org/packages/3.17/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html

We can perform a typical gene-expression based analysis for this data. Later in part two we will perform some specific analytical techniques, but for now let’s explore the dataset and use methods designed for single cell data.

Typically in single-cell data analysis, we perform dimension reduction to project the high dimensional cell x gene matrix on to 2D space. This allows us to visualise various things of interest, such as distribution of cell types and disease outcomes. Here, we will see how cells are segregated by their expression of Sox2.

spe <- runPCA(spe)

b.out <- batchelor::batchCorrect(spe, batch = spe$pos, assay.type = "logcounts", PARAM=FastMnnParam(d=20))
reducedDim(spe, "FastMnn") <- reducedDim(b.out, "corrected")
spe <- runUMAP(spe, dimred = "FastMnn")
spe

#> class: SpatialExperiment 
#> dim: 351 11026 
#> metadata(0):
#> assays(3): counts molecules logcounts
#> rownames(351): Abcc4 Acp5 ... Zfp57 Zic3
#> rowData names(1): gene_name
#> colnames(11026): embryo1_Pos0_cell10_z2 embryo1_Pos0_cell100_z2 ...
#>   embryo1_Pos28_cell97_z2 embryo1_Pos28_cell98_z2
#> colData names(15): cell_id embryo ... sample_id sizeFactor
#> spatialCoords names(2) : x y
#> imgData names(1): sample_id

g_celltype_umap <- plotReducedDim(spe, "UMAP", colour_by = "celltype_mapped_refined") + 
  scale_colour_manual(values = celltype_colours)
g_celltype_umap

plotReducedDim(spe, "UMAP", colour_by = "Sox2")

g_celltype_spatial <- plotReducedDim(spe, "spatialCoords", colour_by = "celltype_mapped_refined") + 
  scale_colour_manual(values = celltype_colours) + 
  coord_fixed()

g_all <- g_celltype_spatial + theme(legend.position = "none") + g_celltype_umap
g_all

Advanced/Extension Question

What considerations need to be made for batch correction of spatial data? What assumptions are being made and/or broken? How could you check this?
Check out the ggiraph package for extending the g_all object to an interactive plot with a tooltip that links the spatial and UMAP coordinate systems. (Hint: This may involve generating a new ggplot object outside of the plotReducedDim function.)

# try to examine answer the above questions using the spe object. 
# you may want to set up some small simulation..

As part of scdney package infrastructure, we provide several pre-processed datasets which can be loaded in with the SpatialDatasets package. Below, we’ve loaded in the MIBI-TOF triple negative breast cancer data from Keren et al. into a SpatialExperiment object.

# Load in Keren et al. data.
#kerenSPE = SpatialDatasets::spe_Keren_2018() 

# Remove cold tumour types and patients with missing survival data.
kerenSPE = kerenSPE |> 
  filter(tumour_type != "cold",
         !is.na(`Survival_days_capped.`)) |> 
  mutate(event = 1 - Censored)

colData(kerenSPE)$x <- spatialCoords(kerenSPE)[,1]
colData(kerenSPE)$y <- spatialCoords(kerenSPE)[,2]

reducedDim(kerenSPE, "spatialCoords") <- spatialCoords(kerenSPE)

kerenSPE

#> class: SpatialExperiment 
#> dim: 48 170171 
#> metadata(0):
#> assays(1): intensities
#> rownames(48): Na Si ... Ta Au
#> rowData names(0):
#> colnames(170171): 1 2 ... 197677 197678
#> colData names(43): CellID imageID ... x y
#> spatialCoords names(2) : x y
#> imgData names(1): sample_id

At the start of any analysis, it is often good to explore the data to get a sense of complexity. We can do this by exploring the distribution of the outcomes and variables in the patients’ meta-data.

Try starting off your exploration by answering the below questions.

Questions

How many cells are in this data?
How many markers? How many images?

# try to answer the above question using the imc object. 
# you may want to check the SpatialExperiment vignette.
# https://www.bioconductor.org/packages/release/bioc/vignettes/SpatialExperiment/inst/doc/SpatialExperiment.html

Again, we can perform dimension reduction to visualize this dataset on a 2D plane.

Questions

Visualise the UMAP using the plotReducedDim function and colour the UMAP by cellType. What does this UMAP tell us?
What are some observations we could make if we coloured by imageID?

set.seed(51773)
kerenSPE <- scater::runUMAP(kerenSPE, exprs_values = "intensities", name = "UMAP")


# try to answer the question here!

Module 2: Cell segmentation

To load in our images we use the loadImages function from cytomapper, here we use the patient 5 image from Keren et al. as an example.

imageLocation <- system.file("extdata", "kerenPatient5.tiff", package = "ScdneySpatial")
image5 = cytomapper::loadImages(
  x = imageLocation,
  as.is = TRUE #Needed as 8-bit image
)

mcols(image5) = data.frame(list("imageID" = "kerenPatient5"))

# Setting the channel names according to orginal paper. 
channelNames(image5) = c("Au", "Background", "Beta catenin", "Ca", "CD11b", "CD11c", "CD138", "CD16", "CD20", "CD209", "CD3", "CD31", "CD4", "CD45", "CD45RO", "CD56", "CD63", "CD68", "CD8", "dsDNA", "EGFR", "Fe", "FoxP3", "H3K27me3", "H3K9ac", "HLA_Class_1", "HLA_DR", "IDO", "Keratin17", "Keratin6", "Ki67", "Lag3", "MPO", "Na", "P", "p53", "Pan-Keratin", "PD-L1", "PD1", "phospho-S6", "Si", "SMA", "Ta", "Vimentin")

Questions

What class is image5? Hint: class()
How many images and markers are in image5?
Challenge: What is the dimension of the image5 image?

# Answer questions here

We can visualise this image to see what we have read in. Lets highlight 4 markers.

# Visualise segmentation performance another way.
cytomapper::plotPixels(
  image = image5[1],
  colour_by = c("CD45", "Pan-Keratin", "SMA", "dsDNA"),
  colour = list(
    CD45 = c("black", "blue"),
    `Pan-Keratin` = c("black", "yellow"),
    SMA = c("black", "green"),
    dsDNA = c("black", "red")
  )
)

We can manipulate the brightness, contrast and gamma levels as follows. See if you can do a better job.

# Visualise segmentation performance another way.
cytomapper::plotPixels(
  image = image5[1],
  colour_by = c("CD45", "Pan-Keratin", "SMA", "dsDNA"),
  display = "single",
  colour = list(
    CD45 = c("black", "red"),
    `Pan-Keratin` = c("black", "yellow"),
    SMA = c("black", "green"),
    dsDNA = c("black", "blue")
  )
  ,
  # Adjust the brightness, contrast and gamma of each channel.
  bcg = list(
    CD45 = c(0, 4, 1),
    `Pan-Keratin` = c(0, 3, 1),
    SMA = c(0, 2, 1),
    dsDNA = c(0, 2, 1)
  ),
  legend = NULL
)

The EBImage package on Bioconductor provides a lot of useful functions for manipulating imaging data in R. This includes functionality for finding cells, process called cell segmentation. Lets work through an example from their vignette. This will use some functionality that complements that which you’ve already learnt.

We start by loading the images of nuclei and cell bodies. To visualize the cells we overlay these images as the green and the blue channel of a false-color image. Notice, that with display you can zoom!

nuc = readImage(system.file('images', 'nuclei.tif', package='EBImage')) 
cel = readImage(system.file('images', 'cells.tif', package='EBImage'))  
cells = rgbImage(green=1.5*cel, blue=nuc) 
display(cells, all = TRUE)

We will next segment the nuclei. The nuc channel contains fluorescent intensities of a protein expressed in the nuclei of cells. First we create a nuclei mask which will threshold this channel to separate signal from noise and then clean this with some morphological operations. Then we identify single nuclei using bwlabel.

nmask = thresh(nuc, w=10, h=10, offset=0.05)
nmask = opening(nmask, makeBrush(5, shape='disc'))
nmask = fillHull(nmask)
nmask = bwlabel(nmask)

display(nmask, all=TRUE)

Questions

Do you understand what nmask is?
What are you seeing when you do table(nmask)
What happens when you using colorLabels() with display?

Next, we use the segmented nuclei as seeds to perform Voronoi segmentation of the cytoplasm.

ctmask = opening(cel>0.1, makeBrush(5, shape='disc'))
cmask = propagate(cel, seeds=nmask, mask=ctmask)

display(ctmask, all=TRUE)

We can then visualise our segmentations with paintObjects.

segmented = paintObjects(cmask, cells, col='#ff00ff')
segmented = paintObjects(nmask, segmented, col='#ffff00')

display(segmented, all=TRUE)

Images stored in a list or CytoImageList can be segmented using simpleSeg. Below simpleSeg will identify the nuclei in the image using the dsDNA, H3K27me3 and H3K9ac channel. To estimate the cell body of the cells we will simply dilate out from the nuclei by 3 pixels. We also have specified that the channels be sqrt transformed, and the 99th quantile of values removed to ensure our segmentation is not affected by outliers.

We can visualise the segmentations using the display and colorLabels functions in EBImage.

# Generate segmentation masks
masks <- simpleSeg(
  image5,
  nucleus = c("dsDNA", "H3K27me3", "H3K9ac"),
  cellBody = "dilate", 
  transform = c("sqrt", "norm99"),
  sizeSelection = 40,
  smooth = 3,
  discSize = 3,
  cores = nCores
)

One way to assess if segmentation is appropriate is to look. We now know two ways to do this.

# Visualise segmentation performance one way.
masks[[1]] |> 
  EBImage::colorLabels() |> 
  EBImage::display()

The plotPixels function in cytomapper makes it easy to overlay the masks on top of the intensities of markers in the image. Here we can s

# Visualise segmentation performance another way.
cytomapper::plotPixels(
  image = image5[1],
  mask = masks[1],
  img_id = "imageID",
  colour_by = c("CD45", "Pan-Keratin", "SMA", "dsDNA"),
  display = "single",
  colour = list(
    CD45 = c("black", "red"),
    `Pan-Keratin` = c("black", "yellow"),
    SMA = c("black", "green"),
    dsDNA = c("black", "blue")
  ),
  # Adjust the brightness, contrast and gamma of each channel.
  bcg = list(
    CD45 = c(0, 4, 1),
    `Pan-Keratin` = c(0, 3, 1),
    SMA = c(0, 2, 1),
    dsDNA = c(0, 4, 1)
  ),
  legend = NULL
)

Using the table function in R we can measure the pixel area of each cell and plot this on a histogram. In this dataset 1 µm is equivalent to $2.56$ pixels, we can transform the pixel area of a cell to physical area by dividing the pixel area by $2.56^2$. We can see below the majority of our cells are just below $100µm^2$, you should decide whether or not this is an appropriate size based on your knowledge of the cell types being imaged.

# Get area of every cell
cellSize = table(masks[[1]])/(2.56^2)

# Filter out first index, which represents background
cellSize[2:length(cellSize)] |> 
  hist(breaks = 30,
       xlab = "Pixel area of cell (µm^2)",
       main = NULL)

Questions

Do our segmentations look better if other parameters are used? Hint: use the help function to see what other parameters are available in simpleSeg.
We only have one image here, but how would you check the quality of 100 images?

With our segmentation masks we can convert our image to a SpatialExperiment object using the measureObjects function in cytomapper. measureObjects will find the center of each cell and represent the cell as an x and y coordinate. In addition the average marker expression within each cell is summarised and stored in the assays object of the SpatialExperiment. NOTE: If you are using other segmentation tools or softwares, you can import the masks as tiffs into R and use measureObjects to create a SpatialExperiment.

# Summarise the expression of each marker in each cell
cells <- cytomapper::measureObjects(
  mask = masks,
  image = image5,
  feature_types = c("basic", "moment"),
  basic_feature = "mean",
  moment_feature = c("cx", "cy"),
  img_id = "imageID",
  BPPARAM = BPPARAM
)


cells

#> class: SingleCellExperiment 
#> dim: 44 3739 
#> metadata(0):
#> assays(1): counts
#> rownames(44): Au Background ... Ta Vimentin
#> rowData names(0):
#> colnames: NULL
#> colData names(5): imageID object_id m.cx m.cy objectNum

Module 3: Feature normalisation

Before moving onto annotating the cell types, we should check if the marker intensities of each cell require any transformation or normalisation. Here we examine the distribution of CD45 in each cell across all the images, which is a general marker for immune cells. Below we can see that the intensity of CD45 looks very skewed.

# Joining the marker information with the cell information
proteinIntensities = kerenSPE |> 
  join_features(features = rownames(kerenSPE), shape = "wide", assay = "intensities")

# View the density of CD45 across all images.
ggplot(proteinIntensities, aes(x = CD45, colour = imageID)) +
  geom_density() +
  theme(legend.position = "none")

We can transform and normalise our data using the normalizeCells function in simpleSeg. Here we have taken the intensities from the intensities assay, performed a asinh transform, then for each image trimmed the 99 quantile and min-max scaled to 0-1. In addition to this we have performed principle component analysis, and regressed out the first PC. This modified data is then stored in the normIntensities assay. We can see that the distribution of CD45 looks much less skewed than before.

kerenSPE <- normalizeCells(
  cells = kerenSPE,
  transformation = "asinh",
  method = c("trim99", "minMax", "PC1"),
  assayIn = "intensities",
  assayOut = "normIntensities",
  cores = nCores
)


normProteinIntensities = kerenSPE |> 
  join_features(features = rownames(kerenSPE), shape = "wide", assay = "normIntensities")

ggplot(normProteinIntensities, aes(x = CD3, colour = imageID)) +
  geom_density() +
  theme(legend.position = "none")

Questions

CD3 is a marker which characterises T cells, plot the distribution of CD3. What does it look like? What does this tell us about T cells in our dataset.
If we set cellType == "CD4_T_cell" is this concordant with the conclusion you made in question 1.

# Answer question here

Module 4: Cell type annotation

Here we cluster using the runFuseSOM function from FuseSOM. We have chosen to specify the same subset of markers used in the original manuscript for gating cell types. We have also specified the number of clusters to identify to be numClusters = 17.

useMarkers = c("CD45", "FoxP3", "CD4", "CD8", "CD3", "CD20", "CD16", "CD68", "MPO", "HLA.DR", "Pan.Keratin", "Keratin17", "Keratin6", "p53", "Beta.catenin", "EGFR")

set.seed(51773)

kerenSPE <- runFuseSOM(
  kerenSPE,
  markers = useMarkers,
  assay = "normIntensities",
  numClusters = 17
)

We can begin the process of understanding what each of these cell clusters are by using the plotGroupedHeatmap function from scater.

scater::plotGroupedHeatmap(
  kerenSPE,
  features = useMarkers,
  group = "clusters",
  exprs_values = "normIntensities",
  center = TRUE,
  scale = TRUE,
  zlim = c(-3, 3),
  cluster_rows = FALSE
)

Questions

Have we captured distinct cell populations? Can you identify any of the clusters?
Run the code below to compare the cell types in the dataset to the clusters we have identified. How do our clusters compare? Are there any clusters that should be combined?

# Type your answer here


# Question 2 code
regionMap(kerenSPE, cellType = "clusters", region = "cellType") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  labs(x = "Cell types in data", y = "Our clusters")

Great question! There is no right answer here, however there are several statistics which can be used to estimate the “true” number of clusters. We can use the estimateNumCluster and optiPlot functions from FuseSOM to examine if our choice of 17 clusters is reasonable. Here we use the gap statistic, other methods such as jump, slope, silhouette and within cluster distance (wcd) are also available.

kerenSPE <- estimateNumCluster(kerenSPE, kSeq = 2:30)
optiPlot(kerenSPE, method = "gap")

kerenSPE@metadata$clusterEstimation$Discriminant

#> [1] 10

Finally we can run a UMAP to examine how distinct our clusters are from one another.

set.seed(51773)
# Perform dimension reduction using UMP.
kerenSPE <- scater::runUMAP(
  kerenSPE,
  subset_row = useMarkers,
  exprs_values = "normIntensities",
  name = "normUMAP"
)



# UMAP by cell type cluster.
scater::plotReducedDim(
  kerenSPE,
  dimred = "normUMAP",
  colour_by = "clusters"
)

Questions

How does this UMAP compare to the original UMAP?

# Answer question here.

Cell type clustering may difficult without the aid of domain experts who know how many cell types to expect and are able to annotate the clusters. Cell type classification is an alternative approach to clustering which annotates cell types based on an expert labeled reference dataset.

To do so we will use scClassify. Below we will use the intensities an cell type annotations from the kerenSPE dataset to predict the cell types in image 5.

# Splitting data to train and test sets
kerenSPE_train = kerenSPE |> 
  filter(imageID != 5)

kerenSPE_test = kerenSPE |> 
  filter(imageID == 5 )


# Converting intensities to dgCMatrix
train_intensities = assay(kerenSPE_train, "intensities") |>  
  as.matrix() |> 
  as( "dgCMatrix")


test_intensities = assay(kerenSPE_test, "intensities") |>  
  as.matrix() |> 
  as( "dgCMatrix")

scClassify first constructs a cell type tree from the training dataset, where cells are organised in a hierarchy with increasingly fined tuned annotations, here we use the HOPACH method to construct this tree. This tree is constructed using the proteins which are differentially expressed between one cell type to all the other cells, identified using limma. To annotate the cells in the unlabeled dataset, the unlabeled cells are projected into a lower dimensional space and a weighted KNN (WKNN) is used to classify unlabeled cells to their closest neighbours using pearson correlation as the distance metric.

scClassifyResults = scClassify(exprsMat_train = train_intensities, 
                            cellTypes_train = kerenSPE_train$cellType,
                            exprsMat_test = list(keren5 = test_intensities),
                            tree = "HOPACH",
                            algorithm = "WKNN",
                            selectFeatures = c("limma"),
                            similarity = c("pearson"),
                            returnList = FALSE,
                            verbose = FALSE,
                            BPPARAM = BPPARAM)

To save time we have preloaded in the scClassify results, below we can view the hierarchical tree constructed by scClassify.

load(file = system.file("extdata", "scClassifyResults.rda", package = "ScdneySpatial"))

plotCellTypeTree(cellTypeTree(scClassifyResults$trainRes))

Having a look at some of the predictions, we can observe some cells belonging to multiple cell types. These represent cells which are classified as parent populations in the hierarchical tree. Here we clean up the names to make plotting easier. We can compare our original labels to our classified labels to get a sense of how concordant our predictions are. As you can see, around 68% of the cell types have identical labels between the original and classified cell types.

classifiedCellTypes = scClassifyResults$testRes$keren5$pearson_WKNN_limma$predRes

table(classifiedCellTypes)

#> classifiedCellTypes
#>                                                                                           B_cell 
#>                                                                                                2 
#>                                                                                       CD4_T_cell 
#>                                                                                              209 
#>                                                                                       CD8_T_cell 
#>                                                                                              440 
#>                                                                                               DC 
#>                                                                                                1 
#>                                                                                       DC_or_Mono 
#>                                                                                              217 
#>                                                               DC_or_Mono_Macrophages_Mono_or_Neu 
#>                                                                                              122 
#>                                                                                         dn_T_CD3 
#>                                                                                               36 
#>                                                            dn_T_CD3_B_cell_CD4_T_cell_CD8_T_cell 
#>                                                                                                1 
#> dn_T_CD3_B_cell_CD4_T_cell_DC_or_Mono_Macrophages_CD8_T_cell_Mono_or_Neu_Neutrophils_NK_DC_Tregs 
#>                                                                                              194 
#>                                                                   dn_T_CD3_CD4_T_cell_CD8_T_cell 
#>                                                                                               70 
#>                                                                                      Endothelial 
#>                                                                                              102 
#>                                                                          Endothelial_Mesenchymal 
#>                                                                                               11 
#>                                                                                   Keratin_Tumour 
#>                                                                                             1901 
#>                          Keratin_Tumour_Unidentified_Other_Immune_Endothelial_Mesenchymal_Tumour 
#>                                                                                              417 
#>                                                                                      Macrophages 
#>                                                                                              857 
#>                                                                          Macrophages_Mono_or_Neu 
#>                                                                                              130 
#>                                                                                      Mesenchymal 
#>                                                                                               76 
#>                                                                                      Mono_or_Neu 
#>                                                                                               35 
#>                                                                                      Neutrophils 
#>                                                                                               65 
#>                                                                                     Other_Immune 
#>                                                                                               75 
#>                                                             Other_Immune_Endothelial_Mesenchymal 
#>                                                                                               55 
#>                                                                                            Tregs 
#>                                                                                               31 
#>                                                                                           Tumour 
#>                                                                                               16 
#>                                                                                       unassigned 
#>                                                                                              335 
#>                                                                                     Unidentified 
#>                                                                                                4 
#>                                                                              Unidentified_Tumour 
#>                                                                                                4

newCellTypes = case_when(
  classifiedCellTypes == "Keratin_Tumour_Unidentified_Other_Immune_Endothelial_Mesenchymal_Tumour" ~ "Tumour_parent",
  classifiedCellTypes == "Other_Immune_Endothelial_Mesenchymal" ~ "Structural_parent",
  classifiedCellTypes == "Macrophages_Mono_or_Neu" ~ "Monocyte_subparent",
  classifiedCellTypes == "DC_or_Mono_Macrophages_Mono_or_Neu" ~ "Monocyte_parent",
  classifiedCellTypes == "dn_T_CD3_B_cell_CD4_T_cell_DC_or_Mono_Macrophages_CD8_T_cell_Mono_or_Neu_Neutrophils_NK_DC_Tregs" ~ "Immune_parent",
  classifiedCellTypes == "dn_T_CD3_CD4_T_cell_CD8_T_cell" ~ "Tcell_parent",
  classifiedCellTypes == "Endothelial_Mesenchymal" ~ "Structural_parent",
  classifiedCellTypes == "dn_T_CD3_B_cell_CD4_T_cell_CD8_T_cell" ~ "Lymphocytes_parent",
  TRUE ~ classifiedCellTypes
)

kerenSPE_test$classifiedCellTypes = newCellTypes


(kerenSPE_test$cellType == kerenSPE_test$classifiedCellTypes) |> 
  mean()

#> [1] 0.6877543

We can look at the distribution of cells in an image. Here we compare our clusters and classified cell types to the cell types in the dataset.

reducedDim(kerenSPE, "spatialCoords") <- spatialCoords(kerenSPE)

kerenSPE |> 
  filter(imageID == "5") |> 
  plotReducedDim("spatialCoords", colour_by = "clusters") +
  ggtitle("Our clusters")

reducedDim(kerenSPE_test, "spatialCoords") = spatialCoords(kerenSPE_test)

# Filter out the "parent" classified cells
kerenSPE_test |> 
  filter(classifiedCellTypes %in% unique(kerenSPE$cellType)) |> 
  plotReducedDim("spatialCoords", colour_by = "classifiedCellTypes") +
  ggtitle("Classified cell types")

kerenSPE |> 
  filter(imageID == "5") |> 
  plotReducedDim("spatialCoords", colour_by = "cellType") +
  ggtitle("Original cell types")

# Comparing orignal cell types to clustering and classification cell types. 
regionMap(kerenSPE, cellType = "clusters", region = "cellType") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  ggtitle("Original vs Clustered") +
  labs(x = "Original cell labels", y = "Clustering cell labels")

regionMap(kerenSPE_test, cellType = "classifiedCellTypes", region = "cellType") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  ggtitle("Original vs Classified") +
  labs(x = "Original cell labels", y = "Classified cell labels")

Questions

Discuss amongst your peers, why might there be differences between the clusters, the classified cell types and the original cell type labels in the dataset?

Module 5: Single sample analysis

In this module we focus on examining gene expression patterns in a single image, from the SeqFISH mouse embryo dataset.

Here we will ask which gene patterns we observe to be changing across the spe$gutRegion cell type in space. Note that we want to assess the anatomical region corresponding to the anterior end of the developing gut developing brain so we will first subset the cells using the spatial coordinates. We can check what we have selected by plotting.

# Filter cells which are labeled gut tub, and in the anterior section.
spe$gutRegion <- spe$celltype_mapped_refined == "Gut tube" &
  reducedDim(spe, "spatialCoords")[,1] < -0.5

plotReducedDim(spe, "spatialCoords", colour_by = "gutRegion") + 
  coord_fixed() + 
  scale_colour_manual(values = c("TRUE" = "red", "FALSE" = "grey"))

Let’s subset the data to only these cells and continue with our scHOT analysis.

spe_gut <- spe[,spe$gutRegion]
spe_gut

#> class: SpatialExperiment 
#> dim: 351 472 
#> metadata(0):
#> assays(3): counts molecules logcounts
#> rownames(351): Abcc4 Acp5 ... Zfp57 Zic3
#> rowData names(1): gene_name
#> colnames(472): embryo1_Pos3_cell377_z2 embryo1_Pos3_cell388_z2 ...
#>   embryo1_Pos27_cell74_z2 embryo1_Pos28_cell373_z2
#> colData names(16): cell_id embryo ... sizeFactor gutRegion
#> spatialCoords names(2) : x y
#> imgData names(1): sample_id

We select genes with at least some proportion of expressed cells for testing, and create the scHOT object.

hist(rowMeans(counts(spe_gut)>0), 40)

gene_to_test <- as.matrix(c(rownames(spe_gut[rowMeans(counts(spe_gut)>0) > 0.2,])))
length(gene_to_test)

#> [1] 165

rownames(gene_to_test) <- apply(gene_to_test, 1, paste0, collapse = "_")
head(gene_to_test)

#>         [,1]     
#> Acvr1   "Acvr1"  
#> Acvr2a  "Acvr2a" 
#> Ahnak   "Ahnak"  
#> Akr1c19 "Akr1c19"
#> Aldh1a2 "Aldh1a2"
#> Aldh2   "Aldh2"

scHOT_spatial <- scHOT_buildFromSCE(spe_gut,
                                    assayName = "logcounts",
                                    positionType = "spatial",
                                    positionColData = c("x_global_affine", "y_global_affine"))

scHOT_spatial

#> class: scHOT 
#> dim: 351 472 
#> metadata(0):
#> assays(1): expression
#> rownames(351): Abcc4 Acp5 ... Zfp57 Zic3
#> rowData names(0):
#> colnames(472): embryo1_Pos3_cell377_z2 embryo1_Pos3_cell388_z2 ...
#>   embryo1_Pos27_cell74_z2 embryo1_Pos28_cell373_z2
#> colData names(16): cell_id embryo ... sizeFactor gutRegion
#> testingScaffold dim: 0 0 
#> weightMatrix dim: 0 0 
#> scHOT_output colnames (0):
#> param names (0):
#> position type: spatial

We now add the testing scaffold to the scHOT object, and set the local weight matrix for testing, with a choice of span of 0.1 (the proportion of cells to weight around each cell). We can speed up computation by not requiring the weight matrix correspond to every individual cell, but instead a random selection among all the cells using the thin function.

scHOT_spatial <- scHOT_addTestingScaffold(scHOT_spatial, gene_to_test)
head(scHOT_spatial@testingScaffold)

#>         gene_1   
#> Acvr1   "Acvr1"  
#> Acvr2a  "Acvr2a" 
#> Ahnak   "Ahnak"  
#> Akr1c19 "Akr1c19"
#> Aldh1a2 "Aldh1a2"
#> Aldh2   "Aldh2"

scHOT_spatial <- scHOT_setWeightMatrix(scHOT_spatial, span = 0.2)
scHOT_spatial@weightMatrix <- thin(scHOT_spatial@weightMatrix, n = 50)

dim(slot(scHOT_spatial, "weightMatrix"))

#> [1]  53 472

For a given cell we can visually examine the local weight given by the span parameter.

cellID = 10

df <- cbind(as.data.frame(colData(scHOT_spatial)),
      W = slot(scHOT_spatial, "weightMatrix")[cellID,])

ggplot(df,
       aes(x = x_global_affine, y = -y_global_affine)) +
  geom_point(aes(colour = W, size = W)) +
  scale_colour_gradient(low = "black", high = "purple") +
  scale_size_continuous(range = c(0.5,2.5)) +
  theme_classic() +
  guides(colour = guide_legend(title = "Spatial Weight"),
         size = guide_legend(title = "Spatial Weight")) +
  ggtitle(paste0("Central cell: ", cellID)) + 
  coord_fixed() +
  NULL

Question

How will the results change if the span is increased/decreased?

## Make associated changes to the code to test out the question above.

We set the higher order function as the weighted mean function, and then calculate the observed higher order test statistics. This may take around 10 seconds.

scHOT_spatial <- scHOT_calculateGlobalHigherOrderFunction(
    scHOT_spatial,
    higherOrderFunction = weightedMean,
    higherOrderFunctionType = "weighted")

slot(scHOT_spatial, "scHOT_output")

#> DataFrame with 165 rows and 2 columns
#>              gene_1 globalHigherOrderFunction
#>         <character>                  <matrix>
#> Acvr1         Acvr1                  0.216666
#> Acvr2a       Acvr2a                  0.375776
#> Ahnak         Ahnak                  0.976418
#> Akr1c19     Akr1c19                  0.744070
#> Aldh1a2     Aldh1a2                  0.245981
#> ...             ...                       ...
#> Wnt5a         Wnt5a                  0.335820
#> Wnt5b         Wnt5b                  0.220300
#> Xist           Xist                  1.162241
#> Zfp444       Zfp444                  0.744082
#> Zfp57         Zfp57                  0.595519

scHOT_spatial <- scHOT_calculateHigherOrderTestStatistics(
    scHOT_spatial, na.rm = TRUE)

Now we can plot the overall mean versus the scHOT statistic to observe any relationship. Labels can be interactively visualised using ggplotly. Some genes may have different distributions so we turn to permutation testing to assess statistical significance.

g <- ggplot(as.data.frame(scHOT_spatial@scHOT_output), 
           aes(x = globalHigherOrderFunction, y = higherOrderStatistic, label = gene_1)) + 
  xlab("Mean across all cells") +
  ylab("scHOT statistic for local weightedMean") +
  geom_point()
g

ggplotly(g)

Set up the permutation testing schema. For the purposes of this workshop we set a low number of permutations over a low number of genes in the testing scaffold, you may want to change this as you work through the workshop yourself. The testing will take a few minutes to run, here with the parallel parameters that were set at the beginning of this document.

scHOT_spatial <- scHOT_setPermutationScaffold(scHOT_spatial,
                                              numberPermutations = 50,
                                              numberScaffold = 30)

scHOT_spatial <- scHOT_performPermutationTest(
    scHOT_spatial,
    verbose = TRUE,
    parallel = FALSE)

#> Permutation testing combination 80 of 165...

slot(scHOT_spatial, "scHOT_output")

#> DataFrame with 165 rows and 9 columns
#>              gene_1 globalHigherOrderFunction            higherOrderSequence
#>         <character>                  <matrix>                  <NumericList>
#> Acvr1         Acvr1                  0.216666 0.251205,0.275076,0.286668,...
#> Acvr2a       Acvr2a                  0.375776 0.398236,0.376223,0.361763,...
#> Ahnak         Ahnak                  0.976418    1.23931,1.22101,1.19278,...
#> Akr1c19     Akr1c19                  0.744070 0.681732,0.622183,0.625407,...
#> Aldh1a2     Aldh1a2                  0.245981 0.117491,0.118105,0.121221,...
#> ...             ...                       ...                            ...
#> Wnt5a         Wnt5a                  0.335820 0.282418,0.280240,0.268180,...
#> Wnt5b         Wnt5b                  0.220300 0.262440,0.321449,0.368172,...
#> Xist           Xist                  1.162241    1.18893,1.17123,1.18238,...
#> Zfp444       Zfp444                  0.744082 0.529888,0.531771,0.538540,...
#> Zfp57         Zfp57                  0.595519 0.853046,0.844188,0.838651,...
#>         higherOrderStatistic numberPermutations storePermutations
#>                    <numeric>          <numeric>         <logical>
#> Acvr1              0.0750954                  0              TRUE
#> Acvr2a             0.0665143                  0              TRUE
#> Ahnak              0.3319897                  0              TRUE
#> Akr1c19            0.1673342                  0              TRUE
#> Aldh1a2            0.1827836                 50              TRUE
#> ...                      ...                ...               ...
#> Wnt5a               0.172300                  0              TRUE
#> Wnt5b               0.104924                  0              TRUE
#> Xist                0.120828                 50              TRUE
#> Zfp444              0.118930                  0              TRUE
#> Zfp57               0.130128                  0              TRUE
#>                              permutations pvalPermutations FDRPermutations
#>                             <NumericList>        <numeric>       <numeric>
#> Acvr1                                  NA               NA              NA
#> Acvr2a                                 NA               NA              NA
#> Ahnak                                  NA               NA              NA
#> Akr1c19                                NA               NA              NA
#> Aldh1a2 0.0482154,0.0593701,0.0513777,...        0.0196078       0.0231579
#> ...                                   ...              ...             ...
#> Wnt5a                                  NA               NA              NA
#> Wnt5b                                  NA               NA              NA
#> Xist       0.147309,0.134604,0.226559,...             0.46            0.46
#> Zfp444                                 NA               NA              NA
#> Zfp57                                  NA               NA              NA

After the permutation test we can estimate the P-values across all genes.

scHOT_plotPermutationDistributions(scHOT_spatial)

scHOT_spatial <- scHOT_estimatePvalues(scHOT_spatial,
                                       nperm_estimate = 100,
                                       maxDist = 0.1)
slot(scHOT_spatial, "scHOT_output")

#> DataFrame with 165 rows and 14 columns
#>              gene_1 globalHigherOrderFunction            higherOrderSequence
#>         <character>                  <matrix>                  <NumericList>
#> Acvr1         Acvr1                  0.216666 0.251205,0.275076,0.286668,...
#> Acvr2a       Acvr2a                  0.375776 0.398236,0.376223,0.361763,...
#> Ahnak         Ahnak                  0.976418    1.23931,1.22101,1.19278,...
#> Akr1c19     Akr1c19                  0.744070 0.681732,0.622183,0.625407,...
#> Aldh1a2     Aldh1a2                  0.245981 0.117491,0.118105,0.121221,...
#> ...             ...                       ...                            ...
#> Wnt5a         Wnt5a                  0.335820 0.282418,0.280240,0.268180,...
#> Wnt5b         Wnt5b                  0.220300 0.262440,0.321449,0.368172,...
#> Xist           Xist                  1.162241    1.18893,1.17123,1.18238,...
#> Zfp444       Zfp444                  0.744082 0.529888,0.531771,0.538540,...
#> Zfp57         Zfp57                  0.595519 0.853046,0.844188,0.838651,...
#>         higherOrderStatistic numberPermutations storePermutations
#>                    <numeric>          <numeric>         <logical>
#> Acvr1              0.0750954                  0              TRUE
#> Acvr2a             0.0665143                  0              TRUE
#> Ahnak              0.3319897                  0              TRUE
#> Akr1c19            0.1673342                  0              TRUE
#> Aldh1a2            0.1827836                 50              TRUE
#> ...                      ...                ...               ...
#> Wnt5a               0.172300                  0              TRUE
#> Wnt5b               0.104924                  0              TRUE
#> Xist                0.120828                 50              TRUE
#> Zfp444              0.118930                  0              TRUE
#> Zfp57               0.130128                  0              TRUE
#>                              permutations pvalPermutations FDRPermutations
#>                             <NumericList>        <numeric>       <numeric>
#> Acvr1                                  NA               NA              NA
#> Acvr2a                                 NA               NA              NA
#> Ahnak                                  NA               NA              NA
#> Akr1c19                                NA               NA              NA
#> Aldh1a2 0.0482154,0.0593701,0.0513777,...        0.0196078       0.0231579
#> ...                                   ...              ...             ...
#> Wnt5a                                  NA               NA              NA
#> Wnt5b                                  NA               NA              NA
#> Xist       0.147309,0.134604,0.226559,...             0.46            0.46
#> Zfp444                                 NA               NA              NA
#> Zfp57                                  NA               NA              NA
#>         numberPermutationsEstimated globalLowerRangeEstimated
#>                           <integer>                 <numeric>
#> Acvr1                           150                  0.209573
#> Acvr2a                          100                  0.286168
#> Ahnak                            50                  0.922912
#> Akr1c19                          50                  0.767661
#> Aldh1a2                         150                  0.209573
#> ...                             ...                       ...
#> Wnt5a                           150                  0.245981
#> Wnt5b                           150                  0.209573
#> Xist                            200                  1.092607
#> Zfp444                           50                  0.767661
#> Zfp57                          1100                  0.209573
#>         globalUpperRangeEstimated pvalEstimated FDREstimated
#>                         <numeric>     <numeric>    <numeric>
#> Acvr1                    0.286168    0.05333333    0.0698413
#> Acvr2a                   0.416245    0.29000000    0.3277397
#> Ahnak                    0.922912    0.01960784    0.0272727
#> Akr1c19                  0.767661    0.01960784    0.0272727
#> Aldh1a2                  0.286168    0.00662252    0.0196429
#> ...                           ...           ...          ...
#> Wnt5a                    0.416245    0.00662252    0.0196429
#> Wnt5b                    0.286168    0.00662252    0.0196429
#> Xist                     1.163058    0.31000000    0.3456081
#> Zfp444                   0.767661    0.16000000    0.1885714
#> Zfp57                    3.171420    0.15090909    0.1791367

We can now examine the spatial expression of the 5 most significant genes, both in our scHOT object and over our original spe object.

output_sorted <- slot(scHOT_spatial, "scHOT_output")[order(slot(scHOT_spatial,
                                                                "scHOT_output")$pvalEstimated),]
topgenes <- rownames(output_sorted)[1:5]

reducedDim(scHOT_spatial, "spatialCoords") <- reducedDim(spe, "spatialCoords")[colnames(scHOT_spatial),]

for (topgene in topgenes) {
  g_spe <- plotReducedDim(spe, "spatialCoords", colour_by = c(topgene), point_size = 1) +
    coord_fixed()
  
  g_scHOT <- plotReducedDim(scHOT_spatial, "spatialCoords", colour_by = c(topgene), point_size = 1,
                           by_exprs_values = "expression") +
    coord_fixed()
  
  g_all <- g_scHOT + g_spe
  print(g_all)
}

Here we are noting the genes that are found to have the most statistically significant spatial variation in their local mean expression. These genes point to specific patterns that govern the development of individual parts of the gut tube.

Advanced/Extended Questions

How would you perform such testing over multiple distinct samples?
scHOT is developed with all higher order testing in mind, use the associated vignette to get towards assessing changes in variation or correlation structure in space.

## try some code

Module 6: Multi-sample analysis

The following module will examine how we can extract different spatial and molecular features from multi-sample datasets. These features can be used to extract some biological understanding. Here we use the MIBI-TOF breast cancer dataset.

Can I measure if two cell types are co-localised?

Kontextual is a part of the Statial package which models spatial relationships between cells in the context of hierarchical cell lineage structures. By assessing spatial relationships between pairs of cells in the context of other related cell types, Kontextual provides robust quantification of cell type relationships which are invariant to changes in tissue structure.

For the purposes of using Kontextual we define cell functions as identified clusters of cells, where larger clusters represent a “parent” cell population, and finer sub-clusters representing a “child” cell population with a particular function. For example, CD4+ T cells have a highly specified function, and may be considered a child to a larger parent population of T cells. Kontextual thus aims to quantify how the localisation patterns of a child population of cells deviate from the spatial behaviour of their parent population, and how their cellular function defines their spatial localisation.

A key input for Kontextual is an annotation of cell type hierarchies. We will need these to organise all the cells present into cell state populations or clusters, e.g. all the different B cell types are put in a vector called bcells.

To make our lives easier, we will start by defining these here. I’m happy to talk about how we use our bioconductor package treekoR to define these hierarchies in a data driven way.

# Set up cell populations

tumour <- c("Keratin_Tumour", "Tumour")

bcells <- c("B_cell")
tcells <- c("dn_T_cell", "CD4_T_cell", "CD8_T_cell", "Tregs")
myeloid <- c("Dc_or_Mono", "DC", "Mono_or_Neu", "Macrophages", "Other_Immune", "Neutrophils")

endothelial <- c("Endothelial")
mesenchymal <- c("Mesenchymal")

tissue <- c(endothelial, mesenchymal)
immune <- c(bcells, tcells, myeloid, "NK") 

all <- c(tumour, tissue, immune, "Unidentified")

Here we examine an image highlighted in the Keren et al. 2018 manuscript where the relationship between two cell types depends on a parent cell population. In image 6 of the Keren et al. dataset, we can see that p53+ tumour cells and immune cells are dispersed. However when the behaviour of p53+ tumour cells are placed in the context of the spatial behaviour of its broader parent population tumour cells, p53+ tumour cells and immune would appear localised.

# Lets define a new cell type vector
kerenSPE$cellTypeNew <- kerenSPE$cellType

# Select for all cells that express higher than baseline level of p53
p53Pos = assay(kerenSPE)["p53",] > -0.300460

# Find p53+ tumour cells
kerenSPE$cellTypeNew[kerenSPE$cellType %in% tumour] <- "Tumour"
kerenSPE$cellTypeNew[p53Pos & kerenSPE$cellType %in% tumour] <- "p53_Tumour"

#Group all immune cells under the name "Immune"

kerenSPE$cellTypeNew[kerenSPE$cellType %in% immune] <- "Immune"


# Plot image 6

kerenSPE |>
  colData() |>
  as.data.frame() |>
  filter(imageID == "6") |>
  filter(cellTypeNew %in% c("Immune", "Tumour", "p53_Tumour")) |>
  arrange(cellTypeNew) |>
  ggplot(aes(x = x, y = y, color = cellTypeNew)) +
  geom_point(size = 1) +
  scale_colour_manual(values = c("#505050", "#64BC46","#D6D6D6")) + guides(colour = guide_legend(title = "Cell types", override.aes = list(size=3)))

The Kontextual function accepts a SingleCellExperiment object, or a single image, or list of images from a SingleCellExperiment object, this gets passed into the cells argument. The two cell types which will be evaluated are specified in the to and from arguments. A parent population must also be specified in the parent argument, note the parent cell population must include the to cell type. The argument r will specify the radius which the cell relationship will be evaluated on. Kontextual supports parallel processing, the number of cores can be specified using the cores argument. Kontextual can take a single value or multiple values for each argument and will test all combinations of the arguments specified.

We can calculate these relationships for a single radius.

p53_Kontextual <- Kontextual(
  cells = kerenSPE,
  image = 6,
  r = 100,
  from = "Immune",
  to = "p53_Tumour",
  parent = c("p53_Tumour", "Tumour"),
  cellType = "cellTypeNew"
)

p53_Kontextual

#>   imageID               test  original kontextual   r weightQuantile inhom edge
#> 1       6 Immune__p53_Tumour -23.49524  -1.856852 100            0.8 FALSE TRUE
#>   includeZeroCells window window.length
#> 1             TRUE convex            NA

The kontextCurve calculates the L-function value and Kontextual values over a range of radii. While kontextPlot plots these values. If the points lie above the red line (expected pattern) then localisation is indicated for that radius, if the points lie below the red line then dispersion is indicated. As seen in the following plot Kontextual is able to correctly identify localisation between p53+ tumour cells and immune cells in the example image for a certain range of radii. The original L-function is not able to identify localisation at any value of radii.

curves <- kontextCurve(
  cells = kerenSPE,
  image = "6",
  from = "Immune",
  to = "p53_Tumour",
  parent = c("p53_Tumour", "Tumour"),
  rs = seq(10, 510, 100),
  cellType = "cellTypeNew",
  cores = nCores
)

kontextPlot(curves)

Alternatively all pairwise cell relationships and their corresponding parents in the dataset can be tested. A data frame with all pairwise combinations can be creating using the parentCombinations function. This function takes in a vector of all the cells, as well as all the parent vectors set up earlier. As shown below the output is a data frame specifying the to, from, and parent arguments for Kontextual.

# Get all relationships between cell types and their parents
parentDf <- parentCombinations(
  all = all,
  tumour,
  bcells,
  tcells,
  myeloid,
  endothelial,
  mesenchymal,
  tissue,
  immune
)

Rather than specifying to, from, and parent in Kontextual, the output from parentCombinations can be input into Kontextual using the parentDf argument, to examine all pairwise relationships in the dataset. This chunk will take a significant amount of time to run (~20 minutes), for demonstration the results have been saved and are loaded in.

# Running Kontextual on all relationships across all images.
kerenKontextual <- Kontextual(
  cells = kerenSPE,
  parentDf = parentDf,
  r = 50,
  cores = nCores
)

load(system.file("extdata", "kerenKontextual.rda", package = "ScdneySpatial"))

To examine whether the features obtained from Statial are associated with patient outcomes or groupings, we can use the colTest function from SpicyR. To understand if survival outcomes differ significantly between 2 patient groups, specify type = "survival" in colTest. Here we examine which features are most associated with patient survival using the Kontextual values as an example. To do so, survival data is extracted from kerenSPE and converted into the survival object kerenSurv.

# Extracting survival data
survData = kerenSPE |>
    colData() |> 
    data.frame() |> 
    select(imageID, Survival_days_capped., event, RECURRENCE_LABEL) |> 
    unique()

# Creating survival vector
kerenSurv = Surv(survData$Survival_days_capped, survData$event)
names(kerenSurv) = survData$imageID

In addition to this, the Kontextual results must be converted from a data.frame to a wide matrix, this can be done using prepMatrix. Note, to extract the original L-function values, specify column = "original" in prepMatrix.

# Converting Kontextual result into data matrix
kontextMat = prepMatrix(kerenKontextual)

# Ensuring rownames of kontextMat match up with rownames of the survival vector 
kontextMat = kontextMat[names(kerenSurv), ]

# Replace NAs with 0
kontextMat[is.na(kontextMat )] <- 0

Finally, both the Kontextual matrix and survival object are passed into colTest, with type = "survival" to obtain the survival results.

# Running survival analysis
survivalResults = spicyR::colTest(kontextMat, kerenSurv, type = "survival")


head(survivalResults)

#>                                      coef se.coef   pval adjPval
#> CD8_T_cell__Keratin_Tumour__immune -0.150   0.043 0.0006    0.21
#> CD4_T_cell__CD8_T_cell__tcells     -0.054   0.019 0.0039    0.51
#> CD4_T_cell__CD8_T_cell__immune     -0.040   0.015 0.0094    0.51
#> Endothelial__Mono_or_Neu__tissue   -0.031   0.012 0.0110    0.51
#> Other_Immune__NK__myeloid           0.045   0.018 0.0120    0.51
#> Endothelial__CD8_T_cell__tissue    -0.033   0.013 0.0120    0.51
#>                                                               cluster
#> CD8_T_cell__Keratin_Tumour__immune CD8_T_cell__Keratin_Tumour__immune
#> CD4_T_cell__CD8_T_cell__tcells         CD4_T_cell__CD8_T_cell__tcells
#> CD4_T_cell__CD8_T_cell__immune         CD4_T_cell__CD8_T_cell__immune
#> Endothelial__Mono_or_Neu__tissue     Endothelial__Mono_or_Neu__tissue
#> Other_Immune__NK__myeloid                   Other_Immune__NK__myeloid
#> Endothelial__CD8_T_cell__tissue       Endothelial__CD8_T_cell__tissue

As we can see from the results CD8_T_cell__Keratin_Tumour__immune is the most significant pairwise relationship which contributes to patient survival. That is the relationship between CD8 T cells and Keratin+ Tumour cells, relative to the parent population of all immune cells. We can see that there is a negative coefficient associated with this relationship, which tells us a decrease in localisation of CD8_T_cell and Keratin+ Tumour leads to poorer survival outcomes for patients.

We can visualise this association sing a Kaplan-Meier curve. We must first extract the Kontextual values of this relationship across all images. Next we determine if CD8 T cells and Keratin+ Tumours are relatively attracted or avoiding in each image, by comparing the Kontextual value in each image to the median Kontextual value. Finally we plot the Kaplan-Meier curve using the ggsurvfit package.

As shown below, when CD8 T cells and Keratin+ Tumours are relatively more dispersed to one another, patients tend to have worse survival outcomes.

# Selecting most significant relationship
survRelationship = kontextMat[["CD8_T_cell__Keratin_Tumour__immune"]]
survRelationship = ifelse(survRelationship > median(survRelationship), "Localised", "Dispersed")
    
# Plotting Kaplan-Meier curve
survfit2(kerenSurv ~ survRelationship) |>
    ggsurvfit() +
    add_pvalue() +
    ggtitle("CD8_T_cell__Keratin_Tumour__immune")

Are there changes in cell states associated with cell localisation?

Changes in cell phenotype can be analytically framed as the change in abundance of a gene or protein within a particular cell type. We can analytically determine whether continuous changes occur to a cell’s proteomic or transcriptomic signature as changes occur in its spatial proximity to another cell type. In the figures below we see the expression of a marker increased in cell type A as it grows closer in spatial proximity to cell type B. This can then be quantified with a scatterplot to determine statistical significance. In the next section of this workshop, we will be exploring the analytical functionalities of Statial which can uncover these continuous changes in cell state.

ctsCombined

The first step in analysing these changes is to calculate the spatial proximity (getDistances) and abundance (getAbundances) of each cell to every cell type. These values will then be stored in the reducedDims slot of the SpatialExperiment object under the names distances and abundances respectively.

kerenSPE <- getDistances(kerenSPE,
                    maxDist = 200)

kerenSPE <- getAbundances(kerenSPE,
                     r = 50)

reducedDim(kerenSPE, "abundances")[is.na(reducedDim(kerenSPE, "abundances"))] <- 0

First, let’s examine the same effect observed earlier with Kontextual. To avoid redefining cell types we’ll examine the distance between p53-positive tumour cells and macrophages in the context of total keratin/tumour cells for image 6.

Statial provides two main functions to assess this relationship - calcStateChanges and plotStateChanges. We can use calcStateChanges to examine the relationship between 2 cell types for 1 marker in a specific image. Similar to Kontextual, we can specify the two cell types with the to and from arguments, and the marker of interest with the marker argument. We can appreciate that the fdr statistic for this relationship is significant, and with a negative coef, or coefficient value, indicating that the expression of p53 in keratin/tumour cells decreases as distance from macrophages increases.

stateChanges <- calcStateChanges(
  cells = kerenSPE,
  type = "distances",
  image = "6",
  from = "Keratin_Tumour",
  to = "Macrophages",
  marker = "p53")

stateChanges

#>   imageID primaryCellType otherCellType marker         coef      tval
#> 1       6  Keratin_Tumour   Macrophages    p53 -0.001402178 -7.010113
#>           pval          fdr
#> 1 2.868257e-12 2.868257e-12

Statial provides a convenient function for visualising this relationship - plotStateChanges. Similar to Kontextual and calcStateChanges, we can specify the cell types to be evaluated with the to and from arguments and the marker of interest with marker.

Through this analysis, we can observe that keratin/tumour cells closer to a group of macrophages tend to have higher expression of p53, as observed in the first graph. This relationship is quantified with the second graph, showing an overall decrease of p53 expression in keratin/tumour cells as distance to macrophages increase.

These results allow us to essentially arrive at the same result as Kontextual, which calculated a localisation between p53+ keratin/tumour cells and macrophages in the wider context of keratin/tumour cells.

p <- plotStateChanges(
  cells = kerenSPE,
  type = "distances",
  image = "6",
  from = "Keratin_Tumour",
  to = "Macrophages",
  marker = "p53",
  size = 1,
  shape = 19,
  interactive = FALSE,
  plotModelFit = FALSE,
  method = "lm")

p$image

p$scatter + 
  labs(y = "p53 expression in Keratin_Tumour",
       x = "Distance of Keratin_Tumour to Macrophages")

Question

What information does this form of analysis provide that Kontextual does not?
Is this observation of localisation consistent across images?
Can you find an interaction where the coefficient is positive? i.e. marker expression in the to cell type rises as distances increases from the from cell type.

Beyond looking at single cell-to-cell interactions for a single image, we can also look at all interactions across all images. The calcStateChanges function provided by Statial can be expanded for this exact purpose - by not specifying cell types, a marker, or an image, calcStateChanges will examine the most significant correlations between distance and marker expression across the entire dataset. Here, we’ve calculated all state changes across all images in case you would like to have a play, but first we’ll be taking a closer examination at the most significant interactions found within image 6 of the Keren et al. dataset.

stateChanges <- calcStateChanges(
  cells = kerenSPE,
  type = "distances",
  minCells = 100)

stateChanges |> 
  filter(imageID == 6) |>
  head(n = 10)

#>    imageID primaryCellType otherCellType       marker         coef      tval
#> 1        6  Keratin_Tumour  Unidentified           Na  0.004218419  25.03039
#> 2        6  Keratin_Tumour   Macrophages  HLA_Class_1 -0.003823497 -24.69629
#> 3        6  Keratin_Tumour    CD4_T_cell  HLA_Class_1 -0.003582774 -23.87797
#> 4        6  Keratin_Tumour  Unidentified Beta.catenin  0.005893120  23.41953
#> 5        6  Keratin_Tumour    CD8_T_cell  HLA_Class_1 -0.003154544 -23.13804
#> 6        6  Keratin_Tumour    DC_or_Mono  HLA_Class_1 -0.003353834 -22.98944
#> 7        6  Keratin_Tumour      dn_T_CD3  HLA_Class_1 -0.003123446 -22.63197
#> 8        6  Keratin_Tumour        Tumour  HLA_Class_1  0.003684079  21.94265
#> 9        6  Keratin_Tumour    CD4_T_cell           Fe -0.003457338 -21.43550
#> 10       6  Keratin_Tumour    CD4_T_cell   phospho.S6 -0.002892457 -20.50767
#>             pval           fdr
#> 1  6.971648e-127 3.357294e-123
#> 2  7.814253e-124 3.386756e-120
#> 3  1.745242e-116 5.602971e-113
#> 4  1.917245e-112 5.730680e-109
#> 5  5.444541e-110 1.498225e-106
#> 6  1.053130e-108 2.808827e-105
#> 7  1.237988e-105 2.980851e-102
#> 8  8.188258e-100  1.689930e-96
#> 9   1.287478e-95  2.325010e-92
#> 10  3.928912e-88  5.821606e-85

In image 6, the majority of the top 10 most significant interactions occur between keratin/tumour cells and an immune population, and many of these interactions appear to involve the HLA class I ligand.

We can examine some of these interactions further with the plotStateChanges function. Taking a closer examination of the relationship between macrophages and keratin/tumour HLA class I expression, the plot below shows us a clear visual correlation - as macrophage density increases, keratin/tumour cells increase their expression HLA class I.

Biologically, HLA Class I is a ligand which exists on all nucleated cells, tasked with presenting internal cell antigens for recognition by the immune system, marking aberrant cells for destruction by either CD8+ T cells or NK cells.

p <- plotStateChanges(
  cells = kerenSPE,
  type = "distances",
  image = "6",
  from = "Keratin_Tumour",
  to = "Macrophages",
  marker = "HLA_Class_1",
  size = 1,
  shape = 19,
  interactive = FALSE,
  plotModelFit = FALSE,
  method = "lm")

p$image

p$scatter + 
  labs(y = "HLA_Class_1 expression in Keratin_Tumour",
       x = "Distance of Keratin_Tumour to Macrophages")

Now, we can take a look at the top 10 most significant results across all images.

stateChanges |> head(n = 10)

#>        imageID primaryCellType otherCellType     marker         coef
#> 69468       37     Endothelial        Tumour       Lag3 -0.001621517
#> 144255      11     Neutrophils            NK       CD56 -0.059936866
#> 16402       35      CD4_T_cell        B_cell       CD20 -0.029185750
#> 16498       35      CD4_T_cell    DC_or_Mono       CD20  0.019125946
#> 4891        35          B_cell    DC_or_Mono phospho.S6  0.005282065
#> 16507       35      CD4_T_cell    DC_or_Mono phospho.S6  0.004033218
#> 4885        35          B_cell    DC_or_Mono     HLA.DR  0.011120703
#> 5043        35          B_cell  Other_Immune          P  0.011182182
#> 16354       35      CD4_T_cell      dn_T_CD3       CD20  0.016349492
#> 4888        35          B_cell    DC_or_Mono     H3K9ac  0.005096632
#>                 tval          pval           fdr
#> 69468  -4.916884e+14  0.000000e+00  0.000000e+00
#> 144255 -2.172437e+15  0.000000e+00  0.000000e+00
#> 16402  -4.057355e+01 7.019343e-282 4.056315e-277
#> 16498   4.053436e+01 1.891267e-281 8.196894e-277
#> 4891    4.041385e+01 5.306590e-278 1.839933e-273
#> 16507   3.472882e+01 4.519947e-219 1.305986e-214
#> 4885    3.415344e+01 8.401034e-212 2.080612e-207
#> 5043    3.414375e+01 1.056403e-211 2.289265e-207
#> 16354   3.391901e+01 1.219488e-210 2.349045e-206
#> 4888    3.399856e+01 3.266533e-210 5.662959e-206

Immediately, we can appreciate that a couple of interactions appear a bit strange. One of the most significant interactions occurs between B cells and CD4 T cells, where CD4 T cells are found to increase in CD20 expression when in close proximity to B cells. Biologically, CD20 is a highly specific ligand for B cells, and under healthy circumstances are usually not expressed in T cells.

Could this potentially be an artefact of calcStateChanges? We can examine the image through the plotStateChanges function, where we indeed observe an apparent localisation between B cells and T cells.

Question

Are there any other interactions here that you think might not make biological sense?
Does the relationship between T cell CD20 expression and B cell proximity occur across images?
Why are the majority of most significant interactions occurring in image 35?

HINT: Configure the parameters of plotStateChanges to examine some these other significant interactions. Do they look like artefacts?

p <- plotStateChanges(
  cells = kerenSPE,
  type = "distances",
  image = "35",
  from = "CD4_T_cell",
  to = "B_cell",
  marker = "CD20",
  size = 1,
  shape = 19,
  interactive = FALSE,
  plotModelFit = FALSE,
  method = "lm")

p$image + 
  labs(y = "CD20 expression in CD4_T_cell",
       x = "Distance of B_cell to CD4_T_cell")

p$scatter

So why are T cells expressing CD20? This brings us to a key limitation of cell segmentation.

Contamination, or more specifically known as lateral marker spill over, is an issue that results in a cell’s marker expressions being wrongly attributed to another adjacent cell. This issue arises from incorrect segmentation where components of one cell are wrongly determined as belonging to another cell. Alternatively, this issue can arise when antibodies used to tag and measure marker expressions do not latch on properly to a cell of interest, thereby resulting in residual markers being wrongly assigned as belonging to a cell near the intended target cell. It is important that we either correct or account for this incorrect attribution of markers in our modelling process. This is critical in understanding whether significant cell-cell interactions detected are an artifact of technical measurement errors driven by spill over or are real biological changes that represent a shift in a cell’s state.

Contamination

To circumvent this problem, Statial provides a function that predicts the probability that a cell is any particular cell type - calcContamination. calcContamination returns a dataframe of probabilities demarcating the chance of a cell being any particular cell type. This dataframe is stored under contaminations in the reducedDim slot of the SpatialExperiment object. It also provides the rfMainCellProb column, which provides the probability that a cell is indeed the cell type it has been designated. E.g. For a cell designated as a CD8+ T cell, rfMainCellProb could give a 80% chance that the cell is indeed CD8+ T cell, due to contamination.

We can then introduce these probabilities as covariates into our linear model by setting contamination = TRUE as a parameter in our calcStateChanges function. However, this is not a perfect solution for the issue of contamination. As we can see, despite factoring in contamination into our linear model, the correlation between B cell density and CD20 expression in CD4+ T cells remains one of the most significant interactions in our model.

kerenSPE <- calcContamination(kerenSPE)

stateChangesCorrected <- calcStateChanges(
  cells = kerenSPE,
  type = "distances",
  minCells = 100,
  contamination = TRUE)

stateChangesCorrected |> head(n = 20)

#>        imageID primaryCellType otherCellType      marker         coef
#> 69468       37     Endothelial        Tumour        Lag3 -0.001621517
#> 144255      11     Neutrophils            NK        CD56 -0.059936866
#> 16402       35      CD4_T_cell        B_cell        CD20 -0.025083624
#> 16498       35      CD4_T_cell    DC_or_Mono        CD20  0.016225621
#> 4891        35          B_cell    DC_or_Mono  phospho.S6  0.004397271
#> 16354       35      CD4_T_cell      dn_T_CD3        CD20  0.013867647
#> 16507       35      CD4_T_cell    DC_or_Mono  phospho.S6  0.003630126
#> 16357       35      CD4_T_cell      dn_T_CD3      HLA.DR  0.010356673
#> 85540        3  Keratin_Tumour            DC          Ca -0.013939833
#> 3697        28          B_cell            NK          Na -0.004462312
#> 4741        35          B_cell      dn_T_CD3      HLA.DR  0.009056688
#> 80974       20  Keratin_Tumour        Tumour HLA_Class_1  0.002965555
#> 82078       23  Keratin_Tumour  Unidentified HLA_Class_1  0.003188059
#> 5083        35          B_cell  Other_Immune  phospho.S6  0.004790796
#> 4885        35          B_cell    DC_or_Mono      HLA.DR  0.008889339
#> 5043        35          B_cell  Other_Immune           P  0.008256807
#> 5080        35          B_cell  Other_Immune      H3K9ac  0.005061792
#> 81737       21  Keratin_Tumour            DC Pan.Keratin -0.005993976
#> 4747        35          B_cell      dn_T_CD3  phospho.S6  0.003767236
#> 16363       35      CD4_T_cell      dn_T_CD3  phospho.S6  0.002966570
#>                 tval          pval           fdr
#> 69468  -2.805747e+14  0.000000e+00  0.000000e+00
#> 144255 -2.265619e+15  0.000000e+00  0.000000e+00
#> 16402  -3.527286e+01 1.471188e-224 8.501655e-220
#> 16498   3.433303e+01 9.709612e-215 4.208219e-210
#> 4891    3.070953e+01 5.072179e-177 1.758656e-172
#> 16354   3.022328e+01 3.883378e-173 1.122057e-168
#> 16507   2.993728e+01 2.466479e-170 6.108516e-166
#> 16357   2.926312e+01 8.819909e-164 1.911307e-159
#> 85540  -2.961537e+01 7.578873e-163 1.459885e-158
#> 3697   -2.885743e+01 5.843841e-156 1.013106e-151
#> 4741    2.645733e+01 7.958696e-137 1.254312e-132
#> 80974   2.595077e+01 1.410980e-135 2.038431e-131
#> 82078   2.585593e+01 5.728680e-135 7.639547e-131
#> 5083    2.612732e+01 7.639758e-134 9.460367e-130
#> 4885    2.577910e+01 1.014405e-130 1.172402e-126
#> 5043    2.570769e+01 4.402113e-130 4.769772e-126
#> 5080    2.539252e+01 2.781691e-127 2.836719e-123
#> 81737  -2.470084e+01 8.985348e-127 8.654038e-123
#> 4747    2.518766e+01 1.793125e-125 1.636114e-121
#> 16363   2.512759e+01 2.089201e-125 1.810951e-121

However, this does not mean factoring in contamination into our linear model was ineffective. In general, cell type specific markers such as CD68, CD45, and CD20 should not change in cells they are not specific to. Therefore, relationships detected to be significant involving these cell type markers are likely false positives and will be treated as such for the purposes of evaluation.

Plotting the relationship between false positives and true positives, we’d expect the contamination correction to be greatest in relationships which are detected to be more significant.

cellTypeMarkers <- c("CD3", "CD4", "CD8", "CD56", "CD11c", "CD68", "CD45", "CD20")

values = c("blue", "red")
names(values) <- c("None", "Corrected")

df <- rbind(data.frame(TP =cumsum(stateChanges$marker %in% cellTypeMarkers), FP = cumsum(!stateChanges$marker %in% cellTypeMarkers), type = "None"),
            data.frame(TP =cumsum(stateChangesCorrected$marker %in% cellTypeMarkers), FP = cumsum(!stateChangesCorrected$marker %in% cellTypeMarkers), type = "Corrected"))

ggplot(df, aes(x = TP, y = FP, colour = type)) + geom_line()+ labs(y = "Cell state marker", x = "Cell type marker") + scale_colour_manual(values = values)

Here, we zoom in on the ROC curve where the top 100 lowest p values occur, where we indeed see more true positives than false positives with contamination correction.

ggplot(df, aes(x = TP, y = FP, colour = type)) + geom_line()+ xlim(0,100) + ylim(0,1000)+ labs(y = "Cell state marker", x = "Cell type marker") + scale_colour_manual(values = values)

Question

What can we conclude from the above ROC graphs?

Similiar to Kontextual, we can run a similar survival analysis using our state changes results. Here, prepMatrix extracts the coefficients, or the coef column of stateChanges by default. To use the t values instead, specify column = "tval" in the prepMatrix function.

# Preparing features for Statial
stateMat <- prepMatrix(stateChanges)

# Ensuring rownames of stateMat match up with rownames of the survival vector
stateMat <- stateMat[names(kerenSurv), ]

# Remove some very small values
stateMat <- stateMat[,colMeans(abs(stateMat)>0.0001)>.8]

survivalResults <- colTest(stateMat, kerenSurv, type = "survival")

head(survivalResults)

#>                                          coef se.coef    pval adjPval
#> Macrophages__CD4_T_cell__CD138            440     110 8.1e-05   0.054
#> Macrophages__Other_Immune__HLA_Class_1   -500     170 2.9e-03   0.620
#> Keratin_Tumour__Mesenchymal__dsDNA       -930     330 4.9e-03   0.620
#> Keratin_Tumour__CD8_T_cell__Keratin6     -220      80 5.2e-03   0.620
#> Keratin_Tumour__Mono_or_Neu__Pan.Keratin -260      94 6.0e-03   0.620
#> Macrophages__Other_Immune__CD4           -480     180 7.6e-03   0.620
#>                                                                           cluster
#> Macrophages__CD4_T_cell__CD138                     Macrophages__CD4_T_cell__CD138
#> Macrophages__Other_Immune__HLA_Class_1     Macrophages__Other_Immune__HLA_Class_1
#> Keratin_Tumour__Mesenchymal__dsDNA             Keratin_Tumour__Mesenchymal__dsDNA
#> Keratin_Tumour__CD8_T_cell__Keratin6         Keratin_Tumour__CD8_T_cell__Keratin6
#> Keratin_Tumour__Mono_or_Neu__Pan.Keratin Keratin_Tumour__Mono_or_Neu__Pan.Keratin
#> Macrophages__Other_Immune__CD4                     Macrophages__Other_Immune__CD4

Question

Are the survival results the same for both the distance and abundance metrics?
Do these survival results change with contamination correction?
Why might the survival results look the same/different between distance and abundance?

For our state changes results, Macrophages__CD4_T_cell__CD138 is the most significant pairwise relationship which contributes to patient survival. That is, the relationship between CD138 expression in macrophages as their spatial proximity to CD4 T cells change. The positive coeffcient associated with this relationship tells us that lower CD138 expression in macrophages nearby CD4 T cells lead to poorer survival outcomes for patients.

# Selecting the most significant relationship
survRelationship = stateMat[["Macrophages__CD4_T_cell__CD138"]]
survRelationship = ifelse(survRelationship > median(survRelationship), "Higher expressed in close cells", "Lower expressed in close cells")
    
# Plotting Kaplan-Meier curve
survfit2(kerenSurv ~ survRelationship) |>
    ggsurvfit() +
    add_pvalue() +
    ggtitle("Macrophages__CD4_T_cell__CD138")

Question

How should these coefficients be interpreted?
Do any of these relationships makes sense?
Could you visualise representative images?

How do we generate a molecular representation for each individual?

The next section of this workshop will be dedicated to scFeatures - a straightforward package for generating a suite of molecular representations regarding a patient, taking a matrix of proteins x cells as input. The molecular representation is interpretable and hence facilitates downstream analysis of the individual. In general, scFeatures generates features across six categories representing different molecular views of cellular characteristics. These include:

cell type proportions
cell type specific gene expressions
cell type specific pathway expressions
cell type specific cell-cell interaction (CCI) scores
overall aggregated gene expressions
spatial metrics

The different types of features constructed enable a more comprehensive multi-view understanding of each individual from a matrix of spot x cells. By default, the function will generate a total of 13 different types of features (feature_types) shown below and store them in a list. All generated feature types are stored in a matrix of samples x features.

##  [1] "proportion_raw"       "proportion_logit"     "proportion_ratio"    
##  [4] "gene_mean_celltype"   "gene_prop_celltype"   "gene_cor_celltype"   
##  [7] "gene_mean_bulk"       "gene_prop_bulk"       "gene_cor_bulk"       
## [10] "L_stats"              "celltype_interaction" "morans_I"            
## [13] "nn_correlation"

There are different ways you can use scFeatures to generate molecular representations for individuals and it requires the following information for spatial data.

data,
sample,
X coordinates,
Y coordinates,
feature_types, and
type

There are a total of 13 different types of features (feature_types) that you can choose to generate. The argument type refers the type of input data we have. This is either scrna (single-cell RNA-sequencing data), spatial_p (spatial proteomics data), or spatial_t (single cell spatial data).

Suppose that we are interested in determining the proportion of each cell type in each individual within each region. It is necessary to specify type = spatial_p to reflect that we have spatial proteomics data and feature_types = proportion_raw to indicate we intend to calculate cell type proportion for each of the region-specific sub-cell types.

#>        B_cell  CD4_T_cell CD8_T_cell           DC  DC_or_Mono
#> 1 0.221985678 0.047029224 0.03348171 0.0003870718 0.034062319
#> 2 0.000660502 0.006935271 0.08091149 0.0122192867 0.002972259
#> 3 0.067616785 0.077751386 0.07854315 0.0057007126 0.068725257
#> 4 0.049525817 0.110341713 0.04576246 0.0003010688 0.068944754
#> 5 0.000554939 0.047169811 0.09822420 0.0005549390 0.061598224

The output of scFeatures, scfeatures_result, is a list of 13 dataframes, which each represent a different molecular feature of a patient. Below, we give an example visualisation of the raw cell type proportions for each patient.

Can I identify distinct neighbourhoods or regions of cells in my images?

We can cluster areas with similar spatial interactions to identify regions using the lisaClust package on Bioconductor. Here we set k = 5 to identify 5 regions.

So what are some ways I can check that my region clustering has worked effectively? One way we can do this is to use the hatchingPlot function to visualise the regions identified by lisaClust.

We can appreciate that the regions have separated out nicely for image 5, with region 1 clearly being a tumour region, region 5 a non-tumour region, and region 3 being the tumour-immune interaction region.

We can also use the regionMap function to get a more quantitative determination of what each region represents, to look at which cell types are encountered in which regions.

One feature we might be interested in exploring is to examine the molecular profile of each region or cells within each region. Statial provides functionality to identify the average marker expression of a given cell type in a given region, using the getMarkerMeans function. Similar to the analysis above, these features can also be used for survival analysis.

#>                                    coef se.coef    pval adjPval
#> Si__Unidentified__region_4         -3.4    1.00 0.00073    0.47
#> HLA_Class_1__Macrophages__region_4 -1.7    0.50 0.00085    0.47
#> CD45RO__dn_T_CD3__region_4         -2.4    0.73 0.00089    0.47
#> HLA_Class_1__Neutrophils__region_4 -1.7    0.51 0.00110    0.47
#> CD209__CD8_T_cell__region_1        10.0    3.20 0.00140    0.47
#> HLA_Class_1__Mono_or_Neu__region_1 -1.6    0.50 0.00150    0.47
#>                                                               cluster
#> Si__Unidentified__region_4                 Si__Unidentified__region_4
#> HLA_Class_1__Macrophages__region_4 HLA_Class_1__Macrophages__region_4
#> CD45RO__dn_T_CD3__region_4                 CD45RO__dn_T_CD3__region_4
#> HLA_Class_1__Neutrophils__region_4 HLA_Class_1__Neutrophils__region_4
#> CD209__CD8_T_cell__region_1               CD209__CD8_T_cell__region_1
#> HLA_Class_1__Mono_or_Neu__region_1 HLA_Class_1__Mono_or_Neu__region_1

What are some ways you could identify the “right” number of regions?
Can you think of some ways to generate additional features using these regions?

Module 7: Patient classification

Finally we demonstrate how we can use the Bioconductor package ClassifyR to perform patient classification with the features generated from Statial and scFeatures. In addition to this, we also calculate cell type proportions and region proportions using the getProp function in spicyR. Here we perform 5 fold cross validation with 20 repeats, using a CoxPH model for survival classification.

To prepare our metrics for classification, first we must convert our dataframes into matrix format. This can be accomplished, as we did earlier, with the prepMatrix function of Statial. As we generated matrices for Kontextual and the distance metric of SpatioMark earlier, we will now be generating matrices for the SpatioMark corrected distance metric, and the abundance metrics.

We can also prepare a metric for region proportions, to look at whether the percentage of area each region occupies in an image is capable of predicting survival.

## Distances Corrected Matrix
stateMatCorDist <- prepMatrix(stateChangesCorrected)
stateMatCorDist <- stateMatCorDist[names(kerenSurv), ]
stateMatCorDist[is.na(stateMatCorDist)] <- 0
# Remove some very small values
stateMatCorDist <- stateMatCorDist[,colMeans(abs(stateMatCorDist)>0.0001)>.8]

reducedDim(kerenSPE, "abundances")[is.na(reducedDim(kerenSPE, "abundances"))] <- 0
## Abundances
stateChangesAbund <- calcStateChanges(
  cells = kerenSPE,
  type = "abundances",
  minCells = 100)

stateMatAbund <- prepMatrix(stateChangesAbund)
stateMatAbund <- stateMatAbund[names(kerenSurv), ]
stateMatAbund[is.na(stateMatAbund)] <- 0
# Remove some very small values
stateMatAbund <- stateMatAbund[,colMeans(abs(stateMatAbund)>0.0001)>.8]

## Abundances Corrected
stateChangesCorrectedAbund <- calcStateChanges(
  cells = kerenSPE,
  type = "abundances",
  minCells = 100,
  contamination = TRUE)

stateMatCorAbund <- prepMatrix(stateChangesCorrectedAbund)
stateMatCorAbund <- stateMatCorAbund[names(kerenSurv), ]
stateMatCorAbund[is.na(stateMatCorAbund)] <- 0
# Remove some very small values
stateMatCorAbund <- stateMatCorAbund[,colMeans(abs(stateMatCorAbund)>0.0001)>.8]

## Region proportions
regionProp <- getProp(kerenSPE, 
                       feature = "region",
                       imageID = "imageID")

Then we compile these matrices into a list. scFeatures conveniently provides all its metrics as part of a list of matrices already, allowing us to easily combine our Statial metrics with scFeatures into 1 list of matrices. We then input this list into the crossValidate function of ClassifyR, specifying our survival object created earlier, and 5 folds of cross validation with 20 repeats. We will also be using a feature selection algorithm selecting for 10 features for each metric.

statialFeatureList <- list(`SpatioMark Distances` = stateMat,
                           `SpatioMark Corrected Distances` = stateMatCorDist,
                           `SpatioMark Abundances` = stateMatAbund,
                           `SpatioMark Corrected Abundances` = stateMatCorAbund,
                           `Kontextual` = kontextMat,
                           `Region Marker Means` = cellTypeRegionMeans,
                           `Region Proportions` = regionProp)

set.seed(51773)

featureList <- c(statialFeatureList, scfeatures_result)
featureList <- lapply(featureList, function(x)x[names(kerenSurv),])

kerenCV = crossValidate(
  measurements = featureList,
  outcome = kerenSurv,
  classifier = "CoxPH",
  selectionMethod  = "CoxPH",
  nFolds = 5,
  nFeatures = 10,
  nRepeats = 20,
  nCores = nCores
)

load(system.file("extdata", "featureList.rda", package = "ScdneySpatial"))

load(system.file("extdata", "kerenCV.rda", package = "ScdneySpatial"))

Next, we use the performancePlot function to assess the C-index from each repeat of the 5-fold cross-validation.

performancePlot(kerenCV,
  characteristicsList = list(x = "Assay Name"),
  orderingList = list("Assay Name" = c(names(statialFeatureList), names(scfeatures_result)))) + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  geom_rect(aes(xmin = 0, xmax = 5.5, ymin = -Inf, ymax = Inf), fill = "red", alpha = 0.2) +
  geom_rect(aes(xmin = 5.5, xmax = 7.5, ymin = -Inf, ymax = Inf), fill = "yellow", alpha = 0.2) +
  geom_rect(aes(xmin = 7.5, xmax = 16.5, ymin = -Inf, ymax = Inf), fill = "blue", alpha = 0.2) +
  geom_rect(aes(xmin = 16.5, xmax = 21, ymin = -Inf, ymax = Inf), fill = "green", alpha = 0.2)

Unfortunately, our spatial metrics did not outperform some of the simpler metrics generated by scFeatures, particularly gene_mean_bulk, which simply produces a matrix describing the average gene expression of each marker in each patient. To examine another variable which might be of interest, we used recurrence as another outcome to determine whether our spatial metrics could uncover signal.

outcome <- survData$RECURRENCE_LABEL
names(outcome) <- survData$imageID

outcome <- factor(outcome)

featureList <- lapply(featureList, function(x)x[names(outcome),])

kerenCV_recurrence = crossValidate(
  measurements = featureList,
  outcome = outcome,
  classifier = "elasticNetGLM",
  selectionMethod = "auto",
  nFolds = 5,
  nFeatures = 10,
  nRepeats = 20,
  nCores = nCores
)

Again, using performancePlot, this time for recurrence, we found better performance in select spatial metrics.

load(system.file("extdata", "recurrenceCV.rda", package = "ScdneySpatial"))

performancePlot(kerenCV_recurrence,
  characteristicsList = list(x = "Assay Name"),
  orderingList = list("Assay Name" = c(names(statialFeatureList), names(scfeatures_result)))) + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  geom_rect(aes(xmin = 0, xmax = 5.5, ymin = -Inf, ymax = Inf), fill = "red", alpha = 0.2) +
  geom_rect(aes(xmin = 5.5, xmax = 7.5, ymin = -Inf, ymax = Inf), fill = "yellow", alpha = 0.2) +
  geom_rect(aes(xmin = 7.5, xmax = 16.5, ymin = -Inf, ymax = Inf), fill = "blue", alpha = 0.2) +
  geom_rect(aes(xmin = 16.5, xmax = 21, ymin = -Inf, ymax = Inf), fill = "green", alpha = 0.2)

Questions

Discuss among your table why spatial metrics might perform better for recurrence compared to survival.

sessionInfo()

#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 12 (bookworm)
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.21.so;  LAPACK version 3.11.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: Australia/Sydney
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] lubridate_1.9.3                 forcats_1.0.0                  
#>  [3] stringr_1.5.1                   purrr_1.0.2                    
#>  [5] readr_2.1.4                     tibble_3.2.1                   
#>  [7] tidyverse_2.0.0                 ttservice_0.3.8                
#>  [9] tidyr_1.3.0                     dplyr_1.1.4                    
#> [11] tidySingleCellExperiment_1.12.0 scater_1.30.1                  
#> [13] scuttle_1.12.0                  reshape_0.8.9                  
#> [15] plotly_4.10.3                   glmnet_4.1-8                   
#> [17] Matrix_1.6-3                    ggsurvfit_1.0.0                
#> [19] ggplot2_3.4.4                   batchelor_1.18.0               
#> [21] cytomapper_1.14.0               EBImage_4.44.0                 
#> [23] Statial_1.4.5                   spicyR_1.14.1                  
#> [25] simpleSeg_1.4.0                 scHOT_1.14.0                   
#> [27] scFeatures_0.99.27              scClassify_1.14.0              
#> [29] lisaClust_1.10.1                FuseSOM_1.4.0                  
#> [31] ClassifyR_3.6.2                 survival_3.5-7                 
#> [33] BiocParallel_1.36.0             MultiAssayExperiment_1.28.0    
#> [35] generics_0.1.3                  SpatialDatasets_1.0.0          
#> [37] BumpyMatrix_1.10.0              STexampleData_1.10.0           
#> [39] SpatialExperiment_1.12.0        SingleCellExperiment_1.24.0    
#> [41] SummarizedExperiment_1.32.0     Biobase_2.62.0                 
#> [43] GenomicRanges_1.54.1            GenomeInfoDb_1.38.1            
#> [45] IRanges_2.36.0                  S4Vectors_0.40.2               
#> [47] MatrixGenerics_1.14.0           matrixStats_1.1.0              
#> [49] ExperimentHub_2.10.0            AnnotationHub_3.10.0           
#> [51] BiocFileCache_2.10.1            dbplyr_2.4.0                   
#> [53] BiocGenerics_0.48.1            
#> 
#> loaded via a namespace (and not attached):
#>   [1] igraph_1.5.1                  graph_1.80.0                 
#>   [3] fpc_2.2-10                    zlibbioc_1.48.0              
#>   [5] tidyselect_1.2.0              bit_4.0.5                    
#>   [7] lattice_0.22-5                rjson_0.2.21                 
#>   [9] blob_1.2.4                    S4Arrays_1.2.0               
#>  [11] parallel_4.3.1                png_0.1-8                    
#>  [13] ResidualMatrix_1.12.0         cli_3.6.1                    
#>  [15] ggplotify_0.1.2               ProtGenerics_1.34.0          
#>  [17] multtest_2.58.0               goftest_1.2-3                
#>  [19] pkgdown_2.0.7                 textshaping_0.3.7            
#>  [21] BiocIO_1.12.0                 kernlab_0.9-32               
#>  [23] bluster_1.12.0                BiocNeighbors_1.20.0         
#>  [25] uwot_0.1.16                   curl_5.1.0                   
#>  [27] mime_0.12                     evaluate_0.23                
#>  [29] tiff_0.1-11                   V8_4.4.0                     
#>  [31] stringi_1.8.2                 backports_1.4.1              
#>  [33] desc_1.4.2                    FCPS_1.3.4                   
#>  [35] lmerTest_3.1-3                XML_3.99-0.16                
#>  [37] httpuv_1.6.12                 flexmix_2.3-19               
#>  [39] AnnotationDbi_1.64.1          magrittr_2.0.3               
#>  [41] DataVisualizations_1.3.2      rappdirs_0.3.3               
#>  [43] splines_4.3.1                 mclust_6.0.1                 
#>  [45] jpeg_0.1-10                   ggraph_2.1.0                 
#>  [47] ggbeeswarm_0.7.2              DBI_1.1.3                    
#>  [49] terra_1.7-55                  HDF5Array_1.30.0             
#>  [51] genefilter_1.84.0             jquerylib_0.1.4              
#>  [53] withr_2.5.2                   class_7.3-22                 
#>  [55] systemfonts_1.0.5             rprojroot_2.0.4              
#>  [57] GSEABase_1.64.0               tidygraph_1.2.3              
#>  [59] rtracklayer_1.62.0            BiocManager_1.30.22          
#>  [61] htmlwidgets_1.6.3             fs_1.6.3                     
#>  [63] biomaRt_2.58.0                segmented_1.6-4              
#>  [65] ggrepel_0.9.4                 princurve_2.1.6              
#>  [67] Cepo_1.8.0                    labeling_0.4.3               
#>  [69] SparseArray_1.2.2             DEoptimR_1.1-3               
#>  [71] ranger_0.16.0                 SingleCellSignalR_1.14.0     
#>  [73] mixtools_2.0.0                diptest_0.76-0               
#>  [75] annotate_1.80.0               minpack.lm_1.2-4             
#>  [77] raster_3.6-26                 XVector_0.42.0               
#>  [79] knitr_1.45                    nnls_1.5                     
#>  [81] AUCell_1.24.0                 timechange_0.2.0             
#>  [83] foreach_1.5.2                 fansi_1.0.5                  
#>  [85] patchwork_1.1.3               caTools_1.18.2               
#>  [87] grid_4.3.1                    data.table_1.14.8            
#>  [89] rhdf5_2.46.0                  vegan_2.6-4                  
#>  [91] R.oo_1.25.0                   psych_2.3.9                  
#>  [93] irlba_2.3.5.1                 gridGraphics_0.5-1           
#>  [95] ellipsis_0.3.2                lazyeval_0.2.2               
#>  [97] yaml_2.3.7                    scattermore_1.2              
#>  [99] BiocVersion_3.18.1            crayon_1.5.2                 
#> [101] RcppAnnoy_0.0.21              RColorBrewer_1.1-3           
#> [103] tweenr_2.0.2                  later_1.3.1                  
#> [105] codetools_0.2-19              GlobalOptions_0.1.2          
#> [107] KEGGREST_1.42.0               Rtsne_0.16                   
#> [109] shape_1.4.6                   limma_3.58.1                 
#> [111] Rsamtools_2.18.0              filelock_1.0.2               
#> [113] pkgconfig_2.0.3               xml2_1.3.5                   
#> [115] ggpubr_0.6.0                  GenomicAlignments_1.38.0     
#> [117] spatstat.sparse_3.0-3         ape_5.7-1                    
#> [119] viridisLite_0.4.2             xtable_1.8-4                 
#> [121] coop_0.6-3                    fastcluster_1.2.3            
#> [123] highr_0.10                    car_3.1-2                    
#> [125] plyr_1.8.9                    httr_1.4.7                   
#> [127] prabclus_2.3-3                tools_4.3.1                  
#> [129] beeswarm_0.4.0                broom_1.0.5                  
#> [131] nlme_3.1-163                  MatrixModels_0.5-3           
#> [133] crosstalk_1.2.1               lme4_1.1-35.1                
#> [135] digest_0.6.33                 permute_0.9-7                
#> [137] numDeriv_2016.8-1.1           farver_2.1.1                 
#> [139] tzdb_0.4.0                    AnnotationFilter_1.26.0      
#> [141] reshape2_1.4.4                yulab.utils_0.1.0            
#> [143] viridis_0.6.4                 glue_1.6.2                   
#> [145] cachem_1.0.8                  hopach_2.62.0                
#> [147] polyclip_1.10-6               proxyC_0.3.4                 
#> [149] Biostrings_2.70.1             profileModel_0.6.1           
#> [151] mnormt_2.1.1                  statmod_1.5.0                
#> [153] concaveman_1.1.0              ragg_1.2.6                   
#> [155] ScaledMatrix_1.10.0           carData_3.0-5                
#> [157] minqa_1.2.6                   dqrng_0.3.2                  
#> [159] utf8_1.2.4                    graphlayouts_1.0.2           
#> [161] gtools_3.9.5                  analogue_0.17-6              
#> [163] ggsignif_0.6.4                gridExtra_2.3                
#> [165] shiny_1.8.0                   GSVA_1.50.0                  
#> [167] GenomeInfoDbData_1.2.11       R.utils_2.12.3               
#> [169] rhdf5filters_1.14.0           RCurl_1.98-1.13              
#> [171] memoise_2.0.1                 rmarkdown_2.25               
#> [173] pheatmap_1.0.12               scales_1.3.0                 
#> [175] R.methodsS3_1.8.2             svglite_2.1.2                
#> [177] spatstat.data_3.0-3           rstudioapi_0.15.0            
#> [179] EnsDb.Mmusculus.v79_2.99.0    cluster_2.1.4                
#> [181] msigdbr_7.5.1                 spatstat.utils_3.0-4         
#> [183] hms_1.1.3                     munsell_0.5.0                
#> [185] cowplot_1.1.1                 scam_1.2-14                  
#> [187] colorspace_2.1-0              rlang_1.1.2                  
#> [189] EnsDb.Hsapiens.v79_2.99.0     DelayedMatrixStats_1.24.0    
#> [191] sparseMatrixStats_1.14.0      shinydashboard_0.7.2         
#> [193] ggforce_0.4.1                 circlize_0.4.15              
#> [195] mgcv_1.9-0                    xfun_0.41                    
#> [197] modeltools_0.2-23             iterators_1.0.14             
#> [199] abind_1.4-5                   interactiveDisplayBase_1.40.0
#> [201] Rhdf5lib_1.24.0               bitops_1.0-7                 
#> [203] fftwtools_0.9-11              promises_1.2.1               
#> [205] RSQLite_2.3.3                 DelayedArray_0.28.0          
#> [207] proxy_0.4-27                  compiler_4.3.1               
#> [209] prettyunits_1.2.0             boot_1.3-28.1                
#> [211] beachmat_2.18.0               Rcpp_1.0.11                  
#> [213] edgeR_4.0.2                   BiocSingular_1.18.0          
#> [215] tensor_1.5                    MASS_7.3-60                  
#> [217] progress_1.2.2                ggupset_0.3.0                
#> [219] babelgene_22.9                spatstat.random_3.2-2        
#> [221] R6_2.5.1                      fastmap_1.1.1                
#> [223] rstatix_0.7.2                 vipor_0.4.5                  
#> [225] ensembldb_2.26.0              nnet_7.3-19                  
#> [227] rsvd_1.0.5                    gtable_0.3.4                 
#> [229] KernSmooth_2.23-22            deldir_2.0-2                 
#> [231] htmltools_0.5.7               RcppParallel_5.1.7           
#> [233] bit64_4.0.5                   spatstat.explore_3.2-5       
#> [235] lifecycle_1.0.4               nloptr_2.0.3                 
#> [237] restfulr_0.0.15               sass_0.4.7                   
#> [239] vctrs_0.6.5                   robustbase_0.99-0            
#> [241] spatstat.geom_3.2-7           scran_1.30.0                 
#> [243] sp_2.1-2                      bslib_0.6.1                  
#> [245] pillar_1.9.0                  GenomicFeatures_1.54.1       
#> [247] gplots_3.1.3                  magick_2.8.1                 
#> [249] metapod_1.10.0                locfit_1.5-9.8               
#> [251] jsonlite_1.8.7                brglm_0.7.2                  
#> [253] svgPanZoom_0.3.4