1 Introduction

Pathway analysis is an essential component in most of omic science. Here, we will introduce basic methods for “1-dimension” pathway analysis; and more advanced methods for implemented in the R package directPA for performing “2 and 3-dimentional” pathway analyses (Pengyi Yang et al. 2014). We will start with proteomic data and look at gene-centric pathways and then extend to phosphoproteomic data and look at signalling pathways.

Load in packages

suppressPackageStartupMessages({
  library(PhosR)
  library(directPA)
})

2 Gene-centric pathway analysis

2.1 SILAC plasma membrane proteome dataset

We will first use a SILAC-based proteomic dataset for demonstration of gene-centric pathway analysis. It contains the plasma membrane (PM) proteome of insulin responsive 3T3-L1 adipocytes under three different treatment conditions and a basal condition. The three treatment conditions are (1) insulin stimulation; (2) wortmannin inhibition plus insulin stimulation; and (3) MK inhibition plus insulin stimulation. Proteomes in this data meansured by using SILAC-based mass spectrometry. The dataset has already been pre-processed into log2 fold changes of (1) Ins vs Basal; (2) Ins+Wmn vs Basal; and (3) Ins+MK vs Basal.

This dataset is already included in the package directPA. We can load this example dataset into current working environment as data(PM).

2.2 Conventional gene pathway analysis (1D)

Let us start with conventional 1-dimensional pathway analysis. The following demonstrates finding pathways that are enriched in insulin stimulation compared to basal. directPA package has built in pathway annotations and PhosR package implements both over-representation test and rank-based test. The pathway annotations included in directPA are from KEGG database and Reactome database. You can always provide your own pathway annotations such as other popular databases (e.g. gene ontology or GO).

# load pathway annotations
data(PM)
data(Pathways)
geneSet <- names(sort(PM[,1], decreasing = TRUE))[1:100]

path1 <- pathwayOverrepresent(geneSet, annotation=Pathways.KEGG, universe = rownames(PM), alter = "greater")
path2 <- pathwayRankBasedEnrichment(PM[,1], annotation=Pathways.KEGG, alter = "greater")

path1[1:5,c("pvalue", "# of substrates")]

##                                                pvalue              
## KEGG_SNARE_INTERACTIONS_IN_VESICULAR_TRANSPORT "0.0766668712824008"
## KEGG_RIBOSOME                                  "0.140645733814865" 
## KEGG_N_GLYCAN_BIOSYNTHESIS                     "0.483905192574014" 
## KEGG_STEROID_BIOSYNTHESIS                      "0.527320456676819" 
## KEGG_RENIN_ANGIOTENSIN_SYSTEM                  "0.838917396716807" 
##                                                # of substrates
## KEGG_SNARE_INTERACTIONS_IN_VESICULAR_TRANSPORT "7"            
## KEGG_RIBOSOME                                  "12"           
## KEGG_N_GLYCAN_BIOSYNTHESIS                     "5"            
## KEGG_STEROID_BIOSYNTHESIS                      "2"            
## KEGG_RENIN_ANGIOTENSIN_SYSTEM                  "1"

path2[1:5,c("pvalue", "# of substrates")]

##                                                pvalue                
## KEGG_RIBOSOME                                  "4.29100890328692e-10"
## KEGG_SNARE_INTERACTIONS_IN_VESICULAR_TRANSPORT "6.81644137750794e-05"
## KEGG_N_GLYCAN_BIOSYNTHESIS                     "6.89092354625917e-05"
## KEGG_PROTEIN_EXPORT                            "0.00235739226937965" 
## KEGG_GLYCEROPHOSPHOLIPID_METABOLISM            "0.00335588047711627" 
##                                                # of substrates
## KEGG_RIBOSOME                                  "44"           
## KEGG_SNARE_INTERACTIONS_IN_VESICULAR_TRANSPORT "17"           
## KEGG_N_GLYCAN_BIOSYNTHESIS                     "11"           
## KEGG_PROTEIN_EXPORT                            "5"            
## KEGG_GLYCEROPHOSPHOLIPID_METABOLISM            "12"

plot(-log10(as.numeric(path1[names(Pathways.KEGG),1])), -log10(as.numeric(path2[names(Pathways.KEGG),1])), xlab="Overrepresentation (-log10 pvalue)", ylab="Rank-based enrichment (-log10 pvalue)", main="Comparison of 1D pathway analyses")

2.3 2D gene pathway analysis using directPA

What if we want to find the pathways that are up-regulated in insulin stimulation vs basal but inhibited by wortmannin or MK? Clearly, conventional 1-dimensional pathway analysis can not directly be applied to address this question. That is why we have implemented a 2-dimensional direction pathway analysis method to address this. Below is the demonstration for this.

par(mfrow=c(1,2))
dpa1 <- directPA(PM[,c(1,2)], direction=pi/2, annotation=Pathways.KEGG, main="Direction pathway analysis")
dpa2 <- directPA(PM[,c(1,3)], direction=pi/2, annotation=Pathways.KEGG, main="Direction pathway analysis")

dpa1$pathways[1:5,]

##                                                                 pvalue      
## KEGG_SNARE_INTERACTIONS_IN_VESICULAR_TRANSPORT                  6.090852e-06
## KEGG_SPHINGOLIPID_METABOLISM                                    0.04696436  
## KEGG_EPITHELIAL_CELL_SIGNALING_IN_HELICOBACTER_PYLORI_INFECTION 0.04928891  
## KEGG_GNRH_SIGNALING_PATHWAY                                     0.06978881  
## KEGG_MELANOGENESIS                                              0.08283226  
##                                                                 size
## KEGG_SNARE_INTERACTIONS_IN_VESICULAR_TRANSPORT                  17  
## KEGG_SPHINGOLIPID_METABOLISM                                    7   
## KEGG_EPITHELIAL_CELL_SIGNALING_IN_HELICOBACTER_PYLORI_INFECTION 11  
## KEGG_GNRH_SIGNALING_PATHWAY                                     8   
## KEGG_MELANOGENESIS                                              8

dpa2$pathways[1:5,]

##                                                pvalue      size
## KEGG_SNARE_INTERACTIONS_IN_VESICULAR_TRANSPORT 0.001477734 17  
## KEGG_PEROXISOME                                0.004388284 38  
## KEGG_OXIDATIVE_PHOSPHORYLATION                 0.008825662 58  
## KEGG_PARKINSONS_DISEASE                        0.02936895  50  
## KEGG_LIMONENE_AND_PINENE_DEGRADATION           0.0299775   5

We can obtain the pvalues of genes as how significant they are in the direction we are testing using dpa1$gene.pvalues. Below we plot for genes that are active in insulin stimulation and inhibited by either wortmannin or MK. As can be seen, some of the genes are different (i.e. inhibited by wortmannin only or MK only). We can dissect these two conditions and their corresponding pathways using, again, 2D test.

par(mfrow=c(1,2))
plot(dpa1$gene.pvalues[rownames(PM)], dpa2$gene.pvalues[rownames(PM)], xlab="Gene pvalue from dpa1", ylab="Gene pvalue from dpa2")

dpa3 <- directPA(PM[,c(2,3)], direction=pi/2, annotation=Pathways.KEGG, main="Direction pathway analysis")

dpa3$pathways[1:5,]

##                                                pvalue       size
## KEGG_PEROXISOME                                0.0001044089 38  
## KEGG_PARKINSONS_DISEASE                        0.005203929  50  
## KEGG_VALINE_LEUCINE_AND_ISOLEUCINE_DEGRADATION 0.01030298   29  
## KEGG_CITRATE_CYCLE_TCA_CYCLE                   0.01097305   24  
## KEGG_BUTANOATE_METABOLISM                      0.01194878   13

2.4 3D gene pathway analysis using directPA

The above repeatitve 2D test is good at answering comparisons involving 2 treatments and 1 control. What if we want to ask a question such as “genes that are up-regulated in insulin stimulation; inhibited by MK; but not inhibited by wortmannin”? For this, we need a 3D test which is what we will introduce below.

dPA <- directPA(Tc=PM, direction=c(1,1,-1), annotation=Pathways.KEGG, main="Direction pathway analysis")
dPA$gst[order(unlist(dPA$gst[,1])),][1:20,]

##                                                pvalue       size
## KEGG_PEROXISOME                                4.450701e-05 38  
## KEGG_STEROID_BIOSYNTHESIS                      0.0007532071 7   
## KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION       0.00219594   10  
## KEGG_LIMONENE_AND_PINENE_DEGRADATION           0.006452643  5   
## KEGG_N_GLYCAN_BIOSYNTHESIS                     0.008243181  11  
## KEGG_RIBOSOME                                  0.01290797   44  
## KEGG_PARKINSONS_DISEASE                        0.02034015   50  
## KEGG_BIOSYNTHESIS_OF_UNSATURATED_FATTY_ACIDS   0.02246491   10  
## KEGG_PROTEIN_EXPORT                            0.03451366   5   
## KEGG_BETA_ALANINE_METABOLISM                   0.05575214   8   
## KEGG_LYSINE_DEGRADATION                        0.07957553   13  
## KEGG_STEROID_HORMONE_BIOSYNTHESIS              0.08834657   7   
## KEGG_OXIDATIVE_PHOSPHORYLATION                 0.09396306   58  
## KEGG_GLYCEROPHOSPHOLIPID_METABOLISM            0.09478043   12  
## KEGG_ALZHEIMERS_DISEASE                        0.09791927   55  
## KEGG_FATTY_ACID_METABOLISM                     0.09982681   20  
## KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION 0.1043517    6   
## KEGG_GLYCEROLIPID_METABOLISM                   0.132431     10  
## KEGG_CARDIAC_MUSCLE_CONTRACTION                0.1331192    18  
## KEGG_HUNTINGTONS_DISEASE                       0.1478461    59

3 Signalling pathways analysis using phosphoproteomic data

The above demonstrated some gene-centric pathway analyses using proteomic data. Now, we will use the same methodology for testing kinase enrichment in phosphoproteomic data.

3.1 Embryonic stem cell differentiation phosphoproteomic dataset

In this section, we will use an embryonic stem cell (ESC) differentiation dataset for demonstration. This dataset contains 12 time points profiling the transition of ESCs to epiblast-like cells (EpiLCs). Details regarding the dataset can be found in (Pengyi Yang et al. 2019). The quantification of the phophorylation at each time point were converted into log2 fold change with respect to the first time point of 0 hour.

This dataset is included in the package PhosR. We can load this example dataset into current working environment as data(phosphoESC).

3.2 2D signalling pathway analysis using directPA

One key aspect in studying signalling pathways is to identify key kinases that are involved in signalling cascades. To identify these kinases, we make use of kinase-substrate annotation databases such as PhosphoSitePlus (Hornbeck et al. 2012) and Phospho.ELM (Dinkel et al. 2011). These databases are included in the PhosR and directPA packages already. To access them, simply load the package and access the data by data("PhosphoSitePlus") and data("PhosphoELM").

Back to the example. Now, suppose we partition the signalling into early (5-15 minutes) and late (24-48h) stages during the differentiation process, we can then identify kinases that are active in these time points by testing kinase-substrates that are up-regulated in phosphorylation level in both time points as below:

par(mfrow=c(1,2))
data("phosphoESC")
data("PhosphoSitePlus")
rownames(phosphoESC) <- sapply(strsplit(rownames(phosphoESC), ";"), function(x)paste(toupper(x[1]), x[2], "", sep=";"))

kPA1 <- kinasePA(Tc=phosphoESC[,c("X5m","X15m")], direction=0, annotation=PhosphoSite.mouse, main="Direction pathway analysis")
kPA1$kinase[1:5,]

##          pvalue       size
## Yang.S6K 2.777923e-36 16  
## p70S6K   2.535104e-24 12  
## P70S6KB  4.910053e-20 5   
## Yang.RSK 2.249326e-17 11  
## ERK2     1.574874e-16 60

kPA2 <- kinasePA(Tc=phosphoESC[,c("X24h","X48h")], direction=0, annotation=PhosphoSite.mouse, main="Direction pathway analysis")

kPA2$kinase[1:5,]

##           pvalue       size
## Yang.S6K  1.003406e-23 16  
## P70S6KB   1.737207e-22 5   
## Yang.GSK3 3.716738e-19 14  
## p70S6K    1.390567e-14 12  
## Yang.mTOR 1.323561e-12 47

There is also a function called perturbPlot2d for testing and visualising activity of all kinases on all possible directions. Below are the results from using this function. Details of this function and its implementation is described in our publication kinase perturbation analysis (kinasePA) (Yang et al. 2016).

z1 <- perturbPlot2d(Tc=phosphoESC[,c("X5m","X15m")], annotation=PhosphoSite.mouse, cex=1, xlim=c(-5, 12), ylim=c(-5, 11), main="Kinase perturbation analysis")

z2 <- perturbPlot2d(Tc=phosphoESC[,c("X24h","X48h")], annotation=PhosphoSite.mouse, cex=1, xlim=c(-5, 11), ylim=c(-5, 11), main="Kinase perturbation analysis")

4 Gene-centric pathway analysis of phosphoproteomic data

While the above kinase-substrate-based analysis of phosphoproteomic data is informative, the annotation of kinases and their substrates in current databases are quite limited. This leads to the limited use of a small subset of phosphosites whereas the majority of the identified phosphosites in the phosphoproteomic data are not used for the analysis.

One possible way to utilise more from the phosphoproteomic data is to perform gene-centric analysis by summarising phosphosite-level information to each protein. The following is an example for such type of analysis using the phosCollapse function in package PhosR and then directPA for gene-centric 2D pathway analysis.

# summarise phospho-level information to genes
phosphoESC.sum <- phosCollapse(phosphoESC, gsub(";.+", "", rownames(phosphoESC)), stat = apply(abs(phosphoESC), 1, max), by = "max")

par(mfrow=c(1,2))
dPA1 <- directPA(Tc=phosphoESC.sum[,c("X5m","X15m")], direction=0, annotation=Pathways.reactome, main="Direction pathway analysis")
dPA1$pathways[1:5,]

##                                                                    pvalue      
## REACTOME_RECYCLING_PATHWAY_OF_L1                                   8.032709e-09
## REACTOME_ERK_MAPK_TARGETS                                          1.839573e-08
## REACTOME_MAPK_TARGETS_NUCLEAR_EVENTS_MEDIATED_BY_MAP_KINASES       5.246676e-08
## REACTOME_NUCLEAR_EVENTS_KINASE_AND_TRANSCRIPTION_FACTOR_ACTIVATION 1.217495e-07
## REACTOME_HIV_LIFE_CYCLE                                            8.002594e-07
##                                                                    size
## REACTOME_RECYCLING_PATHWAY_OF_L1                                   9   
## REACTOME_ERK_MAPK_TARGETS                                          8   
## REACTOME_MAPK_TARGETS_NUCLEAR_EVENTS_MEDIATED_BY_MAP_KINASES       13  
## REACTOME_NUCLEAR_EVENTS_KINASE_AND_TRANSCRIPTION_FACTOR_ACTIVATION 10  
## REACTOME_HIV_LIFE_CYCLE                                            51

dPA2 <- directPA(Tc=phosphoESC.sum[,c("X24h","X48h")], direction=0, annotation=Pathways.reactome, main="Direction pathway analysis")

dPA2$pathways[1:5,]

##                                     pvalue       size
## REACTOME_MTORC1_MEDIATED_SIGNALLING 1.350253e-07 8   
## REACTOME_MEMBRANE_TRAFFICKING       5.626116e-05 31  
## REACTOME_MITOTIC_M_M_G1_PHASES      0.0001401506 78  
## REACTOME_DNA_REPLICATION            0.0001465766 84  
## REACTOME_MITOTIC_PROMETAPHASE       0.0001553383 44

5 Session Info

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] directPA_1.4 PhosR_0.1.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5        xfun_0.11               purrr_0.3.3            
##  [4] colorspace_1.4-1        miniUI_0.1.1.1          htmltools_0.4.0        
##  [7] viridisLite_0.3.0       yaml_2.2.0              rlang_0.4.2            
## [10] manipulateWidget_0.10.0 e1071_1.7-3             pillar_1.4.2           
## [13] later_1.0.0             glue_1.3.1              BiocGenerics_0.32.0    
## [16] calibrate_1.7.5         lifecycle_0.1.0         stringr_1.4.0          
## [19] munsell_0.5.0           pcaMethods_1.78.0       gtable_0.3.0           
## [22] htmlwidgets_1.5.1       evaluate_0.14           Biobase_2.46.0         
## [25] knitr_1.26              fastmap_1.0.1           httpuv_1.5.2           
## [28] crosstalk_1.0.0         parallel_3.6.1          class_7.3-15           
## [31] Rcpp_1.0.3              xtable_1.8-4            scales_1.1.0           
## [34] promises_1.1.0          limma_3.42.0            webshot_0.5.2          
## [37] jsonlite_1.6            mime_0.7                gridExtra_2.3          
## [40] ggplot2_3.2.1           digest_0.6.23           stringi_1.4.3          
## [43] dplyr_0.8.3             shiny_1.4.0             grid_3.6.1             
## [46] tools_3.6.1             magrittr_1.5            rgl_0.100.30           
## [49] lazyeval_0.2.2          tibble_2.1.3            crayon_1.3.4           
## [52] pkgconfig_2.0.3         dendextend_1.12.0       MASS_7.3-51.4          
## [55] ruv_0.9.7.1             assertthat_0.2.1        rmarkdown_1.18         
## [58] viridis_0.5.1           R6_2.4.1                compiler_3.6.1

1, 2, 3-dimensional pathway analysis of (phospho)proteomic data

Pengyi Yang

Dec 2019