1 Introduction

Before we can start interpreting biology from our phoshoproteomic data, the first step is to perform data preprocessing and QC. These include (but not limited to) assessing data quality, filtering, imputing missing values where appropriate, and normalising data to remove batch effects. PhosR package implements various functions for achieving these goals. Here, we will introduce a few QC measures and demonstrate some of the key functions for data normalisation and imputation.

2 Analysis on ESC differentiation dataset

The ESC differentiation dataset contains the temporal phosphoproteome profile during mouse embryonic stem cell (ESC) differentiation to epiblast-like cells (EpiLC) (Pengyi Yang et al. 2019), using label free approach. There are 12 profiled time points including 0h (ESC), 5min, 15min, 30min, 1h, 3h, 6h, 12h, 24h, 36h, 48h, and 72h (EpiLC). The dataset in plain text file ESC_Phospho (STY)Sites.txt is avaliable from (xxx). This file contains essentailly the orignal output from MaxQuant (https://www.maxquant.org/) which is a software for quantifying mass spectrometry machine outputs. We will start by loading PhosR package for subsequent analysis.

library(PhosR)

2.1 Data wrangling

We can load the data into R using the read.delim() function. Note that the first row contains header information.

phospho.raw <- read.delim("./ESC_Phospho (STY)Sites.txt", header = TRUE)

After reading in the text file, we will extract some useful column information and concatenate them into strings as use them as id for each of our phosphosite in the dataset.

Ids <- paste(paste(sapply(strsplit(as.character(phospho.raw[,"Protein"]), ";"), function(x){x[1]}),
             sapply(strsplit(as.character(phospho.raw[,"Gene.names"]), ";"), function(x){x[1]}),
             paste(as.character(phospho.raw[,"Amino.acid"]), as.character(phospho.raw[,"Position"]), sep=""),
             sapply(strsplit(as.character(phospho.raw[,"Sequence.window"]), ";"), function(x){x[1]}), sep="~"))

Ids[1:10]

##  [1] "A0A023T672~RBM8~S24~EAGGEDFAMDEDGDESIHKLKEKAKKRKGRG" 
##  [2] "A0A023T672~RBM8~S56~FGSEGSRARMREDYDSVEQDGDEPGPQRSVE" 
##  [3] "A0A023T672~RBM8~S42~KLKEKAKKRKGRGFGSEEGSRARMREDYDSV" 
##  [4] "A0A023T672~RBM8~S46~KAKKRKGRGFGSEEGSRARMREDYDSVEQDG" 
##  [5] "A0A023T672~RBM8~S166~VRGPPKGKRRGGRRRSRSPDRRRR_______"
##  [6] "A0A023T672~RBM8~S168~GPPKGKRRGGRRRSRSPDRRRR_________"
##  [7] "Q9R0X5~Rpgr~S914~NENPKGHMYDRVKSSSSEILGGNDPTSKDIK"    
##  [8] "Q9CWZ3-2~Rbm8a~S42~KLKEKAKKRKGRGFGSEGSRARMREDYDSVE"  
##  [9] "Q9CWZ3-2~Rbm8a~S45~EKAKKRKGRGFGSEGSRARMREDYDSVEQDG"  
## [10] "O70589-3~Cask~S570~GSITFKIVPSYRTQSSSCEDLPSTTQPKGRQ"

rownames(phospho.raw) <- Ids

Before we going forward, the first thing to do is to remove Reverse and Potential.contaminant entries. These are entires that are reported by MaxQuant. They are either false positive identifications (in case of Reverse) or real identification contaminated proteins other than those extracted from cells of interests.

del <- which(phospho.raw[,"Reverse"] == "+" | phospho.raw[,"Potential.contaminant"] == "+")
phospho.clean <- phospho.raw[-del,]

Next, we will select columns that correspond to phosphosite quantification. These columns normally start with the key word Intensity and follow with sample-specific information. After selecting extracting the raw data from these columns we will transform the data using log2 and mark the missing values as NA

selected.cols <- grep("^Intensity.+_\\d", colnames(phospho.clean))
phospho.exp.full <- log2(as.matrix(phospho.clean[,selected.cols]))
phospho.exp.full[is.infinite(phospho.exp.full)] <- NA

We will also remove the time points after 6 hours for this analysis. Do you know why?

phospho.exp <- phospho.exp.full[,c(1:4, 37:40, 9:12, 21:24, 13:16, 29:32, 41:44)]

Now, we have the quantification of all phosphosites in this experiment saved in phospho.exp.raw. Let us start with some basic QC.

dim(phospho.exp)

## [1] 37620    28

# color code by replicates at each time point
grps <- gsub("_\\d", "", colnames(phospho.exp))
cs <- rainbow(length(unique(grps)))
colorCodes <- sapply(grps, switch, Intensity.0m=cs[1], Intensity.5m=cs[2],
                 Intensity.15m=cs[3], Intensity.30m=cs[4], Intensity.1h=cs[5], Intensity.3h=cs[6], Intensity.6h=cs[7])

par(mfrow=c(2,1))
plotQC(phospho.exp, panel = 1, cols = colorCodes)
plotQC(phospho.exp, panel = 2, cols = colorCodes)

Notice that we have plotted the percentage of missing values and performed a hierarchical clustering using the plotQC function which is part of the PhosR package.

2.2 Filtering of phophosites

Given the high percentage of missing values (>50%), we will start by performing some filtering to remove phosphosites that are unquantified in majority of the samples. We introduce two filtering functions implemented in PhosR. They are selectGrps and selectOverallPercent. The selectGrps function select phosphosites by requring the percentage of quantified values in a treatment group to be equal or higher than a given value [0-1], and the number of treatment groups that meet this cutoff. For example, the following code filter to select phosphosites that have 75% (3 out of 4 replicates in case of ESC dataset) quantified in at least 3 time point.

phospho.exp.filtered1 <- selectGrps(phospho.exp, grps = grps, percent = 0.75, n = 3)

The second function selectOverallPercent allows the selection of phosphosites that have be quantified in at least a given percentage of all samples. For example, the following select sites that have been quantified in 30% of all samples.

# filter to select phosphosites that have be quantified in at least 30% of all samples
phospho.exp.filtered2 <- selectOverallPercent(phospho.exp.filtered1, percent = 0.3)
dim(phospho.exp.filtered2)

## [1] 15388    28

par(mfrow=c(2,1))
plotQC(phospho.exp.filtered2, panel = 1, cols = colorCodes)
plotQC(phospho.exp.filtered2, panel = 2, cols = colorCodes)

2.3 Imputation

Now we have filtered the phosphosites that have large number of missing values, we will perform imputation to the remaining phosphosites so that subsequent analyses that require full matrix could be conducted properly. Before we perform imputation, we use the following code to perform a median-based centring of data in each treatment/time point.

# centring data across their median
phospho.exp.scaled <- medianScaling(phospho.exp.filtered2, scale = FALSE)
boxplot(phospho.exp.scaled, col=colorCodes, las=2)

We will use the following two steps for data imputation:

First impute data by time points for each phosphosite. For each time point, we will impute the missing value if 50% (2 out of 4) replicates are quantified.
Second impute the rest of the missing values using lower-tail imputation (tImpute)

set.seed(1)
# step 1
phospho.exp.imputed1 <- ssImpute(phospho.exp.scaled, percent = 0.5, grps = grps)
# step 2
phospho.exp.imputed <- tImpute(phospho.exp.imputed1)
# finally, convert the data into ratios with respect to the first time point (0h)
phospho.exp.ratio <- phospho.exp.imputed - rowMeans(phospho.exp.imputed[,1:4])
plotQC(phospho.exp.ratio, col=colorCodes)

Quick quiz

How many phosphosites do we have now?
Did we lose any time points (columns)?

3 Session Info

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] directPA_1.4 PhosR_0.1.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5        xfun_0.11               purrr_0.3.3            
##  [4] colorspace_1.4-1        miniUI_0.1.1.1          htmltools_0.4.0        
##  [7] viridisLite_0.3.0       yaml_2.2.0              rlang_0.4.2            
## [10] manipulateWidget_0.10.0 e1071_1.7-3             pillar_1.4.2           
## [13] later_1.0.0             glue_1.3.1              BiocGenerics_0.32.0    
## [16] calibrate_1.7.5         lifecycle_0.1.0         stringr_1.4.0          
## [19] munsell_0.5.0           pcaMethods_1.78.0       gtable_0.3.0           
## [22] htmlwidgets_1.5.1       evaluate_0.14           Biobase_2.46.0         
## [25] knitr_1.26              fastmap_1.0.1           httpuv_1.5.2           
## [28] crosstalk_1.0.0         parallel_3.6.1          class_7.3-15           
## [31] Rcpp_1.0.3              xtable_1.8-4            scales_1.1.0           
## [34] promises_1.1.0          limma_3.42.0            webshot_0.5.2          
## [37] jsonlite_1.6            mime_0.7                gridExtra_2.3          
## [40] ggplot2_3.2.1           digest_0.6.23           stringi_1.4.3          
## [43] dplyr_0.8.3             shiny_1.4.0             grid_3.6.1             
## [46] tools_3.6.1             magrittr_1.5            rgl_0.100.30           
## [49] lazyeval_0.2.2          tibble_2.1.3            crayon_1.3.4           
## [52] pkgconfig_2.0.3         dendextend_1.12.0       MASS_7.3-51.4          
## [55] ruv_0.9.7.1             assertthat_0.2.1        rmarkdown_1.18         
## [58] viridis_0.5.1           R6_2.4.1                compiler_3.6.1

Preprocessing (phospho)proteomics data

Pengyi Yang

Dec 2019