1 Learning outcomes of this session

  • Join two data together using the left_join function
  • Making basic plots using ggplot2: boxplots, histograms and scatter plots
  • Use pivot_longer and pivot_wider to reshape data for visualisation purposes
  • Saving cleaned data for next stage of analysis

2 Loading packages

library(tidyverse)

3 Loading cleaned sample data

Let’s begin today by reading in the csv file that we created in session 1.

session1_data = read_csv(file = "data/clean_sample_data.csv")

4 Reading in gene expression data (top 6 rows)

Remember our aim for this workshop is to join the mice sample data with a mice gene expression data. session1_data is the sample data we have already cleaned yesterday, so we should now read in the gene expression data now.

This gene expression data was downloaded straight from the Gene Expression Omnibus and it was saved as a txt file. This txt file is more or less like a rectangular Excel sheet, with each row separated by a new line and each column is separated by a a single press of the “tab” button on the eyboard. Thus, we call this tab as the “delimiter” of this file. R understands this “tab” as \t.

The whole data has 54 rows, but for the sake of brevity, we will only read in the first 6 rows. We will also get rid of the first column of this raw data (the column of manufacturer’s ID).

raw_ge_data = read_delim(file = "data/GSE43251_NanoString_non-normalized.txt", delim = "\t", n_max = 6)
## Alternatively:
## raw_ge_data = read_tsv(file = "data/GSE43251_NanoString_non-normalized.txt", n_max = 6)

clean_ge_data = raw_ge_data %>% 
  dplyr::select(-1)

clean_ge_data

Let’s try to understand this data before we proceed to the next section. The first column of this data contains the gene symbols that was measured by the gene expression machine. From the second column onwards, each column records the gene expression measurement for one mouse and the sample ID of the mouse is the column name (i.e. HW1, HW2, etc.).

So how should we join these two data? If we just stack the two data together, there are no guarantees to say that we will have the right ordering of sample ID. In fact, if you try to just bind the two sample IDs together, R will complain about this.

length(colnames(clean_ge_data)[-1]) ## Not counting the gene symbol column
length(session1_data$sample)

cbind(
  colnames(clean_ge_data)[-1],
  session1_data$sample)

This is where the left_join function from the dplyr package can help us.

5 left_join sample data

Let’s begin by looking at the samples that have gene expression data available. We will make a data.frame documenting the sample IDs and another column that indicates the availability of the gene expression data (which always have a value of "yes, since we extracted the ID from the gene expression data itself).

ge_sample_data = data.frame(
  sample = colnames(clean_ge_data)[-1],
  ge_data_available = "yes"
)

head(ge_sample_data)

Now we will apply the left_join function between the session1_data, the full sample data, and the ge_sample_data, which only contains the samples with gene expression information available. The sample column in both data serves as an “identifier” to match each mice in the session1_data to each mice in ge_sample_data. See below.

joined_sample_data = left_join(session1_data, ge_sample_data, by = "sample")

joined_sample_data

The resulted data has the same number of rows as session1_data, and this is no surprise because it is the first input into the function and hence the name. The by input specify which column should be used as the identifier.

Looking at this data seems to suggest that nothing has gone wrong in the first 10 rows. But if we apply a small sampling on the number of rows, we can see that there are indeed sample mismatching between the sample and the gene expression data. This makes our task of combining data more difficult. Because we only know how to combine two datasets through a identifier column. This is not an issue with the session1_data, but the challenge here is the gene expression data, which has all identifiers as the column names.

If only there is a way that we can grab the column names of the gene expression data into a column while preserving the gene expression values. That’d make things so much easier! Let’s learn about the pivot_longer function first before returning to this problem.

6 tidyr::pivot_longer the gene expression data

The pivot_longer function takes a “wide” data into a “long” data. We simply tell the function which column we want to preserve, in this case, it will be the gene symbols, because we still want to distinguish each genes for each of the samples after the pivot operation.

clean_ge_data %>% 
  tidyr::pivot_longer(
    cols = -GENE_SYMBOL)

Of course, it is be better to give the new columns some informative names.

ge_data_long = clean_ge_data %>% 
  tidyr::pivot_longer(
    cols = -GENE_SYMBOL,
    names_to = "sample", 
    values_to = "gene_expression")

ge_data_long

Pay attention to how this data still maintains its tidy structure but it is now in a better format to be left_join with the session1_data. This may look simple, but you should appreciate the thinking behind this process as it will be ever more important as your data gets more complex.

long_sample_ge_data = session1_data %>% 
  left_join(ge_data_long, by = "sample") %>% 
  dplyr::filter(!is.na(GENE_SYMBOL)) %>% 
  janitor::clean_names()

long_sample_ge_data

Now that we have a very nice and cleaned data, we are now in a good position to learn about visualisation.

7 ggplot2 visualisation

There are many plotting systems and packages out there, so why should we use this ggplot2 system?

The gg in ggplot2 stands for “grammar of graphics”. That means in order to produce a ggplot, a very strict set of coding structure must be followed. These restrictions are designed in such a way that you can always look at a ggplot and know exactly which characteristic of the plot comes from which column of the data. This is extremely important for the purpose of reproducibility.

The ggplot2 plotting system is based on a layered structure, which means the final plot can always be decompose into several layers. We will illustrate what this means now step-by-step.

Let’s say we want to draw a scatter plot of the strains and the gene expression. We will first begin with a data and pipe that into the ggplot() function. This will give us a blank canvas for plotting.

long_sample_ge_data

long_sample_ge_data %>% 
  ggplot()

Now is a good time to think about what are the components of a scatter plot?

  • x-coordinates of points
  • y-coordinates of points
  • Actually placement of points

The first two components are referred to as “mapping” in ggplot. Because the x-coordinates should be stored as a column in the data.frame, so we will “map” this column to be a characteristic on our ggplot. Similary, the y-coordinates should also be a mapping. This is why we write our code in the following manner. Note, we are using the aes() function to wrap everything.

long_sample_ge_data %>% 
  ggplot(mapping = aes(x = strain, y = gene_expression))

Even though we still have a blank canvas, but notice how the x and y axes are now labelled with the variables/columns in the data.frame. So what are we missing for a scatter plot? Actually putting the points on the canvas!

In ggplot2, a plot is always a combination of layers. We have now a layer of blank canvas with axes, we will want to draw on this canvas using geom_point() which places the actual points on the canvas based on the mappings we have already specified. In ggplot2, geometries are functions that actually draws on the canvas based on the aesthetic mappings given. These are referred to as “geometries”.

long_sample_ge_data %>% 
  ggplot(aes(x = strain, 
             y = gene_expression)) + 
  geom_point()

This is your first ggplot! Let’s give it a title!

long_sample_ge_data %>% 
  ggplot(aes(x = strain, 
             y = gene_expression)) + 
  geom_point() +
  labs(title = "My first ggplot!")

We see that the scale of data points can be very different. This is due to some genes always tend to have a higher expression than others (due to biology). So it will be a good idea to make a plot for each gene instead. This means that we need to make this plot for each value in the gene_symbol column in our data. This can be done using the facet_wrap function, which will produce a plot that we have already made, but split by all unique values in the given variable, in this case gene_symbol.

long_sample_ge_data %>% 
  ggplot(aes(x = strain, 
             y = gene_expression)) + 
  geom_point() +
  facet_wrap(~gene_symbol) +
  labs(title = "My first ggplot!")

This plot shows that “Aldh3a1” is the main culprit for having a very different scale compared all other five genes.

One way to deal with this scaling is to use the logarithm scale. ggplot2 has a very convenient way to doing this. The scale_y_log10 function will scale the y-axis to log10 scale and this is a very convenient function because it retains the labelling of the original data but the points are on a log10-scale.

long_sample_ge_data %>% 
  ggplot(aes(x = strain, 
             y = gene_expression)) + 
  geom_point() +
  facet_wrap(~gene_symbol) +
  scale_y_log10() +
  labs(title = "My first ggplot!")

We now have the scatter plots on a good scale, and we can already see some interesting features from these plots. For example, the “Aldh3a1” gene seems to show off some differences between the two strains, and the “Bbs2” gene seems to have a different variance between the two strains.

So, how can we improve on this plot further?

We can layer plots of other features onto this! For example, boxplots.

long_sample_ge_data %>% 
  ggplot(aes(x = strain, 
             y = gene_expression)) + 
  geom_point() +
  geom_boxplot() +
  facet_wrap(~gene_symbol) +
  scale_y_log10() +
  labs(title = "My first ggplot!")

Now it is a good time to stop and think about what we did.

We started off with a data.frame and a blank canvas with the aim to make a scatter plot. We added aesthetic mappings onto our ggplot and evaluated the quality of our plot as we go along. Every time we identify deficiencies in our plot, we simply add more layers to the plot. This detective-work is what makes ggplot2 is so popular, because it is very flexible in handling new ideas and plotting aesthetics. But at the same time, every aesthetic of this plot can be traced back to a particular column in the data.frame.

There are so many functions and tricks that we can’t really go through in details. Here is a very helpful cheatsheet. And in general, the best way is to Google the plot that you want and copy the code that someone else already made.

Have a play with the following code. Try deleting components of the code to understand which aspect of the plot it was controlling

long_sample_ge_data %>% 
  ggplot(aes(x = strain, 
             y = gene_expression, 
             colour = strain)) + 
  geom_point() +
  geom_boxplot() +
  facet_wrap(~ gene_symbol, scales = "free_y") +
  scale_y_log10(labels = scales::comma) +
  scale_color_manual(values = c("red", "blue")) +
  labs(x = "Strain", 
       y = "Gene Expression", 
       title = "Boxplot of genes across strains") +
  theme_bw() +
  theme(legend.position = "bottom")

7.1 Histogram

We will now try to make a histogram of each gene.

Let’s begin with the question: what are the aesthetic mappings of a histogram?

  • x-axis should be numerical values of gene expression

Let’s try this out. For making a histogram, we need to use the geom_histogram() geometry.

long_sample_ge_data %>% 
  ggplot(aes(x = gene_expression)) + 
  geom_histogram()

You might be wondering, how is it possible that we didn’t specify what should be on the y-axis but still got a plot out? This is because by the very definition of a histogram, only counts or density of the give x-variable should be plotted. Here, the default of ggplot2 is to use counts.

Again, we see the same scaling issue as the above boxplots. Hence, we will try to apply the same functions to manipulate the plots, namely:

  • facet_wrap which creates a plot for each gene in small panels,
  • scale_x_log10 which plots the x-axis on the log10 scale
long_sample_ge_data %>% 
  ggplot(aes(x = gene_expression)) + 
  geom_histogram() +
  facet_wrap(~gene_symbol) +
  scale_x_log10(labels = scales::comma)

This plot looks pretty good! We can see that some genes are unimodal while others are bimodal. This is an new characteristic of the data that we didn’t observe from the boxplots above.

We can make the modality more obvious by imposing the density of each gene onto the histogram. Let’s try out the geom_density function then!

long_sample_ge_data %>% 
  ggplot(aes(x = gene_expression)) + 
  geom_histogram() +
  geom_density() +
  facet_wrap(~gene_symbol) +
  scale_x_log10(labels = scales::comma)

This doesn’t look quite right, so what happened?

If we pay attention to the y-axis of our ggplot so far, we see that it is count, which means that each of bin in the histograms is showing the number of observations in that interval. However, when we talk about density, it is on a very different scale to the number of counts.

So how can we make a density-based histogram?

Let’s think this logically: right now, the count of observation is being mapped to the y-axis (default of geom_histogram), and we need to overwrite this by making density to be mapped to the y-axis. So the answer would be adding an extra input of y = ..density.. into the aes() function. The .. is just a technical way that ggplot handles such problem which you don’t have to worry too much on the details about. (The reason that we can’t use y = density is because in that case ggplot will interpret this line of code as mapping a column named density from the data.frame to the plot - but such column does not exist.)

long_sample_ge_data %>% 
  ggplot(aes(x = gene_expression, y = ..density..)) + 
  geom_histogram() +
  geom_density() +
  facet_wrap(~gene_symbol, scales = "free") +
  scale_x_log10(labels = scales::comma)

8 Plotting two genes against each other

Now, let’s consider how we could make a scatter plot of two genes, say the “Ahrrr” and the “Cxcr7” gene. So, let’s answer the question, what are the components of a scatter plot in this case?

  • Ahrrr gene expression values on the x-axis
  • Cxcr7 gene expression values on the y-axis

But if we look at the data.frame we are using for plotting, long_sample_ge_data, it has all the gene expression values mixed in that gene_symbol column. So we can’t effectively map the expression of only “Ahrrr” and “Cxcr7” to the x and y axes respectively.

long_sample_ge_data 

So we need to think about how to separate out the gene expression values into distinct columns. You may realise this as to be exactly the opposite of the operation of pivot_longer. In the tidyr package, the tidyr::pivot_wider allows for this operation.

wide_sample_ge_data = long_sample_ge_data %>% 
  tidyr::pivot_wider(names_from = "gene_symbol",
                     values_from = "gene_expression")

wide_sample_ge_data

Now that we have each gene as a column, we can perform our familiar ggplot operations. Have a play with each of the layers below. Notice that we also added the colour of the strain as well as a new geometry, geom_smooth that allows for adding linear regression lines.

wide_sample_ge_data %>% 
  ggplot(aes(x = Ahrr,
             y = Cxcr7,
             colour = strain)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_x_log10(labels = scales::comma) +
  scale_y_log10(labels = scales::comma)

9 Saving data

write_csv(long_sample_ge_data, path = "data/long_sample_ge_data.csv")
write_csv(wide_sample_ge_data, path = "data/wide_sample_ge_data.csv")

10 Making gene expression boxplots with p-values

library(ggpubr)

ggpubr::ggboxplot(
  data = long_sample_ge_data,
  x = "strain", 
  y = "gene_expression",
  facet.by = "gene_symbol",
  scales = "free") +
  ggpubr::stat_compare_means()

11 References

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

12 Session Info

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggpubr_0.2.4    magrittr_1.5    visdat_0.5.3    readxl_1.3.1   
##  [5] janitor_1.2.0   forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3    
##  [9] purrr_0.3.3     readr_1.3.1     tidyr_1.0.0     tibble_2.1.3   
## [13] ggplot2_3.2.1   tidyverse_1.3.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5 xfun_0.11        haven_2.2.0      lattice_0.20-38 
##  [5] snakecase_0.11.0 colorspace_1.4-1 vctrs_0.2.0      generics_0.0.2  
##  [9] htmltools_0.4.0  yaml_2.2.0       utf8_1.1.4       rlang_0.4.2     
## [13] pillar_1.4.2     withr_2.1.2      glue_1.3.1       DBI_1.0.0       
## [17] dbplyr_1.4.2     modelr_0.1.5     lifecycle_0.1.0  ggsignif_0.6.0  
## [21] munsell_0.5.0    gtable_0.3.0     cellranger_1.1.0 rvest_0.3.5     
## [25] evaluate_0.14    labeling_0.3     knitr_1.26       fansi_0.4.0     
## [29] broom_0.5.2      Rcpp_1.0.3       backports_1.1.5  scales_1.1.0    
## [33] jsonlite_1.6     farver_2.0.1     fs_1.3.1         hms_0.5.2       
## [37] digest_0.6.23    stringi_1.4.3    grid_3.6.1       cli_1.1.0       
## [41] tools_3.6.1      lazyeval_0.2.2   crayon_1.3.4     pkgconfig_2.0.3 
## [45] zeallot_0.1.0    ellipsis_0.3.0   xml2_1.2.2       reprex_0.3.0    
## [49] lubridate_1.7.4  assertthat_0.2.1 rmarkdown_1.18   httr_1.4.1      
## [53] rstudioapi_0.10  R6_2.4.1         icon_0.1.0       nlme_3.1-142    
## [57] compiler_3.6.1