class: center, middle, inverse, title-slide # Enter the
tidyverse
with R and RStudio ### Kevin Wang and Garth Tarr ### 2-3 December 2019, Sydney, Australia --- class: segue # Session 1: Monday 15:20 - 16:30 --- ## Your instructors + Kevin: PhD candidate in statistics and bioinformatics.
`@KevinWang009` + Garth: Lecturer in statistics and data science.
`@garthtarr` <br> <br> .blockquote[ <h1> bit.ly/bis2019_tidy </h1> ] --- # Getting started with RStudio Cloud We will use RStudio Cloud to run this workshop. 1. Go to https://rstudio.cloud/, create an account and log in. 2. Click the little arrow next to the "New Project" and select "New Project from Git Repo". 3. Enter "https://github.com/SydneyBioX/tidy2019" (it will download some things) 4. Give your project a name (a better name than Untitled Project). 5. Run this line of code `source("install.R")` (this may take a few minutes to run, set it running then get involved in the icebreaker activity!) 6. If things seem to be running slowly after installing all the packages, just refresh the webpage. If you want to run this workshop at home, see [here](https://sydneybiox.github.io/tidy2019#local). --- # Ice breaker: find someone who... The aim of this "human bingo" activity is to meet a number of your peers, learn their names and find out something about them. Directions: - Circulate around the room, collecting names from participants who meet the criterion listed in each box. - You will have 5 minutes to try and fill in as many squares as you can with different names. - You can have more than one name in each square and you can have the same name in many different squares. - The "winner" is the first person who fills in (at least) one name in each box. --- # Workshop outline: make a boxplot similar to [this](https://www.nature.com/articles/cmi2014102.pdf) .pull-left-2[ <center> <img src="../assets/publication_boxplot.png", width="100%"> </center> ] -- .pull-right-1[ 1. Get gene expression data 1. Get clinical data 1. Understand what the data structure 1. Figure out how to join the two data 1. Reshape the data 1. Visualisation 1. Compute t-test p-values and add them to the plots ] .footnote[ Neve Polimeno, M., Ierano, C., D'Alterio, C., Simona Losito, N., Napolitano, M., Portella, L., ... Scala, S. (2015). CXCR4 expression affects overall survival of HCC patients whereas CXCR7 expression does not. _Cellular & molecular immunology_, 12(4), 474–482. [doi:10.1038/cmi.2014.102](https://www.nature.com/articles/cmi2014102.pdf) ] --- # What is tidy data? 1. Each row is an observation unit 1. Each column is a variable 1. Each cell is one value ```r sample_n(tidy_data, size = 10) ``` ``` ## # A tibble: 10 x 4 ## strain sample time dose ## <chr> <chr> <dbl> <dbl> ## 1 LE LE10D35 240 0 ## 2 HW WW44 6 100 ## 3 HW WW29 19 1000 ## 4 LE LE69 19 1 ## 5 HW WW22 19 10 ## 6 LE LE10D34 240 100 ## 7 LE LE10D30 240 100 ## 8 LE LE10D24 240 0 ## 9 LE LE92 6 100 ## 10 LE LE93 6 100 ``` --- # Why tidy data? + Because it will make our analyses and visualisation easier. + An example of dirty data: <center> <img src="../assets/UAC.png", width="100%"> </center> --- # Visualisation Gallery + NYC squirrel sighting: https://twitter.com/i/status/1189658282618699776 + Australian election result: https://srkobakian.github.io/sugarbag/index.html <center> <img src="https://github.com/srkobakian/sugarbag/raw/master/man/figures/README-animated-1.gif", width="60%"> </center> --- # A simple bioinformatics data + We will be working on a real gene expression data from [this article](https://rnajournal.cshlp.org/content/19/1/51.full): **Systematic evaluation of medium-throughput mRNA abundance platforms** by Prokopec et. al. + The original data was generated to understand reproducibility of three gene expression technologies by comparing gene expression of mice treated with different diet. + There are two pieces of data: 1. **The sample data** was downloaded from the [journal's website](https://rnajournal.cshlp.org/content/19/1/51/suppl/DC1). 1. **The gene expression data** was downloaded from the [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE43251&format=file&file=GSE43251%5FNanoString%5Fnon%2Dnormalized%2Etxt%2Egz), accession ID GSE43251. We will play with the sample data on Monday, combine the two data for visualisation on Tuesday morning and perform basic modelling on Tuesday afternoon. --- # Session 1: reading data and basic manipulations .pull-left[ > **Why can't I do everything in Excel?** + Processing raw data to meaningful scientific result is often complex and iterative. + It is better and more rigorous to have a R-Markdown file documenting this process (reproducible). + In this session, we will learn how to read in the raw data, identify some problems with the data file, remove problematic rows and columns of the data and save a cleaned data file. ] .pull-right[ <img src="../assets/noexcel.png", width="100%"> ] .footnote[ Ziemann, M., Eren, Y. & El-Osta, A. (2016). Gene name errors are widespread in the scientific literature. _Genome Biology_ 17, 177 [doi:10.1186/s13059-016-1044-7](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7) ] --- # Pipe operator `%>%` .large[ ``` awesome_data = raw_interesting_data %>% mutate(somehow) %>% filter(the_good_parts) %>% summarise(required_statistics) ``` Instead of: ``` awesome_data = summarise( filter( mutate(raw_interesting_data, somehow), the_good_parts), required_statistics) ``` ] .footnote[ Read more: https://r4ds.had.co.nz/pipes.html ] --- # Functions we covered in this session Reading and saving data: + `read_excel` from the `readxl` package + `write_csv` from the `readr` package Cleaning data + `clean_names` from the `janitor` package Column manipulations + `select`, `rename`, `mutate` from the `dplyr` package Row operations + `filter`, `group_by`, `summarise` from the `dplyr` package --- # Extension exercises 1. Load the `diamonds` data from `ggplot2`. Find the number of observations for each cut. 2. Calculate summary statistics of the price for each cut (e.g. mean, median, min, max, ...) --- class: segue # Session 2: `dplyr`, `tidyr` and `ggplot2` --- # Session 2: visualisation with `ggplot2` + `ggplot2` is rigorous framework for plotting + The `gg` in `ggplot2` stands for "grammar of graphics". + in order to produce a plot, a strict set of coding structure must be followed. + In particular, **every characteristic of your plot is produced using one variable**. This avoids the "double y-axes" plot: <center> <img src="https://statmodeling.stat.columbia.edu/wp-content/uploads/2015/05/journal.pone_.0112042.g001.png", width="40%"> </center> + Other articles about bad visualisations can be found [here](https://venngage.com/blog/misleading-graphs/) and [here](https://www.businessinsider.com.au/the-27-worst-charts-of-all-time-2013-6?r=US&IR=T#i-never-thought-it-was-possible-but-i-actually-understand-soccer-less-after-looking-at-this-chart-3). --- ## A quick example of `ggplot2` ```r divorce_margarine_scaled = read_csv("data/divorce_margarine_scaled.csv") ``` .pull-left[ ```r divorce_margarine_scaled ``` ``` ## # A tibble: 10 x 3 ## year divorce margarine ## <dbl> <dbl> <dbl> ## 1 2000 1.84 2.00 ## 2 2001 1.00 1.15 ## 3 2002 0.723 0.802 ## 4 2003 0.167 -0.0422 ## 5 2004 -0.111 -0.113 ## 6 2005 -0.667 -0.957 ## 7 2006 -0.389 -0.535 ## 8 2007 -0.389 -0.605 ## 9 2008 -0.389 -0.816 ## 10 2009 -1.78 -0.886 ``` ] <br> <br> <br> .pull-right[ + divorce should be mapped to the x-axis + margarine should be mapped to the y-axis ] --- ## First taste of `ggplot2` .pull-left[ ```r divorce_margarine_scaled %>% * ggplot() ``` ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-7-1.png" width="504" /> ] --- ## First taste of `ggplot2` .pull-left[ ```r divorce_margarine_scaled %>% ggplot() + * aes( * x = divorce, * y = margarine * ) ``` ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-8-1.png" width="504" /> ] --- ## First taste of `ggplot2` .pull-left[ ```r divorce_margarine_scaled %>% ggplot() + aes( x = divorce, y = margarine ) + * geom_point() ``` ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-9-1.png" width="504" /> ] --- ## First taste of `ggplot2` .pull-left[ ```r divorce_margarine_scaled %>% ggplot() + aes( x = divorce, y = margarine, * label = year ) + geom_point() + * geom_text() ``` ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-10-1.png" width="504" /> ] --- ## First taste of `ggplot2` .pull-left[ ```r divorce_margarine_scaled %>% ggplot() + aes( x = divorce, y = margarine, label = year ) + geom_point() + geom_text() + labs(title = "Divorce rate and margarine consumption", x = "Divorce rate", * y = "Margarine consumption") + * theme_bw(base_size = 20) ``` ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-11-1.png" width="504" /> ] --- ## (Discuss with your neighbours) can you make this plot using this data? .pull-left[ ```r divorce_margarine_scaled ``` ``` ## # A tibble: 10 x 3 ## year divorce margarine ## <dbl> <dbl> <dbl> ## 1 2000 1.84 2.00 ## 2 2001 1.00 1.15 ## 3 2002 0.723 0.802 ## 4 2003 0.167 -0.0422 ## 5 2004 -0.111 -0.113 ## 6 2005 -0.667 -0.957 ## 7 2006 -0.389 -0.535 ## 8 2007 -0.389 -0.605 ## 9 2008 -0.389 -0.816 ## 10 2009 -1.78 -0.886 ``` ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-13-1.png" width="504" /> ] --- ## Pivot operation <center> <img src="https://www.fromthebottomoftheheap.net/assets/img/posts/tidyr-longer-wider.gif", width="40%"> </center> --- ## `pivot_longer` .pull-left[ ```r divorce_margarine_scaled ``` ``` ## # A tibble: 10 x 3 ## year divorce margarine ## <dbl> <dbl> <dbl> ## 1 2000 1.84 2.00 ## 2 2001 1.00 1.15 ## 3 2002 0.723 0.802 ## 4 2003 0.167 -0.0422 ## 5 2004 -0.111 -0.113 ## 6 2005 -0.667 -0.957 ## 7 2006 -0.389 -0.535 ## 8 2007 -0.389 -0.605 ## 9 2008 -0.389 -0.816 ## 10 2009 -1.78 -0.886 ``` ] .pull-right[ ```r long_data = divorce_margarine_scaled %>% tidyr::pivot_longer( cols = -c("year"), names_to = "type", values_to = "values") long_data ``` ``` ## # A tibble: 20 x 3 ## year type values ## <dbl> <chr> <dbl> ## 1 2000 divorce 1.84 ## 2 2000 margarine 2.00 ## 3 2001 divorce 1.00 ## 4 2001 margarine 1.15 ## 5 2002 divorce 0.723 ## 6 2002 margarine 0.802 ## 7 2003 divorce 0.167 ## 8 2003 margarine -0.0422 ## 9 2004 divorce -0.111 ## 10 2004 margarine -0.113 ## 11 2005 divorce -0.667 ## 12 2005 margarine -0.957 ## 13 2006 divorce -0.389 ## 14 2006 margarine -0.535 ## 15 2007 divorce -0.389 ## 16 2007 margarine -0.605 ## 17 2008 divorce -0.389 ## 18 2008 margarine -0.816 ## 19 2009 divorce -1.78 ## 20 2009 margarine -0.886 ``` ] --- ```r long_data %>% ggplot(aes(x = year, y = values, colour = type)) + geom_point() + geom_path() + labs(title = "Tidy data is dependent on the plot you are making") ``` <img src="slides_files/figure-html/unnamed-chunk-16-1.png" width="864" /> --- # Functions we covered in this session Reading and saving data: + `read_csv` and `read_delim` from the `readr` package Combining data: + `left_join` from the `dplyr` package Pivoting operations: + `pivot_longer` and `pivot_wider` from the `tidyr` package Visualisation + `geom_point` and `geom_histogram` from the `ggplot2` package + `ggboxplot` and `stat_compare_means` from the `ggpubr` package --- class: segue # Session 3: modelling with `tidyverse` (advanced exercise) --- # Final remarks + If you can't get through the entire workshop, don't worry! The website will be up indefinitely. + If you're looking to learn more: - [R for Data Science](https://r4ds.had.co.nz/) - [Resources for a 5 day workshop](https://tidy-ds.wjakethompson.com/) built around the R for Data Science book - [Tidy Tuesday](https://github.com/rfordatascience/tidytuesday) - [Data Wrangling & Exploration with the Tidyverse in R](https://github.com/jobreu/tidyverse-workshop-gesis-2019) workshop + Follow some people on twitter: - [hadleywickham](https://twitter.com/hadleywickham) (Hadley Wickham, chief scientist at RStudio) - [dataandme](https://twitter.com/dataandme) (Mara Averick, tidyverse dev advocate)