R
, in particular the data.frame
When working in R, there are some functions and data sets that are always available, but the real strength of R comes from its community of developers who continually improve the set of available features and add additional functionality through an ecosystem of “packages”.
The following packages have already been installed in RStudio Cloud, but you might need to install them when working on your own machine using install.packages("name_of_package_to_be_installed")
.
Before we take a dive into the tidyverse, let’s take a minute to learn a bit variable types in R. The reason that we want to do this is because everything in R is stored as an object which ultimately determines its behaviour.
obj1 = c(TRUE, FALSE)
obj1
class(obj1)
obj2 = 1:5
obj2
class(obj2)
obj3 = rnorm(5)
obj3
class(obj3)
obj4 = c("a", "b")
obj4
class(obj4)
obj5 = c(obj1, obj2)
obj5
class(obj5)
obj6 = list(obj1, obj2)
obj6
class(obj6[[1]])
class(obj6[[2]])
data(diamonds, package = "ggplot2")
diamonds
class(diamonds)
dplyr::glimpse(diamonds)
In this session, we will read in a Excel dataset (.xls
format). We will use the readxl package to perform this task.
Of course, .xls
is not the only data type that R
can deal with. Have a look at this cheatsheet to see some data types that the readr package is able to handle. If you want to read in SAS or SPSS files, have a look at the haven package.
The read_excel()
function from the readxl package is perfect to read in a .xls
file, let’s try that!
Oh, no! We seemed to read in a very ugly dataset. What happened?
If we open the sample_data.xls
file in Excel, we see that R didn’t do anything wrong! It actually faithfully read in everything that you can see in the Excel sheet.
This .xls
file is poorly formatted with the first three rows being notes that are not part of the data table. We should ask the function read_excel
to skip the first three rows, use the 4th row as the column headings. How can we do that?
The column names of this dataset isn’t quite as nice as we would like.
We can tidy them using the clean_names()
function from the janitor package. Even though this package is not officially part of the tidyverse, it is a very useful package nonetheless.
We can see the new column names:
Note that we have also used the pipe operator, %>%
, which passes the results from one function into the next function. In the above case, these two lines of code would give identical results:
raw_data %>% janitor::clean_names()
janitor::clean_names(raw_data)
We will use the pipe operator a lot over the next two days, it’s a game changer for the way you write R code.
We will now use the dplyr package to perform some basic data cleaning. The dplyr package is one of the most popular packages in the tidyverse. Its main functions are designed to interact with the data.frame
object in R in a very intuitive way. This is why its name is an excellent pun: it is a plyer for data frames (okay, maybe it depends on your sense of humour).
Let’s see dplyr in action. If we want to select only the strain
column of this data, then we can use dplyr::select()
.
If we wanted the strain
variable as a vector (i.e. not inside a dataframe object) we use dplyr::pull()
.
If we want to select multiple columns in the data, we could include more names (unquoted) into the dplyr::select()
function.
However, sometimes it is useful to select columns using a character vector, especially considering the “dose” column has a Greek symbol (nano-gram) inside.
# select using unquoted column names
clean_col_data %>%
dplyr::select(strain, sample) %>%
dplyr::glimpse()
# or the first four columns
clean_col_data %>%
dplyr::select(1:4) %>%
dplyr::glimpse()
# a range of columns
clean_col_data %>%
dplyr::select(strain:rin) %>%
dplyr::glimpse()
# columns starting with a particular string
clean_col_data %>%
dplyr::select(starts_with("s")) %>%
dplyr::glimpse()
# select using character vector
select_columns = colnames(clean_col_data)[1:4]
sub_data = clean_col_data %>%
dplyr::select(one_of(select_columns))
glimpse(sub_data)
We can also rename the columns.
The dplyr::mutate()
function does exactly what you would expect: it changes an entire column. It has the structure that new column = some changes to the (old column)
.
If the new column has the same name as the old column, then this column will be over-written.
When we mutated our data, we saw that there was a warning message, NAs introduced by coercion
. The NA
is is R’s way to represent a missing value. So what happened?
If we have a quick look at the original xls
file, we see that someone must have copied the data with the heading four times. This is why when we converted time
into a numeric variable, R
recognised the column headings in the middle of the data as weird values and assigned those with NA
.
We can visualise NA
in our data using the vis_miss()
function from the visdat package.
We will now filtering out the three rows of column headings, which we have converted into NA
in the time
and dose
columns.
We will use the filter()
function from the dplyr package. The function takes in logical
vectors of the same length as the nrow(data)
as the input.
We will finish today’s session with something simple, but powerful.
If we want to create summary statistics of our data.frame, we need to think about which variable are we summarising over. In this case, we will count the number of samples for each of the strains.
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] visdat_0.5.3 readxl_1.3.1 janitor_1.2.0 forcats_0.4.0
## [5] stringr_1.4.0 dplyr_0.8.3 purrr_0.3.3 readr_1.3.1
## [9] tidyr_1.0.0 tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.11 haven_2.2.0 lattice_0.20-38
## [5] snakecase_0.11.0 colorspace_1.4-1 vctrs_0.2.0 generics_0.0.2
## [9] htmltools_0.4.0 yaml_2.2.0 utf8_1.1.4 rlang_0.4.2
## [13] pillar_1.4.2 withr_2.1.2 glue_1.3.1 DBI_1.0.0
## [17] dbplyr_1.4.2 modelr_0.1.5 lifecycle_0.1.0 munsell_0.5.0
## [21] gtable_0.3.0 cellranger_1.1.0 rvest_0.3.5 evaluate_0.14
## [25] labeling_0.3 knitr_1.26 fansi_0.4.0 broom_0.5.2
## [29] Rcpp_1.0.3 backports_1.1.5 scales_1.1.0 jsonlite_1.6
## [33] farver_2.0.1 fs_1.3.1 hms_0.5.2 digest_0.6.23
## [37] stringi_1.4.3 grid_3.6.1 cli_1.1.0 tools_3.6.1
## [41] magrittr_1.5 lazyeval_0.2.2 crayon_1.3.4 pkgconfig_2.0.3
## [45] zeallot_0.1.0 ellipsis_0.3.0 xml2_1.2.2 reprex_0.3.0
## [49] lubridate_1.7.4 assertthat_0.2.1 rmarkdown_1.18 httr_1.4.1
## [53] rstudioapi_0.10 R6_2.4.1 icon_0.1.0 nlme_3.1-142
## [57] compiler_3.6.1