1 Learning outcomes

  • Understand basic variable types in R, in particular the data.frame
  • Loading in an Excel sheet as an R data frame
  • Cleaning and renaming columns names
  • Selecting and creating new columns
  • Looking for missing values
  • Filtering rows and summarising
  • Saving cleaned data for next stage of analysis

2 Loading packages

When working in R, there are some functions and data sets that are always available, but the real strength of R comes from its community of developers who continually improve the set of available features and add additional functionality through an ecosystem of “packages”.

The following packages have already been installed in RStudio Cloud, but you might need to install them when working on your own machine using install.packages("name_of_package_to_be_installed").

library(tidyverse)
library(janitor)
library(readxl)
library(visdat)

3 Warm up: variable types

Before we take a dive into the tidyverse, let’s take a minute to learn a bit variable types in R. The reason that we want to do this is because everything in R is stored as an object which ultimately determines its behaviour.

obj1 = c(TRUE, FALSE)
obj1
class(obj1)

obj2 = 1:5
obj2
class(obj2)

obj3 = rnorm(5)
obj3
class(obj3)

obj4 = c("a", "b")
obj4
class(obj4)

obj5 = c(obj1, obj2)
obj5
class(obj5)

obj6 = list(obj1, obj2)
obj6
class(obj6[[1]])
class(obj6[[2]])

data(diamonds, package = "ggplot2")
diamonds
class(diamonds)
dplyr::glimpse(diamonds)

4 Reading in data

In this session, we will read in a Excel dataset (.xls format). We will use the readxl package to perform this task.

Of course, .xls is not the only data type that R can deal with. Have a look at this cheatsheet to see some data types that the readr package is able to handle. If you want to read in SAS or SPSS files, have a look at the haven package.

4.1 First attempt

The read_excel() function from the readxl package is perfect to read in a .xls file, let’s try that!

raw_data = readxl::read_excel(path = "data/sample_data.xls")
head(raw_data)

Oh, no! We seemed to read in a very ugly dataset. What happened?

If we open the sample_data.xls file in Excel, we see that R didn’t do anything wrong! It actually faithfully read in everything that you can see in the Excel sheet.

This .xls file is poorly formatted with the first three rows being notes that are not part of the data table. We should ask the function read_excel to skip the first three rows, use the 4th row as the column headings. How can we do that?

4.2 Second attempt

We use the skip argument to the read_excel() function. Have a look at this!

raw_data = readxl::read_excel("data/sample_data.xls", skip = 3)
head(raw_data)

5 Cleaning column names

The column names of this dataset isn’t quite as nice as we would like.

colnames(raw_data)

We can tidy them using the clean_names() function from the janitor package. Even though this package is not officially part of the tidyverse, it is a very useful package nonetheless.

colnames(raw_data)
clean_col_data = raw_data %>% 
  janitor::clean_names()

We can see the new column names:

colnames(clean_col_data)

Note that we have also used the pipe operator, %>%, which passes the results from one function into the next function. In the above case, these two lines of code would give identical results:

raw_data %>% janitor::clean_names()
janitor::clean_names(raw_data)

We will use the pipe operator a lot over the next two days, it’s a game changer for the way you write R code.

6 Selecting data

6.1 Selecting certain columns

We will now use the dplyr package to perform some basic data cleaning. The dplyr package is one of the most popular packages in the tidyverse. Its main functions are designed to interact with the data.frame object in R in a very intuitive way. This is why its name is an excellent pun: it is a plyer for data frames (okay, maybe it depends on your sense of humour).

Let’s see dplyr in action. If we want to select only the strain column of this data, then we can use dplyr::select().

sub_data = clean_col_data %>% 
  dplyr::select(strain)
glimpse(sub_data)

If we wanted the strain variable as a vector (i.e. not inside a dataframe object) we use dplyr::pull().

strain_vec = clean_col_data %>% pull(strain)
glimpse(strain_vec)

7 Selecting columns by character vector

If we want to select multiple columns in the data, we could include more names (unquoted) into the dplyr::select() function.

However, sometimes it is useful to select columns using a character vector, especially considering the “dose” column has a Greek symbol (nano-gram) inside.

# select using unquoted column names
clean_col_data %>% 
  dplyr::select(strain, sample) %>% 
  dplyr::glimpse()

# or the first four columns
clean_col_data %>% 
  dplyr::select(1:4) %>% 
  dplyr::glimpse()

# a range of columns
clean_col_data %>% 
  dplyr::select(strain:rin) %>% 
  dplyr::glimpse()

# columns starting with a particular string
clean_col_data %>% 
  dplyr::select(starts_with("s")) %>% 
  dplyr::glimpse()

# select using character vector
select_columns = colnames(clean_col_data)[1:4]
sub_data = clean_col_data %>% 
  dplyr::select(one_of(select_columns))
glimpse(sub_data)

8 Renaming columns

We can also rename the columns.

sub_data = sub_data %>% 
  dplyr::rename(time = time_hours,
                dose = dose_µg_kg)
glimpse(sub_data)
# a more advanced method is to use rename_at()
# rename_columns = c("strain", "sample", "time", "dose")
# sub_data = sub_data %>% 
#   dplyr::rename_at(vars(select_columns), ~ rename_columns)
# sub_data 

9 Mutating columns

The dplyr::mutate() function does exactly what you would expect: it changes an entire column. It has the structure that new column = some changes to the (old column).

sub_data %>% 
  dplyr::mutate(time_2 = as.numeric(time))

If the new column has the same name as the old column, then this column will be over-written.

mutate_data = sub_data %>% 
  dplyr::mutate(time = as.numeric(time),
                dose = as.numeric(dose))
mutate_data

10 Missing values

When we mutated our data, we saw that there was a warning message, NAs introduced by coercion. The NA is is R’s way to represent a missing value. So what happened?

If we have a quick look at the original xls file, we see that someone must have copied the data with the heading four times. This is why when we converted time into a numeric variable, R recognised the column headings in the middle of the data as weird values and assigned those with NA.

We can visualise NA in our data using the vis_miss() function from the visdat package.

mutate_data %>% 
  visdat::vis_miss()

11 Filtering out problematic rows

We will now filtering out the three rows of column headings, which we have converted into NA in the time and dose columns.

We will use the filter() function from the dplyr package. The function takes in logical vectors of the same length as the nrow(data) as the input.

## These are the rows we want to filter out
mutate_data %>% 
  dplyr::filter(is.na(time))

## The reverse of the above is written with a `!`. 
filter_data = mutate_data %>% 
  dplyr::filter(!is.na(time))

# could also use: mutate_data %>% drop_na(time)

filter_data

12 Making summary statistics

We will finish today’s session with something simple, but powerful.

If we want to create summary statistics of our data.frame, we need to think about which variable are we summarising over. In this case, we will count the number of samples for each of the strains.

12.1 Counting number of samples

filter_data %>% 
  group_by(strain) %>% 
  summarise(num_samples = n())

filter_data %>% 
  group_by(strain) %>% 
  count()

12.2 Counting with more variables

filter_data %>% 
  group_by(strain, dose) %>% 
  count()

filter_data %>% 
  group_by(strain, dose, time) %>% 
  count()

13 Saving data

write_csv(filter_data, path = "data/clean_sample_data.csv")

14 References

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

15 Session Info

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] visdat_0.5.3    readxl_1.3.1    janitor_1.2.0   forcats_0.4.0  
##  [5] stringr_1.4.0   dplyr_0.8.3     purrr_0.3.3     readr_1.3.1    
##  [9] tidyr_1.0.0     tibble_2.1.3    ggplot2_3.2.1   tidyverse_1.3.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5 xfun_0.11        haven_2.2.0      lattice_0.20-38 
##  [5] snakecase_0.11.0 colorspace_1.4-1 vctrs_0.2.0      generics_0.0.2  
##  [9] htmltools_0.4.0  yaml_2.2.0       utf8_1.1.4       rlang_0.4.2     
## [13] pillar_1.4.2     withr_2.1.2      glue_1.3.1       DBI_1.0.0       
## [17] dbplyr_1.4.2     modelr_0.1.5     lifecycle_0.1.0  munsell_0.5.0   
## [21] gtable_0.3.0     cellranger_1.1.0 rvest_0.3.5      evaluate_0.14   
## [25] labeling_0.3     knitr_1.26       fansi_0.4.0      broom_0.5.2     
## [29] Rcpp_1.0.3       backports_1.1.5  scales_1.1.0     jsonlite_1.6    
## [33] farver_2.0.1     fs_1.3.1         hms_0.5.2        digest_0.6.23   
## [37] stringi_1.4.3    grid_3.6.1       cli_1.1.0        tools_3.6.1     
## [41] magrittr_1.5     lazyeval_0.2.2   crayon_1.3.4     pkgconfig_2.0.3 
## [45] zeallot_0.1.0    ellipsis_0.3.0   xml2_1.2.2       reprex_0.3.0    
## [49] lubridate_1.7.4  assertthat_0.2.1 rmarkdown_1.18   httr_1.4.1      
## [53] rstudioapi_0.10  R6_2.4.1         icon_0.1.0       nlme_3.1-142    
## [57] compiler_3.6.1