Introduction to the Concepts of ClassifyR

Audience: This guide is a suitable starting place for anyone who is just beginning to use ClassifyR.

Purpose

ClassifyR is a modelling and evaluation framework for data sets which often have multiple kinds of measurements made on each individual, as is common in the field of bioinformatics. Despite its name, it also allows model building and evaluation of survival data. Unlike other more generic modelling frameworks, it has seamless integration with such a data set structure. Firstly, it allows the input of commonly-used data formats in the Bioconductor bioinformatics community such as MultiAssayExperiment and DataFrame. MultiAssayExperiment is a good way to store multiple assays on the same samples and DataFrame is ideal for storing mixed features (i.e. numerical and categorical), such as clinicopathological data, because it also allows metadata about the columns (features) to be stored.

Secondly, use of the modelling functions with one assay or multiple assays follows the same syntax for ease of use. The data conversion into a flat table required for a typical modelling function is handled internally. Similarly, whether a data set has one assay and requires a relatively simple analysis or has multiple assays and requires more complex evaluation of each assay or various methods of combining assays (e.g. concatenation, prevalidation) can both be similarly evaluated by ClassifyR.

Lastly, ClassifyR allows more than just evaluation of test set predictions in its evaluation. A focus of the evaluation is stability and interpretability. An example of how each of these are addressed is feature selection stability over repeated cross-validation and sample-wise error rate calculation after cross-validation to help identify subsets of samples that are difficult to classify and suggest an interesting subgroup of individuals.

`crossValidate` and `runTests`: Two Ways to Perform Cross-Validation

Two functions are provided to enable running of cross-validation. In most cases, crossValidate is recommended. It provides an easier interface with a limited number of options that are specified as simple parameters whereas runTests expects the user to create S4 parameter set objects using classes such as TrainParams and PredictParams. For example, crossValidate offers only repeat-and-fold and leave-one-out cross-validation whereas runTests additionally offers leave-k-out and Monte Carlo cross-validation. Also, crossValidate is designed for multi-view data set evaluation whereas runTests has limited support for such data, only concatenating different assays into a table. One other key difference is that crossValidate uses a prespecified range of performance tuning values whereas runTests expects tuning parameters to be specified by the user.

So, unless some less widely-used form of cross-validation is desired, crossValidate is the function to use.

One special case is when there is a discovery data set and a validation data set which have been predetermined by a research project. Training is done only on the discovery set and predictions are made only on the validation set. In this case, the train and predict pair of functions can be used or the runTest function if more control of model building parameter settings is desired.

Data Input Formats

There are a variety of allowed input data formats. Let n denote the number of samples and p denote the number of features of an assay.

Data Type	$n \times p$	$p \times n$	Only in `crossValidate`
matrix	✔
data.frame	✔	✔	✔
DataFrame	✔
MultiAssayExperiment		✔
list of tabular data	✔		✔

So, MultiAssayExperiment is the only data type to require the features be in rows and samples in columns whereas all others have the opposite expectation.

For single-cell RNA sequencing data, scFeatures is recommended for transforming the per-cell data into different per-person biological views.

Data Preparation

A convenience function prepareData is first run on input data provided to crossValidate or runTests and their counterpart functions for independent training and validation data set modelling. Basically, all prediction functions need non-missing values for test samples and the default is to remove any feature which has any missing value for any sample, although this can be changed by increasing maxMissingProp from zero. For features which have a small proportion of missing value, it is recommended to impute values to replace the missing values by instead of discarding the feature before using ClassifyR. A second preparation which could be used is to subset numeric input features to the most variable features as a way of reducing the dimensionality of the input data set. topNvariance is an integer of the top number of features that will be kept for modelling. By default, no most variable feature filtering is used and all features are used in modelling.

The Four Stages of Cross-validation

Data Transformation

This typically is done before cross-validation. The only situation in which it would be done within cross-validation is if some value derived from all of the samples needs to be calculated but using samples from the test set avoided. For example $log_2$ -scaling the data for a particular sample uses no information from other samples, so it shouldn’t be done within cross-validation. However, subtracting each feature’s measurements from a location, such as the mean or the median, should be done at this stage by specifying that subtractFromLocation be used.

Feature Selection

This is recommended in most cases. In a typical data set, a small number of features, if any, will be predictive of the outcome of each sample. Providing a large number of uninformative features to most classifiers is widely known to degrade their predictive performance. By default, a t-test ranking and choice of top-p features based on resubstitution error rate is used to select features, but the full set of approaches can be seen at the R command line by running:

library(ClassifyR)
available("selectionMethod")

##    selectionMethod Keyword                                               Description
## 1                     none      Skip selection procedure and use all input features.
## 2                   t-test                                                   T-test.
## 3                    limma                                         Moderated t-test.
## 4                    edgeR                              edgeR likelihood ratio test.
## 5                 Bartlett                   Bartlett's test for different variance.
## 6                   Levene                     Levene's test for different variance.
## 7                      DMD           Differences in means/medians and/or deviations.
## 8          likelihoodRatio              Likelihood ratio test (normal distribution).
## 9                       KS Kolmogorov-Smirnov test for differences in distributions.
## 10                      KL        Kullback-Leibler divergence between distributions.
## 11                   CoxPH           Cox proportional hazards Wald test per-feature.
## 12         randomSelection          Randomly selects a specified number of features.

Some model training methods perform implicit feature selection. It has been suggested by experts in the field that feature selection can be harmful for classifiers which can identify complex non-linear relationships between variables that feature selection methods could not detect and would discard important features. There’s no hard rule that applies to every data set, so it might be worthwhile to try modelling with and without feature selection.

Model Training

This stage involves fitting a model to the training data partition and a variety of models are provided with the package. Apart from multivariate naive Bayes voting and multivariate mixtures of normals voting, the others are wrappers around functionality provided by other R packages.

available("classifier")

##      classifier Keyword                                           Description
## 1          randomForest                                        Random forest.
## 2                  DLDA                Diagonal Linear Discriminant Analysis.
## 3                   kNN                                 k Nearest Neighbours.
## 4                   GLM                                  Logistic regression.
## 5              ridgeGLM         Ridge GLM multinomial regression (alpha = 0).
## 6         elasticNetGLM Elastic net GLM multinomial regression (alpha = 0.5).
## 7              LASSOGLM         LASSO GLM multinomial regression (alpha = 1).
## 8                   SVM                               Support Vector Machine.
## 9                   NSC                           Nearest Shrunken Centroids.
## 10           naiveBayes         Naive Bayes kernel feature voting classifier.
## 11      mixturesNormals         Mixture of normals feature voting classifier.
## 12                CoxPH                             Cox proportional hazards.
## 13               CoxNet                   Penalised Cox proportional hazards.
## 14 randomSurvivalForest                               Random survival forest.
## 15                  XGB                             Extreme gradient booster.

Model Testing

Finally, a fitted model is used to predict the classes on samples not used in training.

Cross-validation Varieties

A number of different cross-validation schemes can be chosen. The choice depends on the goals of the study and the computational running time desired. Next, each scheme will be illustrated visually and its characteristics described.

Repeated Resampling and Folding

This is the default approach. It repeatedly resamples without replacement and divides the orderings of samples into folds. Below is illustrated repeated 5-fold cross-validation.

The benefit of this approach is that each sample is predicted multiple times using slightly different models trained on slightly different training sets, so it gives an indication of sample prediction stability (see Performance Evaluation guide for more details).

Ordinary k-fold cross-validation is effectively the case of repeated cross-validation with the number of repeats being 1.

Leave-k-out

All possible combinations of k samples are determined and each combination is used as the test set once with the remainder of samples used as the training set. Below is illustrated leave-2-out cross-validation.

Using crossValidate, only leave-1-out cross-validation is possible because it a special case of k-fold cross-validation with k being set to the number of samples in the data set (i.e. nRepeats = nrow(tabularData)). Cross-validation settings for runTests are specified by a creating a CrossValParams object, so any value of k is possible.

If leave-1-out cross-validation is used, each sample appears in the test set once and the stability of predictions can’t be evaluated, although the cross-validation will finish relatively quickly.

Repeated Resampling and Percentage Split

Also called Monte Carlo cross-validation, this scheme repeatedly resamples to get new permutations and partitions the samples into a fixed percentage being the test set and the remainder the training set. If $x \%$ are assigned to the test set each time, the the number of times a sample will appear in the test set is a random number but will be approximately $x \div 100 \times nRepeats$ times. Below is illustrated 30% test set cross-validation.

Next, the article about Performance Evaluation is recommended reading. Please choose it from the Articles menu above.

Dario Strbenac The University of Sydney, Australia.