Ensemble classifier
Support Vector Machines (SVM), Random Forests (RF), Neural Network (NN), Linear and Quadric Discriminant Analysis (LDA, QDA) are examples of standard techniques that are widely applied in classification problems. Performances of these classifiers are highly variable across problems. Thus, none of standard classifier can be considered to be the best for all classification settings. In complex situations, such as classifications in high dimensional genomic data, a more meaningful approach would be use an ensemble classifier which combines many standard classification algorithms together to develop an improved classifier. The ensemble classifier we use builds a number of individual models on randomly selected subsets of data which can then be combined or averaged in some meaningful fashion. Majority voting is a popular choice is for a typical solution. Such a classifier by allowing data based utilization of a multitude of classification algorithms for a upholds consistent performance in various types of data and classification problems. In this work, we use the adaptive optimal ensemble classer developed, via bagging and rank aggregation [14]. In this approach, several user specified classifiers are trained on bootstrap samples drawn from the original data using simple random sampling. Since the sampling is done with replacement, some samples will be repeated multiple times while others will be out of the bootstrap sample (known as outofbag (OOB) samples). Focusing on the prediction performances on the OOB samples, a best classifier is select based on various performance measures. For example, in a binary classification problem, sensitivity, specificity, and the area under the curve of the Receiver Operating Characteristic (ROC) curve are some legitimate performance measures. This method is equipped with rank aggregation [15, 16], which provides a great flexibility in selecting the optimal classifier with respect to various multiple performance measures. Predicted classes for a given test set is selected as the highest voted class, as predicted by the above set of “best" classifiers over all bootstrap resamples. Datta et al. [14], demonstrated the performance of the ensemble classifier using various numerical studies and real applications of gene expressions data. In the context of regression similar concepts have been developed [17].
The algorithm described below demonstrates the step by step procedure of developing an ensemble classifier [14]. Suppose the dataset of n samples with p dimensional covariates in the form of { X
_{
n×p
},Y
_{
n×1}}, where X corresponds to independent variables and Y represents the dependent categorical variable that specifies a class label. Assume the ensemble classier is intend to built with M classification algorithms based on K different performance methods such as overall accuracy, class sensitivities etc. to optimize the predictive performance. Thus, we proceed as follows:

1.
Resampling: Draw a bootstrap sample of size n
\(\left \{\boldsymbol {X}^{*}_{n \times p}, \boldsymbol {Y}^{*}_{n \times 1}\right \}\) from the original data {X
_{
n×p
},Y
_{
n×1}} by resampling rows with simple random sampling. Sampling is repeated until samples from all classes are present in the bootstrap sample and then determine the corresponding OOB sample that contains all samples which are left out from the bootstrap sample.

2.
Classifier Training: Train M classification algorithms, C
_{1},…,C
_{
M
}, on the bootstrap sample.

3.
Performance Assessment: Obtain M predicted class labels for each OOB case. Since true classes of the OOB samples are known, calculate K different performance measures for each of M algorithms using their corresponding predictions.

4.
Rank Aggregation: Rank M algorithms according to K performance measures. So, we have K ordered lists (L
_{1},…,L
_{
K
}) of size M. These lists are then rankaggregated using the weighted rank aggregation to determines the best algorithm C
_{(1)} overall.
Repeat the above procedure (steps 1–4) for B times, where B considered to be a large integer which is usually selected according to the computational capacity.

5.
Prediction for a New Sample: Predict the class variable Y for a new sample X using the B prediction models \(C^{1}_{(1)},\ldots,C^{B}_{(1)}\) and determined the highest voted class to obtain the final class prediction \(\hat {Y}\).
Rank aggregation
Suppose the performances of M classifiers are evaluated on the basis of K performance measures. Assume we have ordered lists L
_{1},…,L
_{
K
}, where ith ordered list L
_{
i
}, i=1,…K, provides ranks of M algorithms on their performances evaluated on the ith measure. The rank aggregation [15, 16] procedure provides a single ranked list of M classifiers that minimizes the weighted sum of distances from all individual lists, given by the following objective function,
$$ \Phi(L)=\sum_{i}w_{i}d(L,L_{i})\text{,} $$
(1)
where L is any possible ordered list of the M classifiers, w
_{
i
}’s are weights which represent the user specific importance of each of the K performance measures. The classifier in the first position of this aggregated list that is the optimal classifier overall with respect to all the validation measures. Of course, the default choice would be to use w
_{
i
}=1 for all i which means all the validation measures are taken as equally important in determining the optimal algorithm. Throughout out analyses, we have used w
_{
i
}=1. d is a distance function such as Spearman’s footrule or Kendall’s tau, which measures the closeness between two ordered lists. In this work, we use Spearman’s footrule distance function as the distance measure.
Often for high dimensional data, standard classifiers are combined with dimension reduction, variable selection, or penalization techniques such as Partial Least Squares (PLS), Principle Component Analysis (PCA), Random Forest (RF) based importance measures, L
_{1} regularization, etc., for greater applicability and improved prediction accuracy [18, 19]. For a genomic data characterized by high dimension, use of an ensemble classifier developed on such set of improved component classifiers represents an ideal choice.
Rat liver data
Our data for this study was released by 2015 CAMDA competition. Microarray and RNASeq platforms contain gene expression measurements of nearly 31,000 and 46,000 genes, respectively. The dataset consists of gene expression responses profiled by Affymetrix microarrays and Illumina RNASeq sequencer in rat liver tissues from 105 male SpragueDawley Rats, which are exposed to 27 different chemicals represented by 9 different MOAs. In the original experiment, a training set is formed with 45 rats, which are treated with 15 chemicals corresponding to MOAs of “PPARA", “CAR/PXR", “AhR", “Cytotoxic", “DNA damage", and 18 controls. Test set contains data on 36 rats which are treated with 12 chemicals corresponding to “PPARA", “CAR/PXR", “ER", “HMGCOA" and 6 controls. We found that two MOAs, “ER" and “HMGCOA" are present only in the test set. We further noticed that approximately 22,253 average expressions per sample in RNAseq data were recorded as “NA", which indicates that insufficient number of reads mapped onto the gene to provide a reliable gene expression estimate. We retained gene sets of sizes 13,686 and 16,133 for microarray and RNASeq platforms, after (i) removing unnamed genes, (ii) removing genes with unobserved expressions, and (iii) averaging multiple expressions reported from the genes with unique names.
In this work, we used normalized expression levels that came from microarray data using Robust MultiArray Average (RMA) expression measurements [20], whereas data obtained for RNASeq was already normalized via the Magic normalization [6, 21]. We decided that it would be reasonable to perform separate analysis with a common set of genes (8336) represented in both platforms and also with complete sets of genes, for a comparative study.
Concordance experiments
We conducted three types of investigations for studying the performance of the proposed classifiers.

1.
Train classifiers and make predictions on individual platforms.

2.
Train classifiers in one platform to make predictions on the other platform.

3.
Identify important variables (genes) for accurate classification.
In the 1st analysis, we explore the predictability of MOAs using various classifiers developed in the given training data. To our knowledge, there is no established criteria to define prediction for an unknown class that was not represented in the training data. Thus, we select an adjusted test set after eliminating all test samples belonging to two classes of “ER" and “HMGCOA", where the new test was used in parts of 1st and 3rd analysis. However we also considered the originally given test set as a part of 1st analysis by adopting following alternative classification approach. Accordingly, first we designated both “ER" and “HMGCOA" samples belonging to the original test set as “OTHER". For each classifier, then we determined the maximum class probability for a given test sample and if the above probability was less than 0.5 we selected the predicted class as “OTHER", else kept the originally predicted class. For this purpose, class probabilities for the ensemble classifier was calculated using the predicted class proportions observed in the B bootstrap samples.
Our objective with the 2nd analysis was to examine the interplatform concordance between microarray and RNAseq platforms. Thus, we trained classifiers on a selected platform using the full dataset that included the both given training and test sets for making predictions on the other platform. However, since the classifier needed to run on both platforms for this analysis, each gene expression measurement was standardized, separately for both platforms, prior to the analysis.
For analyses 1 and 2, we selected an ensemble classifier developed with a set of M=7 standard classifiers, SVM, RF, LDA, PLS+RF, PLS+LDA, PCA+RF, PCA+LDA, and Recursive Partitioning (RPART). Primarily, classifiers are selected based on the prior information of their suitabilities in high dimensional data classification. Based on accuracies of predicted classes, each classifier was ranked for K number of performance measures (for example, overall accuracy, class specific accuracies ect.). Since the selection of performance measures for a multiclass classification problem is highly depend upon the aim of study; we optimized the overall prediction accuracy, and the class specific accuracy of each group for the 1st analysis. Furthermore we considered these performance measures to be equally important for classification (i.e., we used equal weights of w
_{
i
}=1, in Eq. (1)), whereas in the 2nd analysis in cross platforms, we focused only on the overall accuracy without optimizing multiple group specific performances. For these analyses, we chose B to be B=300. We performed a 10 fold crossvalidation for each individual classifier to select the number of components for PLS and PCA methods, separately for two platforms. Assuming consistent performance in bootstrap samples similar to the original training data, we employed the same number of components to develop the ensemble classifier.
The 3rd analysis on identifying important variables is subdivided into following two parts.

1.
Detecting important genes with the adjusted test set.

2.
Detecting important genes with full data using the crossvalidation method.
We applied a classifier on the perturbed training data resulted from randomly permuting gene expressions of a given gene to quantify its impact on the predictability of MOAs in a test set. Accordingly, each gene was ranked by a measure given by magnitude of accuracy reduction compared to the true accuracy (in unpermuted data), such that the rank 1 corresponds to the gene that has the highest negative impact on the overall prediction accuracy. In order to reduce the computational burden, we did not use the ensemble classifier for this purpose. Instead the component classifier PLS+LDA which had an overall accuracy close to that of the ensemble classifier was used. We performed theses analysis separately for both platforms to determine a common set of genes presented among the top 20 genes in both platforms.
For Analysis 3.1, we randomly permuted a gene’s expressions in the training set and then made predictions for the test set (adjusted test set) using the classifier trained on the permuted training data. The permutation procedure was repeated l times for each gene to calculate an average overall prediction accuracy (A). Finally, genes were ordered by A, ascending order. Here we chose l to be l=30 in order to achieve reasonably stable approximation, while keeping the computational costs in check.
Analysis 3.2 was performed using the full data which contained both originally given training and test sets. Here we applied the 5 fold crossvalidation technique in order to evaluate the effect of each gene on classifying MOAs. Our approach consisted of two layers of randomization. For the jth, j=1,...,J, outer randomization, we randomly partitioned the dataset into 5 folds and selected a training set of 4 folds, while remaining fold was chosen as a test set. After randomly permuting the expressions of a given gene i across the above specified training set, a classifier was trained to predict on the selected test set. Now using the same approach we described in the previous part (Analysis 3.1) we obtained an average overall prediction accuracy \((A^{cv}_{i_{j}})\) by repeating the permutation l times. After that, the whole procedure was repeated J times for various random partition sets to obtain an average overall prediction accuracy (\(A^{cv}_{i}\)) for ith gene, based on all J scenarios.
$$A^{cv}_{i}=\frac{1}{J}\sum\limits_{j=1}^{J}A^{cv}_{i_{j}}. $$
Suppose A
^{cv} is the average true accuracy (unpermuted data) based on J random partition sets. Note that the magnitude of A
^{cv} can be varied. Thus a better measure will be a relative accuracy reduction (R
_{
i
}) given by,
$$R_{i}=\frac{A^{cv}A^{cv}_{i}}{A^{cv}}, $$
where large values of R
_{
i
} indicate high impacts on the classification. For Analysis 3.2, we used values l=30 and J=100, which stabilize the calculations without being computationally burdensome.