Inter-platform concordance of gene expression data for the prediction of chemical mode of action

Background It is interesting to study the consistency of outcomes arising from two genomic platforms: Microarray and RNAseq, which are established on fundamentally different technologies. This topic has been frequently discussed from the prospect of comparing differentially expressed genes (DEGs). In this study, we explore the inter-platform concordance between microarray and RNASeq in their ability to classify samples based on genomic information. We use a set of 7 standard multi-class classifiers and an adaptive ensemble classifier developed around them to predict Chemical Modes of Actions (MOA) of data profiled by microarray and RNASeq platforms from Rat Liver samples exposed to a variety of chemical compounds. We study the concordance between microarray and RNASeq data in various forms, based on classifier’s performance between two platforms. Results Using an ensemble classifier we observe improved prediction performance compared to a set of standard classifiers. We discover a clear concordance between each individual classifier’s performances in two genomic platforms. Additionally, we identify a set of important genes those specifies MOAs, by focusing on their impact on the classification and later we find that some of these top genes have direct associations with the presence of toxic compounds in the liver. Conclusion Overall there appears to be fair amount of concordance between the two platforms as far as classification is concerned. We observe widely different classification performances among individual classifiers, which reflect the unreliability of restricting to a single classifier in the case of high dimensional classification problems. Reviewers An extended abstract of this research paper was selected for the Camda Satellite Meeting to Ismb 2015 by the Camda Programme Committee. The full research paper then underwent two rounds of Open Peer Review under a responsible Camda Programme Committee member, Lan Hu, PhD (Bio-Rad Laboratories, Digital Biology Center-Cambridge). Open Peer Review was provided by Yiyi Liu and Partha Dey. The Reviewer Comments section shows the full reviews and author responses.


Background
For more than a decade microarray technology has provided enormous momentum to the modern genomic research. The ability of quantify thousands of genes' expressions at the same time has led to remarkable achievements in wide range biological studies. Abundance of microarray assays has been published worldwide in various databases. However, microarray technology has some *Correspondence: somnath.datta@ufl.edu 2 Department of Biostatistics, University of Florida, FL 32603, Gainesville, USA Full list of author information is available at the end of the article limitations, such as the accuracy of expression measurements limited by levels of hybridization and variability hybridization properties of probes [1]. RNAseq is a version of next generation sequencing technology which has recently become popular due to some of its advancement over the microarray technology. Evidently, RNASeq has a potential advantage in measuring absolute expression levels compared to the microarray technique [2,3]. Since these two methods fundamentally differ in their underline technologies, it is interesting know if this disparity results an inconstancy in experimental outcomes.
Classifiers are known to be one of the most widely use statistical tools in genomic oriented biomedical studies. For an example, identifying at risk individuals for a certain disease type such as cancers, based on their genetic profiles. In this work, we explore the concordance between microarray and RNASeq genomic platforms in the context of classifications based on a set of comparative classification experiments carried using these two platforms.
In recent years, a number of authors have discussed the agreement between scientific conclusions made on microarray and RNASeq platforms, based on comparative analyses. A common choice for these studies was the concordance of differentially expressed genes (DEGs). A previous study that described a large scale comparison of microarray and RNASeq platforms using the Cancer Genome Atlas (TCGA) based analysis, reported a high correlation among expressions levels resulted from both platforms and suggested a reasonable concordance between DEGs by comparing tumors with normal tissues [4]. Another study compared these two bases using an analysis on data obtained from a colon cancer study and conclude that RNASeq had an advantage over microarray for detecting DEGs [5]. A recent article provided a comprehensive assessment between microarray and RNASeq methods, comparing DEGs using gene expressions resulted from a rat liver experiment [6]. Further they described the concordance in aspect of classification assessing the predictability of classes defined by the chemical mode of action (MOA), using a set of classifiers trained in two genomic platforms. Their study revealed weak classification accuracies for a set of classifiers when applied to these platforms.
Our work is based on the previously described rat liver data [6], where we primarily focus on developing a common classifier that works reasonably well in cross platforms providing better predictability. Next, we discuss the concordance between microarray and RNASeq platforms in various forms in prospect of classification. Furthermore, we identify a set of important genes for specifying classes given by MOAs by focusing their effects on the classifier accuracy. We use seven standard classifiers and an adaptive ensemble classifier built around them to achieve these goals. This study is part of the 2015 annual conference on Critical Assessment of Massive Data Analysis (CAMDA) challenges. The Rat liver experiment was conducted by the FDA SEQC consortium to assess the performance of modern gene transcript expression profiling methods, which is a comparative analysis designed for developing predictive models to predict the chemical mode of action (MOA).
The remainder of the article is organized as follows. In Section "Results", we provide results and conclusions of the study. Section "Methods" explains all underline procedures applied. The main body of the paper ends with a discussion in Section "Discussion".

Classification in individual platforms
We first describe outcomes of the Analysis 1, which was performed using two basic strategies: adjusted and originally given test sets described in Section "Methods". We provide a detailed summary of these results in Tables 1, 2, 3 and 4, where each table presents the classifier's overall prediction accuracy, class specific sensitivity and the corresponding specificity. Graphical representations of the summarized result are also provided on Figs. 1 and 2.
We first discuss the classification resulted from using a set of genes that are represented in both platforms. For the adjusted test set, the left panel of the Fig. 1 shows that the performance of each classifier is similar in both platforms, since all the data points are fairly close to the diagonal line (Pearson's r =0.92). The accuracy of individual classifier varies from 17 to 75%, and as to be expected, the performance of the ensemble classifier is the best in both platforms. The overall accuracy of the optimal classification method is slightly better in microarray compared to RNA-seq (75% vs 67%). In particular, we observe a lower prediction accuracy for the class "PPARA" in RNASeq (56%), compared to the microarray (89%) platform. Overall, the class given by "CAR/PXR" which has a maximum sensitivity of only 56%, seems to be the MOA that hardest to predict. Some individual classifiers show widely different prediction sensitivity for the same class in two platforms. For example the sensitivity for "PPARA" by RPART is 100% in microarray, whereas it reaches as low as 22% in RNAseq.
When the original (i.e., unadjusted) test set is used, we again observe matching performance of classifiers in both platforms (Table 2) similar to the case with the adjusted test set; in fact, the agreement is even higher (Pearson's r = 0.94) as shown in the right panel of the Fig. 1. The overall accuracy ranges from 60 to 12% indicating a drop in the classification performance compare to the previous scenario. For example, 75% vs 50% in microarray and 67% vs 50% in RNASeq for the ensemble classifier. Comparing Tables 1 and 2, we also notice a decline in sensitivities of predicting three known classes namely "PPARA", "CAR/PXR", and "Control". Since this analysis was carried using an alternative approach as described in the Section "Methods", such decline could be possibly resulted from classifying several samples belonging to above known classes as "OTHER" by depressing the "true" class probability below 0.5 if these class attributes are somewhat close to one another. In this case, few other individual classifiers such as SVM, RF outperform the ensemble classifier in terms of the overall accuracy. But nevertheless, the ensemble classifier still acts as the best overall amongst all with regard to all performance measures.
Even with the complete set of genes, we observe similar conformity of classifiers' performance between the two platforms ( Fig. 2) as described above. Specifically for the ensemble classifier the overall accuracy is identical in the two platforms, in each case. According to Tables 3 and 4,  the overall accuracy ranges between 8 to 67% and 10 to 55%, for adjusted test set and the original test set, respectively. Even though we used bigger gene sets, there is no additional improvement for predicting MOAs; indeed the performance gets worse, which is quite evident for the adjusted test set. However, some classifiers surprisingly hold equal performances for both sets of genes. As for example, the RPART shows identical performances in the microarray platform under bigger and smaller sets of genes.

Classification in cross platforms
Results of the 2nd analysis, namely, classification in cross platform are summarized in Table 5 and Fig. 3. We performed this study using only the common set of genes since both platforms are involved together throughout the analysis. Compared to all previous classifications we discussed in Analysis 1, this result shows even greater agreement between the prediction accuracies of the classifiers trained on a bigger training set in one platform and used to predict using the bigger test data on the other platform (Pearson's r =0.99). Remarkably, the ensemble classifier was able to provide 100% accurate predictions for both cases, regardless of the additional complexity caused by 8 varieties of classes. In this analysis, the component classifier PLS+LDA also performed similarly to the ensemble classifier in both cases yielding 100% accurate class predictions. Apart from above two classifiers, SVM, RF, and PLS+RF also hold substantially high prediction accuracies. Exploring outcomes resulted from Analysis 1 and 2 (Tables 1, 2, 3, 4 and 5), we clearly notice, between the two types of dimension reduction methods, PLS performs far better than PCA throughout this study. The performances of classifiers integrated with PCA are clearly the weakest among all individual classifiers in each scenario.

Importance of genes
We summarize results of the 3rd analysis in Tables 6, 7, 8 and 9, where each table lists the top 20 important gene name and the overall accuracy obtained by the cross validation. As we describe in the methods section this analysis was performed using two experiments: (i) using the adjusted test set and (ii) the full dataset. Furthermore, we consider using the common and complete sets of genes as additional sub-analyses within above primary experiments.
Referring to the Table 6, we observe that five of ten most important genes for classification (Cyp1a1, Fam111a, Ugt2b, Akr1b8, and Hbb) are in common between the two platforms, when the adjusted test set is used with the common set of gene. From literature search we found that Cyp1a1 encodes a member of the cytochrome P450 super-family of enzymes which catalyze many reactions involved in drug metabolism [7]. Likewise, Ugt2b belongs to a large family of proteins capable of detoxifying a wide variety of both endogenous and exogenous substrates such as biogenic amines, steroids, bile acids, phenolic compounds, and various other pharmacologically relevant compounds including numerous carcinogens, toxic environmental pollutants, and prescription drugs [8]. The function of Akr1b8 implicated in the pathogenesis of diabetic complications [9]. Mutations in Hbb have been implicated in a number of blood disorders [10], while mutations of Fam111a are strongly associated with type 2 Kenny-Caffey syndrome [11]. Table 7 presents the top 20 genes detected from complete gene sets for two platforms. We notice that 6 genes (Fam111a, Cyp1a1, Hbb, Aldh1a7, Psat1, and Obp3) for the microarray and 5 genes (Fam111a, Hbb, Cyp1a1, Ugt2b, and Dhrs7) for the RNASeq are in common with the top 20 of the previous analysis (Table 6).
Although the main goal of detecting impotent genes with the full data (Analysis 3.2) was to identify sets of genes making considerable impact on classifying all eight MOAs, interestingly, the outcome of this study (Tables 8  and 9) reveal high average (unpermuted) prediction accuracies (close to 100%) for both platforms using the 5 fold cross-validation technique. Tables 8 and 9 show lists of top genes ranked by the relative reduction of accuracy (R), for microarray and RNASeq, respectively. Clearly, there is no single gene that makes a substantial contribution to the accuracy. However, we identified two genes (Cyp1a1, Abcc3) that are commonly present in both lists when the complete set of genes was used. Based on the same analysis but performed using complete sets of genes we observe only one gene named Id1 is common important gene for the two platforms. We observed that Abcc3 is a member of the superfamily of ATP-binding cassette (ABC) transporters, which is involved in multidrug resistance [12]. The Id1 gene plays a crucial role in activating hepatic stellate cells (HSCs) responding to liver damages [13]. Performances of these classifiers are highly variable across problems. Thus, none of standard classifier can be considered to be the best for all classification settings. In complex situations, such as classifications in high dimensional genomic data, a more meaningful approach would be use an ensemble classifier which combines many standard classification algorithms together to develop an improved classifier. The ensemble classifier we use builds a number of individual models on randomly selected subsets of data which can then be combined or averaged in some meaningful fashion. Majority voting is a popular choice is for a typical solution. Such a classifier by allowing data based utilization of a multitude of classification algorithms for a upholds consistent performance in various types of data and classification problems. In this work, we use the adaptive optimal ensemble classer developed, via bagging and rank aggregation [14]. In this approach, several user specified classifiers are trained on bootstrap samples drawn from the original data using simple random sampling. Since the sampling is done with replacement, some samples will be repeated multiple times while others will be out of the bootstrap sample (known as out-of-bag (OOB) samples). Focusing on the prediction performances on the OOB samples, a best classifier is select based on various performance measures. For example, in a binary classification problem, sensitivity, specificity, and the area under the curve of the Receiver Operating Characteristic (ROC) curve are some legitimate performance measures. This method is equipped with rank aggregation [15,16], which provides a great flexibility in selecting the optimal classifier with respect to various multiple performance measures. Predicted classes for a given test set is selected as the highest voted class, as predicted by the above set of "best" classifiers over all bootstrap resamples. Datta et al. [14], demonstrated the performance of the ensemble classifier using various numerical studies and real applications of gene expressions data. In the context of regression similar concepts have been developed [17]. The algorithm described below demonstrates the step by step procedure of developing an ensemble classifier [14]. Suppose the dataset of n samples with p dimensional covariates in the form of {X n×p , Y n×1 }, where X corresponds to independent variables and Y represents the dependent categorical variable that specifies a class label. Assume the ensemble classier is intend to built with   . . . , L K ) of size M. These lists are then rank-aggregated using the weighted rank aggregation to determines the best algorithm C (1) overall.

Ensemble classifier
Repeat the above procedure (steps 1-4) for B times, where B considered to be a large integer which is usually selected according to the computational capacity. 5. Prediction for a New Sample: Predict the class variable Y for a new sample X using the B prediction models C 1 (1) , . . . , C B (1) and determined the highest voted class to obtain the final class predictionŶ .

Rank aggregation
Suppose the performances of M classifiers are evaluated on the basis of K performance measures. Assume we have ordered lists L 1 , . . . , L K , where ith ordered list L i , i = 1, . . . K, provides ranks of M algorithms on their performances evaluated on the ith measure. The rank aggregation [15,16] procedure provides a single ranked list of M classifiers that minimizes the weighted sum of distances from all individual lists, given by the following objective function, where L is any possible ordered list of the M classifiers, w i 's are weights which represent the user specific importance  [18,19]. For a genomic data characterized by high dimension, use of an ensemble classifier developed on such set of improved component classifiers represents an ideal choice.

Rat liver data
Our data for this study was released by 2015 CAMDA competition. Microarray and RNASeq platforms contain gene expression measurements of nearly 31,000 and 46,000 genes, respectively. The dataset consists of gene expression responses profiled by Affymetrix microarrays and Illumina RNASeq sequencer in rat liver tissues from 105 male Sprague-Dawley Rats, which are exposed to 27 different chemicals represented by 9 different MOAs. In the original experiment, a training set is formed with 45 rats, which are treated with 15 chemicals corresponding to MOAs of "PPARA", "CAR/PXR", "AhR", "Cytotoxic", "DNA damage", and 18 controls. Test set contains data on 36 rats which are treated with 12 chemicals corresponding to "PPARA", "CAR/PXR", "ER", "HMGCOA" and 6 controls. We found that two MOAs, "ER" and "HMGCOA" are present only in the test set. We further noticed that approximately 22,253 average expressions per sample in RNA-seq data were recorded as "NA", which indicates that insufficient number of reads mapped onto the gene to provide a reliable gene expression estimate. We retained gene sets of sizes 13,686 and 16,133 for microarray and RNASeq platforms, after (i) removing unnamed genes, (ii) removing genes with unobserved expressions, and (iii) averaging multiple expressions reported from the genes with unique names.
In this work, we used normalized expression levels that came from microarray data using Robust Multi-Array Average (RMA) expression measurements [20], whereas data obtained for RNASeq was already normalized via the Magic normalization [6,21]. We decided that it would be reasonable to perform separate analysis with a common set of genes (8336) represented in both platforms and also with complete sets of genes, for a comparative study.

Concordance experiments
We conducted three types of investigations for studying the performance of the proposed classifiers.
1. Train classifiers and make predictions on individual platforms. 2. Train classifiers in one platform to make predictions on the other platform. 3. Identify important variables (genes) for accurate classification.
In the 1st analysis, we explore the predictability of MOAs using various classifiers developed in the given training data. To our knowledge, there is no established criteria to define prediction for an unknown class that was not represented in the training data. Thus, we select an adjusted test set after eliminating all test samples belonging to two classes of "ER" and "HMGCOA", where the new test was used in parts of 1st and 3rd analysis. However we also considered the originally given test set as a part of 1st analysis by adopting following alternative classification approach. Accordingly, first we designated both "ER" and "HMGCOA" samples belonging to the original test set as "OTHER". For each classifier, then we determined the maximum class probability for a given test sample and if the above probability was less than 0.5 we selected the predicted class as "OTHER", else kept the originally predicted class. For this purpose, class probabilities for the ensemble classifier was calculated using the predicted class proportions observed in the B bootstrap samples.
Our objective with the 2nd analysis was to examine the inter-platform concordance between microarray and RNAseq platforms. Thus, we trained classifiers on a selected platform using the full dataset that included the both given training and test sets for making predictions on the other platform. However, since the classifier needed to run on both platforms for this analysis, each gene expression measurement was standardized, separately for both platforms, prior to the analysis. For analyses 1 and 2, we selected an ensemble classifier developed with a set of M = 7 standard classifiers, SVM, RF, LDA, PLS+RF, PLS+LDA, PCA+RF, PCA+LDA, and Recursive Partitioning (RPART). Primarily, classifiers are selected based on the prior information of their suitabilities in high dimensional data classification. Based on accuracies of predicted classes, each classifier was ranked for K number of performance measures (for example, overall accuracy, class specific accuracies ect.). Since the selection of performance measures for a multi-class classification problem is highly depend upon the aim of study; we optimized the overall prediction accuracy, and the class specific accuracy of each group for the 1st analysis. Furthermore we considered these performance measures to be equally important for classification (i.e., we used equal weights of w i = 1, in Eq. (1)), whereas in the 2nd analysis in cross platforms, we focused only on the overall accuracy without optimizing multiple group specific performances. For these analyses, we chose B to be B = 300. We performed a 10 fold cross-validation for each individual classifier to select the number of components for PLS and PCA methods, separately for two platforms. Assuming consistent performance in bootstrap samples similar to the original training data, we employed the same number of components to develop the ensemble classifier.
The 3rd analysis on identifying important variables is subdivided into following two parts.
1. Detecting important genes with the adjusted test set. 2. Detecting important genes with full data using the cross-validation method.
We applied a classifier on the perturbed training data resulted from randomly permuting gene expressions of a given gene to quantify its impact on the predictability of MOAs in a test set. Accordingly, each gene was ranked by a measure given by magnitude of accuracy reduction compared to the true accuracy (in unpermuted data), such that the rank 1 corresponds to the gene that has the highest negative impact on the overall prediction accuracy. In order to reduce the computational burden, we did not use the ensemble classifier for this purpose. Instead the component classifier PLS+LDA which had an overall accuracy close to that of the ensemble classifier was used. We performed theses analysis separately for both platforms to determine a common set of genes presented among the top 20 genes in both platforms.
For Analysis 3.1, we randomly permuted a gene's expressions in the training set and then made predictions for the test set (adjusted test set) using the classifier trained on the permuted training data. The permutation procedure was repeated l times for each gene to calculate an average overall prediction accuracy (A). Finally, genes were ordered by A, ascending order. Here we chose l to be l = 30 in order to achieve reasonably stable approximation, while keeping the computational costs in check.
Analysis 3.2 was performed using the full data which contained both originally given training and test sets. Here we applied the 5 fold cross-validation technique in order to evaluate the effect of each gene on classifying MOAs. Our approach consisted of two layers of randomization. For the jth, j = 1, ..., J, outer randomization, we randomly partitioned the dataset into 5 folds and selected a training set of 4 folds, while remaining fold was chosen as a test set. After randomly permuting the expressions of a given gene i across the above specified training set, a classifier was trained to predict on the selected test set. Now using the same approach we described in the previous part (Analysis 3.1) we obtained an average overall prediction accuracy (A cv i j ) by repeating the permutation l times. After that, the whole procedure was repeated J times for various random partition sets to obtain an average overall prediction accuracy (A cv i ) for ith gene, based on all J scenarios.
Suppose A cv is the average true accuracy (unpermuted data) based on J random partition sets. Note that the magnitude of A cv can be varied. Thus a better measure will be a relative accuracy reduction (R i ) given by, where large values of R i indicate high impacts on the classification. For Analysis 3.2, we used values l = 30 and J = 100, which stabilize the calculations without being computationally burdensome.

Discussion
In this study, we used an ensemble classifier built on a set of standard classifiers to predict the MOA in Rat liver experiment data profiled by both microarrays and RNASeq. The newly constructed ensemble classifier performed reasonably well in both platforms individually. Using a selected test set and a set of genes (those present in both platforms) we observe comparable overall predictability of MOAs in the two platforms with 75% and 67% accuracies for microarray and RNAseq, respectively. Similarly, we observe well matched accuracies of 50% for both platforms for the full test sets based on an alternative approach. In an earlier classification approach [6] applied on the same data, reported average overall accuracies of 58% and 61% for microarray and RNAseq, suggesting a slightly better predictability in RNA-seq. However outcomes of these two studies are somewhat incomparable due to the differences in the training and test data sets used. For example, we considered controls as another class, whereas in their analysis, controls were not considered as a separate class. Interestingly, once we trained classifiers to make predictions on cross platforms, the ensemble classifier provided 100% accurate predictions for all 8 classes presented in the whole experiment. This result exhibits a perfect cross platform concordance for the purpose of classification. Also, our study clearly demonstrates a high agreement between the individual classifiers' performances in two genomic platforms. Except for few scenarios, the ensemble classifier performed the best with respect to the overall accuracy and other class specific measures, in all experiments. We observe widely different classification performances among standard classifiers, which reflects the unreliability of restricting to a single classifier in case of high dimensional classification problems. On the other hand, this also demonstrates the utility of the adaptive ensemble classifier which is expected to perform as good or better than the individual classifiers with respect to multiple performance measures.

Conclusion
In this study, we explored the inter-platform concordance between microarray and RNASeq in their ability to classify samples based on genomic information, using data profiled by a Rat Liver experiment. We used an ensemble classifier built on a set of seven standard classifiers to predict the MOA in Rat livers. The ensemble classifier performed reasonably well in both platforms individually, resulting respective 75% and 67% accuracies for microarray and RNAseq on a selected test set. When we trained classifiers to make predictions on cross platforms, the ensemble classifier provided remarkable 100% accurate predictions. This study demonstrates a high agreement between individual classifiers' performances in two genomic platforms. Additionally, we identified a set of important genes those specifies MOAs, by focusing on their impact on the classification.

Reviewers' comments
Reviewer's report 1: Yiyi Liu (yiyi.liu@yale.edu), Yale University In this manuscript, the authors investigated concordance between microarray and RNA-seq in classifying samples based on gene expression profiles. They tested the performances of eight classifiers, including one ensemble method, and obtained very interesting results. Overall the reviewer is positive about the work. There are several minor concerns that the authors need to address.

I suggest the authors add descriptions on the weights
(w i 's) they used in rank aggregation of the ensemble classifier. The authors explained the main idea of the aggregation method, but explicitly stating all the parameters could improve the readability of the paper. 2. The authors mentioned RNA-seq data are "normalized via the Magic normalization". I suggest citing the normalization method paper for reference. method.