Diverse approaches to predicting drug-induced liver injury using gene-expression profiles

Biology Direct

Table 2 Phase I cross-validation results

	Accuracy		Sensitivity		Specificity		MCC
	PC3	MCF7	PC3	MCF7	PC3	MCF7	PC3	MCF7
Multilayer Perceptron	0.63	0.65	0.69	0.69	0.32	0.35	0.01	0.03
Gradient Boosting	0.67	0.60	0.69	0.67	0.39	0.27	0.04	−0.05
K-nearest Neighbor	0.68	0.64	0.70	0.72	0.50	0.41	0.11	0.12
Logistic Regression	0.70	0.62	0.72	0.68	0.57	0.27	0.20	−0.04
Gaussian Naïve Bayes	0.35	0.35	0.71	0.73	0.32	0.32	0.02	0.03
Random Forest	0.66	0.70	0.69	0.72	0.33	0.54	0.01	0.19
Support Vector Machines	0.68	0.68	1.00	1.00	–	–	–	–
Voting-based Ensemble	0.68	0.67	0.69	0.69	0.44	0.33	0.06	0.01

These results indicate how each classification algorithm performed on the training set after hyperparameter tuning. Overall, the Logistic Regression and Random Forests algorithms performed best,thus we selected these for submission to the challenge. The voting-based ensemble never outperformed all the individual algorithms, yet it never performed worse than all the individual algorithms. Thus we also constructed a submission for the challenge based on this classifier. PC3 and MCF7 are names of prostate- and breast-cancer cell lines, respectively. Bolded values indicate relative strong performance for the three algorithms we selected in Phase I. MCC = Matthews Correlation Coefficient. We were unable to calculate specificity or MCC for the Support Vector Machines algorithm because it predicted all cell lines to have the same class label

ISSN: 1745-6150