Prediction and mechanistic analysis of drug-induced liver injury (DILI) based on chemical structure

Background Drug-induced liver injury (DILI) is a major safety concern characterized by a complex and diverse pathogenesis. In order to identify DILI early in drug development, a better understanding of the injury and models with better predictivity are urgently needed. One approach in this regard are in silico models which aim at predicting the risk of DILI based on the compound structure. However, these models do not yet show sufficient predictive performance or interpretability to be useful for decision making by themselves, the former partially stemming from the underlying problem of labeling the in vivo DILI risk of compounds in a meaningful way for generating machine learning models. Results As part of the Critical Assessment of Massive Data Analysis (CAMDA) “CMap Drug Safety Challenge” 2019 (http://camda2019.bioinf.jku.at), chemical structure-based models were generated using the binarized DILIrank annotations. Support Vector Machine (SVM) and Random Forest (RF) classifiers showed comparable performance to previously published models with a mean balanced accuracy over models generated using 5-fold LOCO-CV inside a 10-fold training scheme of 0.759 ± 0.027 when predicting an external test set. In the models which used predicted protein targets as compound descriptors, we identified the most information-rich proteins which agreed with the mechanisms of action and toxicity of nonsteroidal anti-inflammatory drugs (NSAIDs), one of the most important drug classes causing DILI, stress response via TP53 and biotransformation. In addition, we identified multiple proteins involved in xenobiotic metabolism which could be novel DILI-related off-targets, such as CLK1 and DYRK2. Moreover, we derived potential structural alerts for DILI with high precision, including furan and hydrazine derivatives; however, all derived alerts were present in approved drugs and were over specific indicating the need to consider quantitative variables such as dose. Conclusion Using chemical structure-based descriptors such as structural fingerprints and predicted protein targets, DILI prediction models were built with a predictive performance comparable to previous literature. In addition, we derived insights on proteins and pathways statistically (and potentially causally) linked to DILI from these models and inferred new structural alerts related to this adverse endpoint. Supplementary Information The online version contains supplementary material available at 10.1186/s13062-020-00285-0.


Conclusion:
Using chemical structure-based descriptors such as structural fingerprints and predicted protein targets, DILI prediction models were built with a predictive performance comparable to previous literature. In addition, we derived insights on proteins and pathways statistically (and potentially causally) linked to DILI from these models and inferred new structural alerts related to this adverse endpoint.

Background
Drug-induced liver injury (DILI) is a major safety concern and one of the leading causes of drug failure in clinical drug development and market withdrawal, which can be found across nearly all classes of medication (1). DILI may occur either as a hepatitis or cholestatic injury or a mixed form of both and can be further distinguished between intrinsic and idiosyncratic DILI (1). If a drug is hepatotoxic in a dose-dependent manner both in preclinical models and humans (e.g. acetaminophen) it is considered to cause intrinsic DILI. Idiosyncratic DILI, on the other hand, is characterized by the lack of a clear dose-dependency and its rarity (usually less than 1 of 10,000 treated patients develops DILI symptoms). In contrast to intrinsic DILI, idiosyncratic DILI is the result of a patient's rare combination of genetic and non-genetic risk factors, which is responsible for their susceptibility towards the drug (2). Consequently, in most cases idiosyncratic DILI cannot be detected in preclinical studies (3). The idiosyncratic nature of DILI also impedes its prediction with quantitative structure-activity relationship (QSAR) models, as idiosyncrasy implies that the underlying cause lies beyond inherent compound properties. Due to the low incidence of DILI, revealing causal links between the use of a drug and an observed liver injury is a difficult task (4), which decreases the confidence in provided DILI labels and further complicates the building of QSAR models with high predictivity.
The limited capability of animal models to detect hepatotoxic compounds raises the need for alternative testing strategies including in vitro and in silico models, as well as a better understanding of the underlying biology. Major challenges associated with the prediction of DILI using in vitro approaches lie in identifying relevant assays (5) and extrapolating from assay concentrations to in vivo blood concentrations associated with a hepatotoxic risk (6). Numerous in silico models have been generated based on molecular structure (7)(8)(9)(10)(11)(12) and in vitro readouts, such as bioactivity (13) and gene expression (14) in cell culture, which are able to predict DILI better than random, but with a performance not yet sufficient for decision making in practice.
In the case of computational predictions, DILI is often simplified to a classification problem, i.e. separating compounds with or without this annotation in a data set (7)(8)(9)11,13). These labels, however, do not provide information on important factors such as dose-dependency or affected patient population, and consequently the practical applicability of such models is limited. While more detailed information on quantitative compound toxicity is difficult to retrieve, the strength of evidence for DILI is often provided in the datasets available. Paying attention to the quality of the data used for model generation has previously been shown to be relevant; for example, Kotsampasakou et al. (2017) (9) demonstrated that better models can be derived with smaller, but higher quality datasets.
The present work is derived from participation in the Critical Assessment of Massive Data Analysis (CAMDA) 'CMap Drug Safety Challenge' 2019 (http://papers.camda.info/) where the aim was to develop more predictive models for DILI from various different descriptor spaces. In this study, we retrieved compound hepatotoxicity annotations from the DILIrank (15) and SIDER (16) databases which were used as labels to generate compound-based DILI classifiers. The annotations in DILIrank were assigned by considering DILI-related market withdrawals and warnings in drug labels in combination with assessing causal links between the use of the drug and the occurrence of DILI. The drug is annotated as 'DILI positive' in two different severity classes ('vMost-DILI concern' and 'vLess-DILI concern') only if casual links to DILI could be confirmed. Drugs with existing concern but lack of causal proof were annotated as 'Ambiguous DILI concern', whereas drugs without concern were annotated as 'vNo-DILI Concern'. The task set by the CAMDA challenge was to predict the labels of 55 drugs, which were previously annotated as 'Ambiguous DILI concern' and recently re-classified by the FDA. To this end, multiple descriptors were derived of which were used to build classification models for DILI: chemical fingerprints (17) describing the 2D compound structure, as well as Mordred molecular descriptors (18), and predicted protein targets inferred from chemical structure with PIDGIN (19-21). The predictivity of the resulting models was evaluated using two different external test sets.
In addition to predictive performance, we also focused on two practically relevant aspects of DILI prediction, namely the ability of models to extrapolate in chemical space, as well as the interpretation of relevant molecular and biological factors underlying DILI since interpretable models are more trusted, for example by regulatory agencies (22). To gain insights into biological processes, the protein targets with significantly higher binding probability in DILI compounds and the highest information for DILI classification were extracted from the protein target-based machine learning models. Based on these, we identified biological processes associated with DILI labels in the current dataset using the Reactome pathway maps (23) to show that mechanistic understanding of the biology underlying DILI can be obtained from this chemical structure-based feature space.
From the purely chemical side, we derived interpretable structural alerts related to DILI with the Molecular Substructure miner algorithm (MoSS) implementation of graph-based Molecular Fragment miner algorithm (MoFa) (24) and the fragment-based SARpy package (25), which could guide lead optimization to reduce the risk of DILI, as is currently standard practice for other toxicities (26). We then compared the quality of the derived structural alerts against the recent review of DILI related structural alerts by Liu et al.

Predictive modeling
We first compared the performance of Support Vector Machine (SVM) and Random Forest (RF) models trained using different input descriptors to predict DILI positive compounds.
A significant drop in performance is seen for the majority of models on the FDA validation set with a balanced accuracy of below 0.6 ( Fig. 1), which indicated that the models were less capable of generalizing to the FDA validation set than to the external test set. These findings were also observed for models trained using Mordred molecular descriptors (18) and protein target descriptors (19,20) indicating the limited generalization of models occurred irrespective of descriptor space. The best performing model across the most metrics was the SVM model trained using the DILIrank (-vLessConcern) dataset which utilized a linear kernel, a C parameter of 0.1, and a 'one vs. rest' decision function. This model achieved a mean balanced accuracy of 0.759 ± 0.03 and 0.655 ± 0.00 on the external test set and FDA validation set, respectively, thus demonstrating relatively high predictive power across the two independent test sets compared to all other models generated ( Fig. 1 and Table S1).
We next investigated the relationship between compounds' Tanimoto similarity to 5 nearest neighbors in the training set and classification performance for the external test set (Fig. 2a). This was achieved by generating an SVM model (with the same hyperparameters as the best model noted previously) where within each fold of a Leave-One-Out cross-validation (LOO-CV) scheme a compound's predicted DILI label and Tanimoto similarity to the training set were retrieved (see Methods). Note that such an analysis for the FDA validation set was not possible as the DILIrank labels for this set of compounds were withheld. It was found that 65% of the compounds with a Tanimoto similarity to their 5 nearest neighbors in the training set between 0.0-0.2 were correctly classified (already comparable to the predictive performance on the FDA validation set for the same model -mean accuracy 0.673 ± 0.000), and this increased to 89% for compounds with 0.4-0.5 Tanimoto Similarity, and subsequently to 100% for compounds with a Tanimoto similarity greater than 0.5 (Fig. 2a). As similar inter-similarity distributions were found between the training dataset and both the external test and FDA validation sets ( Fig. 2b), one could have naively anticipated a higher predictive performance (in line with the external test set) for the FDA validation compounds than seen in practice ( Fig. 1).

Biological interpretation of protein targets
We next compared the median feature importance of proteins in the RF and SVM model based on the DILIrank (-vLessConcern) dataset which showed the best classification performance with protein target descriptors across LOCO-CV, external test set and FDA validation set ( Figure S2). The Pearson correlation between the absolute respective feature importance is low (0.29) indicating that overall they identify different protein targets as being important for DILI classification.
Given that the focus is set on proteins with bioactivity related to DILI risk, we only further examined those that were significantly enriched in DILI-related compounds as determined by a Wilcoxon rank test. This included aldose reductase AKR1B1 which has been linked to APAP-induced oxidative stress and hepatotoxicity (29), the CYP enzymes CYP1A2 and CYP2C9 which are involved in xenobiotic metabolism in the liver (30), and the p38 kinase MAPK11 which is known to mediate stress-related signals in hepatotoxicity (31). Moreover, aldo-keto reductase family 1 member C (AKR1C3) is essential for Phase II drug metabolism pathways and Hypoxia-inducible factor prolyl 4-hydroxylase (P4HTM) inactivation has reported to have a protective role against DILI (32)(33)(34).
However, novel proteins were also identified such as Dual specificity protein kinase (CLK1). Interestingly, two of the identified novel proteins, namely Discoidin domain receptor Tyrosine Kinase 1 (DDR1) and Adenosine A1 Receptor (ADORA1), are members of the same protein families as DDR2 and ADORA2A, which are known in liver damage (35,36). In fact, DDR1 and DDR2 possess similar structure and shared functions in collagen-mediated signaling (31), and also the adenosine receptors ADORA1 and ADORA2 share physiologic functions (37,38). Furthermore, ADORA1 has been found to contribute to renal dysfunction associated with acute liver injury in rats, supporting a plausible involvement of this target in DILI (39). A full list of the proteins identified as containing the highest feature importance for classification of the current dataset with the RF and SVM methods and their known or potential links to hepatotoxicity are shown in Table S2.
In a next step, over-represented Reactome pathways (21) were determined among the top protein targets, which were significantly enriched in the DILI positive compounds (false discovery rate (FDR) < 0.05) and showed high feature importance in either the RF or SVM models. While results across different importance thresholds are shown in Figures S3 and   S4, representative results of this analysis (with thresholds chosen to be an RF feature importance > 0.0018 and an SVM feature importance > 0.23) are shown in Fig. 4. From both the RF and SVM models, arachidonic acid metabolism and metabolism of lipids were identified as significantly over-represented processes (see Table S3), and the involvement of these two pathways in liver damage has been extensively characterized, especially for injuries induced by acetaminophen (40,41). Moreover, p53 signaling and mitochondrial biogenesis were retrieved by the RF models, which are key regulators of cellular stress response, and which also play a well-established role in DILI (42)(43)(44). The only process which is not known in DILI is related to RUNX1, which is involved in hematopoiesis (45) and was over-represented due to links to three proteins (EP300, CSNK2A2, CSNK2B) which are shared with TP53 signaling and lipid metabolism. Although not known in DILI specifically, increased expression of RUNX1 correlates with inflammation, fibrosis and NASH activity score in patients suffering from Non-Alcoholic Steatohepatitis (NASH) (46).
In contrast, SVM identifies the Phase I metabolism and biological oxidation pathways as characteristic for DILI from the data (Fig. 4), which has previously been shown to be involved in the generation of toxic metabolites causing DILI (47). Hence, both algorithms prioritize proteins known to be involved in key processes in DILI. The same analysis with the lower-performing models based on the DILIrank and DILIrank (+ SIDER) datasets did not retrieve as many relevant proteins and pathways (results not shown).  (Table S4), with a particular focus on precision and coverage among compounds labelled as DILI positive (Fig. 5), with a summary of metrics per structural alert source in Table 2. Furthermore, an analysis of the occurrences of SAs in DrugBank (48) approved compounds was conducted.

Structural alerts
Overall, for both data-and literature-derived structural alerts a common trade-off between precision and coverage was observed i.e. if a substructure had a high precision it rarely had a high coverage (Fig. 5). For example, the SAs benzene derivative (SARpy) and aniline    Table 3) and translated to a more rigorous evaluation of internal model performance. and accordingly also to confidently compare models (Fig. 1).
Larger datasets would be required to allow for enhanced fine-grained sampling of chemical space and the establishment of a model applicability domain. In the present study, the poor generalization to the FDA validation demonstrated that the relationship between chemical structure and the propensity to cause DILI is too complex for the model to learn from the small training dataset used (401 compounds). However, it must be noted that even if larger and higher quality datasets were acquired, model predictivity would still be limited as relevant information that may relate to the manifestation of DILI such as the influence of metabolism in the formation of hepatotoxic prodrugs were not considered in the descriptors used in the present study.

Protein targets
From the models which used predicted protein targets as features, we extracted biological processes by incorporating prior knowledge on bioactivity using PIDGIN and the functional contexts of proteins based on Reactome pathway maps. SVM and RF both identified arachidonic acid metabolism which is the mechanism of action and toxicity of NSAIDs, one of the most common causes for DILI (40,49,50). In addition, cellular stress response was identified as over-represented by Random Forest which is mediated via known pathways in DILI namely p53 signaling, and mitochondrial biogenesis which is initiated upon stresswhich was in this case represented by the p38 kinases MAPK11 and MAPK12 (31)(32)(33)(34)(35)(36)(40)(41)(42)(43)(44). In contrast, SVM identified processes involved in xenobiotic metabolism, biological oxidations and phase I metabolism, which could contribute to the generation of toxic metabolites (47).
While the inferred biological processes have been known to be associated with DILI, this is not true for many of the proteins identified by feature importance themselves (Table S2, S3) such as CLK1 and DDR2. Given that the analysis was based on target binding probabilities, it can be hypothesized that these proteins might be off-targets directly (or indirectly) involved in the pathogenesis of DILI. The described workflow hence was able to derive functional hypotheses on biological processes from compound DILI annotations, which can subsequently be investigated experimentally.

Structural alerts
In this study, structural alerts related to DILI were derived using the SARpy (25) (Table 4) and this obtained a precision of 1. However, a DrugBank (48) database search of the significant structural alerts showed that all of the significant structural alerts derived using MoSS occurred in at least 3 approved drugs, and those from SARpy and by  occurred in at least 10 approved drugs (Table S4). For example, aniline derivative (SARpy) and carbamide derivative (SARpy), were present in 422 and 80 marketed drugs, respectively (Table S4) 2015)) can increase muscle, neural, kidney, liver, blood and spleen toxicity (52), however it is present in e.g. procarbazin, which is a registered antineoplastic agent used in Hodgkin's disease treatment and is an orphan drug for glioma (53). This example demonstrates that it can be beneficial to accept an increased toxicity risk in favor of prolonging the patient's life.

Conclusions
In this study, DILI classifiers were trained using data from the DILIrank and SIDER

Data Preparation
Compounds' SMILES strings were retrieved from the DILIrank database (1036 compounds) (15), and the SIDER 4.1 database (1430 compounds) (16). Bioactivity for 1,673 human protein targets was predicted using the PIDGINv3 software (19-21). 10 µM was chosen as the bioactivity cut-off to consider highly and marginally active compounds. To get a prediction for every compound-target pair, no threshold for the applicability domain was applied. For 6 out of the 923 drugs (4 of them from SIDER) no protein target prediction was made, since their structures could not be standardized internally in the PIDGINv3 software.
In order to implement a OCO-CV scheme (to ensure similar compounds were not in different folds), we performed hierarchical clustering of compounds. Based on pairwise Tanimoto similarities calculated using ECFP4, a tree was generated using hierarchical clustering with the Nearest Point Algorithm implemented in SciPy (version 1.2.1) (57).
Clusters were generated by cutting the hierarchical tree at a distance of 0.5, which resulted in compounds with a Tanimoto similarity of at least 0.5 being in the same cluster.

Overview
We used the SVM and RF implementations in the scikit-learn Python library (version 0.21.2) to train binary classification models for DILI. Models were developed for all three input feature spaces (fingerprints, protein targets, and Mordred molecular descriptors). In addition, we generated models using different subsets of data considering different DILIrank classes as well as additional inactives from the SIDER database (Table 1).

Model Hyperparameter Grid Search
Firstly, for SVMs (58)  Balanced accuracy (equation 1) was primarily used to assess the predictive performance of the models. We also utilized specificity to compare models to those previously published in the literature (Table 3)

Interpretation of protein targets
Proteins targets with higher binding probability in DILI positive compounds were determined using a one-sided Wilcoxon rank-sum test with a FDR of <0.05. Among those, proteins features with high median importance across the 10 train-test splits were identified for RF and SVM models, respectively. The feature importance for RF models implemented in scikit-learn (56) describes the decrease of node impurity achieved by a feature, averaged over all trees in the forest, as a fraction, so that the importance of all features included in the model sum up to 1 (59). In SVM models using linear kernels the importance of features is reflected by the magnitude of their coefficients describing the hyperplane (58). The sign indicates which class is favored by the presence of a given feature. The values used for further analysis were the median importance of a feature across the 10 train-test splits.
Over-enrichment analysis was performed using the ReactomePA R package (version In order to find a discriminative fragment, two thresholds should be defined by the user.  Availability of data and material The datasets generated and/or analyzed during the current study are available in the GitHub repository https://github.com/anikaliu/CAMDA-DILI which will be released upon publication.

Competing interests
The authors declare that they have no competing interests. Author contributions MW, PW, AL and AE analyzed the performance of the predictive models generated by MW.
AL analyzed protein targets and derived pathways, results of which were supported by literature research performed by AL and DD. The structural alerts analysis was performed by AMB and PW, and was revised by HY. AL, PW, MW, AMB, DD, HY wrote the manuscript.
AL and AB supervised the study. All authors read and approved the manuscript.

Figure 1
DILI label prediction performance (balanced accuracy) of RF and SVM models trained using ECFP4 descriptors. Models were trained using the datasets described in Table 1. Performance is stable between the 5-fold LOCO-CV and external test set, but a distinct drop in predictive accuracy is observed when predicting the FDA validation set. Hence, despite demonstrating a capability to generalize to new compounds (not seen during training) in the external test set, models lacked the capability to generalize to the new compounds in the FDA validation set.

Figure 2
Analysis of the link between chemical similarity and classification performance.