DrugMint: a webserver for predicting and designing of drug-like molecules

Background Identification of drug-like molecules is one of the major challenges in the field of drug discovery. Existing approach like Lipinski rule of 5 (Ro5), Operea have their own limitations. Thus, there is a need to develop computational method that can predict drug-likeness of a molecule with precision. In addition, there is a need to develop algorithm for screening chemical library for their drug-like properties. Results In this study, we have used 1347 approved and 3206 experimental drugs for developing a knowledge-based computational model for predicting drug-likeness of a molecule. We have used freely available PaDEL software for computing molecular fingerprints/descriptors of the molecules for developing prediction models. Weka software has been used for feature selection in order to identify the best fingerprints. We have developed various classification models using different types of fingerprints like Estate, PubChem, Extended, FingerPrinter, MACCS keys, GraphsOnlyFP, SubstructureFP, Substructure FPCount, Klekota-RothFP, Klekota-Roth FPCount. It was observed that the models developed using MACCS keys based fingerprints, discriminated approved and experimental drugs with higher precision. Our model based on one hundred fifty nine MACCS keys predicted drug-likeness of the molecules with 89.96% accuracy along with 0.77 MCC. Our analysis indicated that MACCS keys (ISIS keys) 112, 122, 144, and 150 were highly prevalent in the approved drugs. The screening of ZINC (drug-like) and ChEMBL databases showed that around 78.33% and 72.43% of the compounds present in these databases had drug-like potential. Conclusion It was apparent from above study that the binary fingerprints could be used to discriminate approved and experimental drugs with high accuracy. In order to facilitate researchers working in the field of drug discovery, we have developed a webserver for predicting, designing, and screening novel drug-like molecules (http://crdd.osdd.net/oscadd/drugmint/). Reviewers This article was reviewed by Robert Murphy, Difei Wang (nominated by Yuriy Gusev), and Ahmet Bakan (nominated by James Faeder).


Background
High throughput screening techniques and combinatorial chemistry had provided substantial boost in our effort towards discovering new therapeutic molecules [1][2][3]. Despite tremendous progress in the field of drug discovery, there is a high rate of failure of drug molecules in the advanced stage of clinical trials [4,5]. Therefore, more innovative approaches are required in the process of developing new drug molecules. Among the billions of compounds that has been synthesized and tested to date, only a fraction of them has the potential to pass through the FDA approval. A recent estimate suggested that it would take more than 300 years to increase the number of available drugs by two fold at the current rate of drug discovery [6]. Therefore, a prior knowledge that could discriminate the drug-like molecules from its allies would be a welcome step for the drug discovery/design.
In the past, several attempts have been made to shrink the chemical space of the molecules having potential for drug-like properties [7]. Lipinski Rule of Five (Ro5) is the most widely accepted drug-like filter, which is based on simple analysis of four important properties of the drug molecules i.e. number of hydrogen bond donor, number of hydrogen bond acceptor, molecular weight, and solubility [8]. Although, Ro5 had been used as a major guideline in the drug discovery efforts, it has also several limitations [9]. This method is not universally applicable and many compounds particularly those from natural origin e.g. antibiotics etc. are not recognized by this method as drug-like compounds [10]. Recently, it has also been reported that among the two hundred best selling branded drugs in 2008, twenty one had violated Ro5 [11]. Previously, it has been reported that the real drugs are~20fold more soluble than the drug-like molecules present in the ZINC database. Specifically, the oral drugs are about 16-fold more soluble, while the injectable drugs are 50-60 fold more soluble [12]. Comparison of the two molecular properties i.e. molecular weight and ClogP, for different families of FDA-approved drugs, suggested that the modified rules of drug-likeness should be adopted for certain target classes [13]. In 2008, Vistoli et al. summarized the various kindS of pharmacokinetic and pharmaceutical properties of the molecules playing an important role in estimation of drug-likeness [14]. Recently, Bickerton et al. developed a simple computational approach for prediction of oral drug-likeness of the unknown molecules [11]. This is very simple approach applicable only for the oral drugs.
In order to overcome these problems, several models based on machine learning techniques have been developed in the past. An earlier computational model developed in 1998 for predicting drug-like compounds was based on simple 1D/2D descriptors, which showed a maximum accuracy of 80% [15]. In the same year, another study also tried to predict the drug-like molecules based on some common structures that were absent in the non-drug molecules [16]. Genetic algorithm, decision tree, and neural network based approaches had also been attempted to distinguish the drug-like compounds from the non drug-like compounds [17][18][19]. These approaches, although used a large dataset, only showed a maximum accuracy up to 83%. In comparison, better success was shown by some recent studies in predicting drug-like molecules. In 2009, Mishra et al. had classified drug-like small molecules from ZINC Database based on "Molinspiration MiTools" descriptors using a neural network approach [20]. The other reports that appeared promising in predicting the potential of a compound to be approved were based on DrugBank data [21,22].
The main problem associated with the existing models is their non-availability to the scientific community. Moreover, the commercial software packages were used to develop these models, so these studies have limited use for scientific community. In order to address these problems and to complement previous methods, we have made a systematic attempt to develop a prediction model. The performance of our models is comparable or better than the existing methods.

Results and discussion
Analysis of dataset Principal Component Analysis (PCA) We used the principal component analysis (PCA) for computing the variance among the experimental and the approved drugs [23]. As shown in Figure 1, the variance decreased significantly up to the PC-15. Afterwards, it remained more or less constant. The variance between PC-1 and PC-2 for the whole dataset was 15.76% and 8.91% respectively [ Figure 2]. These results clearly indicated that the dataset was highly diverse for developing a prediction model.

Substructure fragment analysis
To explore the hidden information, the dataset was further analyzed using SubFP, MACCS keys based fingerprints using the formula given below; Where N fragment_class is the number of fragments present in that class (approved/experimental); N total is the total number of molecules studied (approved + experimental); N fragment_total is the total number of fragments in all molecules (approved + experimental); N class is the number of molecules in that class (approved/ experimental).
Our analysis suggested that some of the substructure fragments were not preferred in the approved drugs. The substructure-based analysis suggested that primary alcohol, phosphoric monoester, diester and mixed anhydride were non-preferable functional groups that were present in the experimental drugs with higher frequency [Table 1]. Similarly, MACCS keys 66, 112, 122, 138, 144, and 150 were highly desirable and present with higher frequency in the approved drugs [ Table 2, Additional file 1: Table-S1 and Figure 3]. Therefore, while designing new drug-like molecule in the future, the exclusion of SubFP fingerprints and the inclusion of certain MACCS keys might increase the probability of designing a better molecule.

Classification models
In order to evaluate the performance of different fingerprints, we have developed various models on different sets of descriptors that were calculated by PaDEL software. Separate models were developed on fingerprints selected using attribute selection modules rm-useless and CfsSubsetEval of Weka.

Fingerprints based models
The initially developed models based on Estate, PubChem, Extended, FingerPrinter, GraphsOnly, Substructure finger, Substructure count, Klekota-count, Klekota-fingerprint showed nearly equal performance with MCC value in the range of 0.5 to 0.6 [ Table 3]. However, the models developed using 159 MACCS keys, achieve maximum MCC 0.77 with accuracy 89.96% [Table 3, Figure 4]. In addition to that, we have also applied Monte-Carlo (MC) approach by generating 30 times training and testing dataset for five-fold cross-validation. We have observed that these results were more or less same with previously used fivefold cross-validation results having average 87.88%/90.36% sensitivity/ specificity, 89.63% accuracy with MCC value 0.76 (Additional file 1: Table-S2).

PCA based model
In the previous section, we have observed that the models developed using MACCS keys based fingerprints perform better in comparison to the models developed using other fingerprints. We used this class of fingerprint for developing a PCA based model. First model, which was developed on all 166 components, achieved maximum MCC 0.79 and ROC 0.96 [ Table 4]. The models developed using top-20 fingerprints [ Figure 1], achieved maximum MCC 0.72 with a marginal decrease in the value of ROC to 0.94. Furthermore, the models developed using top-15, and top-10 components resulted in a MCC value of 0.68 and 0.61 respectively. A slight decrease in the MCC value was observed on further reducing the number of components to 5.

Hybrid models
In this section, we described hybrid models developed by combining the descriptors that were selected from Table 3. First, a Hybrid model (Hybrid-1) was developed using the top-5 positively correlated fingerprints from each (10 types of ) class and this model obtained MCC up to 0.7. Second hybrid model (Hybrid-2) based on the top-5 negatively correlated descriptors achieved MCC value 0.36 [ Table 5]. A third hybrid model (Hybrid-3) was developed by combining the top-5 positively and the    top-5 negatively fingerprints and it resulted in a slight increase in the performance in comparison to the individual ones and showed a MCC value of 0.77 [ Table 5]. Next, by combining the descriptors of CfsSubsetEval module for each fingerprint, a hybrid model (Hybrid-4) was developed which showed accuracy up to 90.07% with a MCC value of 0.78 [ Table 5]. Finally, a hybrid (Hybrid-5) model on 22 descriptors was obtained upon further reducing these descriptors (296) by CfsSubsetEval module and it resulted in a slight decrease in MCC value to 0.7 with a significant reduction in the number of descriptors.

Performance on validation dataset
We evaluated the performance of our three; i) rm-useless, ii) PCA based, and iii) CfsSubsetEval based models using validation dataset created from MACCS fingerprints (see detail in material and method section). Each model were trained and validated by internal five-fold cross validation [ Table 6]. The best-selected models were further used to estimate the performance on validation dataset. The first model based on 159 (rm-useless) fingerprints showed sensitivity/specificity 90.37%/87.21% with MCC value 0.77 on validation dataset. Next, model was built on top 20 PCs shows sensitivity/specificity 81.85%/87.21% with MCC value 0.67 [ Table 6]. However, the CfsSubsetEval based model developed on 10 fingerprints shows maximum MCC 0.62 on validation dataset. This decrease in MCC value on validation dataset might be due to reduction in number of descriptors.

Performance on independent dataset
We tested our MACCS (ISIS) keys based model on the independent dataset and achieved 84% sensitivity, 38.92% specificity with accuracy value of 41.15%. These results also indicated that~61% of the molecules present in our independent dataset have the potential to be in the approved category in future. Recently, twenty-one drugs were approved in the DrugBank v3.0, which was not classified as approved in the earlier release. Interestingly, all these compounds were classified in the 'drug-like' class by  our model and this result clearly exemplified the performance of our model. Together, these results also indicated that our model could be very useful in the prediction of drug-like properties of a given compound in advance.

Screening of databases
We predicted drug-like potential of molecules in three major databases ChEMBL, ZINC and directory of useful decoys (DUD

Conclusions
This study showed that a better predictive model for discriminating the approved drug from the experimental drugs could be developed using simple binary fingerprints.
In terms of sensitivity, specificity, accuracy as well as MCC values, the performance of our model was better than those described earlier in the literature. Moreover, this could be achieved with~50% reduction in the number of descriptors which is highly significant. Our study also suggested that the CfsSubsetEval algorithm could be used for the selection of the informative descriptors to increase the speed of calculation without compromising the efficiency of the model. From the PCA based models, we observed that 20 PCs were sufficient to develop a prediction model. We have also evaluated the performance of QED method on datasets used in this study, QED correctly classified 44.8% approved and 81.28% experimental drugs from the training dataset and 40% approved and 52.5% experimental drugs from the independent dataset. The performance of QED particularly sensitivity was very poor, it might be due to that QED approach was specifically developed for oral drugs whereas our datasets contained all types of drugs. Among the various numbers of selected fingerprints, some were preferable in the approved drugs while others on the experimental drugs. In addition to that our MACCS keys based model correctly predicted the twenty-one drugs recently listed by FDA in the approved category. Similarly on the independent dataset, our model performed with sensitivity values up to 84%. Our analysis suggested that primary alcohol, phosphoric monoester, diester and mixed anhydride were nonpreferable functional groups. The efficiency of the freely available software was quite similar to that of the commercially available software. We predict that this webserver will be useful in future for selecting the drug-like molecules.

Web server
The major drawback of most of chemo-informatics studies is that they are mainly based on commercial software packages. This is the reason most of the predictive studies described in literature are not available for public use in the form of software or web server. In order to overcome this drawback, we have used freely available software and achieved results comparable to those that have used commercial software. Our study is implemented in the form of a webserver without any restriction. In this server, we have provided the facility to design, screen and predict the Hybrid-1: top 5 descriptors from each fingerprints based on their positive correlation against activity, Hybrid-2: top 5 descriptors from each fingerprints based on their negative correlation against activity, Hybrid-3: sum of descriptors from Hybrid-1 and Hybrid-2, Hybrid-4: sum of all 10 types of fingerprints after applying CfsSubsetEval algorithm, Hybrid-5: Running the CfsSubsetEval algorithm on the descriptors set of Hybrid-4 (296). drug-likeness score of chemical compounds. The screening results of ZINC and ChEMBL library are also provided in the option of database search. In order to provide this free service to the community, we have developed "drugmint" (http://crdd.osdd.net/oscadd/drugmint) a userfriendly webserver for discriminating the approved drug from the experimental drugs. This server allows users to interactively draw/modify a molecule using a Marvin applet [24]. This server is installed on Linux (Red Hat) operating system. The common gateway interface (CGI) scripts of "drugmint" are written using PERL version 5.03.

Dataset source Training dataset
The dataset used in this study was taken from Tang et al. [22], contained 1348 approved and 3206 experimental drugs derived from DrugBank v2.5. The PaDEL software was unable to calculate the descriptors of one approved drug with DrugBank ID DB06149. Therefore, we did not include this molecule in our final dataset, comprises of 1347 approved and 3206 experimental drugs.

Validation dataset
We have also created a validation dataset from the final dataset by randomly taking 20% of data from the whole dataset. Thus, our new training dataset consist of 1077 approved, 2565 experimental drugs and validation dataset comprises of 270 approved and 641 experimental drugs.

Independent dataset
We also created an independent dataset from DrugBank v3.0. Initially, all the 1424 approved and 5040 experimental drugs from DrugBank v3.0 were extracted. All molecules used in our main or training dataset were removed and finally we got 237 approved and 1963 experimental drugs. Our final independent dataset comprises of 100 approved and 1925 experimental drugs after excluding the compounds for which structure was not available in the database.

Selection of descriptors
It has been shown in previous studies that all descriptors are not relevant [27]. Thus, the selection of descriptors is a crucial step for developing any kind of prediction model [28,29]. In this study, we used two modules of Weka i) Remove Useless (rm-useless) and ii) CfsSubsetEval with best-fit algorithm [30]. In case of rm-useless, all those descriptors, which either varies too much or variation is negligible, have been removed. The CfsSsubsetEval module of Weka is a rigorous algorithm; it selects only those features or descriptors that have high correlation with class/activity and very less inter-correlation.

Cross-validation techniques
Leave one out cross-validation (LOOCV) is a preferred technique to evaluate the performance of a model. This technique is time consuming and CPU intensive particularly when dataset is large. In this study, we have used five-fold cross-validation technique to reduce the computational time for developing and evaluating our models. In this technique, the whole data set is randomly divided into five sets of similar size, four sets are used for training and remaining set for testing. This process is repeated five times in such a way that each set is used only once for testing. Overall performance is computed on the whole dataset after repeating the aforesaid process five times.

Model development
In this study, we have developed Support Vector Machine (SVM) based models for prediction of drug-like molecules using SVM light software package. SVM is based on the statistical and optimization theory and it handles complex structural features, and allows users to choose a number of parameters and kernels (e.g. linear, polynomial, radial basis function, and sigmoid) or any user-defined kernel. This software can be downloaded freely from http://www. cs.cornell.edu/People/tj/svm_light/.

Evaluation parameters
All the models developed in this study were evaluated using standard parameters such as i) Sensitivity (percentage of correctly predicted approved drug), ii) Specificity (Percentage of correctly predicted experimental drug), iii) Accuracy (percentage of correctly predicted drugs) and iv) Matthew's Correlation Coefficient (MCC). These parameters can be calculated using following equations 1 to 4. Accuracy where TP and TN are the number of truly or correctly predicted positive (approved) and negative (experimental) drugs, respectively. FP and FN are the number of false or wrongly predicted approved and experimental drugs, respectively. Matthew's correlation coefficient (MCC) is considered to be the most robust parameter of any class prediction method. We have also used a thresholdindependent parameter called receiver-operating curve (ROC) for evaluating performance of our models.

Reviewer number 1: Dr Robert Murphy
Comment-1: This manuscript describe a fairly simply design of a machine learning system for predicting whether a chemical structure is similar to previously approved drugs. It describes a web server to provide predictions about new structures.
The manuscript does not provide sufficient discussion of relevant prior work and quantitative comparison with other published approaches for which code is available (e.g., Bickerton et al. 2012 Response: In the revised version, we have discussed the previous studies as suggested by reviewer. After getting comments from the reviewer, we evaluate performance of QED model on our datasets, QED correctly predict 44.8% (sensitivity) approved and 81.28% (specificity) experimental drugs. While on independent dataset, it shows only 40% sensitivity and 52.5% specificity. QED (Bickerton et al. 2012) perform poor on our dataset because it is developed for predicting oral drug-likeness of a molecule. The high sensitivity and specificity of our models described in this study implies its usefulness in predicting drug-likeness of a molecule. Comment-2: There is a potentially serious concern with the validity of the results due to the fact that the experimental design may result in overfitting. Even though cross-validation was used internally for combinations of features and learners to evaluate predictive accuracies, when these results are subsequently used to make decisions (such as which features to use) it compromises any conclusions from further analysis of the same training and testing data. A related problem may also arise from maximization of ROC area when some of the experimental drugs may indeed be drug-like. These concerns were shown to be warranted because the final evaluation using an independent dataset showed much lower accuracy. However, it is somewhat encouraging that twenty-one molecules in the test set that were recently approved as drugs were classified as "drug-like" by the authors' model.
Response: We are thankful to reviewer for this valuable comment. In order to further validate our prediction model, we used Monte-Carlo approach where we randomly create training and testing datasets 30 times and compute average performance. We achieved sensitivity 87.88%, specificity 90.36% and accuracy 89.63% when evaluated using Monte-Carlo approach. The result for every set is provided in supplementary document (Additional file 1: Table-S2) in the form of sensitivity, specificity, accuracy and MCC along with their mean and standard deviation. These results were more or less same to the previous five fold results. The result indicates that our models are not over-fitted and will be useful in real scenario.
Comment-3: The web server model does not seem appropriate for the primary use case, which is envisaged to be making predictions for users with novel structures. Since users may wish to keep their structures private, an open source approach would be strongly preferable to a public server. This would secure use of the system and also permit inspection and modification of the methods used.
Response: We are thankful for this suggestion. We understand the limitation of the webserver used for prediction. In order to facilitate and for the sake of user privacy, we developed a standalone version of this software, which is available for download from http://osddlinux. osdd.net, now user can run our software on their local machine.
Additional comment-1: The author list contains "Open Source Drug Discovery Consortium" which is not a person and is not mentioned elsewhere in the manuscript.
Response: We are thankful for this comment. In the revised version, we have acknowledged the Open Source Drug Discovery Consortium instead of authors list.
Additional comment-2: The abstract refers to screening but the manuscript does not describe any screening results.
Response: The authors are thankful for this suggestion. In the revised manuscript, we have provided the detailed of chemical libraries and their screening results under the paragraph screening of databases.
Quality of written English: Needs some language corrections before being published.
Response: We have corrected the language in the revised manuscript.

Reviewer number 2: Prof Difei Wang (nominated by Dr Yuriy Gusev)
In general, this is an interesting work and it is important to predict drug-like molecules using various types of molecular fingerprints. However, I do have some questions about the manuscript.
Comment-1: On page7, the authors stated that "Similarly, MACCS fingerprint elements 112, 122, 144, and 150 were highly desirable and present with higher frequency in the approved drugs [Table 2, Figure Table. Is there any reason to exclude MACCS-66 here?
Response: We are thankful to the reviewer for this nice suggestion. Here, we are providing the selected MACCS keys description that would be useful to interpret the results [Additional file 1: Table- Comment-2: What is the score cutoff value for drug like and non drug like molecules for database screening results? What are the meaning of "drug like, low", "drug like, high" and "non drug like, low"? What false-positive rate do we expect here?
Response: The authors are thankful for this comment. In this study, we have used a threshold value 0 for discrimination of the approved and experimental drugs.
The SVM score is categorized into three groups: a) Very High: used when the score is >1 (drug-like) and < −1.0 (Non drug-like). b) High: used when the score is between 0.5-1.0 (drug-like) and in between −1.0 to −0.5 (Non druglike). c) Low: when the score lies in between 0-0.5 (druglike) and in between −0.5 to 0 (Non drug-like).
False positive rate has been calculated via 30 times shuffling the dataset in five fold cross-validation and the average value of FPR is 9.64% (Additional file 2: Table-S2).
Comment-3: How many distinct structural families in drugbank3.0? How structurally diverse of this dataset? Are there many drugs having similar structures? If the answer is yes, will it bias the fingerprint selection and model creation?
Response: We are thankful for this valuable comment. After getting this comment, we analyzed the structural family of drugs in drugbank3.0 and found that at present these were classified into 233 different families (http:// www.drugbank.ca/drug_classes). This clearly shows the dataset is highly diverse and suitable for model development.
Comment-4: I tried the example on the web server. But it seems slow and could not give me the result. Is this server really functional?
Response: We are thankful to the reviewer for this comment. Now, the server is completely functional.
Comment-5: Will it possible to have a standalone version of the web server? It will be great if there is a standalone version available to the community.
Response: We are thankful for such a nice suggestion. To improve the visibility of this work, we have developed a standalone version of this software. This is available to the users at http://osddlinux.osdd.net. Comment-6: On page 1, "can predict drug-likeness of molecules with precession." Is "precession" a typo?
Response: We are thankful to the reviewer for pointing out this typo error. In the revised version, we have corrected this mistake and also take care of any other grammatical error. Comment-7: I am not sure if this topic is suitable for this computational biology-centric journal. Maybe, this work is more suitable for publishing in journals like BMC.
Response: We are thankful for this suggestion and we think this kind of work is well suited for this journal.
Quality of written English: Acceptable

Reviewer number 3: Mr Ahmet Bakan (nominated by Prof James Faeder)
Comment-1: The authors developed various classification models using an exhaustive set of chemical fingerprints for discriminating approved drugs from experimental drugs and made these models available via a web server. In the past years, many newly approved drug molecules are breaking the widely accepted rule of 5 for drug-likeness, this improving and updating methods for calculating drug-likeness is an important problem. However, I don't understand why authors developed models that discriminate "approved" drugs from "experimental" drugs. Experimental drugs are molecules that are under investigation. Being experimental does not meet the compound is not drug-like, so any model that discriminates approved from experimental does not have any value. The exhaustive approach would be valuable if models were developed to discriminate drug-like, safe compounds from potentially toxic, non-drug-like compounds. Response: We completely agreed with the reviewer comment. Although, studies have been done previously with focused towards the discrimination of drug-like molecules from non-drug-like ones. But most of these were based on the use of commercial dataset like MDDR, CMC as drug-like and ACD as non drug-like dataset. Thus, availability of the dataset is the major issue. In contrast, our method is an attempt to discriminate two closely related drug-like molecules. This will be an advance step in drug design process because despite the in vitro drug-like properties, many drugs failed in clinical trial (experimental stage). Thus, it is very important to discriminate these two classes of molecules. This is the only dataset that is available for public use and will be an excellent asset for development of public domain servers.
Quality of written English: Not suitable for publication unless extensively edited Response: We are thankful to the reviewer for this comment. In the revised version, we have tried our best to improve quality of English in revised version of manuscript. Hopefully, the revised version will be suitable for publication.

Reviewer number 1: Dr Robert Murphy
The authors did not respond adequately to my concern about overfitting. By using the results from cross-validation to make choices (such as which features to use), the expected accuracy of the system so configured is no longer the cross-validation accuracy for that configuration. Simply adding more cross-validation trials does not address the issue. The problem may be clarified by considering that some combination of features and model parameters will optimize performance on any finite dataset but that the same combination may not be optimal for another finite dataset even if chosen from the same underlying distribution. Optimization of these choices does not allow the accuracy to be estimate for the new dataset. The point is that in order for cross-validation to be used to estimate future performance, all choices must be made using the training set only. The observation that the performance on the independent dataset (from DrugBank v3.0) was significantly worse suggests that the two datasets may have been drawn from different distributions (likely) but also that the cross-validation accuracy from the original dataset was an overestimate.
Response: After getting above comments on our revised version, we recheck reviewers comment and our previous response. We realize that we misunderstood comments, this is the reason we make more cross-validation trials. We agree with reviewers that we perform feature selection from whole dataset so there is biasness in feature selection. In this version of manuscript, we also evaluated performance of our models to avoid the ambiguity of biasness. We randomly picked 20% of the data from the whole dataset and called this dataset as validation dataset (for detail see Methods section). Remaining dataset (80% data of whole dataset) called New training dataset, were used for training, testing and evaluation of our models using fivefold cross validation. Now, each and everything such as parameter optimization, feature selection, model building was done on New training dataset (80% dataset). Final model with optimized parameters and features was used to evaluate performance on validation dataset (this dataset never used in any kind of training process or feature selection). The performance of our models on training and validation is shown in Table 6. As shown in our results on validation dataset are in agreement with training dataset. We also observed that the prediction performance of MACCS 159 keys based model is same for the New training and validation dataset as well as model developed on whole training dataset. However, a slight decrease in MCC value from 0.72 to 0.67 on PCA based model and 0.67 to 0.62 on CfsSubsetEval based model was observed for New Training and validation dataset. This implies that model developed on 159 MACCS keys is suitable for further prediction because the prediction accuracy is highly similar on both New Train and validation dataset. These results suggested that the models developed in this study are not over-optimized.
Quality of written English: Acceptable Reviewer number 2: Prof Difei Wang (nominated by Dr Yuriy Gusev) The authors' responses for my questions are acceptable. However, it seems the server still has some problems running examples for virtual screening and design analogs. If possible, it is better to give an estimate of running time. Then the users could decide if they should wait for the results. The output of search database is kind of confusing. The first column gives molecule no. What is this for? Why did the example give the same