A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples

Zeng, Ying; Yuan, Hongjie; Yuan, Zheming; Chen, Yuan

doi:10.1186/s13062-019-0236-y

Research
Open access
Published: 11 April 2019

A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples

Ying Zeng^1,2,
Hongjie Yuan¹,
Zheming Yuan^1,3 &
…
Yuan Chen⁴

Biology Direct volume 14, Article number: 6 (2019) Cite this article

3742 Accesses
7 Citations
Metrics details

Abstract

Background

Splice sites prediction has been a long-standing problem in bioinformatics. Although many computational approaches developed for splice site prediction have achieved satisfactory accuracy, further improvement in predictive accuracy is significant, for it is contributing to predict gene structure more accurately. Determining a proper window size before prediction is necessary. Overly long window size may introduce some irrelevant features, which would reduce predictive accuracy, while the use of short window size with maximum information may performs better in terms of predictive accuracy and time cost. Furthermore, the number of false splice sites following the GT–AG rule far exceeds that of true splice sites, accurate and rapid prediction of splice sites using imbalanced large samples has always been a challenge. Therefore, based on the short window size and imbalanced large samples, we developed a new computational method named chi-square decision table (χ²-DT) for donor splice site prediction.

Results

Using a short window size of 11 bp, χ²-DT extracts the improved positional features and compositional features based on chi-square test, then introduces features one by one based on information gain, and constructs a balanced decision table aimed at implementing imbalanced pattern classification. With a 2000:271,132 (true sites:false sites) training set, χ²-DT achieves the highest independent test accuracy (93.34%) when compared with three classifiers (random forest, artificial neural network, and relaxed variable kernel density estimator) and takes a short computation time (89 s). χ²-DT also exhibits good independent test accuracy (92.40%), when validated with BG-570 mutated sequences with frameshift errors (nucleotide insertions and deletions). Moreover, χ²-DT is compared with the long-window size-based methods and the short-window size-based methods, and is found to perform better than all of them in terms of predictive accuracy.

Conclusions

Based on short window size and imbalanced large samples, the proposed method not only achieves higher predictive accuracy than some existing methods, but also has high computational speed and good robustness against nucleotide insertions and deletions.

Reviewers

This article was reviewed by Ryan McGinty, Ph.D. and Dirk Walther.

Background

The amount of genomic sequence data has increased exponentially as a result of the advancement in sequencing technology. Therefore, there is an urgent need to complete genome annotation quickly and reliably. Gene identification is an important task in genome annotation. Most eukaryotic genes consist of protein-coding regions (exons) and noncoding regions (introns), with the exons being separated by intervening introns [1]. The boundaries between exons and introns are called splice sites and are the locations where RNA splicing occurs. The 5′ end of an intron is a donor splice site and the 3′ end is an acceptor splice site. If we can accurately detect splice sites, the coding regions of DNA sequences can be located, so splice site prediction plays a key role in gene identification. Almost 99% of splice sites are canonical GT–AG pairs [2], that is, dinucleotides GT and AG for donor and acceptor splice sites, respectively. However, this strong conservation observed in splice sites is not sufficient to accurately identify them, due to the abundance of dinucleotides GT and AG appearing at non-splice site positions. We therefore face an extremely imbalanced classification task, namely, the discrimination of small numbers of true splice sites from much larger volumes of decoy positions with the dinucleotides GT and AG [3].

For splice site prediction based on machine learning approaches, the main steps are feature extraction and classifier selection or design. The extracted features are usually based on nucleotide position information [4,5,6,7,8,9], the frequency of k-mers [4, 6, 10], dependence between adjacent and nonadjacent nucleotides [1, 6, 11,12,13], RNA secondary structure information [14,15,16,17,18], DNA structural properties [19], and some other attributes that can be calculated directly from sequence information [20,21,22]. The commonly used classifiers include support vector machine (SVM) [1, 3, 5, 6, 10, 18, 23,24,25], artificial neural network (ANN) [26,27,28,29], random forest (RF) [13], and decision tree [30].

Although relatively high accuracy has been achieved with the methods currently available (e.g., the accuracy for most donor splice site prediction based on the HS³D dataset has exceeded 90% [6, 10, 12, 13, 19, 24, 31]), further study is still necessary due to the following factors: 1) Determining a suitable window size prior to the application of any prediction method is essential [32]. Overly long window size may introduce some irrelevant features that would reduce predictive accuracy, and may take more computational time and memory space. 2) The HS³D dataset contains 2796/271,937 true/false donor sites (i.e., the ratio of true sites to false sites is almost 1:100). If all negative samples (false sites) are employed for building the prediction model, the huge number of training samples will increase the time complexity of some classifiers (e.g., SVM and ANN) [3, 33], and an extremely imbalanced class distribution will lead to poor predictive accuracy for some methods, for example, weighted matrix model (WMM) [9] and maximal dependency decomposition (MDD) [34]. If only a part of negative samples (e.g., 2796 negative samples [20]) are employed, predictive accuracy may be lost due to the underutilization of negative samples. 3) There are three billion DNA base pairs in the human genome, so the expected number of GT/AG is over 187 million. This abundance means that even a subtle improvement of the total predictive accuracy would drastically increase the absolute quantity of detected real splice sites.

In this study, we developed a computational approach to predict donor splice sites based on short window size and extremely imbalanced large samples. Our method, named chi-square decision table (χ²-DT), extracts the improved positional features based on chi-square tests, combines them with the frequencies of dinucleotides, and then designs a balanced decision table to predict the test samples, which can effectively resolve the imbalanced pattern classification problem. The results show that χ²-DT can achieve high predictive accuracy, high computational speed, and relatively good robustness against DNA sequencing errors (nucleotide insertions and deletions).

Datasets and methods

Datasets

We collected 2796/271,928 true/false donor splice sites from the publicly available HS³D dataset [35] (http://www.sci.unisannio.it/docenti/rampone/) for the experiments, and named them HS³D_all. Each true/false donor splice site-containing sequence has 140 nucleotides, with the conserved dinucleotide GT at the 71st and 72nd positions, and does not contain non-ACGT bases. Setting the positions of the conserved GT as 00, the upstream positions were successively labeled as − 1, − 2, …, − 70, whereas the downstream positions were successively labeled as 1, 2, …, 68. From HS³D_all, we randomly selected 796 true sites and 796 false sites to constitute a balanced testing set, named HS³D-test_1:1, and then used the remaining sites to construct the training sets with different ratios of true sites to false sites. Additionally, to compare the performance of χ²-DT with that of other methods, we selected 2796 true sites and different numbers of false sites from HS³D_all to construct four datasets, namely, HS³D_I, HS³D_II, HS³D_III, and HS³D_IV.

The BG-570 dataset [36] (http://genome.crg.es/datasets/genomics96/) contains 570 human genomic DNA sequences and 570 corresponding mutated sequences. The mutated sequences were generated by introducing 1% random frameshift errors (nucleotide insertions and deletions) into the original DNA sequences. Using the BG-570 dataset, we constructed two testing sets (BG-570_orig and BG-570_muta) to evaluate the robustness of χ²-DT against the frameshift errors. The extracting process of true/false sites in these two testing sets is described in the “Results and Discussion” section.

The numbers of true/false sites in the datasets described above are given in Table 1.

Table 1 Descriptions of various datasets

A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples

Abstract

Background

Results

Conclusions

Reviewers

Background

Datasets and methods

Datasets

Compressing 2 × 4 contingency table of each position with chi-square test

Window size determination

Feature extraction

Feature introduction

Decision table design

Performance evaluation

Results and discussion

Advantage with the short window size of 11 bp

Superior performance with large extremely imbalanced dataset

Good robustness against DNA sequencing errors

Better performance in comparison with existing methods

Conclusions

Reviewers’ comments

Reviewer’s report 1

Reviewers’ comments

Reviewer’s report 2

Reviewers’ comments

Abbreviations

References

Acknowledgments

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Additional file

Additional file 1:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Biology Direct

Contact us