Reviewer's report 1
Manyuan Long Department of Ecology and Evolution, The University of Chicago Chicago, United States, Email: email@example.com
The authors attempted to develop a simple sensor to detect splice and splice- site signals. A 7-mer was designed to scan a sequence. The ROC diagrams (Fig. 3) showed its obvious advantage, significantly higher specificity and sensitivity than other methods. In addition, the authors also used MHMM to detect ISE and ESE signals and used found signals to improve SS prediction. I think that the authors developed useful new methods for SS detection and I favor its publication in Biology Direct. However, I also have following minor concerns and hope them get fixed in revision.
Page 1: "Figure 1...": is not original, some sources should be cited (for example, the early work of Tom Schnider of NCI in 1992?". Originality is something that a paper in bioinformatics wants to emphasize.
For graphical representation of splicing motif consensuses we extracted multiple splicing motifs from our database and used WebLogo toolto build the logos
Page 1: " the human transcribed region have plenty of motif....": it should be pointed out how are these motifs defined and why mentioned here? Is it relevant to the intron splicing?
Many oligonucleotides have composition identical to known potent splicing signals and at the same time are not supported by spliced alignment. Ab initio SSs prediction has to filter out such signals to predict the correct gene structure(s).
Page 2, the second paragraph, the caveat of current methods to detect SSs is pointed out: non-coding exons do not have three-periodic coding components. The idea used is the signal interaction: SSs, ISE, ESE, ESS and ISS. New gene structural annotation tool SliceScan is developed and reported in this paper. SS sensor is the key and several majors SS sensors reviewed.
Page 3: in the proposal of a new sensor and compute P (7-mer and SS), why to choose the 7-mer rather than 8-mer or 6-mer should be explained. In addition, the sign – I guess is "non-ss" should be defined. If my guess is correct, this equation makes sense. Biology Direct is a journal for general biology audience; not only for computational biologist so the jargons and special signs should be avoided or if having to use them, explanations should be given.
7-mer is the size of donor consensus minus GT dinucleotide, since it is always the same, as could be seen in Figure 1(a). For the modelling of acceptor signal 7-mer appears to be optimal: shorter oligonucleotide will have limited capability of representing long-range positional correlations, while longer oligonucleotides will produce large combinatorial table difficult to learn.
Page 7: I am not sure how they identify ISEs; section 3.2 is unclear. It seems the conservation is the only criterion. This might be reasonable in a narrow scale of evolution. But given the high evolutionary rate of intron sequence with a lot insertion-deletion (indels), I am suspicious of its feasibility because of the difficulty in alignments to identify the short homologous sequences. Although I do not oppose the approach, a cautionary note in the discussion should be given, which I think will be useful to colleagues.
ISE signals are predicted using EM learning of MHMM model on intronic fragments of human genes. Main detection criteria used:
1. Close localization of putative signals to the intronic boundaries,
2. Constant size of putative enhancer,
3. Affinity of putative enhancing element to a certain HMM profile.
To test hypothesis of their higher conservation, compared to other oligonucleotides, we use mouse-rat intronic alignments that have substantial conserved domains.
Typo: Page 12: should be put in a different place to make reference continuous.
Reviewer's report 2
Arcady Mushegian, Stowers Institute, Kansas City, United States
Good: 1. Part 2: the main idea appears to be to trade a more complex model for a larger training set. This seems to improve specificity of the splicing site detection.
Two relevant issues that are not discussed but should be: a. In Figure 3, all ROC curves are still below the non-discrimination line – is this acceptable.
We use ROC curve different from common True Positive Fraction vs. False Positive Fraction plot, where diagonal is a non-discriminant test result. In our test we know total number of positive cases, so we can build Sn vs. 1 - Sp curve, which is more informative for application comparison purposes.
b. Gain of the current method is more evident in the lower FP zone, where sensitivity is also low.
All application ROC curves converge to one point with 100% sensitivity and 0% specificity. The curves differ for lower sensitivity values, where we can speculate about prediction quality. Some applications, like NetUTR and ExonScan, have sensitivity artificially limited to ~50%. Performance analysis for such applications makes sense only in lower sensitivity quarters.
2. A repertoire of intronic splicing enhancers was detected, which is interesting. Not so good: 1. Very unclear writing at different levels:
a. Various inconsistencies and poorly defined terms, for example on pg. 3–4, authors say that they compiled two test sets, and then describe three. Or on pg. 4, line 8 and further: what is "cross-correlating"?
Cross-correlation means the genes in learning and test set have extensive homologous regions, which favorably affects sensor performance on the test set and should be avoided for rigorous comparison.
b. section 3.1 : MHMM is not described well: we see a mix of introductory references on general HMMs, of more specialized references that may be telling something relevant but we do not know that, and cat's cradle pictures which are not self-explanatory (and what about these mu parameters?).
Here we try to reach reasonable compromise between complete system definition and skipping details of well known results from artificial intelligence community, which we reference. Please refer to MHMMotif application source code for more details.
2. The Results section mentions the programs that work less well than SpliceScan. But we do not hear about comparison between SpliceScan (which barely gets over the non-discrimination line) and half a dozen other, more successful methods represented on the same plots. If the goal of the work was to improve the ab initio approach (cf a line in the abstract), this has to be maintained as the message throughout the paper.
Our method has clear advantage in case of 5' UTR gene fragment structural prediction according to ROC curves shown in Figures 9(e)and 9(f). In case of gene structural prediction in CDS area, one should use different application, such as GenScan, since SpliceScan does not have frame-consistent synchronization component.
Overall, this manuscript reads more like the technical report on the ongoing project than a stand-alone paper.
I declare that I have no competing interests.
Reviewer's report 3
Mikhail Gelfand, Institute of Information Transfer Problems, Moscow, Russian Federation Email: firstname.lastname@example.org
On "Method of predicting splice sites based on signal interactions" by A.Tchourbanov et al., submitted to "Biology Direct"
The problem of identification of donor and acceptor splicing sites is not new, but far from solved, whereas identification of sites regulating splicing (exonic/intronic splicing enhancers/silencers) has emerged relatively recently. Given the importance of both these problems for gene recognition and understanding alternative splicing, any progress in this area is most welcome.
The authors attempt to address both problems in one framework of Bayesian analysis. They apply Bayesian sensors to detection of donor and acceptor splicing sites. The exposition in this part (section 2, pp. 2–3) contains several gaps. It is not clear how well the described approach of 7-mer counting with subsequent Bayesian weighting generalizes; in particular, it seems that the sensors will not accept a completely new 7-mer as a site. If the authors implicitly claim that all possible 7-mers have already been observed in the training set, and the only problem is proper weighting, this needs to be substantiated. A helpful piece of data would be the rank distribution of 7-mers in the positive and negative sets. How many 7-mers have been observed only once in the positive set (and would be missed if only half of that set were used for training)?
With cross correlation removed between learning set and the test set, when testing on the set of 250 human genes, we had miss rate of 0.52% for our 5'SS sensor, which is acceptably low value. For 3'SS sensor overall miss rate is negligibly low, since sensor topology is composite of several blocks. We show top 40 ranking 5'SS nonamers in Table 2. We added discussion on sensor performance related to learning set size [see Subsection Learning set size study/, where we show that Bayesian sensor has preference for the large learning sets. For example, the sensor could be successfully applied to recognition of the Translation Initiation Site (TIS) against upstream A UGs, where we can collect large learning set. In the TIS sensor design we used three strategically located heptamers, so that they can catch both long-range dependencies and initial codon bias, as shown below:
We used 42,883 TIS and 77,140 TIS-like signals from human, mouse and rat RefSeq databases to learn our TIS Bayesian sensor, which demonstrated, in our preliminary experiments, superior performance as compared to simple Kozak's consensus rule (GCC)GCCRCC AUG G (where R = G or A)  and corresponding weight matrix. However, the sensor design does not generalize well to recognition of other signal types, such as transcription factors, with very thin learning sets.
Another missing part of exposition is a formula for combining several sensors for acceptor site analysis. Is the final score (probability) obtained by multiplying probabilities assigned by the sensors?
Acceptor sensor uses product of block probabilities.
Given the possibility of over-fitting, the testing procedure should be designed very carefully. Description in section 2.1 (pp. 3–4) does not address several issues, the most important of which is the influence of homologous sites in the training and testing data. The authors mention that they have removed homologs from the human sets, but it is not clear whether only human paralogs have been considered, or mouse homologs as well. It is not clear also whether the rat set has been purged from homologs to sites used in training. A minor note is that the text (end of p.3) mentions two datasets, whereas three sets are listed.
The authors completely ignore the problem of alternative splicing.
Both human and mouse homologs were removed from the learning set in our experiments. Domains, paralogous to the rat test set, were not been specifically purged from the learning set. In the case of rat test set we were interested in the performance test on similar, but substantially diverged organism, i.e. simulation of practical sensor application. We considered prediction of genomic structures the way they are annotated in GenBank. Indeed, some of the predicted SSs could be alternatively committed, but this is another topic for study.
The behavior of ROC curves (Fig. 3) seems to be somewhat erratic. In particular, they are not even monotonic. Probably that means that the distribution of scores on positive and negative sets is not unimodal. Anyhow, these distributions should be presented in addition to the ROC curve data. On a technical side, it would be most helpful if the data were plotted using uniform scales; otherwise it is difficult to compare curves on different plots. The authors should also explain how they produced ROC curves for other methods: whether they had been re-programmed or some existing programs (stand- alone or internet servers) were used, what versions, etc. Otherwise these pieces of data are not easily reproducible.
The distribution of scores returned by different methods is multimodal, as shown in Figures 3(a)and 3(b). We used large number of possible intermediate points to reproduce fine features of the curves and to avoid possible graph extrapolation between distant points. The curves were obtained using Java web application, which sends queries of genomic structures to different online tools, collects statistics and outputs data points for ROC curves reconstruction.
The last sections of the manuscript are somewhat fuzzy. The authors identify a number of likely splicing enhancers/silencers, and then use these signals to improve site detection (section 4). However, is absolutely is not clear, how this improvement is implemented, nor whether the results become stronger: the entire "Results" section (4.1) consists of two short paragraphs and a huge figure featuring the ROC curves. The test sets are not described: what portions of adjacent exons and/or introns were considered? Again, the behavior of the ROC curves in many cases looks absolutely erratic: they are convex, concave, and even zigzagging. The reasons for that are not discussed.
Results of SpliceScan become much stronger compare to simple Bayesian SS sensor. In our algorithm we try to guess the boundaries of region eligible for LOD scoring by looking at the surrounding putative complementing SSs. For example, for 5' SS we consider the nearest 3' SS downstream as the beginning of next exon, and the first upstream 3' SS as the opposite side exonic boundary. Weak signals are abundant, which results in unnecessarily tight region boundaries. By relaxing requirements for the region boundary candidates to be stronger than 1, we substantially extend region boundaries and count additional enhancing signals, which improves performance. However, further relaxing of boundaries will put many signals in the wrong spot (signals that assumed to be within intron region might reside in exons with corresponding LOD score miscalculation), which worsens the ROC characteristic. The maximum allowed distance of the region expansion is -200...+ 300 bp for the 5' SS and -300...+ 200 bp for the 3' SS. Many applications tend to produce multimodal score distributions for the splice and splice-like signals, which causes ROC curves jitter.
" P. 2. Splicing silencers are mentioned in the introduction, but not addressed during analysis. At that, it is not clear how do the authors assign activation/repression function to their identified motifs: they could well function as silencers.
" P. 7. The claim that ISEs have never been systematically analyzed (section 3.2) is not correct.
" P. 7. What are the definitions of parameters in the formula (conserved/non-conserved)?
" P. 9. The first sentence in the last paragraph on this page is obscure. What are "SS of different strengths"? That is, what groups of sites, or what strength intervals, or whatever have been used?
" P. 9. Definition of D: is it a competing SS or a splicing enhancer?
" Ref. 9 = Ref. 11.
" Use of capitals in the reference list is erratic. "DNA", "Markov", "Bayesian" need consistent capitals.
Overall, I believe that, although the study has produced some interesting observations, and the authors' approach seems promising, the manuscript in the present form is rather raw and badly structured (it really looks like several independent papers half-written and stitched together), and several important points are not addressed at all.
I declare that I have no competing interests.