Reviewer's report 1
Dr. Alexey V. Kochetov (nominated by Mikhail Gelfand), Institute of Cytology and Genetics, Novosibirsk, Russia
The manuscript concerns an interesting and important problem of prediction of eukaryotic mRNA 5'-terminal part and translation start site. The authors prepared a tool for automatic comparison of EST and cDNA data sets. Additional information from EST clones allows revealing 5'-ending incomplete cDNAs and correcting them. Some examples of predicted 5'-end extended Danio rerio cDNAs were selected and verified experimentally, which proved usefulness of the method.
In my opinion, the problem of correct prediction of both 5'-end of mRNA and translation start sites is far from being solved. Accuracy and sensitivity of available computational tools are limited (e.g., Nadershahi et al. Comparison of computational methods for identifying translation initiation sites in EST data. BMC Bionformatics. 2004. 5:14). Additional analysis of full-size mRNAs is of importance to reveal new mRNA and protein variants (e.g., Casadei et al. mRNA 5'-region sequence incompleteness: a potential source of systematic errors in translation initiation codon assignment in human mRNAs. Gene 2003 32 185–193; Porcel et al. Numerous novel annotations of the human genome sequence supported by a 5'-end-enriched cDNA collection. Genome Res. 2004. 14. 463–471) Thus, new information and software resources in this field are valuable.
1) EST data are very frequently used to map the gene structure. As I understand, the approach of Frabetti et al. is likely to be characterized by less strict limitations on the EST usage that allowed getting more information in comparison with other investigations. For example, Kitagawa et al. (Bioinformatics. 2005. 21. 1758–1763) have also used EST multiple alignment to map mRNA 5'-end (transcription start sites). They removed 10% of 5'-farthest ESTs as outliers because "...they did not need a large quantity of TSS (transcription start site) datasets but rather accurate ones". Probably, application of more strict limitations increases accuracy but decreases sensitivity. Frabetti et al. proved the efficiency of applied criteria experimentally. However, I would like to know their opinion. Probably, this question may also be discussed in more detail in the manuscript to address the difference between the Authors approach and other available tools. To this point: some comparative (brief) review of different prediction methods and tools might increase the manuscript quality considerably. It would be also interesting to know: how many zebrafish gene models corresponding to these 285 potentially 5'-extended mRNAs (and available in databanks) coincide with the mRNA model predicted by the Authors?
Author's response:The aim of this work is not to predict the mRNA 5'-end and translation star site, but to exploit the currently available EST dataset to improve the present knowledge about the most complete ORF which can be assigned to a known mRNA. In this respect, our tool does not start from EST sequences as in the case of the tools compared in Nadershahi et al., which work independently of the known reference translation start site, but on the contrary it examines known mRNAs to verify the possibility of building more extended models at 5'-end that include an extension of the currently accepted ORF. Similarly, the approach described in rice by Kitagawa et al., is aimed to first create EST clusters, then to identify ORFs, rather than check completeness of known ORFs. In addition, these Authors do not release a package for general use of their method. The work by Kitagawa et al., 2005, has now been briefly discussed and cited in the third paragraph of the Discussion.
To our knowledge, available software to compute mRNA ORF models with the best possible ORF 5'-end is included in the pipelines of the major genome browsers (NCBI MapViewer, UCSC Genome Browser, Ensembl). We have already shown in the article that our predicted and experimentally validated extended models were not computed by these reference map browsers. As the reviewer suggests, we believe that it is due to differences in the approaches used. While these programs use complex computations including statistical analysis to build their models, our software is more sensitive in the specific task of mechanically finding any EST which may allow the extension of a known mRNA ORF, considering its already known translation frame. Searching for potentially 5'-extended mRNAs into NCBI sequences, including both finished and predicted sequences, only 29 out of the 285 are available to date (August 2007). In addition, in several cases the extension was made available only after we had conducted our analysis, due to the release of new finished (non-EST) sequences in the databases, as in the case of the three mRNAs we have experimentally validated in this work and we ourselves have submitted to GenBank. A more detailed discussion of this item would require a systematic comparison of the details of the pipelines used by the genome browser to build their gene models, which is beyond the aim of this work. Moreover, the originality of our software is also in allowing simple large-scale analysis of BLAST results in a database framework, e.g. producing the subset of all mRNAs extended by the analysis for a whole transcriptome of an organism.
The work by Porcel et al., 2004, has been briefly discussed and cited in the first paragraph of the Discussion.
2) Many eukaryotic genes produce several mRNA variants with different 5'-ends because of the usage of alternative promoters and alternative splicing. It was recently evaluated that in mouse transcriptome there were about 1.32 5' start sites for each 3'-end (The FANTOM consortium et al. The transcriptional landscape of the mammalian genome. Science. 2005. 309. 1559–1563). Actually, comparative analysis of available EST data can be used to reveal mRNA 5'-end heterogeneity. For example, multiple alignment of 5'-EST sequences allowed revealing numerous multiple transcription start sites and alternative first exones in rice and mouse (Kitagawa et al. Computational analysis suggests that alternative first exones are involved in tissue-specific transcription in rice. Bioinformatics. 2005. 21. 1758–1763). However, Frabetti et al. considered only the situation of 5'-end incomplete mRNAs rather than an opportunity of synthesis of alternative mRNA forms producing different protein isoforms. What is the reason for such a limitation? May some "artifacts" be just additionally produced alternative forms?
Author's response:Our research simply aimed to identify cDNA ORF sequences longer than those previously suspected. The systematic analysis of transcription start sites and/or alternative splicing would obviously require the incorporation of the whole-genome sequence, provided that it is available for a given organism, and the development of heavy computations to build mRNA models accounting for alternative transcription and/or splicing. This is beyond the aim of this work, and is also probably beyond the capability of the FileMaker Pro software running on the present personal computers.
Regarding the possibility that an mRNA model extended by our approach could represent an isoform due to alternative transcription and/or splicing, we cannot formally exclude this possibility. As in the case of any other computer prediction, further investigation is required, in silico but especially in vitro, for a fine characterization of the putative model. However, we would underline that our program adds new sequence information starting exactly from the first base of the 5' end of a known mRNA form, revealing coding bases that were previously considered to be untranslated. Actually, if the more complete sequence had been cloned in the original work, the Authors would certainly have registered the upstream in-frame AUG as the start codon. We have added a brief discussion of this item in the fifth paragraph of the Discussion section.
3) I also have one comment concerning the prediction of translation start site (TSS). The Authors used Kozak's rules to predict the position of start AUG codon (context and 5'-proximal position). This is a common method and it may be used. However, I would like to note that there is a discrepancy between the experimental data and bioinformatics approaches. According to the scanning model, 40S ribosomal subunits are recruited to the 5'-terminal cap structure, scan in a 5'- to-3' direction, and can initiate translation at the first AUG they encounter (Kozak. Regulation of translation via mRNA structure in prokaryotes and eukaryotes. Gene. 2005. 361. 13–37.). If context of 5'-proximal AUG codon is suboptimal, some 40S ribosomal subunits recognize it as a translational start site, but other will miss it, continue scanning in 3' direction, and initiate translation at downstream AUG (leaky scanning). The usage of alternative translation start sites is quite possible (some examples were demonstrated experimentally). This opinion was also recently supported by analysis of interdependency between the TSS context and the presence of AUG codons at the CDS beginning (Kochetov. AUG codons at the beginning of protein coding sequences are frequent in eukaryotic mRNAs with a suboptimal start codon context. Bioinformatics. 2005 21. 837–840; Kochetov et al. The role of alternative translation start sites in generation of human protein diversity. Mol. Genet. Genomics. 2005. 273. 491–496). Thus, suboptimal AUG context does not necessarily mean that this AUG is not used as a start site. This is not a problem of this particular manuscript: currently most gene prediction programs do not take into account alternative translation start sites. This comment does not need a reply.
Author's response:We agree that a brief comment about alternative translation is appropriate, and we have added a sentence and two references in the fifth paragraph of the Discussion about this item.
Actually, I am not sure that the Authors need to attract so much attention to the importance of problem of cDNA 5'-end incompleteness: it is quite clear that many cDNAs are incomplete. It may be more reasonable to concentrate attention on the methods (and software) to solve this well known problem rather than emphasize the problem importance itself.
English language should be improved.
Author's response:In the Introduction and Discussion we have concentrated our attention on the method to solve the problem rather than on the known importance of the problem, as suggested. The English has been revised.
Reviewer's report 2
Dr. Shamil Sunyaev, Harvard Medical School, Boston (MA)
This manuscript contributes to the problem of accurate annotation of sequenced genomes. The authors argue that inaccurate determination of transcription starts may lead to false annotation of translation starts. Using analysis of EST sequences they identified translation start sites located 5' of the annotated sites for a few percent of zebrafish genes. They experimentally prove the existence of these larger transcripts for three zebrafish genes.
I have a few comments on this version of the manuscript:
1) Eukaryotic transcription start sites are known to be frequently wobbly and sometimes there are alternative sites which may be tissue or developmental stage specific. Is it possible that the authors detect alternative starting sites rather than 5' artifacts? The manuscript would greatly benefit from a discussion of this point.
Author's response:The same point was also raised by Reviewer 1, please see response to Reviewer 1, point 2 for what alternative transcription is concerned.
2) The authors use 97% sequence identity threshold for EST alignments (it seems without constraining alignment length). Is this sufficient to avoid aligning EST belonging to paralogous genes? In other words, is it possible that ESTs used by the authors correspond to transcripts of different genes?
Author's response:As we note in the Methods section, 97% of the sequence identity parameter is stringent but it may be modified by the user if desired. It was chosen considering the known recurrence of sequencing errors in the EST entries. Actually, we also point out that we allow the option of a constraint of alignment length and that we used 49% of the total EST length, empirically determined.
Regarding the possibility of detecting ESTs related to paralogous genes in the same query, we first observe that coding-gene paralogy within a species may assume mean values different from those found in a different species. However, in general, the percentage of identity between paralogous genes in a gene family is well below 90%. For example, in the case of the organism we study in this work (zebrafish), a simple analysis of two classic gene families known to have a very high grade of conservation among their members (homeobox and histone gene families) reveals that, at nucleotide level, values of sequence identity for each pair comparison were up to 80% (e.g., typically in the range of 71–80% in the case of homeobox).
3) It is unclear why the software is currently limited to plus/plus alignments and why so many manual steps are needed. Are there any serious hurdles for development of a fully functioning software?
Author's response:We have limited the software to plus/plus alignments, for two reasons: first, it is implemented in a general-purpose database software for common personal computers, and computations to reverse the sequence in order to resolve plus/minus alignments would currently bevery heavy and slow in this situation. However, we explicitly chose this graphical interface to make the software easier for biologists with limited IT skills to use. In addition, as we point out in the Discussion, most ESTs are directionally cloned cDNAs, in particular those deriving from more recently obtained libraries that are also, due to experimental improvements, those with a better representation of the cDNA 5' end. In the case of directional cloning, the reverse sequence of the EST is not expected to add data about the 5' end.
We agree that some manual work is at present necessary to make the software run. However, due to the potentially long time required to import data, we prefer to keep the execution of each main step separate. In addition, it is difficult to incorporate the procedure in a single command, because it exploits different programs and web sites to be used on a personal computer. The use of AppleScript could be useful on Macs in order to combine the different required instructions, however many programs are not fully scriptable and in any case this system script language is not easily available in Windows OS. We look forward to improvements in personal computer OS scripting languages and software to make the procedure more compact in future versions of the software. Anyway, we have now added an example test to the software distribution to make the instructions clearer.
4) I would recommend omitting straightforward computational details of the procedure from the manuscript.
Author's response:The details about the procedure have been removed and they remain available in the Guide provided along with the software.