Reviewer 1: Dr. Daniel Haft
Wood and coauthors describe a study that seems to have two separate and complementary purposes – to showcase how standardizing on GLIMMER3 used with properly set parameters might improve gene finding for prokaryotes, and to showcase how BLAST searching vs. the COMBREX database of genes with experimental evidence allows better verification of coding region predictions than searching unfiltered databases. Unfortunately, each purpose is somewhat compromised by bundling the two tasks without benchmarking each phase of the project independently.
Because GLIMMER3 relies on analyses of k-mer frequencies in trusted sets of coding regions from a single genome, it models what is typical across a single genome. Consequently, islands present in that genome because of lateral gene transfer (LGT) are handled much more poorly than the rest of the genome. If the point is to find large numbers of genes that may have been missed, in a computationally efficient manner, where those candidates are slated to be filtered through a subsequent validation process anyway, then that the pipeline probably should use GLIMMER3 and MetaGene in combination, taking the union, and the pipeline should not filter out plasmid replicons (an area of relative weakness for GLIMMER3, which needs a large replicon to generate good statistical models).
It would be good to know from this study how severe the issue of missed genes in public databases actually is. In fact, the average number of genes picked up per archival genome submission to GenBank is small in this paper, on the order of 10 new genes for each 4000-gene genome, suggesting accuracy has actually been very good all along, other than the conspicuous exceptions noted in the paper. A previous study looked at every intergenic ORF, and found similar numbers, but that study would not have seen genes overshadowed by spurious genes and spurious extensions to genes. The real problem of missed genes may run closer to 3 % (by count, not by length, since missed genes tend to be small). The current paper uses GLIMMER3 as the only gene-finding method, and to a large extent finds genes missed by GLIMMER in earlier incarnations or run with inappropriate parameters. I recommend that the ab initio gene-finding phase of this study be redone, using the union of GLIMMER and MetaGene predictions instead of GLIMMER only (after which I would amend my remarks here).
Authors’ response: The use of multiple gene-finding programs, and taking the union of their results, would certainly be a worthwhile step were one seeking to build a full annotation pipeline, or even to find as many missed genes as possible; as we stated in our introduction, however, this is not what we intended with this study. We sought to provide a simple pipeline that would quantify the number of new missed genes that could be found using just a few freely available tools and resources (Glimmer3, BLAST, and RefSeq). With few exceptions, genes found by such a pipeline should simply not be missing from existing annotations, and if they are absent from the annotation, a reason for doing so should be present in the annotation itself (e.g., the annotators believe such genes to be pseudogenes). Although there are almost certainly genes that we did not find as part of this study, we believed that by focusing attention on these unannotated genes and the reasons for their absence, we can bring attention to the many various methods of annotation currently in use and the need for standardization.
While we agree with the reviewer that average sensitivity appears to be quite high, the “conspicuous exceptions” we noted are quite important, and a large part of what we aimed to highlight with this study. The lack of review given to GenBank submissions is not necessarily well-understood by the entire genomics community, and as we mention later in reply to the reviewer’s point about RefSeq, errors and omissions from a GenBank annotation can often be propagated into the respective RefSeq annotation. Bringing these errors in the public genomic records to the community’s attention is one of the main goals of our paper.
The second phase of the study attempts to demonstrate how validating predicted gene calls is improved by use of a BLAST-searchable database of proteins in which direct experimental evidence is distinguished from transitive evidence of function, and both are distinguished from sequences with no evidence of representing real genes. While it is natural to pair this part of the study with the data stream of predicted missed gene calls from the ab initio work with GLIMMER, it is regrettable that there is no benchmarking of the COMBLAST pipeline and COMBREX data set using more typical data than the less-than-1-percent of atypical genes found in phase 1. How does COMBREX do with the complete genome of an endosymbiont such as Blochmannia floridanus? How does it do with a large, GC-rich genome such as Mycobacterium smegmatis? The paper does not describe proteomics as a source of evidence for confirmation in COMBREX that a gene is translated into protein, so perhaps the best test is a set of proteomics-verified polypeptides from, say, Yersinia.
It is difficult for the reader/reviewer to evaluate whether or not there may be a problem in COMBREX – that some of the evidence in COMBREX may point precisely to a genomic region and yet not necessarily indicate the correct reading frame for that gene. Experiments based on antisense RNA, or transposon mutagenesis, do not point to a specific reading frame. The test I recommend is to select some GC-rich genome, make a test set by GLIMMER ab initio gene predictions, make a negative control set from all the longest ORFs that GLIMMER rejected from the genome, run both sets through the COMBLAST pipeline, and compare those results.
Authors’ response: The types of analysis suggested by the reviewer are worth doing in order to evaluate the advantages of using the newly available ComBlast tool and the COMBREX database. However, this was not the focus of our work. Currently we do not consider ComBlast as a tool for discovering missed genes. However, in the context of this work, COMBREX becomes a useful resource as it provides fully traceable annotation (whenever possible) to the experimentally determined evidence. As such we use it to provide sources of evidence for some of the missed genes, which we hope will help to focus the attention of the community on that topic. We demonstrated here a simple yet important use of that type of information, which we believe should be an integral part of any annotation pipeline. It is important to make clear that we do not intend to classify the missed genes as true and false positives based only on COMBREX, since there might be many missed genes that are indeed protein-coding genes that COMBREX cannot currently associate with any experimental evidence. In this study we also used COMBREX to identify biologically interesting potential genes, which others might want to further investigate. In the future, we do plan to continue developing the abilities of ComBlast and publish a more complete study evaluating its capabilities. We hope such a study will provide some answers to the important questions pointed out by the reviewer.
Here are a number of additional points about the paper.
One mechanism to evaluate COMBREX for spurious gene calls is to search it with a database of HMMs built to detect some of the most popular spurious gene calls – AntiFam. These models should be used to check both COMBREX genes with evidence and GLIMMER’s candidate missed genes.
Authors’ response: We thank the reviewer for bringing AntiFam to our attention. We used AntiFam to check both the named missed genes as well as the hypothetical missed genes; 8 of the 13614 (0.06%) named missed genes were found to be spurious genes, as were 141 of the 39003 (0.36%) hypothetical missed genes. We have added discussion of this to the manuscript and a new table summarizing the results.
The statement that a gene has evidence is very different from the statement that the full length of a gene has evidence. Thus, a spurious ORF from one genome could be compared to a real gene, with an improper 5’-extension, from another genome. The COMBLAST pipeline would provide evidence in support for adding the gene, and naming it, when that action might be inappropriate.
Authors’ response: Although this is possible, ComBlast does attempt to prevent such occurrences by only considering genes to be similar if the alignment between the two covers at least 80% of the gene in COMBREX. As the reviewer mentioned earlier, 5’ annotation remains a more challenging problem than gene identification, and the lower accuracy of start codon annotation compared to gene annotation should be taken into account by anyone using ComBlast, or any gene finding/annotation tool. We do not claim that every missed gene identified is necessarily correctly called as a gene and associated with a name, but only that it has good evidence to support that it is a true gene. Further analysis and checking may need to be performed to confirm each gene's validity.
The paper makes no mention that NCBI provides RefSeq versions of genomes, produced by a pipeline that is applied pretty consistently. The purpose of RefSeq is to compensate for the fact that submitted genomes are largely archival documents, not maintained with respect to functional annotation and only infrequently modified to correct previous gene-calling errors. Thus, genes missing from GenBank are not equivalent to genes missing from the accessible world of searchable protein sequences.
Authors’ response: While annotations of genomes in RefSeq are provided by NCBI, many of these are still listed as “provisional”, meaning that they have not been manually reviewed by NCBI staff. For prokaryotic genomes, the annotation present in RefSeq is based heavily on the GenBank annotation if one is provided. We have added a discussion on this, as well as a comparison of RefSeq and GenBank annotations for ten genomes. In summary, although genes missing from GenBank are not necessarily missing from RefSeq, a sample of missed genes from GenBank we found were also missing from RefSeq.
Reviewer 2: Dr. Arcady Mushegian
The study by Wood et al. reports the results of re-prediction, by Glimmer3, of the open reading frames in the majority of finished bacterial genomes and identifying such of these genes that have been missed by earlier genome annotation efforts. The protein-level sequence similarity to the database proteins is used as a criterion of the reality, which is further elaborated by taking into account the experimental knowledge about the homologs of the predicted gene products. The COMBREX database, which holds and curates this knowledge, is discussed, and examples of database queries that have to do with antibiotics resistance and gene essentiality and help guide experiments are given. The study also collected statistics concerning missed genes in genome annotations. For example, the average number of missed conserved genes per annotated genome appears to be about 10, this average varies between 2 and 13 depending on the sequencing center, large sequencing centers are better than small teams - presumably, because of the more robust annotation pipelines and better staffed bioinformatics department - and genes missed by all centers tend to be on the shorter side - evidently, in part because of the arbitrary length cutoffs imposed by many genome annotation efforts.
COMBREX is a useful resource, set up in such a way as to involve the scientific community in improving annotation of bacterial gene function. This is worth highlighting. On the other hand, my perfunctory attempts to find new predictions, i.e., previously missed but now restored to being, genes by querying http://combrex.bu.edu/ failed - is there a way to do it, and should it not be provided simultaneously with the submission of a paper that talks about such genes?
Authors’ response: It is one of our goals to make sure that the new set of missed genes will be accessible to the community. To insure that, we first provide a webpage on COMBREX’s database website with all the missed genes, including the ones that can be related to COMBREX genes. One can also query the COMBREX database with a specific sequence using the ComBlast server currently under development at
. We plan to develop both the user interface and the tool itself so it will be as useful as possible to general users. Finally, it is our intention to include the missed genes as part of the COMBREX database. This way every missed gene will be available to queries to the database.
A technical comment: on p. 4, we read: "For those candidate missed genes with homology only to hypothetical proteins, we needed additional information to determine if they were indeed genes." and again on p. 7: "it is likely that a significant fraction of such genes [i.e., those passing three reasonably restrictive filters, after sequence similarity has been established in the first place - AM] are not true genes". In both cases, it is not clear to me what the alternative to these ORFs being true genes may be - they could be pseudogenes of course, but the authors address this separately. Then, on p. 8, the authors suggest an alternative for the special case of very closely related strains, when the conservation of a spurious ORF can be an artifact of similar k-mer distribution (I suppose, even more trivially in this case, this could be a result of very high overall nucleotide-level sequence similarity). Right away, however, the authors remark that if an ORF is conserved in more genomes, and, better yet, in several relatively diverse evolutionary lineages, then it is most likely not spurious. But since the ORFs in question were identified by sequence similarity in the first place, one would think that a simple filter on the taxonomic closeness or even percent identity (not too high in either case) would take care of the problem? More generally, and in line with the authors' approach, it would be useful to have a rule of thumb such as "an ORF with a homolog separated by evolutionary distance X has an Y percent chance not to be spurious" - most likely, the authors already have data on hand to address this?
Authors’ response: Although such filters would very likely solve the problem, we wanted to be as sure as possible that the missed genes we considered as part of our annotation center analysis were indeed true genes. To that end, we elected to require the presence of a functional assignment to the gene, rather than attempting to discover a percent identity threshold that was “conservative enough” to give us a similar confidence in a gene’s coding nature.
p. 8 and Table 1: replace "significant homology" with "significant similarity".
Authors’ response: We have made these changes, and thank the reviewer for bringing them to our attention.
Quality of written English: Acceptable
Reviewer 3: Dr. M. Pilar Francino (nominated by Prof. David Ardell)
This work reports an interesting reanalysis of gene annotation in bacterial genomes, revealing that, although the great majority of genes are found in every genome, a large number of very likely genes have been missed overall. Many of these misses are due to overstringent cut offs in terms of minimum gene length. The analysis also reveals that large genome centers that rely on well established annotation pipelines miss fewer genes than smaller centers and individual laboratories, suggesting that bioinformatics expertise is another crucial factor in this issue. The missed genes are separated by the authors into “named” and “hypothetical” groups and further analysed using the new COMBREX database, which contains functional and phenotypic gene information that has been gathered from the experimental literature. This provides further support for the coding nature of the candidate missed genes in the “named” group and for a fraction of those in the “hypothetical” group. Moreover, a specific level of support is assigned to every gene annotation depending on the type of COMBREX information associated with it. Overall, the paper is an important attention call on the need to homogenize annotation procedures as well as a demonstration of how knowledge bases such as COMBREX can facilitate and improve gene annotation. It is extremely well written and easy to follow.
Quality of written English: Acceptable
Authors’ response: We thank the reviewer for her kind comments.