Reviewer's report 1
Eugene Koonin, National Center for Biotechnology Information National Library of Medicine National Institutes of Health.
1. Overlapping genes are an old and appealing topic, tracing back to the startling discovery of 'genes within genes' in small bacteriophages by Sanger et al. back in 1977. However, the subsequent history of the study of this phenomenon has been somewhat anticlimactic because it turns out that long overlaps are, after all, not that common in cellular life forms (or even in viruses with larger genomes), and the longest ones reported have the unpleasant habit to go away as artifacts. However, the apparent lack of a truly essential biological role of overlaps – beyond very short overlaps involved in regulatory compression as discussed in the present paper – is not to diminish their theoretical interest. Even if the sequences of overlap are rather short, they do carry two messages in the same string of nucleotides, and an extensive analysis of such sequences on genome scale has the potential to reveal aspects of selection and neutrality that escape our attention in the study of "normal" genes.
2. The paper of Lillo and Krakauer is, to my knowledge, the most comprehensive and nuanced analysis of this kind to date, and beyond any doubt, will be a useful addition to the literature. Of course, I do have a variety of comments that might be of some use for revision or could just help the interested reader to become better oriented in this rather complex tangle of problems.
2. First a couple of very general issues. The work would greatly benefit from a more extensive and more explicit analysis of the evolutionary conservation of overlapping regions (ORs). This would make a lot of sense both methodologically, by increasing the reliability of the results, and substantively. Indeed, it is interesting to see how the evolutionary conservation varies among different orientations and phases of ORs, are there some ORs that are conserved in a broad range of prokaryotes, and more questions like that.
More or less along similar lines, it would make sense to present more information on possible lineage-specific trends in ORs. Is there some interesting biology here or are the characteristics of ORs just a function of genome size and gene density? If it is the latter, it is worth illustrating and stating explicitly. All the more so if it is the former.
As we show in this paper, using some careful controls, long overlaps are in fact far more common than is thought. While short overlaps can be reasonably easily explained in terms of an ongoing legacy of neutral mutations, modest genome minimization and regulatory compression, long overlaps make a strong case for replicative compression. As is shown in figure
2
, the length distribution function indicates clearly how the longest overlap lengths exceed the expectations of the an exponential distribution with the same mean.
With regard to lineage-specific trends, we include an explicit phylogenetic component by tracing several features of overlapping genes through a consensus prokaryotic phylogeny making use of squared change parsimony. This allows us to track varying degrees of conservation, and to identify lineages in which changes to the genes have been more recent or derived. We also consider a more restricted dataset including only overlaps formed by pairs of genes that are conserved in at least two species and in this way seek to minimize curatorial artifacts.
3. In the Background section, the authors discuss the mutational origins of overlaps and the interplay of neutral or adaptive processes in their evolution. However, I think it is rather important to explain right away the range of the phenomenon and to make distinction between viral overlapping genes where many occasions of actual "genes within genes" and prokaryotic overlaps that are (predominantly) very short.
We elected to not treat cases of embedded overlapping genes in this study. This is because we sought those cases where controls based on non-overlapping orthologues could be used in the analysis. Embedded genes are a fascinating topic as they often make additional demands on the post-transcriptional machinery, and as stated, are used extensively in viruses.
4. Variational benefits...this is an old, somewhat tired issue on the reality of "evolution of evolvability", evolution having no foresight etc etc. The authors present this as a fully legitimate, regular evolutionary force. Perhaps, some extra caution and more discussion are due.
We have pointed out that variation in phase can not be explained exclusively in terms of either regulatory or immediate benefits. There is an ongoing effect of mutational origin on the statistics of phase usage, but this is also unable to explain a significant portion of the variation. What does seem more likely is that phases that are preferred have mutational properties that are either conservative – more redundant, or are amplifying – increase the amino acid replacement probability. Both of these strategies are consistent with robustness arguments, in which genomes are either buffered from genetic variation or more efficiently purged in a clonal quasispecies. It is also possible of course that increased variation can come under positive selection, in which case this would constitute an evolveability hypothesis. With the current evidence we are unable to discriminate between the previous possibilities and we do not use the term evolveable although the hypothesis remains very much in play. We think that evolveability could be particularly interesting in microbes where population sizes are typically large and where indirect selective effects are thus able to exert a significant influence on adaptation. The existence of the mutator genotypes provide compelling evidence for this possibility.
5. The issue of selection for genome compression – not regulatory compression (with which I have no problem) but replication rate. This is very obvious and one of the first things that comes to mind when one considers the raison d'etre for overlaps. But is it real or, at least, is it particularly important and general? How much sequence can be actually saved through overlaps? Fig. 2a shows values > 10 kb for 5 genomes; whether this is a lot or not really depends on the size of the respective genomes (by the way, is it worth to show the same data after normalizing by genome size?) One can sort of get the hang of it by comparing Fig. 2a and Fig. 2b but it is not, exactly, straightforward. In any case, for the great majority of the genomes, the total length of the overlaps is much less. I understand it is no easy question but could there be any way to assess the selective advantages conferred by this amount of compression against the obvious disadvantages of overlaps (assuming they are not subject to other types of selection)? Also, if compression is so important, why no genes within genes? We know from viruses that this is not impossible.
This is now dealt with fairly explicitly in the text and the conclusions. The question of the magnitude of replicative benefit, ideally measured in terms of replication rates in culture, is difficult to quantify using only published sequence data. Having said this, it is instructive to note that around 3% of all genes are involved in overlap with a mean overlap length of 26 base pairs. In some species this percentage can be over 50%. The sum of all overlapping regions is on average 0.2% of the total genome length.
6. The rather notorious issue of long overlaps. It is possible that I am overly cautious but I am worried over the right tail of the distribution in Fig. 1b. Clearly, there are only a few points in this area, and even a small number of artifacts would sway the curve away from the exponent. At least, I think this issue should be given more attention.
In the revised version we give more attention to this issue by considering a set of highly conserved ORs. See point 4 above and the new panel in Fig. 2.
7. The explanation of the striking under-representation of divergent overlaps given on p. 11 (and in ref. [25]) is, probably, correct. The point, I believe is that the constellation of regulatory elements that are required for the initiation of both transcription and translation is much more demanding than that required for termination. Hence the strong purifying selection against divergent but not so much against convergent overlaps. By the way, an interesting thing to check: are convergent overlaps in prokaryotes seen primarily in genes with rho-dependent or rho-independent termination? Sequence requirements for the two are very different.
We discuss this interesting problem in relation to the rho-independent terminators and convergent overlaps in E. coli. As the text indicates at least for E. coli there remains a relatively large number of sequence-dependent termination motifs even in overlapping sequences.
8. In the discussion of phase frequencies – relating to the issue of long overlaps once again. I am very worried about the "crossing" at length 75. How many points there, after all?
In the revised version we make this number clear and include the sentence 'There are more than 300 codirectional ORs longer than 75 bp'
9. In the conclusions it would be desirable to indicate that, alas, there is no good way to distinguish between the adaptationist and neutral explanations or any combination thereof. Under these circumstances, is it not prudent to take the neutral explanation as the null hypothesis?
As the paper shows there seems to be evidence for all three kinds of evolutionary explanation. While we agree that neutral hypotheses constitute an appropriate null model, we do find many patterns at odds with neutrality, from the most modest 4-base pair overlaps in operons, through to variation in phase preference in long overlapping sequences. We have tried to make these distinctions as clear as possible.
Reviewer's report 2
Martijn A. Huynen, Ph.D. Center for Molecular and Biomolecular Informatics Nijmegen Centre for Molecular Life Sciences Radboud University Nijmegen Medical Centre
1. taxonomy: The position of the hyperthermophilic bacteria is, in my opinion, not resolved. Gene content (Dutilh, JME 2004) and indel analyses (Gupta) put Aquifex with the Proteobacteria, or at least at their root. Proteobacteria, and Thermotoga with the Firmicutes. That position of Aquifex would fit "better" their high level of overlapping genes which they share with the Proteobacteria. In line with this, I would be careful with the remark about the primitiveness of the overlapping organization, as you are, as far as I understand also referring to your results on the Archaea (not Archaeota) here. And there is no reason to assume that they are primitive.
We have endeavored to be cautious in the interpretation of the phylogenetically reconstructed patterns, largely because the status of the phylogeny remains somewhat ambiguous. Our tree is simply an 'all the evidence' super-tree which at the very least tells us that overlapping genes have been around as long as some of the most ancestral clades in the prokaryotic group.
2. with respect to Figure 8: please give a histogram, not a "longer than" plot. The former would give us a better impression of the strength of the signal and the amount of data supporting it.
We have added a panel to figure
8
containing the histogram of the number of occurrences.
3. with respect to the phylogeny: do you observe any conservation of overlaps: i.e. rather than counting them, do the same genes overlap in phylogenetically "close" species. I guess it goes beyond the analyses in this paper, but there have been a lot of analyses done on the rate at which gene order is "randomized" in Bacteria and Archaea, also with respect to gene order. Phylogenetic conservation is always a strong argument for selection.
While a detailed study of all 58 genomes for conserved sequences would be beyond the scope of this project, the analysis of pairs of closely related genomes, such as in the study[10]does support conservation.
4. For table 5: did you take the codon bias in E. coli into account?
Yes, the expected fractions are computed by using the codon bias observed in E. coli
5. Conclusion 2: Are in all cases of overlapping co-directional gene pairs both genes indeed in the same operon? Either rephrase, or examine the available operon data, e.g. for E. coli.
F is this the case?
6. page 21: what does population size have to do with the rate of replication? Do know that Archaea are quite a bit slower in their replication than e.g. E. coli, and also in the Bacteria the differences are huge with some Bacteria (e.g. plantomycetes) having lower replication rates than some eukaryotes like yeast.
The point here is simply that prokaryote effective population sizes tend to be large and that this will make selection more efficient.
Reviewer's report 3
Han Liang, Department of Ecology and Evolution, University of Chicago, USA (Nominated by Laura Landweber, Department of Ecology and Evolution, Princeton University
1. The study by Lillo and Krakauer represents a very comprehensive analysis of overlapping genes in prokaryotes. In particular, the carefully designed statistical analyses on the length and strand of overlapping sequences provides important insights into how different selective forces (i.e. genome minimization and co-regulation efficiency) shaped the evolution of overlapping genes. Overall, I think this is a valuable study that advances our understanding about the evolution of prokaryote genomes.
2. The authors presented the observation that no codirectional (123:123) phase was found as one of major results. Is it due to the bias in our current genome annotation? When a shorter ORF is embedded in a longer ORF with the same reading frame, only the longer one is reported. I also noticed that the embedded genes were not included in the dataset from the beginning. Thus, by definition, the codirectional (123:123) phase was excluded.
This is correct. We chose to study only non-embeded overlaps in order to arrive at a better understanding of differential patterns of phase usage.
3. The study specifically tested the stop codon mutation mechanism, where a mutation in a stop codon leads to read-through, thereby making two genes overlapped. But there is another alternative mechanism: a novel start codon can be created by mutations at the upstream of the second gene, leading to overlapping sequences. Discussion or further analysis on this aspect would be very helpful. The thing is that mutations occur without knowledge of translation orientation, and only their effects are evaluated by selection.
We have wrestled with this problem from the very start of this project. In one earlier version of the manuscript we had a section dealing explicitly with predictions derived from a model of ribosomal frame-shifting. This was eventually removed as there are numerous different motifs that are able to promote a ribosomal skip, and we found that we were unable to calculate the null expectation for overlapping genes under this mechanisms without a full look-up table of these sequences. The logic that you describe we share and this process is often observed in the process of differential gene translation in RNA viruses such as HIV. See section on stochastic origination
4. It is well known that there is a strong bias on stop codon usage and flanking nucleotide composition in most genomes. The comparison on stop codon usage (also flanking nucleotide bias) between overlapping and normal genes may generate some interesting results.
This is an interesting observation, and we have analyzed patterns of stop codon usage in both overlapping and non-overlapping genes. There are some differences, but we are not able to provide an explanation in this paper.