Reviewer's report 1
Scott Roy, Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North, New Zealand. Nominated by Anthony Poole.
"Conserved intron positions in ancient protein modules" by de Roos. The paper takes an ingenious approach to the attempt to distinguish between the introns-late and introns-early perspectives. Much previous evidence for the introns-early theory has relied on the relationship of intron positions to the coding frame of the flanking exonic sequence, or to the three-dimensional structure of the corresponding protein, findings whose existence and relevance have not always been accepted by proponents of introns-late. Avoiding these mine strewn landscapes, de Roos investigates another prediction of the presence of introns in LUCA – if introns were (primarily) created in ancient times, they should preferentially fall within ancient sequences. However, I am very concerned about various methodological issues, and therefore suspect that the results do not inform the debate.
Author's response
In concordance with the comments of other reviewers, I have focused less on a discrimination between introns-late and introns-early, but more towards support of the Exon theory of genes. I also discussed the potential bias in the results and the way this was investigated.
The author presents three related lines of evidence for pre-LUCA introns. First, conserved introns tend to be found in conserved protein domains. Second, among conserved sequences, those that contain introns are more conserved (i.e. sequences conservation extends further along the protein sequence). Third, conserved introns are preferentially found in genes common to all three domains of life.
An alternative explanation for the first finding, that introns tend to be found in conserved protein domains, is that there is a bias in the finding of 'conserved' introns. Here, conserved introns means those introns that share positions across eukaryotic groups in very highly conserved sequences (for instance 6 residues exactly conserved). Similarly, domains are currently generally defined by sequence conservation across long evolutionary distances. Thus, this finding reduces to: introns in short highly conserved sequences are preferentially found in longer highly conserved sequences. This is not particularly surprising. Without a subtle negative control, it is hard to discern a signal of ancientness above this potential bias.
Author's response
I am aware that the finding of conserved introns in conserved domains does not prove an ancient origin, since the search method could not discriminate directly between an insertion early in the eukaryotic tree. However, the Exon theory of genes expects conserved introns in ancient modules and the first part of the article set out to find those potential conserved introns. In this way it is a confirmation of the hypothesis, although it cannot be considered evidence. I designed the experiments in Figs 5 and 6 to address this point (see below). The nature of the bias and how this has been investigated is now stated explicitly.
The second finding is curious, but I fear may also be explained by limitations in the methodology. Here, the central finding is that short highly-conserved sequences containing conserved introns are more likely to lie in long highly-conserved sequences than are those that do not contain introns. Given the enormity of the sequence database, it is likely that some highly-conserved sixmers will be false positives – i.e. they will not reflect true homology. The probability of a false positive is greatly reduced by the presence of a conserved intron (the vast majority conserved-sequence, conserved-intron sequences are likely homologous). Thus, the conserved intron-containing set is likely to contain far fewer false-positives than the control set. The higher false positive rate among the control set predicts a lower extent of sequence conservation, just as seen. Thus it is again difficult to discern a signal of ancientness.
Author's response
In order to investigate a potential bias, the experiments in Fig. 5 were designed, were a control situation was created that used exactly the same methodology, but now without an intron. It was shown that sequences without an intron are less often found within a conserved domain or sequence. In other words, if a sample is taken from a genomic database that selects short, highly similar sequences, the gene it is in is more ancient if an intron is also shared between these sequences. This would not be expected when the sequences with an intron are just a subset of a larger set with conserved sequences regardless of an intron.
The assumption that the probability of a false positive is greatly reduced by the presence of a conserved intron was actually tested in the experiments. Although not evidence for the ancient origins, this does strengthen the finding that introns at similar positions in conserved genes are homologous and would be in line with an Exon theory of genes.
The final finding, that intron-containing matching sequences are preferentially found across domains of life, is not statistically significant: the fraction of intron-containing sequences in 'ancient' genes (8/19) is not statistically different from the fraction of intronless sequences (10/51; P = 0.07 by a Fisher Exact Test). Thus while the difference in fractions of ancient genes is suggestive, we cannot currently conclude anything from this finding. In addition, I am concerned that this test may also suffer from problems similar to those discussed above.
Author's response
These sets consisted of sequences where only the splice site sequence matched (2 × 3 residues), but did not show any further substantial gene similarity. It is true that this value is not statistically different, and this is now shown (I found p = 0.053 using Student T test). However, there were more than one test that suggested the ancientness: a) overall sequence similarity, b) representatives of ancient protein modules, and c) representatives of protein modules that are shared between prokaryotes as well as eukaryotes. Although these tests cannot be considered truly independent, the set of controls does suggest that the introns are preferentially located in ancient sequences.
In addition, the unorthodox methods used (complete sequence conservation over short sequence fragments) are hard to evaluate given the lack of a history for such methods. It is not clear why these methods have been chosen over more traditional tools, which are abundant. This makes it difficult to interpret the data.
Author's response
I have included a paragraph that discusses the method used, and added the result set of potential conserved introns.
Assuming an 'Exon theory of genes'-scenario (de Roos, 2005), I set out to look for conserved introns. I defined a potential conserved intron just by a shared intron in a short sequence, without any preselection of the type of genes. Since SQL is a powerful tool to extract specific information from large relational data sources, this quickly led to introns that could be considered conserved. Since sequence matching was done using a simple algorithm that compared sequence similarity, it was easy to control the variables and do the experiments with control sequences as in Fig. 5. I believe this is one of the main advantages: the straightforward way of querying makes it possible to design predictive experiments by changing one parameter at a time and use a control situation.
One further consideration is the usage of the word 'ancient.' This word is traditionally reserved in the context of the intron debate for the period before the divergence of eukaryotes from prokaryotes, and this distinction has become all the more important with the mounting evidence for significant intron presence in early eukaryotes. One way to distinguish the two epochs is between 'ancient' and 'ancient/early eukaryotic'.
Author's response
This is an important distinction and has been made more clear in the article. Although it is difficult to make a distinction between ancient and ancient/early eukaryotic with the experiments presented, the data is in line with ancient origin (before pro/eukaryotic split). I assume, however, that eukaryotes were not derived from prokaryotes, but that they both share an ancestor whose genome was eukaryotic-like.
Following a long line of previous results, de Roos identifies intron positions that are shared between broad eukaryotic groups. Previous statistical analyses of the numbers and distributions of these patterns have concluded that a substantial majority of these represent ancestral positions, indicating that there were already substantial intron numbers by the time of divergence. While the eukaryotic groups used here (and elsewhere) may not allow for inferring an intron's presence in the last ancestor of all extant eukaryotes, we can now be confident that there were already a very large number of introns in the last ancestor of eukaryotes, since the spliceosome is broadly shared among eukaryotes, and large numbers of intron positions are shared between potentially early diverging eukaryotes and others (Slamovits and Keeling this year solidified the last link, showing conservation of more than 0.5 introns per gene between an excavate species and 'later-diverging' species, confirming Archibald, O'Kelly and Doolittle's 2002 work).
I bring up this in order to point out two things. First, the debate over the timing of introns is now over whether introns arose in a universal ancestor or between the eukaryotic-archaeal ancestor and the last eukaryotic ancestor. Although findings of conservation of introns or splicing-associated features across eukaryotes underscores the fact that introns are very old within eukaryotes, it does not help us to distinguish the timing of origin of spliceosomal introns. Second, it is important to point out that the findings of large number of introns in early eukaryotes are not fundamentally transformative for the introns early/late debate. Though substantial intron presence is clearly resonant with introns-early (and perhaps nearly necessary for the model), it does not contradict the introns-late model. The fundamental tenet of introns-late is that introns arose after the last common universal ancestor, thus intron presence in early eukaryotes only refines the timing of the origin on that model.
Author's response
I rewrote sentences that implied a contradiction of introns-late. Instead I focused on results that are in line with an Exon theory of genes. On the other hand, the patterns of conserved introns in ancient protein modules, phase distributions and correlation with ancientness of genes as seen in this article, may also make a mechanistic model for introns-late more difficult and is therefore relevant to the debate.
I think that the most important question in unravelling the origins of introns is whether introns were inserted into preformed genes, or were at the basis of the genes themselves. Insertion of introns at the DNA level seems extremely difficult given the fact that an intron should be excised at the RNA level perfectly in order to be functional. I have not seen a mechanism for gradual spliceosome evolution and intron insertion that could address the negative fitness effects of such a scenario. Moreover, intron insertion models should explain how complex genes arose in the first place without the help of exons. So, as long as the mechanistic puzzles have not been solved, I consider the debate still open.
In light of this, we can ask whether tests such as those undertaken here are likely to shed light on the introns early/late debate. The presence of shared introns in shared eukaryotic genes does not shed light on whether these introns date to earlier epochs. One possible approach would be to compare intron densities between ancient (shared eukaryotic-prokaryotic) gene and ancient eukaryotic genes. However, even in this case a difference would not be conclusive support for ancient introns, since the origin of many ancestral eukaryotic genes may postdate the origin of introns. And this is leaving aside the differential rates of intron loss across genes, the fact that genes that arise by gene duplication (and then diverge beyond recognition) may retain ancestral introns, and difficulties with distinguishing truly eukaryotic-specific genes from the inability of programs like BLAST to detect homology.
Author's response
Another approach may be to first unravel the mechanistic steps in an Exon theory of genes for building a genome, and then compare the construction of eukaryotic genes with those of prokaryotes and see if they are assembled in a similar way. In this respect, the proto splice-sites as proposed by Dibb and Newman[49]could as well represent the remnants of the ancient exon concatenations (see[18]).
Reviewer's report 2
Sandro de Souza, Ludwig Institute for Cancer Research, Sao Paulo, Brazil. Nominated by Manyuan Long
Albert de Roos developed a new way to identify conserved intron positions between taxa from different kingdoms of life. Instead of looking for conserved intron positions in know related genes, de Roos searches for conserved "splice sites" in an intron database. By "splice site", he means a segment of protein sequence (10 aa residues) flanking an intron position. He considered an intron position conserved when this "splice site" was conserved (6 out of 10 residues in that window) between two distantly related protein sequences. The strategy may be interesting since it can identify cases missed by other approaches. Although I recognize the efforts of the author in trying to look at this problem with a different, creative strategy, I found the manuscript hard to follow and highly speculative with many assumptions not supported by the data. Furthermore, since this is a new strategy, the author should give more details about the methodology as well as provide more analyses for his dataset. I will list some issues below in a point-by-point basis to make it easier for the author to reply.
Author's response
I have changed the article to focus more on the data that support a 'very early' origin of introns, but does not prove introns-early, and can by itself not disprove introns-late. The article was too strong in that aspect.
Overall, I have many problems with the "Introduction" section, in which many important references were misquoted or ignored.
• For instance, the original quotation for the introns-early theory is given to Gilbert's 86 paper on RNA world and to a review from Gilbert and colleagues in Cell (also from 86). Here the papers cited should be the Nature 1978 paper from Ford Doolittle and the "Exon Theory of Genes" paper from Gilbert in 1987, which best represent the early phases of the introns-early theory.
• Furthermore, the synthetic theory of intron evolution, developed by myself, Scott Roy and Wally Gilbert is completely ignored.
• In a particular point, the papers cited as evidence for a correlation between intron positions and module boundaries is certainly outdated. Papers from de Souza et al (1996), de Souza et al (1998) and Fedorov et al (2001) were ignored but best represent the conceptual aspect explored by the author.
Author's response
I changed the references and rewrote the introduction.
1) In the last paragraph of the "Introduction" section, the author state that the rationale for this manuscript was based in the assumption that "....one would expect to find conserved introns specifically in ancient proteins". In the way is presented, this sounds wrong. An intron conserved between vertebrates and invertebrates does not necessarily be located in an ancient gene. I think the author should be more explicit by saying that this conservation has to be deep among the four kingdoms that he later will mention.
Author's response
I rewrote the introduction and the articles focuses on the identification of possible ancient introns and their characteristics.
2) It would be nice to have a more detailed description of the 251 (or 250?) conserved intron positions. The entire list of positions should be available. It was not clear as well the level of stringency used in the analysis. As far as I understood a simple match involving proteins from, for example, human and Arabidopsis was suffice to flag that as conserved.
Author's response
I included the entire result set of the potentially conserved introns, as well as the larger set with strongly homologous genes that was cleaned to obtain the 251 potential ancient introns. The flag 'shared between major eukaryotic kingdoms' was indeed based on the occurrence in plants as well as animals. In order to further qualify for 'conserved', the splice site sequence and intron position should be conserved, as well as show more homology further away from the splice site. The stringency is also discussed in the article.
3) Since this is a new method, the author should compare his dataset with other datasets. For example, the dataset reported by Rogozin et al (Curr. Biol 13:1512–1517, 2003) was analyzed by these authors and by Roy and Gilbert (PNAS 102:1986–1991, 2005). The datasets are apparently significant different. They should be since the methodology is different but the author needs to explore these differences. For instance, I am surprised by the low number of conserved intron position identified by de Roos. In the dataset mentioned above, there were almost a thousand introns conserved between animals and A. thaliana or P. falciparum.
Author's response
I included a comparison of the different method used and the results that were obtained in both mentioned articles. The method used in my article is quite stringent in the way that it requires high sequence similarity in a short region. The basic concept of an ancient intron, i.e. an intron at the same position within a similar sequence was used, so not only the position but also the splice site sequence had to match. I expect that a more BLAST like search with gap-alignment of homologous amino acids would yield more conserved introns. The scope of the first part of the article was not to get an exhaustive set, but to confirm that they exist and to study some characteristics.
4) Many of the interpretations seem to me quite speculative and not supported by the data. For example, by looking at so many introns, it is expected that some matches will occur just by chance. Some simulations aimed to draw a threshold for this numbers would be welcomed.
Author's response
As now more explicitly mentioned in the article, there is a potential bias in the search method, which was investigated and shown in Fig. 5 and in Fig. 6 which was quantified. The results supported the hypothesis of the Exon theory, and I have been careful not to suggest that they prove introns-early. See also discussion of dr. Roy's review.
5) The analysis on the occurrence of conserved intron positions in ancient genes is presented in a confusing way. I strongly recommend the author to re-write this. Nevertheless, I have serious concerns about the analysis. First, it seems that de Roos call a gene ancient when there is some conservation in a window of 10 residues. This window size should be increased to at least the average size of a protein domain. Furthermore, it does not seem that the CDD database was used here.
Author's response
I rewrote the analysis. I call an intron ancient when it is located in a conserved sequence, i.e. there is splice site similarity (6/10) as well a overall gene similarity as measured over an additional 20 (2 × 10) amino acids, overall a similarity of at least 12/30. In my approach this yielded a strong selection of conserved genes, and most matched sequences were a member of a conserved protein domain and shared a similar phase.
Author's response
There was no a priori assumption of gene relatedness and the CDD database was indeed not used to find the conserved set, but was used to confirm the ancientness of the set. As Fig. 6 shows, when only an 8/10 splice site similarity was used, most sequences were conserved as shown by the sequence similarity further up and downstream. Lowering this to 6/10 but requiring an additional 6/20 (total 12/30) selected only conserved genes. Even with only a similar splice site (2 × 3 identical) and no further similarity, the sequences were in 42.1% genes shared with prokaryotes and 52.6% were member of a conserved domain (see text).
6) The interpretation derived from the analyses on the ancientness of the genes identified is kind of circular. Of course, if a condition was established in the early stages of the pipeline that a match would be considered a match only if involved kingdoms that split a long time ago, I would expect to have my dataset enriched with ancient proteins.
Author's response
The article consists in a way of two parts. In the first, I tried to get a set of conserved introns by specifically looking for them. Since I query for similar intron positions between sequences that diverged in the eukaryotic tree, the dataset will be enriched for ancient proteins. I then looked whether this enrichment is due to the presence of an intron, or solely intrinsic to the method used. As can be seen in the controls of Figs 5, 6, 7, the presence of an intron makes a difference both in conservation of phase and overall sequence. I conclude that the enrichment of these sequences is due to the presence of the introns, and not a result of the method. In the Results as well as in the Discussion section, I made this more clear.
Reviewer's report 3
Gáspár Jékely, European Molecular Biology Laboratory, Heidelberg, Germany
This paper describes a new method to identified conserved intron positions in distantly related eukaryotic taxa. It is based on looking at short protein sequence stretches around a splice site and finding similar sequences having an intron at the corresponding position, rather then defining and aligning orthologs and then mapping intron positions onto the alignment. Using this method the author identifies 218 introns with the same phase at homologous sites that probably trace back to early eukaryote evolution. The method is novel and interesting. I have some comments on how to improve the presentation and analysis of the results.
1) One problem is that the phylogenetic context is not defined explicitly. It is not enough to refer to kingdoms, especially in the case of protist. If an intron is shared by animals, fungi (and in addition e.g. by choanoflagellate protists), it does not mean, that it was there in the eukaryotic common ancestor. The phyletic distribution of each identified intron shared by two or more kingdoms should be checked against a simplified but rooted eukaryote tree including the species under study (best would be to take a tree rooted between Unikonts and Bikonts). Only this way can the author reliably reconstruct conserved intron positions that have most likely been already present in the last eukaryotic common ancestor.
Author's response
In line with the comments of the other reviewers, I have weakened the conclusions about the timing of the origin of introns based on the results presented here. Instead, the article is placed more as additional support for the Exon theory of genes, by the demonstration of conserved introns in ancient genes such as phosphatases.
2) All the identified conserved introns should be shown in a word or excel sheet as supplemental information.
Author's response
I now made the entire set available for download, including a set that still includes homologous genes (e.g Arac10 and Arac13, or gpc2 and GapC represent strongly homologous sequences.
3) "Shared introns in ancient protein domains"
When the results of the CDD searches are presented, they should be presented along with a control, preferably as a table or graph. Great care should be taken when designing a control set (or sets). It should be selected based on the same criteria, including the phyletic distribution. It is quite tricky, given that the 'result set' is heterogeneous in this regard. The control should be of similar sample size as well, and given the heterogeneity and the small sample the best would be to sample a control many times independently (e.g. to have five control sets of ~ 200 sequences that show the same degree of sequence similarity across the same phyletic distance as the 'result set'). The selection for highly conserved sequences will obviously result in an enrichment of conserved domains. The design of the controls is crucial for the correct interpretation of the results, and it should be explained in detail in the text.
Author's response
I have added paragraphs explaining the method and results in more detail. The sample size was not correctly given and the sample size for the control was not included, and this was changed. The results in Figs. 3 and 4 do not have a direct control, they represent potential conserved introns and represent a different set used in Figs 6 and 7. It was investigated whether this was the result of a bias in the query and just represent a subset of conserved introns in conserved genes. The quantitative experiments in Figs. 6 and 7 show this potential bias in a controlled situation, where the only difference between the sets is the presence of an intron. The positive correlation between presence of an intron and ancientness of the sequence is taken as support for the hypothesis that introns were indeed ancient.
4) "Preferred occurrence of conserved introns in ancient genes"
In the analysis to identify prokaryotic hits, only sequences having "a number of identical amino acids residues of 6 (2 × 3) around a (virtual) splice site" were used. This means 19 sequences. The analysis should best be repeated for the whole set (218 sequences), with the appropriate control set (see above).
Author's response
As can be seen in the graph of Fig. 6, there is a small window where are difference can be seen in ancientness of the genes. A lower similarity (2 or 4 identical residues) would mainly select false-positives, a higher number will select only conserved sequences in both sets. If the whole set is taken, then any effect will disappear in the noise. This represents one of the problems in analyzing intron data: given the enormous introns loss and possibly gain during evolution, together with possible intron sliding and the continuous protein diversification, it is difficult to discern an ancient signal in this noise.
On page 2 the author writes: "The strongest prediction of an introns-early scenario is the presence of conserved introns between orthologous proteins that diverged early in the eukaryotic lineage." I don't think it is correct. If introns originated early during eukaryote evolution, but still later then prokaryotic genes, i.e. not as building blocks of the first genes to be assembled, one would also expect to find many conserved introns between eukaryotic orthologs.
Author's response
I changed the introduction. Although expected in introns-early, conserved introns are compatible with introns-late. As now mentioned in the article in the last paragraph of the Discussion, I believe that the ancestral genome of prokaryotes and eukaryotes was a eukaryotic-like genome with characteristics such as introns and the RNA relics (e.g. the ribosome and the spliceosome). In that respect, early eukaryotic and ancient would be synonymous.
Page 2/3: "It was found that conserved introns are frequently and specifically found in ancient protein domains" the word 'specifically' is too strong here, it would mean that conserved introns are only found in ancient domains, which is not the case (it is 53%).
Author's response
Changed to 'positively correlated with sequence ancientness'.
The word 'ancient' is often used as a synonym of 'conserved' but it is of course not the same thing. E.g. page 5 "Thus, based on these results, introns seem to be preferentially located into ancient genes, in line with an introns-early scenario" these are rather conserved, then necessarily 'ancient' genes.
Author's response
Checked for inappropriate references to ancient and conserved.
page 7: "In conclusion, the data presented here indicate a high occurrence of introns in ancient genes, followed by a massive loss of introns later in evolution. " High occurrence sounds too strong, a total of 218 introns were identified. Intron loss was not addressed in the paper, this would require mapping the distribution of conserved introns onto a phylogenetic tree.
Author's response
I changed this. My interpretation of the data, with an Exon theory point-of-view, is that the ancient domains contained many introns since multiple were found in phosphatases for example (Fig. 4). The relatively low number of conserved introns found (251) indicate that either many were lost during evolution, or the sequence diverged and were not picked up by the current method. In the article there are two sets shown. The first consists of a set of conserved introns, based on the requirement that any 6 out of 10 residues should be identical, the second set consists of 5 separate queries in which the exact position of the identical residues was determined.
Page 7: "These results support the idea of ancient introns and are difficult to reconcile with an origin of introns late in evolution, although an insertion directly after the eukaryote-prokaryote split cannot be excluded based on the current results." I would rather say that the results nicely support the ancestral presence of introns in the eukaryote common ancestor. This is interesting enough, and the whole paper and discussion would be much better if the author didn't try to argue too strongly for a model, that is not strongly supported by the data, but rather discuss what one can really conclude from the results.
Author's response
I have changed the focus on the distinction between introns-late and introns-early (which was not well-supported by the data presented) towards a support for the Exon theory of genes. I made more explicit that, seen in the light of introns-late, these data would not exclude an introns-late scenario.