Reviewer 1: Dr. Purificacion Lopez-Garcia, Centre National de la Recherche Scientifique, France
This manuscript presents an update of the archaeal cluster of orthologous genes (arCOGs) based on 120 archaeal genome sequences which, although still limited face to the wide archaeal diversity, is a large improvement compared to the initial set of genomes used to define the first arCOGs. The arCOGs become therefore much more comprehensive and, on this sole basis, the work is already important and useful. In addition, the authors take advantage of this analysis to make some inferences about archaeal genome evolution. The authors estimate in this way the number of genes at ancestral nodes (including the last common archaeal ancestor, LACA) and the estimated number of gene gains at different nodes on a guide phylogenetic tree based on highly conserved ribosomal proteins. The number of genes shared by distant archaeal groups is used to evaluate the level of intra-domain gene transfer and compared with that observed in a (smaller) set of bacterial genomes. Overall, I find the article interesting and useful.
I have a few comments that the authors might want to address:
-Wolf et al reconstruct an archaeal phylogeny based on the arCOGs of ribosomal proteins as guide for subsequent inferences of genome evolution. According to the authors the tree agrees with known taxonomic schemes and points to the monophyly of the “TACK” supergroup (Thaumarchaeota –Aigarchaeota – Crenarchaeota - Korarchaeota). This is difficult to say from an unrooted tree, because the root might lie somewhere else in the tree, e.g. between the Korarchaeota and the remaining “TAC” or elsewhere, in the presence of an appropriate outgroup. At any rate, the TACK monophyly appears to be supported by a large number of shared genes (431). At this point, it may be timely to consider a more systematics-oriented comparison in terms of the scale of gene loss and gain and genome evolution between the Euryarchaeota and the “TACK” super-phylum. Are we in front of comparable evolutionary scales in terms of phylogenomic breadth? If that is the case, could the TACK super-phylum be assimilated with Crenarchaeota sensu lato? In other words, could we define a less arbitrary barrier for the phylum level within the archaea based on phylogenomic information? A comment along those lines would be welcome.
Authors’ response: We cannot and should not address issues of formal taxonomy in this article. What really matters is the cladistic pattern; which clades are given the status of “phylum” and which are subsumed within other phyla matters only operationally.
-The authors detect a higher intra-domain transfer in bacteria than in archaea, and do not find it “surprising given the typical greater complexity of bacterial compared to archaeal communities”. However, it is not self-evident that greater community complexity necessarily correlates with more horizontal gene transfer. Most likely, there is an important phylogenetic effect, with some lineages being more prone than others even within bacteria. The observation of a higher intra-domain transfer in bacteria could be biased by the taxonomic sampling and also the more limited sampling size for bacteria. Can these possibilities be discarded?
Authors’ response: The possibility that the results of any comparative genomic analysis are affected by taxonomic sampling bias effectively can never be discarded given that the existing samples represent a tiny slice of the entire existing biodiversity. Nevertheless, some of the sampling bias effects can be mitigated by using the tree-based probabilistic inference of ancestral states and evolutionary events (e.g. the potential effect of the sampling bias introduced by inclusion of 7 strains of Sulfolobus islandicus in our data set is reduced by the fact that their gene complements are not counted independently but are largely shared by the common branch of S. islandicus and S. solfataricus). Regardless, we trust that this comment and response will remind the reader of inevitable incompleteness of comparative genomic analyses.
-The number of inferred ancestral genes for LACA, but also for the last common bacterial ancestor is relatively high and the number of gene losses outnumbers gene gains by at least a factor of 4. Given the importance of horizontal gene transfer in prokaryotic evolution and the fact that many genes evolve by duplication and neofunctionalization, the idea of a gene-rich ancestral genome evolving by important gene loss seems a bit odd. Is this real or does this result from our incapacity to identify gene transfer events and/or failure to identify some type of gene coalescent processes in the past? What are the potential biases of this inferred value?
Authors’ response: The idea of a deep ancestor that was more complex than an average modern member of the descendant clade probably is “a bit odd” (or even more than a bit) to many biologists. We submit, however, that this oddity is primarily a vestige of the traditional belief in the increase of complexity with evolutionary time, a notion that might be intuitive but is not actually borne by any substantial evidence[54]. Given the importance of genome streamlining, especially in the evolution of prokaryotes, there is no reason to a priori dismiss the possibility that gene loss on average outweighs gene gain caused by HGT as well as duplication followed by neo/subfunctionalization (the latter process being relatively less important in prokaryotes). The inference of the complex ancestor is “real” in the sense that this is the maximum likelihood estimate based on the best possible estimation of the parameters of the evolutionary process (gene gain and loss rates) from the available data on phyletic patterns. Thus, HGT is taken into account implicitly here, under the gene gain rate. What is meant by “some type of gene coalescent processes in the past” is not quite clear to us. Admittedly, the current inference methodology is somewhat crude, being based on phyletic patterns and not taking into account the phylogenies of individual genes. Such a comprehensive analysis is beyond the capabilities of the currently accessible maximum likelihood models. However, given the apparent (approximate) randomness of the HGT processes, it appears highly unlikely that if and when such analysis becomes feasible, it changes the conclusions qualitatively. With regard to the possible biases, this again goes back to the taxonomic sampling and its potential incompleteness. If tomorrow a deep branch of Archaea with very small genomes is discovered, this will result in a downward reassessment of the complexity of LACA.
The inference of complex deep ancestors is neither entirely new nor limited to the evolution of Archaea. Many evolutionary reconstructions and theoretical models of genome evolution point to extensive gene loss as one of the key evolutionary processes[55–57]. Perhaps particularly impressive are reconstructions of the ancestral gene structure in eukaryotes that suggest highly complex, intron-rich ancestral genomes[58, 59]. Generally, it appears to be a distinct possibility that genome degradation and streamlining is one of the quantitatively dominant modalities of evolution.
Minor spelling errors:
Page 7, bottom, Thaumarchaeota (not Taumarchaeota), Candidatus (only Candidatus in italics) Caldiarchaeum subterraneum (not subterreneum)
Authors’ response: Errors fixed.
Reviewer’s response: Dear Editorial Team,
I appreciate the response of Wolf et al. and find their manuscript suitable for publication. However I have an additional comment as follows: It seems to me that the authors are avoiding answering my previous question "Are we in front of comparable evolutionary scales [between Euryarchaeota and the TACK superphylum] in terms of phylogenomic breadth?" This is not a taxonomy question. It is a comparative genomics question whose answer might have implications for Systematics (it would correspond then to taxonomists to take this information into account for their work), but not exclusively. It would also be important to determine patterns and eventually unravel processes of archaeal evolution within the two major clades put forward by the authors. Actually, Wolf and colleagues insist in the number of shared arCOGs as a signature for the TACK superphylum, but then refrain themselves from going further and explicitly make a direct comparison with the situation in the Euryarchaeota.
The authors may choose not to comment on this and stay on the simple description of arCOGs and their distribution (which is already very useful). However, this is a pity in when they have the data at hand. It is also at odds with the more general ambition of the manuscript to extract evolutionary information about archaeal genome evolution (which they do when focusing e.g. on horizontal gene transfer and gene gain/loss).
Authors’ response to reviewer’s response: We appreciate the importance of the reviewer’s question but continue to believe that this is outside of the immediate scope of our work.
Reviewer 2: Prof. Patrick Forterre, University Paris XI, France
The archaeal domain is presently the only one for which a robust evolutionary history based on phylogenies of large concatenations of protein markers is available [60]. This is probably because the number of currently sequenced genomes is not yet very large, or because the different archaeal phyla and orders diverged less rapidly than bacterial and eukaryal ones, reducing the “radiation” effect that makes difficult the resolution of the most basal nodes in the global bacterial and eukaryal phylogenies. Therefore, I was really interested to read this paper describing comparative genomic results obtained from an updated version of the arCOGs database. The new database contains 120 archaeal genomes, a significant improvement compared to the 41 genomes used in the first version (2007). The results obtained are in line with previous ones [9, 30], suggesting a rather complex Last Archaeal Common Ancestor (LACA) and an important streamlining component in archaeal evolution, since gene losses outnumber gene gains by at least four to one.
An essential starting point of all analyses that seek to estimate ancestral gene contents, and subsequent gene losses and gene gains is the availability of a solid reference phylogeny. Therefore, I have two important concerns on this analysis:
1) Concatenation of ribosomal proteins has been widely used over the past ten years to reconstruct a robust reference phylogeny of the Archaea [60]. Therefore, it seems of no use to recalculate it, all the more if this is done by using a fast and approximate tree-building algorithm. The authors should rather use a sketched phylogeny that makes a consensus of previous analyses. The tree presented here is probably not the best one, as suggested by the branching of Nanoarchaeota between Euryarchaeota and other Archaea. Nanoarchaeum equitans is more likely a fast evolving Euryarchaeote probably related to Thermococcales [60, 61]. In addition to its position in the updated ribosomal protein trees of Brochier et al. [60], the grouping of Nanoarchaeota with Thermococcales is supported by several other phylogenies, as well as a clear-cut synapomorphy corresponding to the transfer of a bacterial tRNA modification enzyme to the Nanoarchaea/Thermococcales clade [61, 62].
Authors’ response: We fail to see good reasons to use a “sketched phylogeny” from previous studies. Our phylogenetic reconstruction utilizes the largest and most thoroughly curated data set available to date and employs a powerful and robust phylogenetic method[51, 52]that, in our experience, produces better results than most allegedly “precise” ML implementations. The results described here fully agree with our previously published reconstructions[39]. We believe that, although many archaeal branches are well established so that the results of the reconstruction should not dramatically depend on the tree topology, the strategy that we used is the best current choice to optimize the reconstruction.
2) Even more problematic, the authors have artificially puts the root at the junction point between Nanoarchaeota, Euryarchaeota and the recently proposed “TACK superphylum”, i.e. a group encompassing Thaumarchaeota, Aigarchaeota, Crenarchaeaota, and Korarchaeota [40]. However, because the rooting of the archaeal tree has not been firmly established, the very existence of the “TACK superphylum”, -since it implies to root the archaeal tree in the euryarchaeal branch- remains to be confirmed [63]. Therefore, the choice of rooting the archaeal tree between the “TACK superphylum” and Euryarchaeota appears arbitrary, and the authors should definitely test alternative rootings, i.e. in all branches leading to the major archaeal phyla (T-C-K-E).
Authors’ response: Given the high computational demands on the ancestral reconstruction method (more than a week of wall-clock time), we find it impractical to test multiple root positions. See below the argument for our choice of the root position for the purpose of reconstruction.
I would be especially interested by comparative analyses based on rooting the archaeal tree in the branch leading to Thaumarchaeota (including Aigarchaeota, see below). This rooting was obtained for the ribosomal protein tree using a eukaryotic outgroup [64]). Such rooting makes sense, considering that Thaumarchaeota exhibits several important eukaryotic features that are not present in other Archaea (Topo IB, a monomeric RNA polymerase A subunit, the presence of both RPA and “cren” SSB proteins) [64, 65]. The use of eukaryotes as an outgroup to root the archaeal tree is reasonable since eukaryotes most likely emerged either from a deeply branching archaea or from a proto-eukaryote, but not within lineages leading to modern Archaea. Indeed, although the situation is not yet settled, it is difficult to imagine the emergence of eukaryotes from a reduced organism, such as a modern Archaea or even LACA. I know that some phylogenetic analyses suggest that Archaea emerged from within Archaea. However, I don’t trust these analyses. The use of universal proteins to root the archaeal tree is problematic because archaeal and eukaryotic proteins are quite divergent from their bacterial homologues. As a consequence, archaeal rootings obtained with universal trees are often contradictory and cannot be taken for granted [66].
Authors’ response: The reviewer briefly but impressively describes the multiple controversies around the topology of the archaeal tree and the position of eukaryotes. Being well aware of these problems, we deliberately chose to avoid adhering to any particular hypothesis and instead to root the archaeal tree in the least controversial position. Both Euryarchaeota and the TACK members have a large arCOG repertoire specific to the respective group and hence representing likely derived shared characters (32 arCOGs are present in >90% of Euryarchaeota to the exclusion of TACK and 73 arCOGs are present in >90% of Crenarchaeota and all Thaum-, Aig-and Korarchaeota to the exclusion of Euryarchaeota). In our reconstruction, Nanoarchaeon is derived directly from LACA, minimizing the effect of its highly reduced gene repertoire on the reconstruction of the other clades.
In any case, since their phylogeny is arbitrarily rooted, the authors cannot claim that their analysis confirms the existence of a monophyletic “TACK” superphylum!!
Authors’ response: Indeed, we point out in this article that “the inference of gene gain depends on tree topology and therefore cannot be construed as direct evidence of the monophyly of any group”. Nevertheless, we indicate that the large number of shared gene gains makes the monophyly of the TACK a plausible hypothesis (in a weak Popperian sense, the results of our analysis fail to provide any evidence against the TACK monophyly).
The name TACK itself raises problem because, in my opinion, Aigarchaeota (A) should be better considered as an order of the phylum Thaumarchaeota. Caldarchaeum subterraneum, the only member of this new order indeed forms a robust monophyletic group and share similar genomic signatures with other Thaumarchaeota [67]. If the status of phylum is delivered to “Aigarchaeota”, Crenarchaeota and Euryarchaeota should be divided into several new phyla, since Thaumarchaeota and “Aigarchaeota” are more closely related in term of distance than several internal groups of Crenarchaeota and Euryarchaeota in ribosomal protein trees [60].
Authors’ response: This is an issue of formal taxonomy. We have neither intent nor authority in this paper to offer any judgment on the appropriate taxonomic level of any of the (putative) major archaeal clades discussed here. We refer to “Aigarchaeota” (following the literature[40, 68]) for convenience of discussion of the hypothetical TACK superphylum.
Finally, I think a bit confusing to talk of gene gain in the case of LACA (the obvious overall winner). Of course, starting from the origin of life, gene gain obviously occurred before any streamlining. However, talking about gene gain for LACA seems implying that LACA was positioned exactly at the transition point between gene gains and gene losses in the evolution of the archaeal (and pre-archaeal) lineages. This would not be correct. In fact, LACA was no more a transition point in archaeal evolution than the African eve in Homo sapiens evolution. The streamlining tendency observed by Wolf and colleagues (and others) in archaeal evolution probably taken place not only between LACA and modern Archaea, but between the last common ancestor of Archaea and Eukarya (LCAAE) and modern Archaea. For me, this fits well with the idea that reductive evolution was instrumental in shaping the archaeal (and bacterial genomes) by streamlining starting from a more complex LCAAE, possibly via thermoreduction [69, 70].
Authors’ response: The gene repertoires of both LUCA and LAECA/LCAAE (if such an organism existed outside of the diversity of modern Archaea) are outside the scope of the present work. We do not assign “transition point” status to any of the common ancestors of the observed clades; the ancestors simply are operationally defined by the relationships between these clades themselves. Formally, the genes that “appear” in LACA under our reconstruction have to come from somewhere; thus, they are considered gains in LACA in a purely formal sense, regardless of their history outside of the modern Archaea.
Finally, the work of Wolf and colleagues reveals that HGTs appears to have occurred largely randomly during the evolution of Archaea, with few exceptions. This again shows that HGTs are not a major problem for tree reconstruction. This is refreshing in the framework of the debate between tree-thinkers and web-thinkers [71]. However, the existence of a few exceptions” reminds us that one should be careful (taking these exceptions into account) in performing global phylogenetic analyses based on whole genome trees.
Reviewer’s response: I understand that it would be impractical for this particular work to test multiple root positions and I suspect that the result should not be very different. I hope that for the next update, it will be possible to use a well supported rooting, beyond the “TACK superphylum”.
I also can understand that the authors prefer to use their phylogenetic reconstruction. However, in that case, I would like to remind that recovering the best tree is not only the problem of the phylogenetic method used, but of the critical incorporation in a consensus phylogeny of different information (especially removing genes affected by HGT and fast evolving lineages at once to test different positions). For instance, to recover the “most likely” correct position of Nanoarchaea, it was first necessary to analyze all individual ribosomal protein phylogenies to realize that the correct result was probably the grouping of nanoarchaeota + Thermococcales [61]. This grouping was then support by additional phylogenies (reverse gyrase, Topo VI, elongation factors) and a synapomorphy based on a tRNA modification protein. It’s the reason why I think that a consensus phylogeny based on the work of several groups should be preferred for the kind of work presented here. In any case, the update arcCOG database will be a powerful addition to genomic tools available to study archaeal evolution and I am fine with the revised version.
Authors’ response to reviewer’s response: We will definitely revisit the question of the archaeal core phylogeny when working on the next arCOGs update. We would like to add a word of caution regarding consensus phylogenies: there is a body of evidence indicating that the “supermatrix” approach (concatenated alignments) provides more robust phylogenetic reconstructions compared to the “supertree” (consensus) methods[72–75].
Reviewer 3: Dr. Pascal Lapierre, University of Connecticut, USA (nominated by Prof. J Peter Gogarten, University of Connecticut, USA)
Review for the manuscript titled “The updated clusters of orthologous genes for Archaea: a complex ancestor of the archaea and the byways of horizontal gene transfers”
This manuscript by Wolf et al. is presenting the updated version of the clusters of orthologous genes for Archaea (arCOGs), as well as a detailed analysis of the genomic evolutionary history of the archaeal domain. This new version of arCOGs now includes data from 120 archaeal genomes (up from 41), segregated into 10,335 clusters of orthologous genes. Based on the genomic distributions of these gene families, Wolf et al. were able to determine that the last common archaeal ancestor was more complex and that genome streamlining lead to the smaller genomes found in most of the modern archaea. They also found that gene gain through horizontal gene transfers across archaeal species did not shown any preferred highway of gene sharing. Cluster of orthologous gene databases have been proven to be a valuable tool for genomic analysis. This updated version will help increase the accuracy and reliability of genome annotation and functional and comparative genomics. There are however, a few questions and concerns about the analysis and conclusions in this paper.
1) Regarding the arCOGs core and shell compositions, the authors are saying that the archaeal “core” genome went from 230 arCOGs in 2007 to 220 arCOGS, while the number of arCOGs present in the “shell” stayed unchanged to about 2000 arCOGs. What happened to the 10 extra core arCOGs that were present in 2007? Shouldn’t they have move to the “shell” category if they were no longer highly conserved amongst the archaeal genomes?
Authors’ response: The numbers are approximate (see below on the nature of the estimates) and are given to the 2 significant digits. Although the 10 arCOGs that disappeared from the core, indeed, most likely moved to the shell category, the size of the latter partition remains ~2200 because reporting it as 2210 would be excessively precise and beyond the power of the present analysis.
2) On the estimation of LACA genome size, I found conclusions based on phyletic patterns are difficult to accept, mainly because ancestral genomes reconstruction always tends to be larger than the extant genomes. If HGTs are rampant and probably were so in the past, the phyletic patterns that we see today for most of the proteins impacted by frequent transfers will fool any attempt to accurately reconstruct ancestral genome history. Can the authors comment on this?
Authors’ response: It is not the case that “ancestral genomes reconstruction always tends to be larger than the extant genomes”. This is observed only when gene loss is arbitrarily assumed to be more common than gene gain (in particular, via HGT)[8]. Quite the contrary, simple reconstruction methods such as maximum parsimony tend to yield simple ancestors. That said, theoretically, the situation where rampant HGT totally erases the historical signal and makes any attempt to “accurately reconstruct ancestral genome history” moot, is possible although the main effect of HGT is shrinking rather than expansion of the ancestral gene set. However, observation of real phyletic patterns makes this possibility highly unlikely. In particular, the prevalence of single-gain patterns implies an uncanny match between the pattern-inferred and sequence-inferred histories which is improbable under the HGT-saturated model.
Reviewer response: In my application of ml ancestral state reconstruction, gene presence in the ancestral state often is favored, because the gene gain and loss are assumed to be constant throughout the tree. However, this often made assumption is almost certainly false. One possible reason is that genes, which were invented later in evolution, cannot have been gained early on. I did not yet test the program used by the authors; however, I suspect that the use of too simple a model assuming uniformity in rates might be the reason for the observed complex archaeal ancestor. It certainly is possible that evolution went from complex to simple inside the archaeal domain; however, analysis of phyletic patterns alone in the absence of a test for the impact of model misspecification, and without corroborating phylogenetic evidence, only results in a preliminary finding that remains highly questionable.
Authors’ response to reviewer’s response: Fortunately, the model behind the COUNT program is more sophisticated than that. First, the gain and loss rates are not assumed to be constant, but are estimated for the tree branches separately. Second, the deep ancestral state is always set to zero and the gain on the branch leading to the LCA is no more favored than the data dictates. Taking into account the data beyond the phyletic patterns (individual gene phylogenies) might, in principle, provide even better results; unfortunately, this is currently beyond practical state of the art as we understand it.
3) In Figure 1, the boundary between which proteins belong to the shell and which belong to the variable cloud is somewhat fuzzy. The authors should comment on the following. Depending upon where one sets this boundary, how would it affect the size estimation of the ancestral archaeal genome (LACA)? Similarly, would lumping together non-identical patterns artificially increase the size of LACA?
Authors’ response: The boundary between the shell and the cloud, as well as that between the core and the shell, includes uncertainty (is fuzzy) by definition. The reported numbers are the integrals of the “core”, “shell” and “cloud” exponents, not a result of counting families within arbitrarily defined boundaries. Redefining these boundaries would have no effect on LACA size and content estimates because the estimates are derived from specific phyletic patterns regardless of the arbitrary “core”, “shell” and “cloud” status.
We are not sure about the meaning of the rest of this comment. We do not lump patterns together whether they are identical or non-identical; neither can we see any rationale for doing so. Therefore we cannot venture to guess how this procedure would affect the LACA size estimate.
Reviewer response: In lumping patterns together, I was referring to the sentence in the text where you say “The rapidly increasing proportion of unique phyletic patterns calls for a more coarse-grained comparison whereby non-identical but similar patterns are treated as members of the same group”. My concern is that depending on which cutoff you use to determine if patterns are similar enough to be considered as part of the same group, there is a possibility that those patterns, if analyzed individually, may yield different results, or at least have different probabilities of being present in LACA.
Authors’ response to reviewer’s response: There was no “lumping” with arbitrarily set cutoffs to “determine if patterns are similar enough to be considered as part of the same group”. All patterns were analyzed individually. The required “coarse-graining” of the analysis emerged naturally by concentrating on the more biologically interesting patterns of inferred gains and disregarding the less interesting pattern of losses.
4) The final claim of a LACA with ~2600 genes is a little optimistic. This number is based on the ML estimated genome size of ~1725 arCOGs, of which only about half have a p-value > 90%, with arbitrarily added numbers of paralogs and transient genes. There should be a better explanation of how these numbers were determined and how confident the authors about the conclusions.
Authors’ response: In the extant genomes, on average, the ratio between the number of genes in the genome and the number of detected arCOGs is approximately 1.5. Because we do not have any reason to believe that LACA was qualitatively different from modern Archaea, we used this ratio to arrive to ~2600 genes from ~1725 ancestral arCOGs. The latter number does not depend on the confidence level assigned to each particular gene family but rather represent the sum of posterior probabilities. For example, 8 genes each with a 25% chance of being present in LACA, together would contribute ~2 genes to the estimate of the LACA family set.
Reviewer response: I think one should be cautious in making assumptions about the state of LACA based on observation from modern genomes without corroboration using other methods than phyletic patterns alone. You arrive at a number of about 1725 ancestral arCOGs, of which only about 850 arCOGs can be traced back to the ancestral genome with good confidence level. The other halves are not well substantiated extrapolation. In addition, if you are overestimating the number of arCOGs in LACA because of flaws in the model used, the size of LACA would be even smaller. The best thing to do in my opinion would be to build phylogenetic trees from these arCOGs and to compare them to the reference phylogeny to determine if they follow a vertical line of descent or not. Only then could one make reasonable conclusions about ancestral genome content.
Authors’ response to reviewer’s response: Not “making assumptions about the state of LACA” was exactly our motivation. As mentioned earlier, we have no reason to believe that LACA was qualitatively different from modern Archaea; thus, the natural assumption about LACA is that it is similar. The notion that only 850 genes are “confidently” (i.e. with posterior probability exceeding some arbitrary cutoff) traced to LACA is not quite relevant. In fact, 1725 families is the maximum likelihood estimate of LACA size regardless of which (and how many) individual gene families contributed what fraction of posterior probability to the final estimate. Again, detailed analysis of individual gene histories might improve the estimates for individual genes but unless the collective shift in these estimates would turn out to be massively asymmetric, the overall estimate would change very little.
5) On the phylogenetic tree in Figure 2, it would be advisable to add support values to the tree to have a better idea of the reliability of the reference phylogeny since any misplaced branches can greatly influence the final results.
Authors’ response: First, fortunately, misplaced branches tend to have only a minor effect on reconstructions because of the tendency of reconstruction errors to be confined to short internal branches where the number evolutionary events is usually small. Rearrangement of such branches affects the inference of events only minimally and locally. Second, given the size of the concatenated ribosomal protein alignment, confidence levels for the tree branches are mostly inflated. In the tree in Figure2, all branches have reported bootstrap support values >0.9 which is most likely overly optimistic. The sole exception is the branch of Methanococcales whose position relative to its presumed sister group and to Methanobacteriales is effectively unresolved (support value of 0.23).
6) Figure 4. Do the different box sizes on the tips of the tree having any meaning? If so, please explain.
Authors’ response: Single-size (square) boxes are used for branches ending in single genomes and is always colored uniformly. Double-size (rectangular) boxes are used for compressed clades and usually contain the color transition between the smallest and the largest genome in the clade. We added this information to the figure legend.
7) Figure 6. Is there any relevance to the different line colors depicting the byways of gene sharing? There are a few byways that seem to defy logic. One example is the byways linking the Methanomicrobiales to their own ancestor near the base of the tree. The authors should explain the cause of a set of genes travelling back or forward in time.
Authors’ response: The line colors in Figure6are not informative and are used for better visual separation only (we added the explanation to the figure legend). It should be noted that byways shown in this picture connect the nodes where gains have occurred; they do not depict the actual transfer paths (the latter cannot be derived from phyletic patterns in principle). A connection between an ancestor and its descendant is inferred when a gene family appears to be gained in some deep clade, lost subsequently, and then re-gained by a shallow descendant of the same clade. Any member of the donor clade, extant or extinct, that is contemporary to the ancestor of the acceptor clade, can be the source of the transfer. Again, the exact path cannot be derived from phyletic patterns only.