Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea
© Makarova et al. 2007
Received: 02 November 2007
Accepted: 27 November 2007
Published: 27 November 2007
Skip to main content
© Makarova et al. 2007
Received: 02 November 2007
Accepted: 27 November 2007
Published: 27 November 2007
An evolutionary classification of genes from sequenced genomes that distinguishes between orthologs and paralogs is indispensable for genome annotation and evolutionary reconstruction. Shortly after multiple genome sequences of bacteria, archaea, and unicellular eukaryotes became available, an attempt on such a classification was implemented in Clusters of Orthologous Groups of proteins (COGs). Rapid accumulation of genome sequences creates opportunities for refining COGs but also represents a challenge because of error amplification. One of the practical strategies involves construction of refined COGs for phylogenetically compact subsets of genomes.
New Archaeal Clusters of Orthologous Genes (arCOGs) were constructed for 41 archaeal genomes (13 Crenarchaeota, 27 Euryarchaeota and one Nanoarchaeon) using an improved procedure that employs a similarity tree between smaller, group-specific clusters, semi-automatically partitions orthology domains in multidomain proteins, and uses profile searches for identification of remote orthologs. The annotation of arCOGs is a consensus between three assignments based on the COGs, the CDD database, and the annotations of homologs in the NR database. The 7538 arCOGs, on average, cover ~88% of the genes in a genome compared to a ~76% coverage in COGs. The finer granularity of ortholog identification in the arCOGs is apparent from the fact that 4538 arCOGs correspond to 2362 COGs; ~40% of the arCOGs are new. The archaeal gene core (protein-coding genes found in all 41 genome) consists of 166 arCOGs. The arCOGs were used to reconstruct gene loss and gene gain events during archaeal evolution and gene sets of ancestral forms. The Last Archaeal Common Ancestor (LACA) is conservatively estimated to possess 996 genes compared to 1245 and 1335 genes for the last common ancestors of Crenarchaeota and Euryarchaeota, respectively. It is inferred that LACA was a chemoautotrophic hyperthermophile that, in addition to the core archaeal functions, encoded more idiosyncratic systems, e.g., the CASS systems of antivirus defense and some toxin-antitoxin systems.
The arCOGs provide a convenient, flexible framework for functional annotation of archaeal genomes, comparative genomics and evolutionary reconstructions. Genomic reconstructions suggest that the last common ancestor of archaea might have been (nearly) as advanced as the modern archaeal hyperthermophiles. ArCOGs and related information are available at: ftp://ftp.ncbi.nih.gov/pub/koonin/arCOGs/.
This article was reviewed by Peer Bork, Patrick Forterre, and Purificacion Lopez-Garcia.
A robust classification of genes based on accurately deciphered evolutionary relationships is the cornerstone of comparative and evolutionary genomics. Such a classification is indispensable both for the functional annotation of sequenced genomes and for any genome-wide evolutionary reconstruction. The construction of an evolutionary classification of genes is a non-trivial task because of the complexity of homologous relationships between genes. The two principal classes of homologs are orthologs and paralogs. Orthologs are homologous genes that evolved via vertical descent from a single ancestral gene in the last common ancestor of the compared species. Paralogs are homologous genes, which, at some stage of evolution, have evolved by duplication of an ancestral gene [1, 2]. Orthology and paralogy are intimately linked because, if a duplication (or a series of duplications) occurs after the speciation event that separated the compared species, orthology becomes a relationship between sets of paralogs, rather than individual genes (in which case, such genes are called co-orthologs).
Correct identification of orthologs and paralogs is of central importance for both the functional and the evolutionary aspects of comparative genomics. Orthologs typically occupy the same functional niche in different organisms; by contrast, paralogs evolve to functional diversification as they diverge after the duplication [3, 4]. Therefore, the accuracy of genome annotation critically depends on the accurate identification of orthologs . A clear demarcation of orthologs and paralogs is also required for constructing evolutionary scenarios which include, along with vertical inheritance, lineage-specific gene loss and horizontal gene transfer (HGT) [6–8].
In principle, orthologs, including co-orthologs, should be identified by means of phylogenetic analysis of entire families of homologous proteins in the compared genomes, which is expected to define orthologous protein sets as clades. However, for genome-wide protein sets, such analysis remains extremely labor-intensive, and error-prone as well . Accordingly, procedures have been developed for identification of sets of likely orthologs without an explicit referral to phylogenetic analysis. These procedures are based on the notion of a genome-specific best hit (BeT), i.e., the protein from a target genome that is most similar (typically, in terms of similarity scores computed using BLAST or another sequence comparison method) to a given protein from the query genome [10, 11]. The assumption central to this approach is that orthologs have a greater similarity to each other than to any other protein from the respective genomes. When multiple genomes are analyzed, pairs of probable orthologs detected on the basis of BeTs are combined into orthologous clusters represented in all or a subset of the analyzed genomes. This approach, amended with additional procedures for detecting co-orthologous protein sets and for treating multidomain proteins, was implemented in the database of Clusters of Orthologous Groups (COGs) of proteins [11, 12]. The latest COG set released in 2003 includes ~70% of the proteins encoded in 69 genomes of prokaryotes and unicellular eukaryotes . The COGs have been employed for functional annotation of newly sequenced genomes (e.g. [14, 15], comparative analysis of gene neighborhoods [16–18] and other types of connections between genes, as implemented in the widely used STRING tool , target selection in structural genomics (e.g. , and various genome-wide evolutionary analyses [7, 8]. Independently, other groups have developed similar methodologies for identification of orthologs and paralogs in pairwise or multiple genome comparisons [21, 22]. Very recently, a major effort on automatic construction of sets of orthologous genes has culminated in the EggNOG database which employed the COGs as a prototype and a seed .
The methods for the construction of COGs were developed and originally applied to small sets of genomes; these and other related methods do not guarantee correct identification of the paralogous and orthologous relationships, due to the variability of domain architectures of proteins, differential loss of paralogs in different lineages, extreme divergence of some orthologous and paralogous genes, and other complications [2, 12, 13]. The computational cost of exhaustive genome comparisons also grows almost prohibitively with the steep increase in the number of sequenced genomes which approached 500 in the beginning of 2007 . Thus, several smaller scale studies have been conducted in which COGs were constructed for compact groups of bacteria including the Thermus-Deinococcus group , Cyanobacteria , and Lactobacillales . In each of these analyses, a considerably better resolution of the homologous relationship than in the overall COG set has been achieved.
In the previous comparative-genomic analyses of archaea, we delineated COGs for this domain of life and used them to partition archaeal genes into the evolutionarily stable, conserved core and the "shell" of genes that are often lost during evolution or are characteristic of a narrow group of species ; we further traced the dynamics of drop in the number of the core genes with sequencing of additional archaeal genomes [28, 29].
Here we present the updated set of COGs that includes 41 sequenced archaeal genomes and delineate the core sets of genes that are represented in all archaea or in the major archaeal divisions, Euryarchaeota and Crenarchaeota. We further describe evolutionary reconstructions aimed at inferring the nature of the Last Archaeal Common Ancestor (LACA) and other ancestral forms, and uncovering the trends of gene loss and gain during archaeal evolution.
The 41 archaeal genomes included in the arCOGs
Genome size, Mb
Number of annotated protein-coding genes
Life style and other features
Aerobic chemorganotroph, sulfur enhances growth
Caldivirga maquilingensis IC-167
Moderate acidophile, heterotroph, anaerobe or microaerophyle
Moderate psychrophile, uncultivated symbiont of sponges
Hyperthermophilic neutrophile, anaerobe
Facultative nitrate-reducing anaerobe
Pyrobaculum calidifontis JCM 11548
Same as Pyrae
Pyrobaculum islandicum DSM 4184
Same as Pyrae
Staphylothermus marinus F1
Anaerobic submarine heterotroph
Sulfolobus acidocaldarius DSM 639
Sulfur-metabolizing chemorganotroph, thermoacidophilic, motile aerobe
Same as Sulso
Thermofilum pendens Hrk 5
Facultative hydrogen-sulfur authotroph, anaerobe
Motile, anaerobic, sulfate-reducing chemolito- or chemorgano- autothroph
Haloarcula marismortui ATCC 43049
Chemoorganotrophic obligate halophile
Aerobic chemorganotroph, obligate halophile, proteolytic, motile, with cell envelope; 2 extrachromosomal elements
Halophilic, aerobic heterotroph
Chemolitoautothroph, strict anaerobe, nitrogen-fixing methanogen
Methanococcoides burtonii DSM 6242
Psychrotolerant, strictly anaerobic, slightly halophilic methylotroph
Chemolito-autothrophic, strictly anaerobic, motile methanogen, 2 extrachromosomal elements
Methanococcus maripaludis C5
Mesophilic hydrogenotrophic, nitrogen-fixing methanogen
Methanococcus maripaludis S2
same as MetmC
Methanocorpusculum labreanum Z
Strictly anaerobic, CO2 fixing methanogen
Methanoculleus marisnigri JR1
Strictly anaerobic methanogen
Chemolito-autothrophic, strictly anaerobic, methanogen, high intracellular salt concentration
Methanosaeta thermophila PT
Strictly anaerobic methanogen
Chemolito-autothrophic, anaerobic, nitrogen-fixing, versatile methanogen, motile, forms multicellular structures
Methanosarcina barkeri fusaro
Same as Mac
Same as Mac
Methanogen, human intestinal inhabitant
Methanospirillum hungatei JF-1
Strictly anaerobic methanogen
Picrophilus torridus DSM 9790
Extremely acidophilic moderate thermophile
Same as Pho
Same as Pho
Anaerobic, motile heterotroph
Thermococcus kodakaraensis KOD1
Chemorganotrophic, thermoacidophilic, motile facultative anaerobe
Same as Tac
Uncultured methanogenic archaeon
Methanogen isolated from rice rhizosphere
Obligate symbiont of the crenarchaeon Ignicoccus
The 10 most common phyletic patterns in the arCOGs
Number of arCOGs
Metac, Metba, Metma
Halma, Halsp, Halwa, Netph
Sulac, Sulso, Sulto
Pyrae, Pyrca, Pyris, Thete
Pyrab, Pyrfu, Pyrho, Theko
Picto, Theac, Thevo
From the inception of the COG methodology, it had been realized that COGs have potential for straightforward evolutionary-genomic applications. One of these is the construction of gene-content trees whereby the phyletic patterns of COGs are converted into a distance matrix between the analyzed genomes, with an appropriate normalization for genome size [37, 38, 40](see Materials and Methods).
Major features of the reconstructed gene set of LACA
No. of arCOGs
Implication for LACA
Complete translation system and essentially complete set of enzymes for tRNA and rRNA modification
aaRS and related enzymes
Moderately sophisticated transcription control
RNA polymerase subunits
Replication, recombination and repair
Advanced DNA replication and repair system
DNA polymerase subunits
Energy production and conversion
Membrane-based redox bioenergetics; partial TCA cycle
NADH dehydrogenase or Na+/H+ antiporter
V-type ATPase-ATP synthase
Carbohydrate transport and metabolism
Moderately sophisticated sugar metabolism
Amino acid transport and metabolism
Enzymes for the biosynthesis of all amino acids
Amino acid biosynthesis
Nucleotide transport and metabolism
Enzymes for the biosynthesis of all nucleotides
Coenzyme transport and metabolism
Enzymes for the biosynthesis of all essential cofactors
Lipid transport and metabolism
Fully developed membrane
Cell wall, membrane and envelope biogenesis
Fully developed cell wall
Inorganic ion transport and metabolism
Sophisticated ion uptake system
Secondary metabolites biosynthesis, transport and catabolism
Limited or unknown
Limited motility and/or conjugation
Posttranslational modification, protein turnover, chaperones
Sophisticated system of protein fate control
Cell cycle control
Limited or unknown
Signal transduction mechanisms
Limited use of bacterial type signal transduction system; original signal transduction
Intracellular trafficking and secretion
Fully developed secretion system
Viruses abundant at LACA times
Poorly characterized or unknown
Comparing these observations with those presented in Figs. 3 and 5, one comes to the conclusion that, quantitatively, archaeal genomes are dominated by the relatively mobile "shell" genes that belong to the common prokaryotic gene pool and encode the overwhelming majority of metabolic, structural, and signal transduction functions; a sharp contrast is presented by the stable, archaeo-eukaryotic core of information-processing genes. These quantitative conclusions, even if based on a crude analysis, are in a good agreement with the early observations on the bimodal distribution of the taxonomic affinities of archaeal genes , the subsequent observations on the affinities of eukaryotic genes [51, 53], and the complexity hypothesis which posited distinct evolutionary fates of information and operational genes .
The arCOGs, which are expected to be updated as genome sequencing progresses, are a resource for genome annotation of the newly sequenced archaeal genomes and the refinement of the existing annotations, as well as evolutionary reconstructions. Crude reconstructions presented here indicate that the ancestral archaeal forms, including LACA, probably, were full-fledged prokaryotes, of approximately the same level of complexity as the simplest of the modern free-living archaea.
All-against-all BLAST  search was used to establish the similarity relationships between the archaeal proteins. Lineage-specific expansions of paralogs were identified essentially as described previously [57, 58]. Initial clusters based on triangles of symmetrical best hits were constructed using a modified COG algorithm [11, 13]; the major difference in the current implementation was the strict symmetry requirement for the "best hit" relationship between proteins. This constraint lowers the number of false-positives but, in the presence of paralogs, leads to substantial underclustering ; this was rectified on the subsequent steps.
Multiple alignments of the initial cluster members were constructed using the MUSCLE program ; alignments were used to construct PSSMs for a PSI-BLAST search  against the database of Archaea proteins with the e-value threshold of 0.01; proteins (domains) were added to the corresponding best-scoring original clusters resulting in a set of expanded clusters.
Sequences of expanded cluster members were aligned using MUSCLE, and the PSSMs constructed from these alignment were used for a second round of PSI-BLAST search against the database of archaeal proteins. The search results were used to construct a similarity graph for the relationships between the expanded clusters. Formally, all statistically significant (e<0.01) hits in a search with the PSSM for a particular cluster were classified according to the cluster they belong to; clusters in the hit list were ranked according to the mean score across their members (members missing from the hit list were assigned an arbitrary score 2 bits below the significance threshold). An edge between the i-th and the j-th clusters was given weight equal to the lowest rank among the i→j and j→i relationships (i.e., if cluster j is the top-ranking hit when cluster i is the query but cluster i is the third-ranking hit for cluster j, then the edge connecting i and j is given the rank of 3). Connected components were extracted from the graph; pairs of nodes within a connected component were assigned an edge with a rank of infinity if they were not connected directly. A minimum-linkage clustering procedure was applied to the connected sets of clusters (if cluster i and j are merged, the edge between cluster k and the node, representing the merged clusters, is given the rank equal to the lowest rank of k-i and k-j edges), resulting in a rooted dendrogram of relationships between the clusters. Then each node on on the tree was labeled with the number of species that were present in all descendant clusters. Two rules were used to determine if the descendant clusters should be merged: i) if species-coverage of the node is at least 50% greater than that of any of the descendant nodes and ii) if, among the descendants of a node, one is species-rich and the other one is species-poor (formally, if s i >20s j /(10-s j ) where s i and s j stand for the species-coverage of the species-rich and species-poor descendant nodes, respectively).
In parallel to the above procedures, a BLAST search against the COG 2003 database was performed, followed by using a modified COGNITOR program [11, 13] to assign archaeal proteins to prokaryotic COGs. Merged clusters with proteins assigned to different COGs were split into COG-specific clusters to avoid clustering of paralogous proteins that previously have been assigned to different curated COGs.
Reconstruction of gene gain and loss during the evolution of Archaea was performed using a modified weighted parsimony approach  implemented in a two-pass algorithm. First, a coarse-resolution multifurcating species tree was compiled from several single-gene phylogenetic reconstructions and taxonomic data. For each arCOG, the phyletic pattern indicating the presense/absence of the respective gene in each analyzed species was mapped onto the leaves of the tree. The first pass is performed in the leaves-to-root direction, and the number of descendant nodes containing the given gene is counted for each internal tree node. If this number is greater than or equal to the first (generally, more stringent) threshold, which is set for each node individually, the node is assigned state "1" (presence of the gene), otherwise it is assigned state "0" (absence of the gene). In the second pass, which is performed in the opposite, root-to-leaves direction, if the gene is absent in the given node (state "0") but present in its ancestor and the number of descendant nodes carrying this gene is greater than or equal to the second (generally more relaxed) threshold, the node is assigned state "1". For the guide tree and the thresholds, see .
The paper describes the construction of orthologous group for archea.
Given the success of the COGs and KOGs (a subset for eukaryotes with higher resolution) and the inability of current purely automatic procedures to produce reliable orthologus groups and, very importantly, their reliable functional annotation, I see this as an important resource for various studies. Furthermore, it uses a semi-automatic procedure that includes some clever guiding principles e.g. it takes into account phylogenetic gene presence patterns. The average coverage of 88% at a higher resolution than the current 76% COG coverage of genes in archeal genomes is another noteworthy and useful feature. As far as I can see, the arCOGs are of high quality and I look forward to use them.
There is no comparison to more recent orthology-built procedures, but I assume that this semi-automatic procedure presented here provides a more accurate picture than purely automatic methods.
The only concerns I have are availability/formate issues and some minimalistic Figure captions. Both should be easy to solve.
Taken together, I congratulate the authors for this nice, important and very useful piece of work.
Authors' response: The formats of the files on the ftp site were modified to increase transparency, and an extended README file was added. We hope this imporves accessibility which is, indeed, crucial. The figure captions were amended.
The «easy to use» COG database has been especially useful for the biological community. It has helped to improve the quality of genome annotation and has been widely adopted by non bioinformatic experts to perform preliminary rounds of comparative genomic analysis. The main problem with such popular database is the delay in their updating, a daunting task considering the current avalanche of completely sequenced genomes. The present paper by Kira Makarova and colleagues reports a much welcome update of the COG database that focus on archaea (arCOGs). The number of completely sequenced archaeal genomes remains quite low (compared to the situation with bacteria) allowing an exhaustive analysis that remains to be done for bacteria and eukarya. The arCOGs database will be for sure an extremely important source of information for the community working on archaea and for all scientists interested in comparative genomics and microbial evolution. The new analysis corresponds to a substantial increase in information compared to previous one, since around 40% of arCOGs are new.
In addition to the description of the arCOGs database, the paper by Kira Makarova and co-workers present several analyses that bring new (or update) data and raise several interesting evolutionary questions. In particular, they have built a gene-content tree based on the presence-absence of arCOGs in archaeal genome and estimated the evolution of the archaeal genome content along the evolutionary tree based on a gene loss and gain analysis. They reported several intriguing observations that are worth to be discussed in the framework of current debates on archaeal phylogeny and on the nature of the last universal archaeal ancestor.
Makarova and co-workers noticed that the number of strictly specific euryarchaeal and crenarchaeal proteins is very low (one and three, respectively). This seems to strongly argue in favour of the monophyly of Archaea (against the «eocyte» hypothesis). However, it should be interesting to present a slightly «relaxed» version of these cores, by allowing for the possibility for a protein to be missing in a group of related archaea (something quite frequently observed, for instance the lack of the euryarchaeal histone in Thermoplasmatales). More generally, it could be interesting in the future to define a category of conserved arCOGs (carCOGs?) present in all members of at least two archaeal orders in order to discriminate between ORFans arCOGs that are only present in one order (probably «recently» introduced by lateral gene transfer) and arCOGs of probable ancient origin that can tell us something about the evolutionary relationships between the diverse archaeal orders. It should be then interesting to determine if the distribution of such carCOGs correlate with the archaeal phylogeny based on various evolutionary markers.
The parasitic archaeon Nanoarchaeum equitans lacks the larger number (50) of universal arCOG, confirming that this archaeon probably evolved by «genome reduction». Some authors have suggested that N. equitans is a primitive organism. I suspect that there is a relatively high percentage of these 50 proteins that have homologues in Bacteria or Eukarya. This could be indicated as an argument in favour of the reduction scenario versus the "old nano" hypothesis! Interestingly, the gene content tree based on arCOGs groups N. equitans with Thermococcales among Euryarchaeota. Although gene-content trees can be sometimes highly biased by lateral gene transfer, this observation is in good agreement with a preliminary global analysis based on best BLAST-hits and refined phylogenies based on proteins of the small ribosomal subunits, reverse gyrase, Topo VI and elongation factors (Brochier et al.2005). This confirms that N. equitans should not be considered as a member of a new archaeal phylum (as already widely found in text-books!!) but as an odd member of the Euryarchaeota, probably, distantly related to Thermococcales.
Another puzzling observation is the grouping of Cenarchaeum symbiosum with euryarchaea in the gene-content tree. Interestingly, the COG coverage is quite similar for all archaeal genomes (around 88%) except for C. symbiosum and N. equitans. This can be explained by genome reduction in the case of N. equitans, but not in the case of C. symbiosum whose genome has a «normal» size. Significantly, the authors reported that the coverage of C. symbiosum genome with the old COGs was greater than with the new arCOGs! This indicates that this genome contains COGs present in Bacteria or Saccharomyces cerevisiae but not in any other archaeon. The proposed explanation is that C. symbiosum is a symbiotic crenarchaeon that has acquired lots of bacterial genes. An alternative hypothesis is that C. symbiosum is not a crenarchaeon after all, but represents an early branching archaeal phylum that contains bacterial and archaeal homologues that have been lost in other archaea.
From their reconstruction of gene loss and gain events, Makarova and co-workers suggest that the last Universal archaeal ancestor (LACA) was a hyperthermophile and a chemo-litoautotrophe with a minimal number of genes around 1000. They conclude that LACA might have been (nearly) as advanced as modern archaeal hyperthermophiles and found this conclusion quite «unexpected». I am not so surprised. It's a prejudice to think that ancestors are always simpler than present-day organisms and that ancient evolution always occurred toward more "complexity". There is no reason why reductive evolution, which has occurred so often in the evolution of modern cells, was not as pervasive in ancient time (Forterre and Philippe, 1999). In fact, an in-depth analysis of ribosomal protein distribution by Poch and co-workers already suggested a few years ago that the ribosome of LACA was probably more complex that the ribosome of any modern archaea (Lecompte et al., 2002).
Authors' response:We do not, exactly, disagree and certainly realize the importance of reductive evolution. Still, whether or not we should consider the reconstruction of a complex LACA surprising or not, depends on the perspective. Considering that LACA is supposed to be the common ancestor of one of the 3 domains of life, there might be some element of surprise in this observation. After all, at the earliest stages of the evolution of life, there must have been a dramatic increase in complexity. That this complexification stage, apparently, was over by the time the domain of life became distinct (very likely, the same will hold for bacteria) is, certainly, of note. Alternatively, it is conceivable that LACA is actually not as ancient as one might think but represents a more recent bottleneck in archaeal evolution such that there was a complexification stage after the onset of the archaeal domain but it is inaccessible by comparative genomics.
My only criticism of this paper is that the authors have taken a quite conservative view of archaeal phylogeny (only based on 16S rRNA) to analyse gene loss and gain along the archaeal history and to estimate the genome content of LUCA. Indeed, several features of their unresolved multifurcation tree are dubious.
N. equitans appears as an isolated lineages (a third phylum)
C. symbiosum is grouped with hyperthermophilic Crenarchaeota.
Methanopyrus kandleri is shown as an isolated branch
In all these cases, the authors have chosen to follow the 16S rRNA tree, whereas careful analyses based on ribosomal proteins have shown that Methanopyrus kandleri most likely groups with methanococcales and methanomicrobiales (Brochier et al. 2004) and that N. equitans is at least sister-group of euryarchaea (if not of Thermococcales). As previously indicated, the grouping of C. symbiosum with crenarchaea could be also highly misleading. It should have been interesting to compare the genome content of LACA based on the 16S rRNA phylogeny and the more robust phylogeny based on ribosomal proteins. My feeling is that the nature of LACA (chemo-litoautotroph or not, hyperthermophile or not?) is still a pending question.
Authors' response:We have not really followed the 16S RNA tree but rather deliberately chose a poorly resolved topology so as not to subscribe to any particular phylogenetic hypothesis with respect to issues that are still considered unresolved. We are well aware of the published work on archaeal phylogenies and the two important papers by Brochier et al. are cited. Out of fairness, the likely position of Methanopyrus with Methanococcales and Methanobacteriales, was first reported in Slesarev et al. in 2002, and this cited as well. The wording on Methanopyrus in the text was modified to reflect these reports but we did not modify the tree in Fig. 7. One has to keep in mind that the reconstruction here is by no means supposed to be the final word on the scenario of archaeal evolution but more of an exercise showcasing the utility of the arCOGs. We expect that there will be many more iterations with more genomes, better resolved trees, and better methods of reconstruction, and we certainly hope to be involved.
Finally, in the discussion of the gene-content tree, the authors wrote «methanogenesis which are spread both vertically and horizontally». In fact, a detailed phylogenetic analysis of genes involved in methanogenesis by Bapteste and co-workers has shown that, surprisingly, although these proteins can be considered as «operational» they have been only transmitted by vertical inheritance in the archaeal domain (Bapteste et al., 2005).
Authors' response:We believe that the issue is not quite resolved yet. The wording in the paper was softened, nevertheless.
Bapteste E, Brochier C, Boucher Y.
Higher-level classification of the Archaea: evolution of methanogenesis and methanogens.
Archaea.1, 353–363 (2005).
Brochier, C. Forterre P. and Gribaldo S.
Archaeal phylogeny based on proteins of the transcription and translation machineries: tackling the Methanopyrus kandleri paradox
Genome Biology, 5, R17 (2004).
Brochier, C., Gribaldo, S., Zivanovic, Y. Confalonieri, F. and Forterre, P.
Nanoarchaea: representative of a novel archaeal phylum or a fast evolving euryarchaeal lineage related to Thermococcales?
Genome Biology, 6:R42 (2005).
Forterre, P. and Philippe, H
Where is the root of the universal tree of life?
Bioessays, 21, 871–879 (1999).
Lecompte O, Ripp R, Thierry JC, Moras D, Poch O.
Comparative analysis of ribosomal proteins in complete genomes: an example of reductive evolution at the domain scale.
Nucleic Acids Res., 30, 5382–5390 (2002).
This article describes the analysis of genes present in most of the currently available archaeal genome sequences in view of their classification in clusters of orthologous genes specific to the archaea (arCOG). It represents an updated extension of previous comparative genomic analyses of COGs though exclusively devoted to the archaea. As a consequence, the arCOG database produced is more refined, resulting in an increased coverage and resolution. The latter is reflected in the numerical increase of specific archaeal COGs and the accompanying decrease in the number of clusters containing paralogs. The comparison of arCOGs thus defined allows to infer the presence of ~166 core arCOGs, which were likely present in the last archaeal common ancestor (LACA), while 282 and 336 arCOGs appear ancestral to the euryarchaeotal and crenarchaeotal branches, respectively. From the nature of the core arCOGs, the authors conclude that the LACA was a rather complex hyperthermophilic chemoautotroph possessing ~1000 genes. Differential gene gain and loss are predicted to have occurred in the two major archaeal branches. The pattern of arCOG distribution in the different archaeal genomes is used to reconstruct a gene-content tree. Despite biases that may be associated to this approach, which are cautiously recognized by the authors, the tree obtained is largely congruent with widely accepted archaeal molecular phylogenies. Interestingly, Nanoarchaeum equitans is placed within the Thermococcales in agreement with recent detailed phylogenetic analyses, reinforcing the idea that the basal placement of N. equitans in some trees was due to long-branch attraction artifacts. The two major differences of this gene-content tree with respect to previous accepted molecular phylogenies for the archaea are that all methanogenic euryarchaeota, normally split in at least two large groups in molecular phylogenies, cluster together as they share a large number of methanogenesis-related genes, and that Cenarchaeum symbiosum is placed within the Euryarchaota, in disagreement with its expected position within the Crenarchaeota. Although the type of analyses carried out is not innovative, the new arCOG database presented here will certainly be very useful to improve future genome annotations.
I have only a few minor comments or suggestions, as follows:
-First, it has to be noticed that the euryarchaeal core (282 arCOGs) and the crenarchaeal core (336 arCOGs) are not dramatically larger than the pan-archaeal core, emphasizing the general volatility of archaeal genomes.
The affirmation that 282 and 336 arCOGs are not dramatically larger than the 166 core arCOGs appears quite subjective. It is roughly twice the size. How does this compare with the situation in bacteria? It would be nice to include this information here, and even better, to relate/normalize this information to the average genetic distance in a reference conserved genetic marker, such as the 16S rRNA gene.
Authors' response:"Dramatic", certainly, is in the eye of the beholder. We believe the reader will see it that way, so no changes. Comparing to bacteria is dubious because there are no two major groups of bacteria emulating Euryarchaeota and Crenarchaeota. Calibration – complex exercise that goes beyond the scope of this paper.
Defining genome volatility would also be useful. Genome volatility has been defined in the literature as the mean volatility of all codons weighted by their frequency within the genome, codon volatility being a measurement related to the non-synonymous versus synonymous mutations (e.g. Dagan and Graur, Mol Biol Evol 2004, 22:496). I believe the meaning is more informal and vague here, and also subjective. Can you provide a reference showing that archaeal genomes are "volatile"?
Authors' response:Good point, we changed the wording to avoid any wrong connotations, "volatility" is not used anymore.
Horizontal gene transfer from bacteria has apparently contributed to shape the C. symbiosum genome. In page 14, it is mentioned that C. symbiosum falls within the euryarchaeotal part of the gene-content tree. Would you predict that HGT from euryarchaeota may partly explain this observation as some (although very limited) environmental genomic studies appear to suggest (Lopez-Garcia, Brochier et al, Environ Microbiol 2004, 6:19?
Authors' response:Yes, a valid point, we included this possibility in the revision and cite the paper.
This work was supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.