The evolution of domain-content in bacterial genomes
© Molina and van Nimwegen; licensee BioMed Central Ltd. 2008
Received: 05 December 2008
Accepted: 11 December 2008
Published: 11 December 2008
Across all sequenced bacterial genomes, the number of domains n c in different functional categories c scales as a power-law in the total number of domains n, i.e. , with exponents α c that vary across functional categories. Here we investigate the implications of these scaling laws for the evolution of domain-content in bacterial genomes and derive the simplest evolutionary model consistent with these scaling laws.
We show that, using only an assumption of time invariance, the scaling laws uniquely determine the relative rates of domain additions and deletions across all functional categories and evolutionary lineages. In particular, the model predicts that the rate of additions and deletions of domains of category c is proportional to the number of domains n c currently in the genome and we discuss the implications of this observation for the role of horizontal transfer in genome evolution. Second, in addition to being proportional to n c , the rate of additions and deletions of domains of category c is proportional to a category-dependent constant ρ c , which is the same for all evolutionary lineages. This 'evolutionary potential' ρ c represents the relative probability for additions/deletions of domains of category c to be fixed in the population by selection and is predicted to equal the scaling exponent α c . By comparing the domain content of 93 pairs of closely-related genomes from all over the phylogenetic tree of bacteria, we demonstrate that the model's predictions are supported by available genome-sequence data.
Our results establish a direct quantitative connection between the scaling of domain numbers with genome size, and the rate of addition and deletions of domains during short evolutionary time intervals.
This article was reviewed by Eugene V. Koonin, Martijn A. Huynen, and Sergei Maslov.
When the first gene sequences became available in the 1960s some striking and unexpected patterns were observed. For example, comparison of the fossil record with the number of amino acid substitutions separating orthologous proteins in mammals  suggested a constant rate of amino acid substitutions. In addition, the inferred rate of amino acid substitutions was so high that it was hard to imagine how all of these substitutions could have been fixed by the action of natural selection . This famously lead Kimura to propose the neutral theory of molecular evolution . Neutral evolution became the de facto null model of sequence evolution and the availability of such a null model was crucial for the development of rigorous methods for reconstructing evolutionary phylogenies (e.g. ) and methods for detecting selection acting on gene sequences (e.g. [5, 6]).
Evolution of course also takes place at higher levels of organization than substitutions within protein-coding genes. In particular, large genomic segments containing one or more genes can be duplicated or deleted, and segments can be 'horizontally transfered', i.e. taken from one organism's genome and inserted into another organism's genome. Through such events organisms can vary the gene content of their genomes, acquiring genes with new functions, sub-functionalizing existing functions, or deleting genes whose functions are no longer required. Now that the sequences of several hundred of whole microbial genomes have become available over the last decade it has become possible to investigate variation in gene-content across genomes in a quantitative manner.
Studies of gene content have uncovered several striking quantitative 'laws'. First of all, it was noticed [7–9] that a number of key genomic quantities show power-law distributions. In particular, the distribution of gene family sizes is a power-law in each genome, whose exponent appears to depend mostly on the size of the genome. Several theoretical models have been put forward for explaining these power-law distributions, which all include gene duplications and deletions as key ingredients. Another striking observation  is that the numbers of genes in different functional categories scale as power-laws in the total number of genes in the genome. For example, whereas the numbers of genes involved in different types of metabolism scale approximately linear with genome size, the number of genes involved with regulatory processes such as transcription regulation and signal transduction scales roughly quadratically with genome size, and the number of genes involved with basic processes such as DNA replication or cell division scales with an exponent less than 1. Such scaling laws are observed for the large majority of high-level functional categories of genes and appear to apply to all bacterial genomes.
As we have argued previously [10, 11], these scaling laws have important implications for the evolutionary dynamics of gene duplications and deletions and we will here investigate these implications in detail. The organization of the paper is as follows. We study genome evolution at the level of protein domains and we start by demonstrating that scaling laws are also observed at the level of the number of protein-domains.
We re-estimate the scaling exponents α c using all 630 currently available genomes. Next, using the assumption that the scaling laws are time invariant, we derive a 'null model' for genome evolution that accounts for the observed scaling laws. In this model the exponents of the scaling laws are identified as universal constants of the evolutionary process.
We collected 93 pairs of closely-related bacterial genomes and tested the model's predictions by analyzing the protein-domain content of these genomes and estimating, for each pair, the rates at which additions and deletions of domains from different categories have occurred since their common ancestor. We show that essentially all of the model's predictions are supported by the available genome data. Finally, we also discuss the important implications of our results for the role of horizontal gene transfer in genome evolution.
Results and Discussion
Scaling laws in protein domain occurrences
Although genes are natural units in genome analysis there are some disadvantages to using genes as the central units in the analysis of the evolution of genome content. For example, apart from being able to mutate, duplicate, and be deleted, it is well-known that, not infrequently, two genes can fuse into one, single genes can split into two , and genes can evolve de novo from non-coding sequence. Such events significantly complicate the analysis of the evolution of gene content.
Protein domains form more natural units for the study of the evolution of gene-content for several reasons. It can be argued that protein domains act like 'evolutionary atoms' to a certain extent ; Protein domains form functional units  that cannot be split into smaller units, and a single protein domain can, in general, not be constructed by fusing multiple occurrences of other protein domains. Therefore, we can safely assume that almost all changes in the number of occurrences in the genome of a given protein domain are due to deletions, duplications, or the horizontal transfer of a domain from another organism's genome. We thus decided to study the evolution of functional gene content in terms of the number of occurrences of different protein domains. Among databases of protein domains Pfam  is attractive because the Pfam domain families are disjoint, i.e. at the default settings it is guaranteed that any given DNA sequence segment will be classified to belong to at most one domain family. We thus used Pfam domains as our evolutionary 'atoms'.
We counted the number of occurrences of each Pfam domain in each fully sequenced bacterial genome (Methods). Using a mapping from Pfam to Gene Ontology categories  we determined, for each genome g, the total number of domains n(g) that can be associated with any GO category and, for each GO category c, the number of domains n c (g) occurring in the genome.
with both the pre-factors β c and the exponents α c varying between categories. These power-laws are observed for the large majority of high-level functional categories. For each GO category we fitted a power-law of the form (1) using a Bayesian procedure which in particular provides a posterior probability distribution for the exponent α c (Methods). We selected 156 GO categories that occur in at least 95% of all genomes and that show good power-law fits (Methods). The inferred exponents match what we found previously based on the gene-number analysis of a much smaller number of genomes [10, 11], i.e. for basic processes such as translation and DNA repair exponents are low, whereas exponents for regulatory functions such a regulation of transcription and signal transduction are largest. The inferred exponents for all 156 selected categories are listed in Additional file 1.
We want to investigate the implications of the scaling laws (1) for evolutionary dynamics. That is, we want to infer what the scaling laws imply for the behavior of the domain number counts n c (t) as a function of time t. It is important to define precisely what we mean by n c (t). A sequenced genome g represents a particular bacterial strain and can idealistically be thought of as representing the genome of a single bacterial organism living today with domain counts n c (g). Since bacteria reproduce clonally we can imagine tracing this individual back through time, back to its mother cell, its grandmother, and eventually all the way back until the common ancestor of all currently sequenced genomes. We denote by n c (g, t) the number of domains of category c that were present in the ancestor organism of genome g that was living at time t.
Let tnow denote today and let x c (g, t) denote the logarithm of the domain-number, i.e. x c (g, t) = log[n c (g, t)], and similarly x(g, t) = log[n(g, t)]. In these variables the scaling laws are just straight lines, i.e all genomes g (approximately) obey the linear relation
x c (g, ttoday) = α c x(g, ttoday) + β c ∀ g .
Since (5) must hold for all genomes g, this equation first of all implies a relation between the offsets β c and the domain counts in the last common ancestor:
β c = x c (0) - α c x(0).
The equations (10) reflect the constraints on domain-count dynamics implied by the scaling laws but they don't uniquely determine an evolutionary model. To derive a unique evolutionary null model we will assume time invariance of the scaling laws. We assume that, if we had collected genomes of bacteria living several tens or even hundreds of million years ago, as opposed to the bacteria living today, we would have observed the same scaling laws as we observe today. That is, we assume that there is nothing particularly special about our current time, and that the same scaling laws have held since the last common ancestor, or at least since the origin of the clades from which our current genome sequences derive. We feel that this is by far the simplest assumption that can be made about the evolutionary dynamics and will here analyze its implications.
That is, the assumption of time invariance implies that, for each genome g, and for each short time interval in its evolution, the ratio between the change dn c (g, t) in the domain-count of category c and the total change dn(g, t) in domain-count is given by the product of the exponent α c and the fraction n c (g, t)/n(g, t) of all domains that are of category c. In particular, equation (12) will apply to the domain-count changes that occurred since the common ancestors of pairs of closely-related species, as illustrated on the right-hand side of Fig. 2. Therefore, we can test the validity of the null model by comparing the domain-counts in the genomes of closely-related bacteria.
Implications for closely-related pairs of genomes
We now discuss how the prediction (12) can be tested with data from closely-related genomes. Note that, strictly speaking, (12) holds only in the limit of infinitesimally small dn(g, t) and that we have so far implicitly assumed that the n c (g, t) are continuous variables, whereas in reality the smallest possible change is dn(g, t) = 1. For the integer-valued quantities n c (g, t) equation (12) can be interpreted as follows: whenever a single domain is added to the genome, i.e. dn = 1, then the probability that this domain is of category c is given by α c n c /n. Similarly, whenever a single domain is removed, i.e. dn = -1, then the probability that this domain is of category c is also given by α c n c /n.
Since this interpretation is of key conceptual importance we briefly expand on its meaning.
For pairs of closely-related genomes the number of domain-count changes that occurred since they diverged from a common ancestor is generally very small compared to the total number of domains. Therefore, the fractions n c /n have generally changed little during the time since the two genomes diverged from their ancestor and we will make the assumption that the fraction n c /n can be considered constant. Under this approximation equation (13) predicts that, if during the time interval since the pair's common ancestor, a total of Δn domain-count changes occurred, i.e. the sum of all additions and deletions, then the expected number of domain-count changes Δn c in category c (which is again the sum of all additions and deletions in this category) should equal
We collected 93 pairs of fully-sequenced genomes that are evolutionary relatively closely related, using the tree of life that was inferred by Ciccarelli et al.  as a guide (Methods). For each pair of genomes i we counted the numbers of domain occurrences for each Pfam family and used these (Methods) to estimate the number of domain-count changes for each category c and the total number of domain-count changes Δn i . Again, we stress that the are the estimated total number of changes, adding additions and deletions together. For example, if we denote by the difference in the number of domains in category c occurring in the two genomes of the pair, then we typically find that the estimated is larger than (see Additional file 1). Apart from estimating we estimated, for each genome pair i, the fractions by averaging the domain counts over the two genomes in the pair (Methods). Our model thus predicts that, for each pair i, the ratio should be proportional both to the fraction and to scaling law exponent α c .
The fraction of domain-count changes is proportional to the number of existing domains
The left panel of Fig. 3 demonstrates two points. First, comparing the three categories with each other, we see that most domain-count changes occur in the most abundant category and least domain-count changes occur in the least abundant category, with the fraction of domain-count changes indeed scaling roughly linearly with (compare with the dotted guide lines showing linear scaling). Beyond that, if we compare the numbers of domain-count changes across the different genomes within each category we see that, in those genomes where the domains of the category are most abundant domain-count changes in that category are also most abundant. That is, although the data is quite noisy, it is clear that all three clouds of points show a close to linear increase of with .
The estimated slopes γ c for all selected GO categories are shown in the right panel of Fig. 3 (and listed in Additional file 1). The estimated γ c are very roughly symmetrically distributed around 1 with a median γ c of 1.16. For almost 75% of the categories a slope of γ c = 1 is within the 99% posterior probability interval.
This thus supports the prediction of our evolutionary null model that the fraction of all domain-count changes that involve domains of category c is proportional to the fraction n c /n of all domains in the genome that belong to category c.
For about 25% of the categories we infer slopes significantly deviating from 1. It should be noted, however, that the least-squares fitting assumes simple Gaussian noise in log[Δn c /Δn], whereas in reality the size of the noise in log[Δn c /Δn] increases as Δn decreases. Moreover, whereas the fitting assumes that the numbers of domain-count changes are given, in reality these are estimated (see Methods) and thus themselves subject to uncertainty. We therefore are significantly underestimating the uncertainty in the fitted slope for many categories, and it is reasonable to conclude that for most if not all categories the data is consistent with the predicted linear dependence of Δn c /Δn on n c /n.
which is a generalization of equation (13). The proportionality constants defined by this equation quantify the extent to which domain-count changes of category c are more or less frequent in the lineages of pair i than expected based on their frequency . For this reason we will refer to these proportionality constants as evolutionary potentials. That is, when is high it indicates that, apparently, domain additions and deletions involving domains of category c are fixed in evolution at a higher rate in the evolutionary lineages of pair i.
Our evolutionary null model predicts that the evolutionary potentials are the same for all evolutionary lineages, and in addition that the evolutionary potentials are equal to the scaling law exponents α c . We will check these two predictions in turn.
The evolutionary potentials are constant across evolutionary lineages
Since the total number of domain-count changes Δn i is often small, it is not surprising that the posterior probability intervals are often rather wide. In spite of this, it can be clearly seen that, consistent with the scaling exponents α c , is largest for the category 'regulation of transcription', and smallest for the category 'translation'. Moreover, Fig. 4 shows that the data by and large support the prediction that the potentials are the same for all evolutionary lineages. That is, for each of the three categories the posterior probability intervals for are consistent with a common underlying potential ρ c for the majority of genome pairs i. This is a further piece of support for the evolutionary null model.
Evolutionary potentials ρ c correlate with scaling exponents α c
Note that, since the evolutionary potential ρ c is a measure of the relative frequency of domain-count changes between closely-related species, and α c is a measure of the scaling of the number of domains with genome size, there is a priori no reason why these two quantities should be strongly correlated. However, as predicted by our evolutionary null model, there is clear evidence of a linear dependency between the exponents α c and the evolutionary potentials ρ c .
Rather than a simple relation ρ c = α c we find that ρ c varies over a somewhat smaller range, i.e. the 99% posterior probability interval for the slope of the correlation runs from 0.65 to 0.79. One possible explanation is that, because the estimation of the numbers of domain-count changes Δn c is the same for all categories, we might underestimate the numbers of domain-count changes more for categories with large ρ c than for categories with low ρ c .
Implications for the rates of horizontal transfer
In general, the rate at which additions/deletions occur is the product of two independent factors. First, the rate at which domain additions and deletions are introduced into individuals of the population, and second the fraction of the time that such mutations are being fixed into the population. There are likely three main mechanisms through which domain additions or deletions are introduced: duplications, deletions, and horizontal transfers. To a first approximation, the rates at which duplications, deletions, and horizontal transfers are being introduced into individuals will be determined by the biases inherent in the mechanisms underlying these processes and not by selection. In contrast, the fraction of the time that such mutations are fixed in evolution will strongly depend on selection.
It is clear that, for duplications and deletions, the rate at which such mutations are introduced is naturally proportional to the number of existing domains n c . That is, when the number of domains n c doubles, the total rate at which duplications and deletions are introduced within this category also doubles. Moreover, since selection is not involved, the rate of introduction of duplications and deletions can be expected to be the same for all functional categories c (except of course for transposable elements which are duplicated through a separate mechanism). Therefore, as the rate of introduction is proportional to n c , with the same proportionality constant for each category, and the total rate must be proportional to ρ c n c , this implies that the relative rate of fixation through selection must be proportional to the evolutionary potential ρ c .
Thus, the evolutionary potentials ρ c (and the scaling exponents α c ) have a particularly simple interpretation: they give the average relative rate with which additions and deletions of domains in category c are fixed by selection.
Evidence has accumulated over recent years that horizontal transfers occur in essentially all evolutionary lineages and gene families, see e.g. [17–22] and the extensive discussion in the recent review . There is less consensus in the literature, however, regarding the precise amount that horizontal transfer contributes to gene-content evolution. Discussion of the reasons for this lack of consensus are beyond the scope of this article but it is clear that the apparent disparity between the conclusions reached by different authors is a combination of: the fact that different authors ask different questions, i.e. asking what fraction of gene families are affected by horizontal transfer at least once in evolutionary history  is very different from asking, for example, the relative rates of gene loss to horizontal transfer [25, 26]. The fact that very different types of evidence are used, such as presence/absence of members of a gene family across leafs of the species tree [19, 20, 26], comparison of gene trees with species trees [21, 27], or the presence of genes without known homologs . And finally, the fact that there are technical issues (like the way gene families are build and tree topologies are inferred) which may affect the results quantitatively if not qualitatively.
To the present authors the evidence currently in the literature does suggest that horizontal gene transfer accounts for a non-negligible and maybe even a large fraction of changes in gene content, at least among closely-related genomes. For example, it was found in  that up to 20% of the gene-content of proteo-γ bacteria consists of genes that have no homology with any of the other genes among all currently sequenced proteo-γ bacteria, but that do have homology with genes found outside of the proteo-γ clade. It is hard to see how this statistic could result from any process other than a high rate of horizontal transfer. It is thus worthwhile to investigate the implications of our current findings under the assumption that many of the domain additions are due to horizontal transfer.
Although we have no direct evidence, it is attractive to assume that the probability that a domain addition will be fixed in the population does not depend on the mechanism by which it was introduced. That is, the relative rate of fixation of domain additions in category c should be proportional to ρ c for both duplicated domains as well as horizontally transfered domains. If this is indeed the case, it follows immediately from the fact that the overall rate should be proportional to ρ c n c , that the rate at which horizontal transfers are introduced must be proportional to the number of domains n c present in the genome. However, whereas this is naturally the case for gene duplications, it is not clear at all why this should also hold for horizontal transfers. Therefore, our results put rather strong constraints on the rate of horizontal transfer.
One possibility is that horizontal transfer is negligible and that domain additions are dominated by duplications. However, as we have just discussed, this assumption, which we have made in previous work [10, 11], appears at odds with recent work. It should be noted, however, that some studies that investigate evolution of gene content over long time scales find that horizontal transfer is only responsible for a minor fraction of all events on a long time scale, i.e. . One hypothesis that might be worthwhile to entertain is that most horizontal transfers are only transient. It is conceivable that horizontally transfered genes consist mostly of 'accessory' genes that are involved with adaptations to the local environment that are easily taken up by genomes moving into a certain environment, but which are also easily lost again when the environment changes, so that the horizontal transfers of these accessory genes contributes relatively little to the gene-content dynamics on long time scales. However, at least to these authors this hypothesis does not seem particularly plausible a priori.
Alternatively, there are several hypotheses that could explain why the rate at which horizontal transfers of domains of category c are introduced is proportional to the number of domains n c already in the genome. First, it is possible that horizontal transfer is highly biased to occur predominantly between genomes that are closely-related phylogenetically. One mechanism of horizontal gene transfer, conjugation, does indeed occur preferentially between related organisms. Since closely-related species are likely to have highly correlated domain counts, it is likely that the fraction n c /n of category c domains in the donor genome is close to the fraction of domains of category c in the receiver genome. However, many of the horizontal transfers detected through sequence analysis involve transfers between distally related species.
Another possible explanation is that bacterial habitats naturally separate into different genome-size classes. That is, it is conceivable that bacteria tend to be surrounded mostly by other bacteria of roughly the same genome size. Because the scaling laws apply to all genomes, the fractions n c /n are similar for similarly sized genomes and one would naturally have that the rate at which horizontal transfers of domains of category c occur is proportional to n c . As far as these authors are aware, currently there seems to be no evidence suggesting that there is a characteristic genome size for each bacterial habitat, but it appears that this hypothesis should in principle be testable using metagenomics data.
Finally, it is possible that, even though a given bacterium would generally be surrounded by other bacteria of very different sizes, that horizontal transfer is highly biased to occur predominantly between organisms that have genomes with similar sizes. In fact, there is some evidence in the literature that bacteria can recognize and silence horizontally transfered genes that have an AT-content which is significantly higher than the AT-content of the genome itself . In addition, there is a good correlation between genome size and GC-content . It is therefore conceivable that horizontal transfers between genomes of similar size are much more common than horizontal transfers between genomes of significantly different sizes.
In any case, whatever the underlying mechanism, if horizontal transfers account for a significant fraction of domain additions through evolution, then something must ensure that the rate of introduction of such horizontal transfers is proportional to the number of existing domains n c in the receiving genome.
We have shown that, across all bacteria and for most high-level GO categories c, the number of domain occurrences n c scales as a power-law in the total number of domains n, with scaling exponents α c varying from close to zero to a bit larger than 2. We have derived what we believe is the simplest evolutionary model that can account for the observed scaling laws. This 'null model' assumes that, across all evolutionary lineages and all evolutionary times, the relative rate r c /r at which additions and deletions of domains of category c are fixed in evolution is proportional to the current fraction n c /n of domains in category c and a characteristic evolutionary potential ρ c which equals the scaling exponent α c .
By comparing genome-wide domain-counts n f for each Pfam family f across 93 pairs of closely-related species we have estimated the rates at which domain additions and deletions occur across GO categories and across different evolutionary lineages. The results of this analysis support the predictions made by the evolutionary null model. First, we have shown that, for most categories c, the relative rate r c /r of domain additions and deletions is proportional to the fraction of domains n c /n already occurring in the genome.
Second, we estimated the relative rates of domain additions and deletions independently for different evolutionary lineages i and used these to estimate lineage-dependent evolutionary potentials . We found that, whereas the evolutionary potentials clearly vary between categories c, the data support the null model's prediction that for a given category c the potentials are the same across all evolutionary lineages i. Finally, by combining data from all lineages we estimated average evolutionary potentials ρ c and found that, as predicted by the model, there is a good correlation between these evolutionary potentials and the scaling law exponents α c . Importantly, this result establishes that there is a direct relation between the scaling of domain-counts with genome size and the rates with which domains are added and removed during short evolutionary time intervals. This reinforces our proposal that the evolutionary potentials ρ c are fundamental constants of the evolutionary process.
An interesting question is if our simple null model can also explain the observed power-law distribution [7–9] of genome-family sizes in each genome. In previous work  one of us has suggested that the simplest explanation for the power-law distribution of gene family sizes is a multiplicative noise process. Although we will defer a detailed analysis of the gene-family size distributions implied by our null model to future work, it is clear that the basic ingredients for such a multiplicative noise process are already present. Since the model only constrains the relative rates of domains in different functional categories, the overall rate of genome growth/shrinkage can fluctuate randomly, and the rates of different families within a functional category can also fluctuate around a common mean. It is interesting to note that our null model implies that categories with large exponents, such as transcription factors, should show larger fluctuations in gene family sizes than categories with small exponents. Since the categories with large exponents are more abundant in larger genomes this in turn implies that the exponent of the gene-family size distribution should increase (i.e. be less negative) for larger genomes. This is indeed what is observed .
If, as recent work suggests, horizontal transfer is an important force in shaping the gene-content of genomes, then our results put strong constraints on the rates r c at which horizontal transfers of domains of different functional categories c can occur. In particular, we find that the rate at which domains of category c are horizontally transfered into a genome must be proportional to the number of domains n c already existing in the receiving genome. An important avenue for future research is to clarify the underlying mechanism that is responsible for this surprising fact.
As our results have made plausible that the evolutionary potentials ρ c (and the corresponding scaling exponents α c ) are fundamental constants of the evolutionary process that apply across all time and all evolutionary lineages, the major challenge is now to elucidate what determines these numbers. In this respect it is important to note that the functional categories c that we consider are taken directly from the human-defined Gene Ontology hierarchy and are thus rather subjective. A first challenge for future work is therefore to identify a procedure that divides domain families into functional groups in a more objective manner. Although difficult with the current amount of available data, one possible approach is to estimate evolutionary potentials ρ f for individual domain families and to investigate if these fall into a small number of natural classes. That is, it is conceivable that on some more fundamental level there are only a small number of distinct exponents, for example α = 0, α = 1, and α = 2, and that the observed scaling laws with more complex exponents are different mixtures of these more fundamental scaling laws. Finally, we believe that the exponents α c reflect fundamental design principles of bacterial life, maybe similar to the way geometry and architectural design principles demand that the number of windows in a building scales as the 2/3 power of the building's volume. Seen from this point of view the exponents α c encode crucial information about the basic design that is shared by all bacterial life.
We obtained all 630 currently available bacterial genomes from the NCBI database . To count the number of occurrences of each Pfam domain in each fully sequenced bacterial genome we ran HMMer  using all Pfam models on all proteins encoded in each genome, as annotated in the NCBI reference file. We thus assume that there are no significant fluctuations in the quality of gene prediction across the genomes. A hit was considered a valid domain if its score was equal or bigger than the so-called gathering score of the model provided by the Pfam web site, and it did not overlap with any other hit of lower E-value. There were 4,732 Pfam domain families with at least one occurrence across the 630 bacterial genomes. To count the number of domain occurrences per functional category we used a mapping from Pfam domains to Gene Ontolology terms  which is available at http://www.geneontology.org/. If a domain-family f maps to category c it will be associated with c and all parent categories of c in the Gene Ontology hierarchy.
Bayesian fitting of exponents
where G is the number of genomes, σ xx is the variance of x values, σ yy the variance of y values, and σ xy the covariance of x and y values. Note that the optimal line in this procedure corresponds roughly, i.e. up to the effects of the rotationally-invariant prior, to the line that minimizes the sum of the squared orthogonal distances of the data points to the line. The latter also corresponds to the first principal component of the data.
We selected all GO categories that have nonzero count in at least 95% of the genomes (600 out of 630), where the fraction of the variance explained by the fit is at least 0.9375 (this corresponds to the average distance to the data-points from the fitted line being 0.25 or less of the average distance of the data-points to the center of mass of the scatter), and where the average number of domains (averaged over all genomes) is at least 5. This led to 156 categories listed in Additional file 1.
where P is the number of genome pairs, the x-values are now given by the log-fractions, i.e. , and the y-values are the log-fractions of changes, i.e. .
Extracting closely-related pairs of bacteria
We extracted the phylogenetic tree of bacteria from the tree of life that was produced by Ciccarelli et al.  based on the concatenated protein sequences of 31 protein families. As shown previously , even strains that are so close that they traditionally would be considered the same species, i.e. more than 94% nucleotide identity between orthologous genes, can have substantial differences in their gene content. In selecting 'close' pairs of organisms we want, on the one hand, to be able to estimate relative rates, for which we need a large enough number of domain additions and deletions to have taken place. On the other hand, the further apart the organisms, the harder it is to accurately estimate the total number of addition and deletion events that have taken place (see below). We decided to select all pairs of species for which the average identity at the amino acid level of orthologous proteins was at least 0.75, i.e. distance less than 0.25. With this definition one of the most distant pairs considered was Escherichia coli and Vibrio Cholerae. To avoid redundancy and pairs with too few events, we clustered all genomes whose distances were 0.01 or less and took a single representative genome from each cluster. With these cutoffs we obtained 93 pairs of bacterial genomes which are listed in Additional file 1.
Estimating domain-count changes Δn c
To estimate the number of additions and deletions for each family f we maximize the probability (29) with respect to λ, μ, the fractions x f , and the number of extra moves e f . To do this we use an iterative procedure. Note that, given the numbers of extra moves e f , the optimal λ, μ, and x f are given by
λ = A + E,
μ = D + E,
and we can numerically solve for the ef that maximizes this likelihood. We start by setting all e f = 0 and use the above equations to, iteratively, solve for λ, μ and the x f given the e f , and then the e f given the x f . This is repeated until a fixed point is reached. Finally, the estimated total number of events Δn f for family f equals δ n f + 2e f . In this way we estimate the number of events separately for each of the genome pairs i we analyze.
We originally performed this procedure including all Pfam domains. However, doing this we found that the number of extra moves e f estimated for categories associated with transposons and bacteriophages was many times larger than for all other families. This is of course to be expected as both transposons and phages actively multiply their domains. However, in equations (33) and (34) all domain families are treated equally, and therefore the high rates of additions and deletions for transposon and phage related categories significantly increase the estimated total rates for all families. Therefore, recognizing that the mechanisms of domain additions in transposon and phage related families are different from all other domain families, we excluded those Pfams associated with transposons and bacteriophages. In particular, we excluded all 22 Pfam families that map to the GO categories transposition (GO:0032196) or viral reproduction (GO:0016032).
The estimated total number of changes in category c is given by , where the sum is over all Pfam domain families f associated with category c. The estimated total number of changes is given by , where the sum is over all Pfam domain families. To calculate the fractions for a given closely-related pair i we calculate the average number of domains associated with category c as and the average total number of domains .
Eugene V. Koonin
This is an important and welcome development of Van Nimwegen's 2003 classic on the scaling laws for different functional categories of genes in prokaryotic genomes. That classic study established that different functional classes of genes all scale according to power laws but with different, function-specific exponents that are (at least, approximately) the same in all prokaryotic lineages and, supposedly, the same throughout the course of prokaryotic evolution. This paper underpins the scaling laws with the (apparently) simplest conceivable evolutionary model. Under this model, the dynamics of protein domains in each functional category depends on just two variables, the number (fraction) of domains of the given category that are already present in the genome and the intrinsic evolutionary potential of the category. It is shown that the observations on the actual counts of domains in genomes are well explained if the evolutionary potentials are category-specific but invariant across bacterial lineages. All of the above had to be done in order to obtain a concrete evolutionary mechanism yielding the observed scaling laws but the above results are not at all unexpected. As I see it, the interesting things start coming up when one starts considering mechanism of domain gain in specific terms. For the model to work it is necessary that the rate at which horizontally transferred genes are acquired by a prokaryotes is proportional to the number of domains of the given category that are already present in the genome. Why that would be the case remains unclear. I believe the possibility that horizontal gene transfer is negligible compared to duplication as the source of new genes (domains) can be dismissed with confidence. In all, likelihood, in prokaryotes, horizontal gene transfer is actually a more important source of new genes than duplication. It is hard to think of a way for the domain content in the recipient organism could directly affect the rate of horizontal transfer. So the explanation should be indirect, that is, should include a connection between the domain composition of the donor genome with that of the recipient. Molina and Van Nimwegen consider three possibilities, and to me, the one that horizontal gene transfer predominantly occurs between genomes with similar AT-content (horizontally transferred genes coming from organisms with substantially different AT-content being rapidly destroyed or silenced) is highly attractive considering the strong correlation between AT-content and genome size. These hypotheses are testable by comparative-genomic methods although the analysis will not be easy. Of course, in the face of the rather counter-intuitive finding that the rate of horizontal gene transfer should depend on the fraction of domains of the given category already present, one has to consider the possibility that the proposed evolutionary model is too simple to be true. As far as I can see, more specifically, that would imply that the evolutionary potentials are not time-invariant and/or lineage-independent. The results of the present paper do not seem to point in this direction but I suspect that this is not the last word on the subject, more detailed analyses are necessary. On the whole, this is an enormously interesting subject, and the present paper is a useful stepping stone toward understanding the scaling laws. I am particularly intrigued by the final proposition that there could be only three fundamental exponents, the intermediate values currently observed depending on mixing of genes from the three classes in different proportions. Philosophically, this seems to smack of essentialism but... should there be a mechanistic explanation(s) of the 0, 1, and 2 scaling (and I can think of some), this would be a real step ahead in our understanding of how genomes evolve.
Martijn A. Huynen
The manuscript by Molina and van Nimwegen is the culmination of an observation that was originally made by Erik van Nimwegen and on which there has been follow op from several corners: that the variation in the number of proteins in a specific functional class across species scales as a power-law with the total number of proteins encoded in the species, and that the exponent of that power-law varies between the various classes.
Molina and van Nimwegen analyse their model further to show that the number of additions and losses within each category is proportional to the number of genes of that category already in the genome. They show that this prediction is borne out by comparing the number of genes in closely related genomes. Questions: In work of this referee and van Nimwegen we showed that the frequency distribution of gene family sizes in complete genomes follows a power-law, and we argued that this was only possible under a model in which the variations in the number of genes per gene family was proportional to the gene family size. Do I understand correctly that the model and observations in the Molina and van Nimwegen manuscript are consistent with this model, and more importantly, that these two observations about gene family size distributions 1) the size distribution of one family over genomes and 2) the size distribution of all families within one genome, can now be explained by one single model?
The authors argue for domains as the evolutionary unit. This may well be, but such a lower resolution does run the risk of mixing functional categories for domains that function in multiple categories. How many domains did map to multiple categories? And how did that affect the results?
With respect to Horizontal gene transfer the authors do not analyze this process per se, but rather argue that if it is frequent it should also be proportional to gene family size. I do not want to get into a whole HGT debate here, but, although over the complete history of life, along all evolutionary branches, few gene families appear to escape HGT, or at least escape evidence for HGT, compared to processes like gene duplication and gene loss, the quantitative contribution from any generation to the next appears to be small (Snel Bork and Huynen Genome Res 2002). There are other references (BG Mirkin, TI Fenner, MY Galperin, EV Koonin 2004) that do give higher estimates however. In any case it would be worthwhile to mention that relative to gene duplication and gene loss the amount of HGT that actually happens is not necessarily as large as is sometimes implicitly suggested.
With respect to the HGT mainly occurring between closely related species: there is evidence for that (e.g. by conjugation).
With respect to the closely related genomes: can the authors check whether the protein prediction in a pair of genomes was done with the same programs? or did they run the HMMs directly against the DNA? This is a bit nitpicking I know, but comparisons of closely related species have been confounded by inconsistent genome annotations in the past.
The manuscript presents an interesting study of evolutionary implications of previously reported scaling laws in the functional content of bacterial genomes. While it does not answer the ultimate question of why this scaling exists in the first place, it methodically explores all its logical consequences reflected in genomes' evolutionary history.
An important result of this study is that the overall rate of gene (or domain as used in this manuscript) additions AND deletions scales linearly with the number of genes (domains) in a given functional category. This statement is in principle separate and independent from the scaling law itself since it counts the SUM of the rates of domain additions and deletions and not the DIFFERENCE between them.
Unfortunately, when this quantity (Δn c proportional to [rate of additions+rate of deletions]c) is first introduced on page 7, readers could easily confuse it with just the net change in n c (denoted dnc and proportional to [rate of additions-rate of deletions] c ). As a result they would miss one of the central points of the manuscript. I recommend that authors spend some extra time upfront explaining the differences between Δn c and dn c and emphasizing that, a priory, these two quantities are not at all close to each other. To quantify this difference authors might quote the average value of – [rate of additions-rate of deletions] c –/[rate of additions+rate of deletions] c for their 93 pairs of genomes.
My other comment concerns the proposed "superuniversality" of evolutionary potentials (ρ c ) of a individual functional categories. In general this study indicates that ρ c remains nearly the same for all species and at all timescales of evolution. I have previously observed (S. Maslov, unpublished) that in a group of VERY CLOSELY related genomes (28 fully sequences E. coli and Salmonella strains) the number of transcription factors violates the N2 scaling in spite of a considerable range of genome sizes (from 4300 to 5800 genes). The best fit to the scaling exponent gives 0.3 instead of 2. This might indicate that evolutionary dynamics might in fact be rather different on very short timescale. This does not contradict the results of this study since (as explained in the Methods) authors have grouped together all very closely related species (AA substitution rate below 1%). However, I believe this observation deserves future scrutiny since it may shed an additional light on elementary evolutionary steps shaping functional contents of bacterial genomes.
Finally, I would like to offer another possibility of how the results of this study could be reconciled with the evidence of widespread Horizontal Gene Transfer (HGT) among bacteria (see section 2.9). One way to explain the linear correlation between the rate of fixed horizontal gene transfers and the number of genes in host's genome, is to assume that a SUCCESSFUL group of HGT-acquired genes needs to be functionally integrated with the rest of the genome. An example would be a HGT-transferred metabolic pathway that in order to contribute to the biomass production needs to be connected with the rest of the metabolic network of its host. Genomes with larger number of genes n have more places where a HGT-transferred pathway could be successfully connected and hence would be characterized by a proportionally larger Δn HGT . In fact, my collaborators and I (S. Maslov, S. Krishna, K. Sneppen (2008) under review) have recently proposed a model of such pathway-by-pathway evolution to explain the quadratic scaling of the number of transcription factors with genome size.
EvN thanks Eugene Koonin, Paulien Hogeweg, and Sergei Maslov for helpful discussions.
- Zuckerkandl E, Pauling LB: Molecular disease, evolution, and genetic heterogeneity. Horizons in Biochemistry. Edited by: Kasha M, Pulman B. 1962, New York: Academic Press, 189-225.Google Scholar
- Kimura M: Evolutionary rate at the molecular level. Nature. 1968, 217: 624-626. 10.1038/217624a0.PubMedView ArticleGoogle Scholar
- Kimura M: The Neutral Theory of Molecular Evolution. 1983, Cambridge University PressView ArticleGoogle Scholar
- Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981, 17: 368-376. 10.1007/BF01734359.PubMedView ArticleGoogle Scholar
- Nei M, Gojobori T: Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Molecular Biology and Evolution. 1986, 3 (5): 418-426.PubMedGoogle Scholar
- Hughes AL, Nei M: Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection. Nature. 1988, 335: 167-170. 10.1038/335167a0.PubMedView ArticleGoogle Scholar
- Huynen M, van Nimwegen E: The frequency distribution of gene family sizes in complete genomes. Mol Biol Evol. 1998, 15 (5): 583-589.PubMedView ArticleGoogle Scholar
- Luscombe NM, Qian J, Zhang Z, Johnson T, Gerstein M: The dominance of the population by a selected few: power-law behavior applies to a wide variety of genomic properties. Genome biology. 2002, 3 (8):Google Scholar
- Koonin EV, Wolf YI, Karev GP: The structure of the protein universe and genome evolution. Nature. 2002, 420: 218-222. 10.1038/nature01256.PubMedView ArticleGoogle Scholar
- van Nimwegen E: Scaling Laws in the functional content of genomes. Trends in Genet. 2003, 19 (9): 479-484. 10.1016/S0168-9525(03)00203-8.View ArticleGoogle Scholar
- van Nimwegen E: Scaling laws in the functional content of genomes: Fundamental constants of evolution?. Power Laws, Scale-free Networks and Genome Biology. Edited by: Koonin E, Karev G, Wolf Y. 2004, Landes Bioscience, 236-253.Google Scholar
- Snel B, Bork P, Huynen M: Genome evolution. Gene fusion versus gene fission. Trends in genetics. 2000, 16: 9-11. 10.1016/S0168-9525(99)01924-1.PubMedView ArticleGoogle Scholar
- Branden C, Tooze J: Introduction to Protein Structrue. 1999, Garland PublishingGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn R, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer E, Studholme D, Yeats C, Eddy S: The Pfam protein families database. Nucl Acids Res. 2004, 32: D138-D141. 10.1093/nar/gkh121.PubMedPubMed CentralView ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of bioloy. Nature Genetics. 2000, 25: 25-29. 10.1038/75556.PubMedPubMed CentralView ArticleGoogle Scholar
- Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P: Toward Automatic Reconstruction of a Highly Resolved Tree of Life. Science. 2006, 311 (5765): 1283-1287. 10.1126/science.1123061.PubMedView ArticleGoogle Scholar
- Ochman H, Lawrence JG, Groisman EA: Lateral gene transfer and the nature of bacterial innovation. Nature. 2000, 405 (6784): 299-304. 10.1038/35012500.PubMedView ArticleGoogle Scholar
- Gogarten JP, Townsend JP: Horizontal gene transfer, genome innovation and evolution. Nat Rev Micro. 2005, 3 (9): 679-687. 10.1038/nrmicro1204.View ArticleGoogle Scholar
- Pal C, Papp B, Lercher MJ: Adaptive evolution of bacterial metabolic networks by horizontal gene transfer. Nat Genet. 2005, 37 (12): 1372-1375. 10.1038/ng1686.PubMedView ArticleGoogle Scholar
- Lerat E, Daubin V, ad NA, Moran HO: Evolutionary Origins of Genomic Repertoires in Bacteria. PLoS Biol. 2005, 3 (5): e130-10.1371/journal.pbio.0030130.PubMedPubMed CentralView ArticleGoogle Scholar
- Beiko RG, Harlow TJ, Ragan MA: Highways of gene sharing in prokaryotes. PNAS. 2005, 102 (40): 14332-14337. 10.1073/pnas.0504068102.PubMedPubMed CentralView ArticleGoogle Scholar
- Farmer JD: Physicists Attempt to Scale the Ivory Towers of Finance. Computing in Science and Engineering. 1999Google Scholar
- Koonin EV, Wolf YI: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res. 2008, 36 (21): 6688-6719. 10.1093/nar/gkn668.PubMedPubMed CentralView ArticleGoogle Scholar
- Dagan T, Martin W: Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution. PNAS. 2007, 104 (3): 870-875. 10.1073/pnas.0606318104.PubMedPubMed CentralView ArticleGoogle Scholar
- Snel B, Bork P, Huynen MA: Genomes in Flux: The Evolution of Archaeal and Proteobacterial Gene Content. Genome Res. 2002, 12: 17-25. 10.1101/gr.176501.PubMedView ArticleGoogle Scholar
- Mirkin BG, Fenner TI, Galperin MY, Koonin EV: Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evolutionary Biology. 2003, 3: 2-10.1186/1471-2148-3-2.PubMedPubMed CentralView ArticleGoogle Scholar
- Linz S, Radtke A, von Haeseler A: A Likelihood Framework to Measure Horizontal Gene Transfer. Mol Biol Evol. 2007, 24 (6): 1312-1319. 10.1093/molbev/msm052.PubMedView ArticleGoogle Scholar
- van Passel MWJ, Marri PR, Ochman H: The Emergence and Fate of Horizontally Acquired Genes in Escherichia coli. PLoS Comput Biol. 2008, 4 (4): e1000059-10.1371/journal.pcbi.1000059. [http://dx.doi.org/10.1371%2Fjournal.pcbi.1000059]PubMedPubMed CentralView ArticleGoogle Scholar
- Navarre WW, M M, Libby SJ, Fang FC: Silencing of xenogeneic DNA by H-NS – facilitation of lateral gene transfer in bacteria by a defense system that recognizes foreign DNA. Genes and Dev. 2007, 21: 1456-1471. 10.1101/gad.1543107.PubMedView ArticleGoogle Scholar
- Bentley SD, Parkhill J: Comparative genomic structure of prokaryotes. Annual Review of Genetics. 2004, 38: 771-791. 10.1146/annurev.genet.38.072902.094318.PubMedView ArticleGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14: 755-763. 10.1093/bioinformatics/14.9.755.PubMedView ArticleGoogle Scholar
- Konstantinidis KT, Tiedje JM: Genomic insights that advance the species definition for prokaryotes. PNAS. 2005, 102 (7): 2567-2572. 10.1073/pnas.0409727102.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.