The evolution of domain-content in bacterial genomes

Background Across all sequenced bacterial genomes, the number of domains nc in different functional categories c scales as a power-law in the total number of domains n, i.e. nc∝nαc, with exponents αc that vary across functional categories. Here we investigate the implications of these scaling laws for the evolution of domain-content in bacterial genomes and derive the simplest evolutionary model consistent with these scaling laws. Results We show that, using only an assumption of time invariance, the scaling laws uniquely determine the relative rates of domain additions and deletions across all functional categories and evolutionary lineages. In particular, the model predicts that the rate of additions and deletions of domains of category c is proportional to the number of domains nc currently in the genome and we discuss the implications of this observation for the role of horizontal transfer in genome evolution. Second, in addition to being proportional to nc, the rate of additions and deletions of domains of category c is proportional to a category-dependent constant ρc, which is the same for all evolutionary lineages. This 'evolutionary potential' ρc represents the relative probability for additions/deletions of domains of category c to be fixed in the population by selection and is predicted to equal the scaling exponent αc. By comparing the domain content of 93 pairs of closely-related genomes from all over the phylogenetic tree of bacteria, we demonstrate that the model's predictions are supported by available genome-sequence data. Conclusion Our results establish a direct quantitative connection between the scaling of domain numbers with genome size, and the rate of addition and deletions of domains during short evolutionary time intervals. Reviewers This article was reviewed by Eugene V. Koonin, Martijn A. Huynen, and Sergei Maslov.


Background
When the first gene sequences became available in the 1960s some striking and unexpected patterns were observed. For example, comparison of the fossil record with the number of amino acid substitutions separating orthologous proteins in mammals [1] suggested a con-stant rate of amino acid substitutions. In addition, the inferred rate of amino acid substitutions was so high that it was hard to imagine how all of these substitutions could have been fixed by the action of natural selection [2]. This famously lead Kimura to propose the neutral theory of molecular evolution [3]. Neutral evolution became the de facto null model of sequence evolution and the availability of such a null model was crucial for the development of rigorous methods for reconstructing evolutionary phylogenies (e.g. [4]) and methods for detecting selection acting on gene sequences (e.g. [5,6]).
Evolution of course also takes place at higher levels of organization than substitutions within protein-coding genes. In particular, large genomic segments containing one or more genes can be duplicated or deleted, and segments can be 'horizontally transfered', i.e. taken from one organism's genome and inserted into another organism's genome. Through such events organisms can vary the gene content of their genomes, acquiring genes with new functions, sub-functionalizing existing functions, or deleting genes whose functions are no longer required. Now that the sequences of several hundred of whole microbial genomes have become available over the last decade it has become possible to investigate variation in gene-content across genomes in a quantitative manner.
Studies of gene content have uncovered several striking quantitative 'laws'. First of all, it was noticed [7][8][9]] that a number of key genomic quantities show power-law distributions. In particular, the distribution of gene family sizes is a power-law in each genome, whose exponent appears to depend mostly on the size of the genome. Several theoretical models have been put forward for explaining these power-law distributions, which all include gene duplications and deletions as key ingredients. Another striking observation [10] is that the numbers of genes in different functional categories scale as power-laws in the total number of genes in the genome. For example, whereas the numbers of genes involved in different types of metabolism scale approximately linear with genome size, the number of genes involved with regulatory processes such as transcription regulation and signal transduction scales roughly quadratically with genome size, and the number of genes involved with basic processes such as DNA replication or cell division scales with an exponent less than 1. Such scaling laws are observed for the large majority of high-level functional categories of genes and appear to apply to all bacterial genomes.
As we have argued previously [10,11], these scaling laws have important implications for the evolutionary dynamics of gene duplications and deletions and we will here investigate these implications in detail. The organization of the paper is as follows. We study genome evolution at the level of protein domains and we start by demonstrating that scaling laws are also observed at the level of the number of protein-domains.
We re-estimate the scaling exponents α c using all 630 currently available genomes. Next, using the assumption that the scaling laws are time invariant, we derive a 'null model' for genome evolution that accounts for the observed scaling laws. In this model the exponents of the scaling laws are identified as universal constants of the evolutionary process.
We collected 93 pairs of closely-related bacterial genomes and tested the model's predictions by analyzing the protein-domain content of these genomes and estimating, for each pair, the rates at which additions and deletions of domains from different categories have occurred since their common ancestor. We show that essentially all of the model's predictions are supported by the available genome data. Finally, we also discuss the important implications of our results for the role of horizontal gene transfer in genome evolution.

Scaling laws in protein domain occurrences
Although genes are natural units in genome analysis there are some disadvantages to using genes as the central units in the analysis of the evolution of genome content. For example, apart from being able to mutate, duplicate, and be deleted, it is well-known that, not infrequently, two genes can fuse into one, single genes can split into two [12], and genes can evolve de novo from non-coding sequence. Such events significantly complicate the analysis of the evolution of gene content.
Protein domains form more natural units for the study of the evolution of gene-content for several reasons. It can be argued that protein domains act like 'evolutionary atoms' to a certain extent [9]; Protein domains form functional units [13] that cannot be split into smaller units, and a single protein domain can, in general, not be constructed by fusing multiple occurrences of other protein domains. Therefore, we can safely assume that almost all changes in the number of occurrences in the genome of a given protein domain are due to deletions, duplications, or the horizontal transfer of a domain from another organism's genome. We thus decided to study the evolution of functional gene content in terms of the number of occurrences of different protein domains. Among databases of protein domains Pfam [14] is attractive because the Pfam domain families are disjoint, i.e. at the default settings it is guaranteed that any given DNA sequence segment will be classified to belong to at most one domain family. We thus used Pfam domains as our evolutionary 'atoms'.
We counted the number of occurrences of each Pfam domain in each fully sequenced bacterial genome (Methods). Using a mapping from Pfam to Gene Ontology categories [15] we determined, for each genome g, the total number of domains n(g) that can be associated with any GO category and, for each GO category c, the number of domains n c (g) occurring in the genome. Figure 1 shows, for 3 example categories, the number of domains in that category as a function of the total number of domains in the genome (that can be mapped to a GO category). As the figure shows, for all three categories the number of genes in the category n c scales as a power-law in the total number of domains in the genome n, i.e.
with both the pre-factors β c and the exponents α c varying between categories. These power-laws are observed for the large majority of high-level functional categories. For each GO category we fitted a power-law of the form (1) using a Bayesian procedure which in particular provides a posterior probability distribution for the exponent α c (Methods). We selected 156 GO categories that occur in at least 95% of all genomes and that show good power-law fits (Methods). The inferred exponents match what we found previously based on the gene-number analysis of a much smaller number of genomes [10,11], i.e. for basic processes such as translation and DNA repair exponents are low, whereas exponents for regulatory functions such a regulation of transcription and signal transduction are largest. The inferred exponents for all 156 selected categories are listed in Additional file 1.

Evolutionary Model
We want to investigate the implications of the scaling laws (1) for evolutionary dynamics. That is, we want to infer what the scaling laws imply for the behavior of the domain number counts n c (t) as a function of time t. It is important to define precisely what we mean by n c (t). A sequenced genome g represents a particular bacterial strain and can idealistically be thought of as representing the genome of a single bacterial organism living today with domain counts n c (g). Since bacteria reproduce clonally we can imagine tracing this individual back through n e n c c c Scaling laws Figure 1 Scaling laws. The number of protein-domains associated with functional categories 'translation' (green), metabolic process' (blue), and 'regulation of transcription' (red) as a function of the total number of domains in the genome for which a functional annotation is available. Each dot corresponds to a fully-sequenced microbial genome, with the total number of domains on the horizontal axis and the number of domains in a particular functional category on the vertical axis. Both axes are shown on a logarithmic scale. The straight lines show power-law fits. Domains in functional category time, back to its mother cell, its grandmother, and eventually all the way back until the common ancestor of all currently sequenced genomes. We denote by n c (g, t) the number of domains of category c that were present in the ancestor organism of genome g that was living at time t.
Let t now denote today and let x c (g, t) denote the logarithm of the domain-number, i.e. x c (g, t) = log[n c (g, t)], and similarly x(g, t) = log[n(g, t)]. In these variables the scaling laws are just straight lines, i.e all genomes g (approximately) obey the linear relation We will now derive how these scaling laws constrain the changes in domain-numbers that have occurred throughout time. Let t = 0 denote the time at which the last common ancestor of all sequenced bacterial genomes was alive. Note that, since the GO categories that we consider occur in almost all genomes, it is reasonable to assume that they all had nonzero count in the last common ancestor. We let x c (0) denote the log-domain counts in this common ancestor and x(0) the logarithm of the total domain count. Further, we denote by dx c (g, t) the change in the log domain-count for category c, that occurred in a small interval of time centered around time t in the evolutionary history of genome g. The log domain-counts x c (g, t) and x(g, t) are then by definition given by the integrals and Comparing equations (3) and (4) More importantly, we find that all genomes must obey For short time intervals in which the changes in n c are small relative to n c itself, the changes in x c are related to the changes in n c through and similarly Substituting these in (7) we obtain Equation (10) summarizes the implications for domaincount dynamics implied by the scaling laws. It states that, independent of which evolutionary history we take, the ratio of the integrals of dn c /n c and dn/n over all evolutionary time must match the scaling exponent α c . This is illustrated on the left-hand side of figure 2, i.e. equation (10) implies that the ratio of integrals is the same for each of the evolutionary histories indicated as colored lines.

Time Invariance
The equations (10) reflect the constraints on domaincount dynamics implied by the scaling laws but they don't uniquely determine an evolutionary model. To derive a unique evolutionary null model we will assume time invariance of the scaling laws. We assume that, if we had collected genomes of bacteria living several tens or even hundreds of million years ago, as opposed to the bacteria living today, we would have observed the same scaling laws as we observe today. That is, we assume that there is nothing particularly special about our current time, and that the same scaling laws have held since the last common ancestor, or at least since the origin of the clades from which our current genome sequences derive. We feel that this is by far the simplest assumption that can be made about the evolutionary dynamics and will here analyze its implications.
Given that the scaling laws are invariant in time, we immediately obtain that (10) should hold for each short time interval, i.e. we have that or dx g t dn c g t n c g t c ( , ) That is, the assumption of time invariance implies that, for each genome g, and for each short time interval in its evolution, the ratio between the change dn c (g, t) in the domain-count of category c and the total change dn(g, t) in domain-count is given by the product of the exponent α c and the fraction n c (g, t)/n(g, t) of all domains that are of category c. In particular, equation (12) will apply to the domain-count changes that occurred since the common ancestors of pairs of closely-related species, as illustrated on the right-hand side of Fig. 2. Therefore, we can test the validity of the null model by comparing the domaincounts in the genomes of closely-related bacteria.

Implications for closely-related pairs of genomes
We now discuss how the prediction (12) can be tested with data from closely-related genomes. Note that, strictly speaking, (12) holds only in the limit of infinitesimally small dn(g, t) and that we have so far implicitly assumed that the n c (g, t) are continuous variables, whereas in reality the smallest possible change is dn(g, t) = 1. For the integer-valued quantities n c (g, t) equation (12) can be interpreted as follows: whenever a single domain is added to the genome, i.e. dn = 1, then the probability that this domain is of category c is given by α c n c /n. Similarly, whenever a single domain is removed, i.e. dn = -1, then the probability that this domain is of category c is also given by α c n c /n.
Since this interpretation is of key conceptual importance we briefly expand on its meaning.
Mathematically, equation (10) makes a statement about the total overall changes in domain counts that happen over some finite time interval. In particular, the total change dn c that occurs over some time interval is the difference between the number of additions and deletions that occurred during that time interval. From a mathematical point of view, equation (11) is a differential equation that makes a statement about the relative rates at which changes in domain-count number occur, i.e. including both additions and deletions. To put it differently, the assumption of time invariance allows us to make statements about time intervals so short that at most one 'event' can occur during such intervals, so that there is roughly speaking no room left for additions and deletions to cancel each other out, i.e. the relation (11) must hold for both of them. The clearest interpretation is in terms of a model where the key quantities are the rates, i.e. probability of an event per unit time, at which domain-count changes (either additions or deletions) take place. That is, if r denotes the overall rate at which additions or deletions occur, and r c the rate at which additions/deletions of domains of category c occur, then the model predicts from a common ancestor is generally very small compared to the total number of domains. Therefore, the fractions n c /n have generally changed little during the time since the two genomes diverged from their ancestor and we will make the assumption that the fraction n c /n can be considered constant. Under this approximation equation (13)

The fraction of domain-count changes is proportional to the number of existing domains
Equation ( This thus supports the prediction of our evolutionary null model that the fraction of all domain-count changes that involve domains of category c is proportional to the fraction n c /n of all domains in the genome that belong to category c. For about 25% of the categories we infer slopes significantly deviating from 1. It should be noted, however, that the least-squares fitting assumes simple Gaussian noise in log[Δn c /Δn], whereas in reality the size of the noise in log[Δn c /Δn] increases as Δn decreases. Moreover, whereas the fitting assumes that the numbers of domain-count changes are given, in reality these are estimated (see Methods) and thus themselves subject to uncertainty. We therefore are significantly underestimating the uncertainty in the fitted slope for many categories, and it is reasonable to conclude that for most if not all categories the data is consistent with the predicted linear dependence of Δn c /Δn on n c /n.

Evolutionary Potentials
The results of the previous section support that the rate r c where we used the definition (14). Using a uniform prior over we and for the posterior probability of given the estimated domain-count changes Using (16) we determined posterior probability intervals defined by and for each category c and each genome pair i. Figure 4 shows these posterior probability intervals, for all genome pairs i, for the categories 'translation', 'metabolic process', and 'regulation of transcription'.
Since the total number of domain-count changes Δn i is often small, it is not surprising that the posterior probability intervals are often rather wide. In spite of this, it can be clearly seen that, consistent with the scaling exponents α c , is largest for the category 'regulation of transcription', and smallest for the category 'translation'. Moreover, Fig.  4 shows that the data by and large support the prediction that the potentials are the same for all evolutionary lineages. That is, for each of the three categories the posterior probability intervals for are consistent with a common underlying potential ρ c for the majority of genome pairs i.
This is a further piece of support for the evolutionary null model.

Evolutionary potentials ρ c correlate with scaling exponents α c
The previous section has shown that the data are mostly consistent with constant evolutionary potentials across the genome pairs. We will now assume that the evolutionary potentials all equal a common potential ρ c and estimate it by combining data from all genome pairs. We Evolutionary potentials across different lineages  Figure 5 shows a scatter plot of α c against the estimated ρ c .
Note that, since the evolutionary potential ρ c is a measure of the relative frequency of domain-count changes between closely-related species, and α c is a measure of the scaling of the number of domains with genome size, there is a priori no reason why these two quantities should be strongly correlated. However, as predicted by our evolutionary null model, there is clear evidence of a linear dependency between the exponents α c and the evolutionary potentials ρ c .
Rather than a simple relation ρ c = α c we find that ρ c varies over a somewhat smaller range, i.e. the 99% posterior probability interval for the slope of the correlation runs from 0.65 to 0.79. One possible explanation is that, because the estimation of the numbers of domain-count changes Δn c is the same for all categories, we might under-estimate the numbers of domain-count changes more for categories with large ρ c than for categories with low ρ c .

Implications for the rates of horizontal transfer
In general, the rate at which additions/deletions occur is the product of two independent factors. First, the rate at which domain additions and deletions are introduced into individuals of the population, and second the fraction of the time that such mutations are being fixed into the population. There are likely three main mechanisms through which domain additions or deletions are introduced: duplications, deletions, and horizontal transfers. To a first approximation, the rates at which duplications, deletions, and horizontal transfers are being introduced into individuals will be determined by the biases inherent in the mechanisms underlying these processes and not by selection. In contrast, the fraction of the time that such mutations are fixed in evolution will strongly depend on selection.
It is clear that, for duplications and deletions, the rate at which such mutations are introduced is naturally propor-Correlation between exponents α c and evolutionary potentials ρ c  Evolutionary potentials tional to the number of existing domains n c . That is, when the number of domains n c doubles, the total rate at which duplications and deletions are introduced within this category also doubles. Moreover, since selection is not involved, the rate of introduction of duplications and deletions can be expected to be the same for all functional categories c (except of course for transposable elements which are duplicated through a separate mechanism). Therefore, as the rate of introduction is proportional to n c , with the same proportionality constant for each category, and the total rate must be proportional to ρ c n c , this implies that the relative rate of fixation through selection must be proportional to the evolutionary potential ρ c .
Thus, the evolutionary potentials ρ c (and the scaling exponents α c ) have a particularly simple interpretation: they give the average relative rate with which additions and deletions of domains in category c are fixed by selection.
Evidence has accumulated over recent years that horizontal transfers occur in essentially all evolutionary lineages and gene families, see e.g. [17][18][19][20][21][22] and the extensive discussion in the recent review [23]. There is less consensus in the literature, however, regarding the precise amount that horizontal transfer contributes to gene-content evolution. Discussion of the reasons for this lack of consensus are beyond the scope of this article but it is clear that the apparent disparity between the conclusions reached by different authors is a combination of: the fact that different authors ask different questions, i.e. asking what fraction of gene families are affected by horizontal transfer at least once in evolutionary history [24] is very different from asking, for example, the relative rates of gene loss to horizontal transfer [25,26]. The fact that very different types of evidence are used, such as presence/absence of members of a gene family across leafs of the species tree [19,20,26], comparison of gene trees with species trees [21,27], or the presence of genes without known homologs [28]. And finally, the fact that there are technical issues (like the way gene families are build and tree topologies are inferred) which may affect the results quantitatively if not qualitatively.
To the present authors the evidence currently in the literature does suggest that horizontal gene transfer accounts for a non-negligible and maybe even a large fraction of changes in gene content, at least among closely-related genomes. For example, it was found in [20] that up to 20% of the gene-content of proteo-γ bacteria consists of genes that have no homology with any of the other genes among all currently sequenced proteo-γ bacteria, but that do have homology with genes found outside of the proteo-γ clade. It is hard to see how this statistic could result from any process other than a high rate of horizontal transfer. It is thus worthwhile to investigate the implica-tions of our current findings under the assumption that many of the domain additions are due to horizontal transfer.
Although we have no direct evidence, it is attractive to assume that the probability that a domain addition will be fixed in the population does not depend on the mechanism by which it was introduced. That is, the relative rate of fixation of domain additions in category c should be proportional to ρ c for both duplicated domains as well as horizontally transfered domains. If this is indeed the case, it follows immediately from the fact that the overall rate should be proportional to ρ c n c , that the rate at which horizontal transfers are introduced must be proportional to the number of domains n c present in the genome. However, whereas this is naturally the case for gene duplications, it is not clear at all why this should also hold for horizontal transfers. Therefore, our results put rather strong constraints on the rate of horizontal transfer.
One possibility is that horizontal transfer is negligible and that domain additions are dominated by duplications. However, as we have just discussed, this assumption, which we have made in previous work [10,11], appears at odds with recent work. It should be noted, however, that some studies that investigate evolution of gene content over long time scales find that horizontal transfer is only responsible for a minor fraction of all events on a long time scale, i.e. [21]. One hypothesis that might be worthwhile to entertain is that most horizontal transfers are only transient. It is conceivable that horizontally transfered genes consist mostly of 'accessory' genes that are involved with adaptations to the local environment that are easily taken up by genomes moving into a certain environment, but which are also easily lost again when the environment changes, so that the horizontal transfers of these accessory genes contributes relatively little to the gene-content dynamics on long time scales. However, at least to these authors this hypothesis does not seem particularly plausible a priori.
Alternatively, there are several hypotheses that could explain why the rate at which horizontal transfers of domains of category c are introduced is proportional to the number of domains n c already in the genome. First, it is possible that horizontal transfer is highly biased to occur predominantly between genomes that are closelyrelated phylogenetically. One mechanism of horizontal gene transfer, conjugation, does indeed occur preferentially between related organisms. Since closely-related species are likely to have highly correlated domain counts, it is likely that the fraction n c /n of category c domains in the donor genome is close to the fraction of domains of category c in the receiver genome. However, many of the horizontal transfers detected through sequence analysis involve transfers between distally related species.
Another possible explanation is that bacterial habitats naturally separate into different genome-size classes. That is, it is conceivable that bacteria tend to be surrounded mostly by other bacteria of roughly the same genome size. Because the scaling laws apply to all genomes, the fractions n c /n are similar for similarly sized genomes and one would naturally have that the rate at which horizontal transfers of domains of category c occur is proportional to n c . As far as these authors are aware, currently there seems to be no evidence suggesting that there is a characteristic genome size for each bacterial habitat, but it appears that this hypothesis should in principle be testable using metagenomics data.
Finally, it is possible that, even though a given bacterium would generally be surrounded by other bacteria of very different sizes, that horizontal transfer is highly biased to occur predominantly between organisms that have genomes with similar sizes. In fact, there is some evidence in the literature that bacteria can recognize and silence horizontally transfered genes that have an AT-content which is significantly higher than the AT-content of the genome itself [29]. In addition, there is a good correlation between genome size and GC-content [30]. It is therefore conceivable that horizontal transfers between genomes of similar size are much more common than horizontal transfers between genomes of significantly different sizes.
In any case, whatever the underlying mechanism, if horizontal transfers account for a significant fraction of domain additions through evolution, then something must ensure that the rate of introduction of such horizontal transfers is proportional to the number of existing domains n c in the receiving genome.

Conclusion
We have shown that, across all bacteria and for most highlevel GO categories c, the number of domain occurrences n c scales as a power-law in the total number of domains n, with scaling exponents α c varying from close to zero to a bit larger than 2. We have derived what we believe is the simplest evolutionary model that can account for the observed scaling laws. This 'null model' assumes that, across all evolutionary lineages and all evolutionary times, the relative rate r c /r at which additions and dele- An interesting question is if our simple null model can also explain the observed power-law distribution [7][8][9] of genome-family sizes in each genome. In previous work [7] one of us has suggested that the simplest explanation for the power-law distribution of gene family sizes is a multiplicative noise process. Although we will defer a detailed analysis of the gene-family size distributions implied by our null model to future work, it is clear that the basic ingredients for such a multiplicative noise process are already present. Since the model only constrains the relative rates of domains in different functional categories, the overall rate of genome growth/shrinkage can fluctuate randomly, and the rates of different families within a functional category can also fluctuate around a common mean. It is interesting to note that our null model implies that categories with large exponents, such as transcription factors, should show larger fluctuations in gene family sizes than categories with small exponents. Since the categories with large exponents are more abundant in larger genomes this in turn implies that the exponent of the gene-family size distribution should increase (i.e. be less negative) for larger genomes. This is indeed what is observed [7].
If, as recent work suggests, horizontal transfer is an important force in shaping the gene-content of genomes, then our results put strong constraints on the rates r c at which horizontal transfers of domains of different functional categories c can occur. In particular, we find that the rate at which domains of category c are horizontally transfered into a genome must be proportional to the number of domains n c already existing in the receiving genome. An important avenue for future research is to clarify the underlying mechanism that is responsible for this surprising fact.
As our results have made plausible that the evolutionary potentials ρ c (and the corresponding scaling exponents α c ) are fundamental constants of the evolutionary process that apply across all time and all evolutionary lineages, the major challenge is now to elucidate what determines these numbers. In this respect it is important to note that the functional categories c that we consider are taken directly from the human-defined Gene Ontology hierarchy and are thus rather subjective. A first challenge for future work is therefore to identify a procedure that divides domain families into functional groups in a more objective manner. Although difficult with the current amount of available data, one possible approach is to estimate evolutionary potentials ρ f for individual domain families and to investigate if these fall into a small number of natural classes. That is, it is conceivable that on some more fundamental level there are only a small number of distinct exponents, for example α = 0, α = 1, and α = 2, and that the observed scaling laws with more complex exponents are different mixtures of these more fundamental scaling laws. Finally, we believe that the exponents α c reflect fundamental design principles of bacterial life, maybe similar to the way geometry and architectural design principles demand that the number of windows in a building scales as the 2/3 power of the building's volume. Seen from this point of view the exponents α c encode crucial information about the basic design that is shared by all bacterial life.

Domain counts
We obtained all 630 currently available bacterial genomes from the NCBI database [31]. To count the number of occurrences of each Pfam domain in each fully sequenced bacterial genome we ran HMMer [32] using all Pfam models on all proteins encoded in each genome, as annotated in the NCBI reference file. We thus assume that there are no significant fluctuations in the quality of gene prediction across the genomes. A hit was considered a valid domain if its score was equal or bigger than the so-called gathering score of the model provided by the Pfam web site, and it did not overlap with any other hit of lower Evalue. There were 4,732 Pfam domain families with at least one occurrence across the 630 bacterial genomes. To count the number of domain occurrences per functional category we used a mapping from Pfam domains to Gene Ontolology terms [15] which is available at http:// www.geneontology.org/. If a domain-family f maps to category c it will be associated with c and all parent categories of c in the Gene Ontology hierarchy.

Bayesian fitting of exponents
We used a Bayesian model to fit a power-law of the form for each category c. We discard all genomes with zero counts, i.e. n c (g) = 0, for each category c and logtransform the remaining domain-counts, i.e. (x g , y g ) = (log[n c (g)], log[n(g)]). We assume that the pairs (x g , y g ) derive from a line y g = αx g + β plus noise of unknown size in both x-and y-direction. In addition we assume a rotationally invariant prior for the slope α. Under these assumptions the posterior probability density for the slope α given the data D is given by where G is the number of genomes, σ xx is the variance of x values, σ yy the variance of y values, and σ xy the covariance of x and y values. Note that the optimal line in this procedure corresponds roughly, i.e. up to the effects of the rotationally-invariant prior, to the line that minimizes the sum of the squared orthogonal distances of the data points to the line. The latter also corresponds to the first principal component of the data.
We selected all GO categories that have nonzero count in at least 95% of the genomes (600 out of 630), where the fraction of the variance explained by the fit is at least 0.9375 (this corresponds to the average distance to the data-points from the fitted line being 0.25 or less of the average distance of the data-points to the center of mass of the scatter), and where the average number of domains (averaged over all genomes) is at least 5. This led to 156 categories listed in Additional file 1.
To estimate the exponents γ c we make use of the additional information that the noise in the fraction f c is almost certainly much smaller than the noise in dn c /dn. .

Extracting closely-related pairs of bacteria
We extracted the phylogenetic tree of bacteria from the tree of life that was produced by Ciccarelli et al. [16] based on the concatenated protein sequences of 31 protein families. As shown previously [33], even strains that are so close that they traditionally would be considered the same species, i.e. more than 94% nucleotide identity between orthologous genes, can have substantial differences in their gene content. In selecting 'close' pairs of organisms we want, on the one hand, to be able to estimate relative rates, for which we need a large enough number of domain additions and deletions to have taken place. On the other hand, the further apart the organisms, the harder it is to accurately estimate the total number of addition and deletion events that have taken place (see below). We decided to select all pairs of species for which the average identity at the amino acid level of orthologous proteins was at least 0.75, i.e. distance less than 0.25. With this definition one of the most distant pairs considered was Escherichia coli and Vibrio Cholerae. To avoid redundancy and pairs with too few events, we clustered all genomes whose distances were 0.01 or less and took a single representative genome from each cluster. With these cutoffs we obtained 93 pairs of bacterial genomes which are listed in Additional file 1.

Estimating domain-count changes Δn c
We estimate the number of domain-count changes Δn and and Similarly, when the x f are given, the probability of e f conditioned on these variables is given by and we can numerically solve for the ef that maximizes this likelihood. We start by setting all e f = 0 and use the above equations to, iteratively, solve for λ, μ and the x f given the e f , and then the e f given the x f . This is repeated until a fixed point is reached. Finally, the estimated total number of events Δn f for family f equals δ n f + 2e f . In this way we estimate the number of events separately for each of the genome pairs i we analyze.
We originally performed this procedure including all Pfam domains. However, doing this we found that the number of extra moves e f estimated for categories associated with transposons and bacteriophages was many times larger than for all other families. This is of course to be expected as both transposons and phages actively multiply their domains. However, in equations (33)

Eugene V. Koonin
This is an important and welcome development of Van Nimwegen's 2003 classic on the scaling laws for different functional categories of genes in prokaryotic genomes. That classic study established that different functional classes of genes all scale according to power laws but with different, function-specific exponents that are (at least, approximately) the same in all prokaryotic lineages and, supposedly, the same throughout the course of prokaryotic evolution. This paper underpins the scaling laws with the (apparently) simplest conceivable evolutionary model. Under this model, the dynamics of protein domains in each functional category depends on just two variables, the number (fraction) of domains of the given category that are already present in the genome and the intrinsic evolutionary potential of the category. It is shown that the observations on the actual counts of domains in genomes are well explained if the evolutionary potentials are category-specific but invariant across bacterial lineages. All of the above had to be done in order to obtain a concrete evolutionary mechanism yielding the observed scaling laws but the above results are not at all unexpected. As I see it, the interesting things start coming up when one starts considering mechanism of domain gain in specific terms. For the model to work it is necessary that the rate at which horizontally transferred genes are acquired by a prokaryotes is proportional to the number of domains of the given category that are already present in the genome. Why that would be the case remains unclear. I believe the possibility that horizontal gene transfer is negligible compared to duplication as the source of new genes (domains) can be dismissed with confidence. In all, likelihood, in prokaryotes, horizontal gene transfer is actually a more important source of new genes than duplication. It is hard to think of a way for the domain content in the recipient organism could directly affect the rate of horizontal transfer. So the explanation should be indirect, that is, should include a connection between the domain composition of the donor genome with that of the recipient. Molina and Van Nimwegen consider three possibilities, and to me, the one that horizontal gene transfer predominantly occurs between genomes with similar AT-content (horizontally transferred genes coming from organisms with substantially different AT-content being rapidly destroyed or silenced) is highly attractive considering the strong correlation between AT-content and genome size. These hypotheses are testable by comparative-genomic methods although the analysis will not be easy. Of course, in the face of the rather counter-intuitive finding that the rate of horizontal gene transfer should depend on the fraction of domains of the given category already present, one has to consider the possibility that the proposed evolutionary model is too simple to be true. As far as I can see, more specifically, that would imply that the evolutionary potentials are not time-invariant and/or lineage-independent. The results of the present paper do not seem to point in this direction but I suspect that this is not the last word on the subject, more detailed analyses are necessary. On the whole, this is an enormously interesting subject, and the present paper is a useful stepping stone toward understanding the scaling laws. I am particularly intrigued by the final proposition that there could be only three fundamental exponents, the intermediate values currently observed depending on mixing of genes from the three classes in different proportions. Philosophically, this seems to smack of essentialism but... should there be a mechanistic explanation(s) of the 0, 1, and 2 scaling (and I can think of some), this would be a real step ahead in our understanding of how genomes evolve.

Martijn A. Huynen
The manuscript by Molina and van Nimwegen is the culmination of an observation that was originally made by Erik van Nimwegen and on which there has been follow op from several corners: that the variation in the number of proteins in a specific functional class across species scales as a power-law with the total number of proteins encoded in the species, and that the exponent of that power-law varies between the various classes.
Molina and van Nimwegen analyse their model further to show that the number of additions and losses within each category is proportional to the number of genes of that category already in the genome. They show that this prediction is borne out by comparing the number of genes in closely related genomes. Questions: In work of this referee and van Nimwegen we showed that the frequency distribution of gene family sizes in complete genomes follows a power-law, and we argued that this was consistent with this model, and more importantly, that these two observations about gene family size distributions 1) the size distribution of one family over genomes and 2) the size distribution of all families within one genome, can now be explained by one single model?
The authors argue for domains as the evolutionary unit. This may well be, but such a lower resolution does run the risk of mixing functional categories for domains that function in multiple categories. How many domains did map to multiple categories? And how did that affect the results?
With respect to Horizontal gene transfer the authors do not analyze this process per se, but rather argue that if it is frequent it should also be proportional to gene family size. I do not want to get into a whole HGT debate here, but, although over the complete history of life, along all evolutionary branches, few gene families appear to escape HGT, or at least escape evidence for HGT, compared to processes like gene duplication and gene loss, the quantitative contribution from any generation to the next appears to be small (Snel Bork and Huynen Genome Res 2002). There are other references (BG Mirkin, TI Fenner, MY Galperin, EV Koonin 2004) that do give higher estimates however. In any case it would be worthwhile to mention that relative to gene duplication and gene loss the amount of HGT that actually happens is not necessarily as large as is sometimes implicitly suggested.
With respect to the HGT mainly occurring between closely related species: there is evidence for that (e.g. by conjugation).
With respect to the closely related genomes: can the authors check whether the protein prediction in a pair of genomes was done with the same programs? or did they run the HMMs directly against the DNA? This is a bit nitpicking I know, but comparisons of closely related species have been confounded by inconsistent genome annotations in the past.

Sergei Maslov
The manuscript presents an interesting study of evolutionary implications of previously reported scaling laws in the functional content of bacterial genomes. While it does not answer the ultimate question of why this scaling exists in the first place, it methodically explores all its logical consequences reflected in genomes' evolutionary history.
An important result of this study is that the overall rate of gene (or domain as used in this manuscript) additions AND deletions scales linearly with the number of genes (domains) in a given functional category. This statement is in principle separate and independent from the scaling law itself since it counts the SUM of the rates of domain additions and deletions and not the DIFFERENCE between them.
Unfortunately, when this quantity (Δn c proportional to [rate of additions+rate of deletions]c) is first introduced on page 7, readers could easily confuse it with just the net change in n c (denoted dnc and proportional to [rate of additions-rate of deletions] c ). As a result they would miss one of the central points of the manuscript. I recommend that authors spend some extra time upfront explaining the differences between Δn c and dn c and emphasizing that, a priory, these two quantities are not at all close to each other. To quantify this difference authors might quote the average value of -[rate of additions-rate of deletions] c -/ [rate of additions+rate of deletions] c for their 93 pairs of genomes.
My other comment concerns the proposed "superuniversality" of evolutionary potentials (ρ c ) of a individual functional categories. In general this study indicates that ρ c remains nearly the same for all species and at all timescales of evolution. I have previously observed (S. Maslov, unpublished) that in a group of VERY CLOSELY related genomes (28 fully sequences E. coli and Salmonella strains) the number of transcription factors violates the N 2 scaling in spite of a considerable range of genome sizes (from 4300 to 5800 genes). The best fit to the scaling exponent gives 0.3 instead of 2. This might indicate that evolutionary dynamics might in fact be rather different on very short timescale. This does not contradict the results of this study since (as explained in the Methods) authors have grouped together all very closely related species (AA substitution rate below 1%). However, I believe this observation deserves future scrutiny since it may shed an additional light on elementary evolutionary steps shaping functional contents of bacterial genomes.
Finally, I would like to offer another possibility of how the results of this study could be reconciled with the evidence of widespread Horizontal Gene Transfer (HGT) among bacteria (see section 2.9). One way to explain the linear correlation between the rate of fixed horizontal gene transfers and the number of genes in host's genome, is to assume that a SUCCESSFUL group of HGT-acquired genes needs to be functionally integrated with the rest of the genome. An example would be a HGT-transferred metabolic pathway that in order to contribute to the biomass production needs to be connected with the rest of the metabolic network of its host. Genomes with larger number of genes n have more places where a HGT-transferred Additional material