### Scaling laws in protein domain occurrences

Although genes are natural units in genome analysis there are some disadvantages to using genes as the central units in the analysis of the evolution of genome content. For example, apart from being able to mutate, duplicate, and be deleted, it is well-known that, not infrequently, two genes can fuse into one, single genes can split into two [12], and genes can evolve *de novo* from non-coding sequence. Such events significantly complicate the analysis of the evolution of gene content.

Protein domains form more natural units for the study of the evolution of gene-content for several reasons. It can be argued that protein domains act like 'evolutionary atoms' to a certain extent [9]; Protein domains form functional units [13] that cannot be split into smaller units, and a single protein domain can, in general, not be constructed by fusing multiple occurrences of other protein domains. Therefore, we can safely assume that almost all changes in the number of occurrences in the genome of a given protein domain are due to deletions, duplications, or the horizontal transfer of a domain from another organism's genome. We thus decided to study the evolution of functional gene content in terms of the number of occurrences of different protein domains. Among databases of protein domains Pfam [14] is attractive because the Pfam domain families are disjoint, i.e. at the default settings it is guaranteed that any given DNA sequence segment will be classified to belong to at most one domain family. We thus used Pfam domains as our evolutionary 'atoms'.

We counted the number of occurrences of each Pfam domain in each fully sequenced bacterial genome (Methods). Using a mapping from Pfam to Gene Ontology categories [15] we determined, for each genome *g*, the total number of domains *n*(*g*) that can be associated with *any* GO category and, for each GO category *c*, the number of domains *n*_{
c
}(*g*) occurring in the genome.

Figure 1 shows, for 3 example categories, the number of domains in that category as a function of the total number of domains in the genome (that can be mapped to a GO category). As the figure shows, for all three categories the number of genes in the category *n*_{
c
}scales as a power-law in the total number of domains in the genome *n*, i.e.

{n}_{c}={e}^{{\beta}_{c}}{n}^{{\alpha}_{c}},

(1)

with both the pre-factors *β*_{
c
}and the exponents *α*_{
c
}varying between categories. These power-laws are observed for the large majority of high-level functional categories. For each GO category we fitted a power-law of the form (1) using a Bayesian procedure which in particular provides a posterior probability distribution for the exponent *α*_{
c
}(Methods). We selected 156 GO categories that occur in at least 95% of all genomes and that show good power-law fits (Methods). The inferred exponents match what we found previously based on the gene-number analysis of a much smaller number of genomes [10, 11], i.e. for basic processes such as translation and DNA repair exponents are low, whereas exponents for regulatory functions such a regulation of transcription and signal transduction are largest. The inferred exponents for all 156 selected categories are listed in Additional file 1.

### Evolutionary Model

We want to investigate the implications of the scaling laws (1) for evolutionary dynamics. That is, we want to infer what the scaling laws imply for the behavior of the domain number counts *n*_{
c
}(*t*) as a function of time *t*. It is important to define precisely what we mean by *n*_{
c
}(*t*). A sequenced genome *g* represents a particular bacterial strain and can idealistically be thought of as representing the genome of a single bacterial organism living today with domain counts *n*_{
c
}(*g*). Since bacteria reproduce clonally we can imagine tracing this individual back through time, back to its mother cell, its grandmother, and eventually all the way back until the common ancestor of all currently sequenced genomes. We denote by *n*_{
c
}(*g*, *t*) the number of domains of category *c* that were present in the ancestor organism of genome *g* that was living at time *t*.

Let *t*_{now} denote today and let *x*_{
c
}(*g*, *t*) denote the logarithm of the domain-number, i.e. *x*_{
c
}(*g*, *t*) = log[*n*_{
c
}(*g*, *t*)], and similarly *x*(*g*, *t*) = log[*n*(*g*, *t*)]. In these variables the scaling laws are just straight lines, i.e all genomes *g* (approximately) obey the linear relation

*x*_{
c
}(*g*, *t*_{today}) = *α*_{
c
}*x*(*g*, *t*_{today}) + *β*_{
c
}∀_{
g
}.

We will now derive how these scaling laws constrain the changes in domain-numbers that have occurred throughout time. Let *t* = 0 denote the time at which the last common ancestor of all sequenced bacterial genomes was alive. Note that, since the GO categories that we consider occur in almost all genomes, it is reasonable to assume that they all had nonzero count in the last common ancestor. We let *x*_{
c
}(0) denote the log-domain counts in this common ancestor and *x*(0) the logarithm of the total domain count. Further, we denote by *dx*_{
c
}(*g*, *t*) the change in the log domain-count for category *c*, that occurred in a small interval of time centered around time *t* in the evolutionary history of genome *g*. The log domain-counts *x*_{
c
}(*g*, *t*) and *x*(*g*, *t*) are then by definition given by the integrals

{x}_{c}(g,{t}_{\text{now}})={x}_{c}(0)+{\displaystyle {\int}_{0}^{{t}_{\text{now}}}d{x}_{c}(g,t)},

(3)

and

x(g,{t}_{\text{now}})=x(0)+{\displaystyle {\int}_{0}^{{t}_{\text{now}}}dx(g,t)}.

(4)

Comparing equations (3) and (4) with equation (2) the scaling laws thus imply that we have

{x}_{c}(0)+{\displaystyle {\int}_{0}^{{t}_{\text{now}}}d{x}_{c}(g,t)}={\beta}_{c}+{\alpha}_{c}\left[x(0)+{\displaystyle {\int}_{0}^{{t}_{\text{now}}}dx(g,t)}\right]\forall g.

(5)

Since (5) must hold for *all* genomes *g*, this equation first of all implies a relation between the offsets *β*_{
c
}and the domain counts in the last common ancestor:

*β*_{
c
}= *x*_{
c
}(0) - *α*_{
c
}*x*(0).

More importantly, we find that all genomes must obey

{\displaystyle {\int}_{0}^{{t}_{\text{now}}}d{x}_{c}(g,t)}={\alpha}_{c}{\displaystyle {\int}_{0}^{{t}_{\text{now}}}dx(g,t)\forall g}.

(7)

For short time intervals in which the changes in *n*_{
c
}are small relative to *n*_{
c
}itself, the changes in *x*_{
c
}are related to the changes in *n*_{
c
}through

d{x}_{c}(g,t)=\frac{d{n}_{c}(g,t)}{{n}_{c}(g,t)},

(8)

and similarly

dx(g,t)=\frac{dn(g,t)}{n(g,t)}.

(9)

Substituting these in (7) we obtain

{\alpha}_{c}=\frac{{\displaystyle {\int}_{0}^{{t}_{\text{now}}}\frac{d{n}_{c}(g,t)}{{n}_{c}(t)}}}{{\displaystyle {\int}_{0}^{{t}_{\text{now}}}\frac{dn(g,t)}{n(g,t)}}}\forall g.

(10)

Equation (10) summarizes the implications for domain-count dynamics implied by the scaling laws. It states that, *independent* of which evolutionary history we take, the ratio of the integrals of *dn*_{
c
}/*n*_{
c
}and *dn*/*n* over all evolutionary time must match the scaling exponent *α*_{
c
}. This is illustrated on the left-hand side of figure 2, i.e. equation (10) implies that the ratio of integrals is the same for each of the evolutionary histories indicated as colored lines.

### Time Invariance

The equations (10) reflect the constraints on domain-count dynamics implied by the scaling laws but they don't uniquely determine an evolutionary model. To derive a unique evolutionary *null model* we will assume *time invariance* of the scaling laws. We assume that, if we had collected genomes of bacteria living several tens or even hundreds of million years ago, as opposed to the bacteria living today, we would have observed the *same* scaling laws as we observe today. That is, we assume that there is nothing particularly special about our current time, and that the same scaling laws have held since the last common ancestor, or at least since the origin of the clades from which our current genome sequences derive. We feel that this is by far the simplest assumption that can be made about the evolutionary dynamics and will here analyze its implications.

Given that the scaling laws are invariant in time, we immediately obtain that (10) should hold for each *short* time interval, i.e. we have that

\frac{d{n}_{c}(g,t)}{{n}_{c}(g,t)}={\alpha}_{c}\frac{dn(g,t)}{n(g,t)}\forall g,t,

(11)

or

\frac{d{n}_{c}(g,t)}{dn(g,t)}={\alpha}_{c}\frac{{n}_{c}(g,t)}{n(g,t)}\forall g,t.

(12)

That is, the assumption of time invariance implies that, for each genome *g*, and for each short time interval in its evolution, the ratio between the change *dn*_{
c
}(*g*, *t*) in the domain-count of category *c* and the total change *dn*(*g*, *t*) in domain-count is given by the product of the exponent *α*_{
c
}and the fraction *n*_{
c
}(*g*, *t*)/*n*(*g*, *t*) of all domains that are of category *c*. In particular, equation (12) will apply to the domain-count changes that occurred since the common ancestors of pairs of closely-related species, as illustrated on the right-hand side of Fig. 2. Therefore, we can test the validity of the null model by comparing the domain-counts in the genomes of closely-related bacteria.

### Implications for closely-related pairs of genomes

We now discuss how the prediction (12) can be tested with data from closely-related genomes. Note that, strictly speaking, (12) holds only in the limit of infinitesimally small *dn*(*g*, *t*) and that we have so far implicitly assumed that the *n*_{
c
}(*g*, *t*) are continuous variables, whereas in reality the smallest possible change is *dn*(*g*, *t*) = 1. For the integer-valued quantities *n*_{
c
}(*g*, *t*) equation (12) can be interpreted as follows: whenever a single domain is added to the genome, i.e. *dn* = 1, then the *probability* that this domain is of category *c* is given by *α*_{
c
}*n*_{
c
}/*n*. Similarly, whenever a single domain is removed, i.e. *dn* = -1, then the probability that this domain is of category *c* is also given by *α*_{
c
}*n*_{
c
}/*n*.

Since this interpretation is of key conceptual importance we briefly expand on its meaning.

Mathematically, equation (10) makes a statement about the total overall changes in domain counts that happen over some finite time interval. In particular, the total change *dn*_{
c
}that occurs over some time interval is the *difference* between the number of additions and deletions that occurred during that time interval. From a mathematical point of view, equation (11) is a differential equation that makes a statement about the relative *rates* at which changes in domain-count number occur, i.e. including both additions and deletions. To put it differently, the assumption of time invariance allows us to make statements about time intervals so short that at most one 'event' can occur during such intervals, so that there is roughly speaking no room left for additions and deletions to cancel each other out, i.e. the relation (11) must hold for both of them. The clearest interpretation is in terms of a model where the key quantities are the *rates*, i.e. probability of an event per unit time, at which domain-count changes (either additions or deletions) take place. That is, if *r* denotes the overall rate at which additions or deletions occur, and *r*_{
c
}the rate at which additions/deletions of domains of category *c* occur, then the model predicts

\frac{{r}_{c}}{r}={\alpha}_{c}\frac{{n}_{c}}{n}.

(13)

For pairs of closely-related genomes the number of domain-count changes that occurred since they diverged from a common ancestor is generally very small compared to the total number of domains. Therefore, the fractions *n*_{
c
}/*n* have generally changed little during the time since the two genomes diverged from their ancestor and we will make the assumption that the fraction *n*_{
c
}/*n* can be considered constant. Under this approximation equation (13) predicts that, if during the time interval since the pair's common ancestor, a total of Δ*n* domain-count changes occurred, i.e. the *sum* of all additions and deletions, then the expected number of domain-count changes Δ*n*_{
c
}in category *c* (which is again the *sum* of all additions and deletions in this category) should equal {\alpha}_{c}\frac{{n}_{c}}{n}\Delta n

We collected 93 pairs of fully-sequenced genomes that are evolutionary relatively closely related, using the tree of life that was inferred by Ciccarelli et al. [16] as a guide (Methods). For each pair of genomes *i* we counted the numbers of domain occurrences for each Pfam family and used these (Methods) to estimate the number of domain-count changes \Delta {n}_{c}^{i} for each category *c* and the total number of domain-count changes Δ*n*^{i}. Again, we stress that the \Delta {n}_{c}^{i} are the estimated total number of changes, adding additions and deletions together. For example, if we denote by d{n}_{c}^{i} the *difference* in the number of domains in category *c* occurring in the two genomes of the pair, then we typically find that the estimated \Delta {n}_{c}^{i} is larger than d{n}_{c}^{i} (see Additional file 1). Apart from estimating \Delta {n}_{c}^{i} we estimated, for each genome pair *i*, the fractions {n}_{c}^{i}/{n}^{i} by averaging the domain counts over the two genomes in the pair (Methods). Our model thus predicts that, for each pair *i*, the ratio \Delta {n}_{c}^{i}/\Delta {n}^{i} should be proportional both to the fraction {n}_{c}^{i}/{n}^{i} and to scaling law exponent *α*_{
c
}.

### The fraction of domain-count changes is proportional to the number of existing domains

Equation (13) puts very strong constraints on the dynamics of domain-counts which we will check in three steps. First, we check that, for each category *c*, the estimated fractions Δ*n*_{
c
}/Δ*n* of domain-count changes grow linearly with the fractions *n*_{
c
}/*n*. The left panel of figure 3 shows scatter plots of \Delta {n}_{c}^{i}/\Delta {n}^{i} as a function of {n}_{c}^{i}/{n}^{i} for three selected categories. The axes are shown on logarithmic scales and the straight lines show least-squares linear fits of the form \mathrm{log}[\Delta {n}_{c}^{i}/\Delta {n}^{i}]={\gamma}_{c}\mathrm{log}[{n}_{c}^{i}/{n}^{i}]+{\delta}_{c}.

The left panel of Fig. 3 demonstrates two points. First, comparing the three categories with each other, we see that most domain-count changes occur in the most abundant category and least domain-count changes occur in the least abundant category, with the fraction of domain-count changes \Delta {n}_{c}^{i}/\Delta {n}^{i} indeed scaling roughly linearly with {n}_{c}^{i}/{n}^{i} (compare with the dotted guide lines showing linear scaling). Beyond that, if we compare the numbers of domain-count changes across the different genomes *within* each category we see that, in those genomes where the domains of the category are most abundant domain-count changes in that category are also most abundant. That is, although the data is quite noisy, it is clear that all three clouds of points show a close to linear increase of \Delta {n}_{c}^{i}/\Delta {n}^{i} with {n}_{c}^{i}/{n}^{i}.

The estimated slopes *γ*_{
c
}for all selected GO categories are shown in the right panel of Fig. 3 (and listed in Additional file 1). The estimated *γ*_{
c
}are very roughly symmetrically distributed around 1 with a median *γ*_{
c
}of 1.16. For almost 75% of the categories a slope of *γ*_{
c
}= 1 is within the 99% posterior probability interval.

This thus supports the prediction of our evolutionary null model that the fraction of all domain-count changes that involve domains of category *c* is proportional to the fraction *n*_{
c
}/*n* of all domains in the genome that belong to category *c*.

For about 25% of the categories we infer slopes significantly deviating from 1. It should be noted, however, that the least-squares fitting assumes simple Gaussian noise in log[Δ*n*_{
c
}/Δ*n*], whereas in reality the size of the noise in log[Δ*n*_{
c
}/Δ*n*] increases as Δ*n* decreases. Moreover, whereas the fitting assumes that the numbers of domain-count changes are given, in reality these are estimated (see Methods) and thus themselves subject to uncertainty. We therefore are significantly underestimating the uncertainty in the fitted slope for many categories, and it is reasonable to conclude that for most if not all categories the data is consistent with the predicted linear dependence of Δ*n*_{
c
}/Δ*n* on *n*_{
c
}/*n*.

### Evolutionary Potentials

The results of the previous section support that the rate *r*_{
c
}of domain-count changes involving domains of category *c* is proportional to the number of domains *n*_{
c
}currently present in the genome. Let {r}_{c}^{i} denote the rate of addition/deletion of domains of category *c* for genome pair *i* and let *r*^{i}denote the overall rate of addition/deletion of domains for genome pair *i*. Assuming only that {r}_{c}^{i} is proportional to {n}_{c}^{i} we can generally write for the relative rates

\frac{{r}_{c}^{i}}{{r}^{i}}={\rho}_{c}^{i}\frac{{n}_{c}^{i}}{{n}^{i}},

(14)

which is a generalization of equation (13). The proportionality constants {\rho}_{c}^{i} defined by this equation quantify the extent to which domain-count changes of category *c* are more or less frequent in the lineages of pair *i* than expected based on their frequency {n}_{c}^{i}/{n}^{i}. For this reason we will refer to these proportionality constants as *evolutionary potentials*. That is, when {\rho}_{c}^{i} is high it indicates that, apparently, domain additions and deletions involving domains of category *c* are fixed in evolution at a higher rate in the evolutionary lineages of pair *i*.

Our evolutionary null model predicts that the evolutionary potentials {\rho}_{c}^{i} are the same for all evolutionary lineages, and in addition that the evolutionary potentials {\rho}_{c}^{i} are equal to the scaling law exponents *α*_{
c
}. We will check these two predictions in turn.

### The evolutionary potentials {\rho}_{c}^{i}are constant across evolutionary lineages

Given the estimated numbers of domain-count changes \Delta {n}_{c}^{i}, and the total number of domain-count changes Δ*n*^{i}we can estimate the lineage-specific evolutionary potentials {\rho}_{c}^{i} as follows. For every domain-count change that occurs, the probability that it will involve a domain of category *c* is simply given by the relative rate {r}_{c}^{i}/{r}^{i}. Therefore, if Δ*n*^{i}domain-count changes occur in total, the probability that \Delta {n}_{c}^{i} involve domains of category *c* is simply given by

\begin{array}{r}\hfill P(\Delta {n}_{c}^{i}|\Delta {n}^{i},{\rho}_{c}^{i})=\left(\begin{array}{c}\Delta {n}^{i}\\ \Delta {n}_{c}^{i}\end{array}\right){\left({\rho}_{c}^{i}\frac{{n}_{c}^{i}}{{n}^{i}}\right)}^{\Delta {n}_{c}^{i}}\\ \hfill {\left(1-{\rho}_{c}^{i}\frac{{n}_{c}^{i}}{{n}^{i}}\right)}^{\Delta {n}^{i}-\Delta {n}_{c}^{i}},\end{array}

(15)

where we used the definition (14). Using a uniform prior over {\rho}_{c}^{i} we and for the posterior probability of {\rho}_{c}^{i} given the estimated domain-count changes

\begin{array}{r}\hfill P({\rho}_{c}^{i}|\Delta {n}^{i},\Delta {n}_{c}^{i})d{\rho}_{c}^{i}=\frac{{n}_{c}^{i}}{{n}^{i}}\frac{(\Delta {n}^{i}+1)!}{\Delta {n}_{c}^{i}!(\Delta {n}^{i}-\Delta {n}_{c}^{i})!}\\ \hfill {\left({\rho}_{c}^{i}\frac{{n}_{c}^{i}}{{n}^{i}}\right)}^{\Delta {n}_{c}^{i}}{\left(1-{\rho}_{c}^{i}\frac{{n}_{c}^{i}}{{n}^{i}}\right)}^{\Delta {n}^{i}-\Delta {n}_{c}^{i}}d{\rho}_{c}^{i}\end{array}

(16)

Using (16) we determined posterior probability intervals [{l}_{c}^{i},{h}_{c}^{i}] defined by

{\displaystyle {\int}_{0}^{{l}_{c}^{i}}P(\rho |\Delta {n}^{i},\Delta {n}_{c}^{i})d\rho =0.01,}

(17)

and

{\displaystyle {\int}_{0}^{{h}_{c}^{i}}P(\rho |\Delta {n}^{i},\Delta {n}_{c}^{i})d\rho =0.99,}

(18)

for each category *c* and each genome pair *i*. Figure 4 shows these posterior probability intervals, for all genome pairs *i*, for the categories 'translation', 'metabolic process', and 'regulation of transcription'.

Since the total number of domain-count changes Δ*n*^{i}is often small, it is not surprising that the posterior probability intervals are often rather wide. In spite of this, it can be clearly seen that, consistent with the scaling exponents *α*_{
c
}, {\rho}_{c}^{i} is largest for the category 'regulation of transcription', and smallest for the category 'translation'. Moreover, Fig. 4 shows that the data by and large support the prediction that the potentials {\rho}_{c}^{i} are *the same* for all evolutionary lineages. That is, for each of the three categories the posterior probability intervals for {\rho}_{c}^{i} are consistent with a common underlying potential *ρ*_{
c
}for the majority of genome pairs *i*. This is a further piece of support for the evolutionary null model.

### Evolutionary potentials *ρ*_{
c
}correlate with scaling exponents *α*_{
c
}

The previous section has shown that the data are mostly consistent with constant evolutionary potentials across the genome pairs. We will now assume that the evolutionary potentials {\rho}_{c}^{i} all equal a common potential *ρ*_{
c
}and estimate it by combining data from all genome pairs. We find for the probability of *ρ*_{
c
}given the observed domain-count changes {\Delta {n}_{c}^{i}} and {Δ*n*^{i}}

\begin{array}{r}\hfill P({\rho}_{c}|\left\{\Delta {n}_{c}^{i}\right\},\left\{\Delta {n}^{i}\right\})\propto {\displaystyle \prod _{i}{\left({\rho}_{c}\frac{{n}_{c}^{i}}{{n}^{i}}\right)}^{\Delta {n}_{c}^{i}}}\\ \hfill {\left(1-{\rho}_{c}\frac{{n}_{c}^{i}}{{n}^{i}}\right)}^{\Delta {n}^{i}-\Delta {n}_{c}^{i}}.\end{array}

(19)

Using this equation we estimated *ρ*_{
c
}for each selected category *c*. Equation (13) predicts that the evolutionary potentials *ρ*_{
c
}equal the scaling exponents *α*_{
c
}. Figure 5 shows a scatter plot of *α*_{
c
}against the estimated *ρ*_{
c
}.

Note that, since the evolutionary potential *ρ*_{
c
}is a measure of the relative frequency of domain-count changes between closely-related species, and *α*_{
c
}is a measure of the scaling of the number of domains with genome size, there is *a priori* no reason why these two quantities should be strongly correlated. However, as predicted by our evolutionary null model, there is clear evidence of a linear dependency between the exponents *α*_{
c
}and the evolutionary potentials *ρ*_{
c
}.

Rather than a simple relation *ρ*_{
c
}= *α*_{
c
}we find that *ρ*_{
c
}varies over a somewhat smaller range, i.e. the 99% posterior probability interval for the slope of the correlation runs from 0.65 to 0.79. One possible explanation is that, because the estimation of the numbers of domain-count changes Δ*n*_{
c
}is the same for all categories, we might underestimate the numbers of domain-count changes more for categories with large *ρ*_{
c
}than for categories with low *ρ*_{
c
}.

### Implications for the rates of horizontal transfer

In general, the rate at which additions/deletions occur is the product of two independent factors. First, the rate at which domain additions and deletions are *introduced* into individuals of the population, and second the fraction of the time that such mutations are being fixed into the population. There are likely three main mechanisms through which domain additions or deletions are introduced: duplications, deletions, and horizontal transfers. To a first approximation, the rates at which duplications, deletions, and horizontal transfers are being introduced into individuals will be determined by the biases inherent in the mechanisms underlying these processes and not by selection. In contrast, the fraction of the time that such mutations are fixed in evolution will strongly depend on selection.

It is clear that, for duplications and deletions, the rate at which such mutations are introduced is naturally proportional to the number of existing domains *n*_{
c
}. That is, when the number of domains *n*_{
c
}doubles, the total rate at which duplications and deletions are introduced within this category also doubles. Moreover, since selection is not involved, the rate of introduction of duplications and deletions can be expected to be the *same* for all functional categories *c* (except of course for transposable elements which are duplicated through a separate mechanism). Therefore, as the rate of introduction is proportional to *n*_{
c
}, with the same proportionality constant for each category, and the total rate must be proportional to *ρ*_{
c
}*n*_{
c
}, this implies that the relative rate of *fixation* through selection must be proportional to the evolutionary potential *ρ*_{
c
}.

Thus, the evolutionary potentials *ρ*_{
c
}(and the scaling exponents *α*_{
c
}) have a particularly simple interpretation: they give the average relative rate with which additions and deletions of domains in category *c* are fixed by selection.

Evidence has accumulated over recent years that horizontal transfers occur in essentially all evolutionary lineages and gene families, see e.g. [17–22] and the extensive discussion in the recent review [23]. There is less consensus in the literature, however, regarding the precise amount that horizontal transfer contributes to gene-content evolution. Discussion of the reasons for this lack of consensus are beyond the scope of this article but it is clear that the apparent disparity between the conclusions reached by different authors is a combination of: the fact that different authors ask different questions, i.e. asking what fraction of gene families are affected by horizontal transfer at least once in evolutionary history [24] is very different from asking, for example, the relative rates of gene loss to horizontal transfer [25, 26]. The fact that very different types of evidence are used, such as presence/absence of members of a gene family across leafs of the species tree [19, 20, 26], comparison of gene trees with species trees [21, 27], or the presence of genes without known homologs [28]. And finally, the fact that there are technical issues (like the way gene families are build and tree topologies are inferred) which may affect the results quantitatively if not qualitatively.

To the present authors the evidence currently in the literature does suggest that horizontal gene transfer accounts for a non-negligible and maybe even a large fraction of changes in gene content, at least among closely-related genomes. For example, it was found in [20] that up to 20% of the gene-content of proteo-*γ* bacteria consists of genes that have no homology with any of the other genes among all currently sequenced proteo-*γ* bacteria, but that *do* have homology with genes found outside of the proteo-*γ* clade. It is hard to see how this statistic could result from any process other than a high rate of horizontal transfer. It is thus worthwhile to investigate the implications of our current findings under the assumption that many of the domain additions are due to horizontal transfer.

Although we have no direct evidence, it is attractive to assume that the probability that a domain addition will be fixed in the population does *not* depend on the mechanism by which it was introduced. That is, the relative rate of fixation of domain additions in category *c* should be proportional to *ρ*_{
c
}for both duplicated domains as well as horizontally transfered domains. If this is indeed the case, it follows immediately from the fact that the overall rate should be proportional to *ρ*_{
c
}*n*_{
c
}, that the rate at which horizontal transfers are *introduced* must be proportional to the number of domains *n*_{
c
}present in the genome. However, whereas this is naturally the case for gene duplications, it is not clear at all why this should also hold for horizontal transfers. Therefore, our results put rather strong constraints on the rate of horizontal transfer.

One possibility is that horizontal transfer is negligible and that domain additions are dominated by duplications. However, as we have just discussed, this assumption, which we have made in previous work [10, 11], appears at odds with recent work. It should be noted, however, that some studies that investigate evolution of gene content over long time scales find that horizontal transfer is only responsible for a minor fraction of all events on a long time scale, i.e. [21]. One hypothesis that might be worthwhile to entertain is that most horizontal transfers are only transient. It is conceivable that horizontally transfered genes consist mostly of 'accessory' genes that are involved with adaptations to the local environment that are easily taken up by genomes moving into a certain environment, but which are also easily lost again when the environment changes, so that the horizontal transfers of these accessory genes contributes relatively little to the gene-content dynamics on long time scales. However, at least to these authors this hypothesis does not seem particularly plausible a priori.

Alternatively, there are several hypotheses that could explain why the rate at which horizontal transfers of domains of category *c* are introduced is proportional to the number of domains *n*_{
c
}already in the genome. First, it is possible that horizontal transfer is highly biased to occur predominantly between genomes that are closely-related phylogenetically. One mechanism of horizontal gene transfer, conjugation, does indeed occur preferentially between related organisms. Since closely-related species are likely to have highly correlated domain counts, it is likely that the fraction *n*_{
c
}/*n* of category *c* domains in the donor genome is close to the fraction of domains of category *c* in the receiver genome. However, many of the horizontal transfers detected through sequence analysis involve transfers between distally related species.

Another possible explanation is that bacterial habitats naturally separate into different genome-size classes. That is, it is conceivable that bacteria tend to be surrounded mostly by other bacteria of roughly the same genome size. Because the scaling laws apply to all genomes, the fractions *n*_{
c
}/*n* are similar for similarly sized genomes and one would naturally have that the rate at which horizontal transfers of domains of category *c* occur is proportional to *n*_{
c
}. As far as these authors are aware, currently there seems to be no evidence suggesting that there is a characteristic genome size for each bacterial habitat, but it appears that this hypothesis should in principle be testable using metagenomics data.

Finally, it is possible that, even though a given bacterium would generally be surrounded by other bacteria of very different sizes, that horizontal transfer is highly biased to occur predominantly between organisms that have genomes with similar sizes. In fact, there is some evidence in the literature that bacteria can recognize and silence horizontally transfered genes that have an AT-content which is significantly higher than the AT-content of the genome itself [29]. In addition, there is a good correlation between genome size and GC-content [30]. It is therefore conceivable that horizontal transfers between genomes of similar size are much more common than horizontal transfers between genomes of significantly different sizes.

In any case, whatever the underlying mechanism, if horizontal transfers account for a significant fraction of domain additions through evolution, then something must ensure that the rate of introduction of such horizontal transfers is proportional to the number of existing domains *n*_{
c
}in the receiving genome.