Parameters of proteome evolution from histograms of amino-acid sequence identities of paralogous proteins

Background The evolution of the full repertoire of proteins encoded in a given genome is mostly driven by gene duplications, deletions, and sequence modifications of existing proteins. Indirect information about relative rates and other intrinsic parameters of these three basic processes is contained in the proteome-wide distribution of sequence identities of pairs of paralogous proteins. Results We introduce a simple mathematical framework based on a stochastic birth-and-death model that allows one to extract some of this information and apply it to the set of all pairs of paralogous proteins in H. pylori, E. coli, S. cerevisiae, C. elegans, D. melanogaster, and H. sapiens. It was found that the histogram of sequence identities p generated by an all-to-all alignment of all protein sequences encoded in a genome is well fitted with a power-law form ~ p-γ with the value of the exponent γ around 4 for the majority of organisms used in this study. This implies that the intra-protein variability of substitution rates is best described by the Gamma-distribution with the exponent α ≈ 0.33. Different features of the shape of such histograms allow us to quantify the ratio between the genome-wide average deletion/duplication rates and the amino-acid substitution rate. Conclusion We separately measure the short-term ("raw") duplication and deletion rates rdup∗, rdel∗ which include gene copies that will be removed soon after the duplication event and their dramatically reduced long-term counterparts rdup, rdel. High deletion rate among recently duplicated proteins is consistent with a scenario in which they didn't have enough time to significantly change their functional roles and thus are to a large degree disposable. Systematic trends of each of the four duplication/deletion rates with the total number of genes in the genome were analyzed. All but the deletion rate of recent duplicates rdel∗ were shown to systematically increase with Ngenes. Abnormally flat shapes of sequence identity histograms observed for yeast and human are consistent with lineages leading to these organisms undergoing one or more whole-genome duplications. This interpretation is corroborated by our analysis of the genome of Paramecium tetraurelia where the p-4 profile of the histogram is gradually restored by the successive removal of paralogs generated in its four known whole-genome duplication events.

allows one to extract some of this information and apply it to the set of all pairs of paralogous proteins in H. pylori, E. coli, S. cerevisiae, C. elegans, D. melanogaster, and H. sapiens.It was found that the histogram of sequence identities p generated by an all-to-all alignment of all protein sequences encoded in a genome is well fitted with a power-law form ∼ p −γ with the value of the exponent γ around 4 for the majority of organisms used in this study.This implies that the intra-protein variability of substitution rates is best described by the Gamma-distribution with the exponent α ≈ 0.33.Different features of the shape of such histograms allow us to quantify the ratio between the genome-wide average deletion/duplication rates and the amino-acid substitution rate.

Background
The recent availability of complete genomic sequences of a diverse group of living organisms allows one to quantify basic mechanisms of molecular evolution on an unprecedented scale.The part of the genome consisting of all protein-coding genes (the full repertoire of its proteome) is at the heart of all processes taking place in a given organism.Therefore, it is very important to understand and quantify the rates and other parameters of basic evolutionary processes shaping thus defined proteome.The most important of those processes are: • Gene duplications that give rise to new protein-coding regions in the genome.The two initially identical proteins encoded by a pair of duplicated genes subsequently diverge from each other in both their sequences and functions.
• Gene deletions in which genes that are no longer required for the functioning of the organism are either explicitly deleted from the genome or stop being transcribed and become pseudogenes whose homology to the existing functional genes is rapidly obliterated by mutations.
• Changes in amino-acid sequences of proteins encoded by already existing genes.This includes a broad spectrum of processes including point substitutions, insertions and deletions (indels), and transfers of whole domains either from other genes in the same genome or even from genomes of other species.
The BLAST (blastp) algorithm [1] allows one to quickly obtain the list of pairs of paralogous proteins encoded in a given genome whose amino-acid sequences haven't diverged beyond recognition.The set of their percentage identities (PIDs) is a dynamic entity that changes due to gene duplications, deletions, and local changes of sequences.Duplication events constantly create new pairs of paralogous proteins with PID=100% , while subsequent substitutions, insertions and deletions result in their PID drifting down towards lower values.A paralogous pair disappears from this dataset if one of its constituent genes is deleted from the genome, becomes a pseudogene, or when the PID of the pair becomes too low for it to pass the E-value cutoff of the algorithm.Thus the PID histogram contains a valuable if indirect information about past duplications, deletions, and sequence divergence events that took place in the genome.In what follows we propose a mathematical framework allowing one to extract some of this information and quantify the average rates and other parameters of the basic evolutionary processes shaping protein-coding contents of a genome.
The list of all paralogous pairs generated by the all-to-all alignment of protein sequences encoded in a given genome is generally much larger than the list of pairs of sibling proteins created by individual duplication events.For example, a family consisting of F paralogous proteins contributes up to F (F − 1)/2 pairs to the all-to-all BLAST output, while not more than F − 1 of these pairs connect the actual siblings to each other.The identification of the most likely candidates for these "true" duplicates is in general a rather complicated task which involves reconstructing the actual phylogenetic tree for every family in a genome.This goes beyond the scope of this study, where we employ a much simpler (yet less precise) Minimum Spanning Tree algorithm to extract a putative non-redundant subset of true duplicated (sibling) pairs.The idea of quantifying evolutionary parameters using the histogram of some measure of sequence similarity of duplicated genes in itself is not new.It was already discussed by Gillespie (see [2] and references therein) and later applied [3] to measure the deletion rate of recent duplicates.There are two important differences between our methods and those of the Ref. [3]: • We use relatively slow changes in amino-acid sequences of proteins as opposed to much faster silent substitutions of nucleotides used in the Ref. [3].This allows to dramatically extend the range of evolutionary times amenable to this type of analysis.
• In addition to PID distributions in the non-redundant set of true duplicated pairs used in the Ref. [3] we also study that in the highly redundant set of all paralogous pairs detected by BLAST.It turned out that both these distributions contain important and often complimentary information about the quantitative dynamics of the underlying evolutionary process.The shape of the latter (all-to-all) histogram is to a first approximation independent of duplication and deletion rates and thus it allows us to concentrate on fine properties of amino-acid substitution.
The central results of our analysis are: • The middle part of the PID histogram of all paralogous pairs detected by BLAST is well described by a powerlaw functional form with a nearly universal value of the exponent γ ≃ −4 observed in a broad variety of genomes.Our mathematical model relates this exponent to parameters of intra-protein variability of sequence divergence rates.
• The upper part of the PID histogram corresponding to recently duplicated pairs (PID>90%) deviates from this powerlaw form.It is exactly this subset of paralogous pairs that was extensively analyzed in Ref. [3].This feature is consistent with the picture of frequent deletion of recent duplicates proposed in Ref. [3].
• The analysis of various features of the PID histogram of all paralogous pairs and that of a subset consisting of true duplicated (sibling) pairs allows us to quantify both the long-term average duplication and deletion rates in a given genome as well as a dramatic increase in those rates for recently duplicated genes.
• Abnormally flat PID histograms observed for yeast and human are consistent with lineages leading to these organisms undergoing one or more Whole-Genome Duplications (WGD).This interpretation is corroborated by the genome of Paramecium tetraurelia where the PID −4 profile of the sequence identity histogram is gradually restored by the successive removal of paralogs generated in its four known WGD events.
• Applying the same methods to large individual families of paralogous proteins allows one to study the variability of evolutionary parameters within a given genome.It is shown that larger or slower evolving families are characterized by higher inter-protein variability of amino-acid substitution rates.

Distribution of sequence identities of all paralogous pairs in a genome
We studied the distribution of Percent Identity (PID) of amino acid sequences of all pairs of paralogous proteins in complete genomes of bacteria H. pylori, E. coli, a single-celled eukaryote S. cerevisiae, and multi-cellular eukaryotes C. elegans, D. melanogaster, and H. sapiens.Every protein sequence contained in a given genome was attempted to be aligned with all other sequences in the same genome using the blastp algorithm [1].To avoid including pairs of multidomain proteins homologous over only one of their domains we have filtered the data by only keeping the pairs in which the length of the aligned region constitutes at least 80% of the length of the longer protein.Infrequent spurious alignments between different splicing variants of the same gene or between proteins listed in the database under several different names were dropped from our final dataset.The exact details of our procedure are described in the Methods section.• Region II: This region covers the widest interval of PIDs 30% < p < 90%.N a (p) in this region can be approximated by a power-law form of p −γ with γ ≈ 4 (shown as a dashed line in Fig. 1A.)The best fits to the power-law form in the Region II are listed in Table 2 and (with the exception of yeast and human ) they fall in the 3 − 5 range.The near-universality of the shape of the PID histogram is perhaps best illustrated by an approximate collapse of PID histograms in different genomes when they are normalized by the total number N genes (N genes − 1)/2, of all (both paralogous and non-paralogous) gene pairs in the genome (Fig. 1B).
• Region III: In this region p < 25 − 30% the histogram N a (p) starts to deviate down from the p −γ powerlaw behavior.This decline is an artifact of the inability of sequence-based algorithms such as BLAST to detect some of the bona fide paralogous pairs with low sequence identity.This explanation is corroborated by the observation that the exact position of the downturn of N a (p) in the region III is determined by the E-value cutoff (see Fig. 1S in the supplementary materials).

Birth-and-death model of the proteome evolution
In an attempt to interpret the empirical features of the PID distribution described above we propose a simple stochastic birth and death model of the proteome evolution.It consists of a sequence of random gene duplications, deletions, and changes in amino-acid sequences of proteins they encode.Several versions of such models were previously studied [4][5][6][7] most recently in the context of powerlaw distribution of family sizes.Our model extends these previous attempts by concentrating on evolution of sequence identities as opposed to just the number of proteins in different families.Amino acid substitutions, insertions and deletions cause the sequence identity of any given pair of paralogous proteins to decay with time.Consider two paralogous proteins with PID=p×100% aligned against each other.In the simplest possible case changes in their sequences happen uniformly at all amino acid positions at a constant rate µ = const.The effective "substitution" rate µ combines the effects of actual substitutions and short indels.The PID of this paralogous pair changes according to the equation dp/dt = −2µp.The factor two in the right hand side of this equation comes from the fact that substitutions can happen in any of the two proteins involved, while the factor p -from the observation that only changes in parts of the two sequences that remain identical at the time of the given change lead to a further decrease of the PID.This equation results in an exponentially decaying PID: p(t) ∼ exp(−2µt).
More generally the drift of PID could be described by the equation dp/dt = −v(p).When substitution rate varies for different amino acids within the same protein the relationship between v(p) and p would in general be non-linear.For our immediate purposes we will leave it unspecified.The negative drift of PIDs generates a p-dependent flux of paralogous pairs down the PID axis given by v(p)N a (p).The net flux into the PID bin of the width ∆p centered around p is given by ∆p ∂ ∂p [v(p)N a (p)] (see Fig. 2A) Our model also involves random gene duplication and deletion events (birth and death of new protein-coding genes) which happen at rates r dup and r del correspondingly.In Fig. 2B  In our model the total number of genes N genes in the genome exponentially grows (or decays) according to dN genes /dt = (r dup − r del )N genes .When the genome size of an organism remains (approximately) constant with time one can find the stationary asymptotic solution of the previous equation.In this case one must have r dup = r del so that the second term in the right hand side is equal to zero. In In the steady state solution one has ∂n a (p, t) The conjecture that the normalized PID histogram n a (p) = N a (p)/[N genes (N genes − 1)/2] indeed is nearly stationary during the course of evolution is corroborated by the fact that all six n a (p) curves in various genomes used in our study approximately lie on top of each other in Fig. 1B (compared to unnormalized N a (p) shown in Fig. 1A).
Comparing the Eq. 2 with the empirical form of N a (p) ∼ 1/p 4 in the region II of Fig. 1 one concludes that the drift velocity in real genomes must obey v(p) ∼ p 4 .Such a non-linear dependence of v(p) could be explained by the variability of the effective substitution rate within proteins (intra-protein variability).
Robustness of the functional form of N a (p) with respect to assumptions used in the model.
The birth-and-death model described above is based on a simplified picture of genome evolution.In particular it implicitly assumes: • The neutrality of individual gene duplication and deletion events resulting in identical rates of these two processes in all paralogous families in the genome.
• Identical average amino-acid substitution rates µ 0 in all individual proteins.
Both of these assumptions are known to be not, strictly speaking, true.Sequences of some "important" proteins (e.g.constituents of the ribosome) are known to evolve very slowly.Also, the families containing essential (lethal knockout) genes were recently shown [11] to be characterized by higher average duplication and deletion rates than those lacking such genes.
However, the validity of our main results goes well beyond the validity of the approximations that went into our birth-and-death model.The advantage of using the histogram of sequence identities generated by the all-to-all alignment (N a (p)) lies in its remarkable universality and robustness.When the Eq. 2 is applied to individual families one can see that family-to-family variation of (and correlations between) the duplication rate r dup , the deletion rate r del , and the average substitution rate µ 0 affect only the prefactor in the powerlaw form of N a (p).Thus the exponent γ = 1 + 1/α describing this powerlaw is very robust with respect to assumptions of the model and depends only to the exponent α quantifying the intra-protein variability of amino-acid substitution rates.
The exact mechanisms behind this apparent universality of α are not entirely clear.Chances are that it is dictated more by the protein physics rather than by organism-specific evolutionary mechanisms.A possible path towards derivation of the exponent α from purely biophysical principles starts with the results of Ref. [12], which models the effects of (correlated) multiple amino-acid substitutions on stability of the native state of a protein.However, such analysis goes beyond the scope of the present work and will be reserved for a future study.

Distribution of sequence identities of true duplicated pairs
A highly redundant dataset consisting of all paralogous pairs present in the genome enabled us to quantify the variability of intra-protein substitution rates.Another set of important parameters describing proteome evolution are average deletion/inactivation and duplication rates r del and r dup .As will be shown in the following, the reduced non-redundant dataset consisting only of protein pairs directly produced in duplication events allows us to estimate these rates.
To better understand the difference between those two datasets we illustrate it with a simple example.The Nothing in the BLAST output for a given paralogous pair contains any information if it should or should not be included in the N d (p).However, using the set of all sequence identities of proteins for a given family one could tentatively reconstruct the course of duplication events that led to the appearance of this family.
Generally speaking, this is a rather complicated task involving reconstructing the actual phylogenetic tree for every family in a genome.In this study we use a much simpler alternative based on the Minimum Spanning Tree (MST) algorithm (see Methods for more details).For each protein family this algorithm generates a tentative set of duplication events in its past history.Numbers of pairs included in N a (p) and N d (p) distributions in different organisms are listed in the Table 1.
The dynamics of the distribution of duplicated pairs N d (p, t) is described by simply excluding the duplication term 2r dup N (p, t) from the equation for N a (p, t).Indeed, this term is caused by PIDs of non-duplicated paralogs A'-B and A'-C (dashed lines in Fig. 2 (B)) generated when a gene A was duplicated.However, only the actual duplicated pair A-A' with initial PID of 100% (dotted line in Fig. 2 (B)) is included in the distribution of duplicated pairs N d (p).Thus the dynamics of N d is described by Once again the stationary solution exists for the normalized distribution.However, in this case the correct normalization factor is given by N genes and not N genes (N genes − 1)/2 as for N a (p).Indeed, every duplication event increasing the number of genes by one adds just one duplicated pair to N d (p) but up to According to our empirical findings the average rate of sequence divergence of paralogous proteins in most organisms is described v(p) = 2μp γ , where μ is the substitution rate averaged over all amino-acid positions in all proteins, and γ ≈ 4 is the exponent related to the intra-protein variability of µ.The steady state of Eq. 5: ∂n d /∂t = 0 is satisfied by:

Numerical test of analytical predictions
The analytical results derived above were confirmed by a numerical simulation.An artificial "proteome" used in our numerical model consists of a fixed number of "proteins" of identical lengths.At every timestep one takes the sequence of a randomly selected protein and uses it to overwrite the sequence of another randomly selected protein.This corresponds to a stationary genome case when r dup = r del .Each combined duplication/deletion event is followed by random substitutions of several "amino-acids" (see Methods for details of our simulation).In the beginning of the simulation each amino-acid position within every protein was randomly assigned a substitution rate drawn from the Gamma-distribution with Fitting evolutionary parameters of real proteomes: the long-term deletion rate.
To estimate the average of deletion and duplication rates we performed a two-parameter fit to the ) in the 30% < p < 90% interval (region II in Fig. 1 ).Here A and B = (r dup + r del )/(2μ) are the two free fitting parameters.The exponent γ used in the fitting formula itself was obtained from the best fit to N a (p) in the same region with the power-law form p −γ (see column 3 in Table 1).The ratio r del /μ was extracted from the best-fit value of B and the independently calculated duplication rate ratio r dup /μ (see subsection below).It is listed in the sixth column of the Table 2.
Fitting evolutionary parameters of real proteomes: the short-term deletion rate of recent duplicates.
A very pronounced and reproducible feature in all organism-wide histograms is an abrupt drop as is lowered from 100% down to about 90-95% (region I in Fig. 1.)The drop is as large as 30-fold in prokaryotes and is around 3-to-10 fold in eukaryotes.It is subsequently followed by a increase of in the region II which at low 25% (region III) turns down again only due to limitations of our ability to detect evolutionary related sequences.There exists several possible explanations for this initial drop in the region I : • The gene conversion process.In a gene conversion process a part or the whole sequence of one of the paralogous genes is used as a template to modify the sequence of another.It happens with a reasonable frequency only if those two genes are sufficiently close to each other in their sequences so that DNA repair mechanisms might mistakenly assume that one of them is the corrupted version of the other.If gene conversion events are sufficiently common, the initial separation of a pair of freshly duplicated genes may take a long time, as one of them would be getting constantly converted back to the other.This would result in an abnormally small drift velocity v(p) for p close to 100% and hence to an abnormally high N a (p) ∼ 1/v(p).
• Another, more plausible explanation is that freshly duplicated genes are characterized by a much higher deletion rate r * del ≫ r del [3].Functional roles of such genes have not had enough time to diverge from each other making each of them more disposable than an average gene in the genome.Indeed, for S. cerevisiae and C. elegans it was empirically demonstrated [13,14] that the deletion or inactivation of genes with a highly similar paralogous partner in the genome is up to 4 times more likely to have no consequences for the survival of the organism than the deletion/inactivation of genes lacking such a partner.
The N a dynamics in the region I (p ≃ 100%) is then described by .
while the normalized distribution n a = N a /[N genes (N genes − 1)/2] obeys Here 2μ = v(100%) is the average substitution rate in freshly duplicated pairs and r * del is the deletion rate inside region I.The equation has an exponentially decaying stationary solution which for r * del ≫ r del is simply given by n a (p) ∼ exp(r * del p/μ) This functional form is consistent with the empirical data for p just below 100% and the best fits to r * del /μ are listed in the seventh column of the Table 2. Ref. [3] analyzed the distribution of silent substitution numbers per silent site K s between pairs of recently duplicated genes.
Under the same "drift and deletion" hypothesis used to derive the Eq.7 such K s -distribution N d (K s ) should also have an exponential decaying form N d (K s ) ∼ exp(−r * del K s /μ s ), where μs is the average drift velocity of K s immediately following the duplication event.Fits to this exponential functional form performed in Ref. [3] resulted in r * del /μ s ∼ 7 − 24.Our estimates r * del /μ ∼ 20 − 70 are consistent with those of ( [3] ) provided that the μ/μ s ratio is in 0.1 − 1 interval.
Fitting evolutionary parameters of real proteomes: long-and short-term duplication rates.
The number of paralogous pairs with PID≃100% also contains information about the raw duplication rate r * dup in the genome.This rate is subsequently trimmed down to its long-term stationary value r dup by the removal of a large fraction of freshly created pairs as described in the previous subsection.New pairs with PID=100% are created at a rate r * dup N genes , while they leave the bin containing PID=100% at a rate 2μN a (100%)/∆p .Here ∆p is the width of the bin and N a (100%) is the number of pairs in this last bin.
The width of the bin is assumed to be small enough so that the removal of genes from the bin due to deletion is negligible in comparison to that due to the drift in their sequences.Thus r * dup /μ = 2N a (100%)/(N genes ∆p) .The average duplication rates calculated this way are presented in the fourth column of Table 2 .They are compatible with r * dup /μ s calculated in [15], where the same idea was applied to N d (K s ).
The rate r * dup includes the creation of some extra duplicated pairs which are then quickly (on an evolutionary timescale) eliminated from the genome during a "trial period" while their PID>90%.We have already demonstrated that such a deletion happens at a very high rate r * del and thus has to be treated separately from the background deletion rate r del .The duplication rapidly followed by a deletion does not change the overall distribution of paralogous pairs.Therefore, the long-term average duplication rate r dup used in Eqs. 2, 6 is in fact considerably lower than the raw duplication rate r * dup .An approximate way to calculate it is to use power-law fits to N a (p) in the region II to extrapolate it up to 100%.Such extrapolated value N ext a (100%) could then be used to calculate the long-term average duplication rate as r dup /μ = 2N ext a (100%)/(N genes ∆p).(see the fifth column of Table 2).

Discussion and Conclusions
An estimate of the number of superfamilies in different genomes Any sequence-based method is bound to miss similarities between some of the distant paralogs.The situation could be somewhat improved [7] if one compares proteins' three-dimensional structures which are conserved over longer evolutionary times.The addition of previously undetected paralogous pairs results in some of the sequence-based families merging together to form larger superfamilies.Our empirical observations allow us to estimate the number of such superfamilies contained in a given genome.Indeed, the fraction of paralogous pairs among all gene pairs in a genome consisting of N F mutually unrelated superfamilies is given by 1/N F .A rough estimate of N F is provided by extrapolation of the p −4 powerlaw into the Region III down to some cutoff PID min : Here A is the best fit to N a (p) with the p −4 in the region II.Remarkably, the results of such calculation are roughly genome independent.Using the lowest theoretical limit PID min =5% (the sequence identity of two unrelated sequences composed of 20 amino-acids) results in the effective number of superfamilies N F ranging between 4.7 in C. elegans and 9.9 in D. melanogaster.A more realistic limit PID min =8% [7], which takes into account the non-uniform frequency among 20 amino-acids, somewhat increases the number of superfamilies to 36, 28, 31, 19, 40, and 35 for H. pylori, E. coli, S. cerevisiae, C. elegans, D. melanogaster, and H. sapiens correspondingly.These numbers are still respectably small compared to N F ≃ 1000 one gets by using the cutoff PID min =25% imposed by the inadequacy of sequence-based methods to detect similarity of remote paralogs.

The exponent α in large individual families
The Gamma-distribution ∼ µ α−1 exp(−2µ/µ 0 ) was traditionally used to model and fit the distribution of substitution rates in individual families of proteins (this tradition goes back to [16]).Our approach extends this approach to a proteome-wide scale and demonstrates that beyond its role as a ad hoc fitting function the Gamma-distribution indeed provides an excellent quantitative description of variability of intra-protein substitution rates.
The genome-wide value of the exponent α ≃ 0.33 obtained in our analysis is consistent with its previous estimates in large protein families.For example, the best fits with the Gamma-distribution performed in Refs. [9,10] resulted in the exponent α in the 0.2 − 0.4 range.The authors of Ref. [8] quantified the variability of substitution rates using a large set of orthologous proteins in different genomes (this should be contrasted with paralogous proteins used in our analysis).The fits with Gamma-distribution resulted in a broad range of exponents α for individual proteins.Still the distribution of exponents α peaks around 0.25 (solid histograms in the Fig. 4 of Ref. [8]) .
In principle, our methods could be also applied to large individual families of paralogous proteins.This corresponds to the exponent α ≃ 0.17 − 0.25, which is a somewhat lower than the exponent α = 0.33 that we observed for genomes as a whole.
Smaller individual families do not hold sufficient statistical power to analyze the shape of N a (p).One approach would be to group them together by some shared characteristic (e.g by their size, or by whether or not they contain an essential gene as described in Ref. [11]).However, the N a (p) histogram in such a group would depend on additional parameters such as the rate of creation and removal of families of a given type and thus will not be amenable to our type of analysis.For example the collection of families binned by their size would have additional birth-and-death events due to whole families entering or leaving the selected bin.The rates of these processes would have a non-trivial dependence on the age of a family and thus cannot be easily incorporated into our mathematical framework.
It is important to emphasize once again that the exponent α quantifies only the intra-protein variability of substitution rates at different amino-acid positions within individual proteins.Such variability should not be confused with a much larger protein-to-protein variability of average substitutions rates.Indeed, sequences of different proteins encoded in the same genome are known to evolve at vastly different rates (see [8,10] and references therein).Some sequences, such as e.g.those of ribosomal proteins, remain virtually unchanged over billions of years of evolution, while others change at a much faster pace.In fact, the very importance of a protein is sometimes quantified by its average rate of evolution as more essential proteins involved in core cellular processes tend to evolve at slower than average rates.

Genome size dependence and other properties of long-and short-term duplication and deletion rates
Our data indicate that the long-term duplication rate r dup is of the same order of magnitude as the long-term deletion rate r del (see columns 5 and 6 in the Table 2).This is to be expected since any large discrepancy in these rates would generate much greater differences in genome sizes than actually observed in these model organisms.However, as was proposed by [3], both of these rates are considerably smaller than their short-term ("raw") counterparts r * dup and r * del that include recently duplicated proteins.
Our results for the fruit fly D. melanogaster are consistent with an earlier observation [15] of an abnormally low average duplication rate in this organism.According to our data r * dup /μ is about nine times lower than that in the genome of C. elegans.The long-term stationary duplication rate r dup /μ in the fly is also the lowest in all eukaryotic genomes used in this study but is only three times lower than that in the worm.
Intriguingly, r del /μ, r dup /μ, and r * dup /μ ratios are all positively correlated with the complexity of the organism quantified by the total number of genes in its genome (see correspondingly filled circles, open diamonds, and open squares in Fig. 4).This means that either the per-gene duplication rate in more complex organisms is consistently higher than in their simpler counterparts or that their average amino-acid substitution rate is lower.It is likely that both above trends operate simultaneously.One possible explanation for the latter trend is that the more sophisticated mechanisms of DNA copying and repair of higher organisms lead to lower average amino-acid substitution rates.
On the other hand, we find that the deletion rate of recent duplicates, r * del /μ, (filled triangles in Fig. 4) is negatively correlated with the number of genes in the genome.This result is in agreement with Ref. [17] where this trend was attributed to the decrease in effective population size in more complex organisms.

The effects of whole genome duplications on the histogram of sequence identities
Two of the organisms used in our study (S.cerevisiae and H. sapiens) are characterized by a dramatically lower value of the power-law exponent γ (1.8 for yeast and 2.4 for human) and the overall poor quality of the power law fit to N a (p).One plausible explanation of this anomaly is in terms of Whole Genome Duplications (WGD) in lineages leading to these genomes.It is well established [18] that baker's yeast underwent a WGD event, which most likely occurred about 100 Myrs ago.While the subject remains controversial, it is also commonly believed that the vertebrate lineage leading to H. sapiens (among many other vertebrate genomes) also underwent one or several large-scale duplication events [19,20].In the immediate aftermath of a WGD event the PID distribution changes as follows: N a (p) → 4N a (p) for p<100%, while N a (100%) → 4N a (100%) + N genes .Indeed, every ancestral paralogous pair A-B would give rise to 3 new pairs with the same PID: A-B', A'-B, and A'-B'.At the same time the bin containing the PID=100% would in addition get N genes (or fewer for a large segmental duplication) of freshly created duplicated pairs of the type A-A' and B-B'.The subsequent spread of this sharp peak at PID=100% towards lower values of PID accompanied by a rapid deletion of redundant copies of duplicated genes would result in an effective flattening of the N a (p) histogram in its upper range and thus in lower effective value of the exponent γ.
To further test this hypothesis we analyzed the recently sequenced genome [21] of a ciliate Paramecium tetraurelia.This organism underwent as many as four separately identifiable WGD events [21].We used our standard methods to construct the PID histogram N a (p) from the all-to-all alignment of its nearly 40,000 genes.Due to the sheer size of this proteome we employed the same conservative 10 −30 E-value cutoff we used for H. sapiens and C. elegans.Solid diamonds in the Fig. 6 correspond to the full PID histogram in Paramecium tetraurelia consisting of all 103,828 paralogous pairs detected by our methods.Authors of Ref. [21] identified the lists of putative pairs of duplicated genes generated in each of the four WGD events in the lineage leading to this genome.By dropping one randomly-selected gene from these WGD pairs we generated the set of four progressively more narrow PID histograms.These histograms are also shown in Fig. 6: 41,890 pairs excluding the genes generated in the latest WGD event (solid squares), 25,342 pairs excluding the genes generated in the last two WGD events (solid circles), 22287 pairs excluding the genes generated in the latest three WGD events (open triangles), and 21,417 pairs excluding the genes generated in all four WGD events (red stars).For comparison, the Fig. 6 also reproduces the histogram of 31,078 pairs in the N a (p) of H. sapiens (blue ×-es).One can see that progressive elimination of pairs generated in WGD events gives rise to the N a (p) histogram approaching the universal scaling form: N a (p) ∼ p −4 (the dashed line in Fig. 6).Furthermore, the PID distribution of gene pairs generated in each of the WGD events has a shape that is qualitatively consistent with the predictions of our birth-and-death model.In particular, the gene pairs from the latest round of WGD did not have time to sufficiently diverge.As a result, their PID-distribution (shown in black in the Fig. 3a of the Ref. [21]) has a peak around 95% sequence identity with a half-maximum at 75%.The analysis of Paramecium tetraurelia genome provides an additional strong support to our conjecture that the unusually flat PID histograms in human and baker's yeast are caused by WGD events in lineages leading to these two organims.

The details of generating lists of paralogous proteins
The proteomes of H. pylori strain 26695 and E. coli strain K12-MG1655 were downloaded from the Comprehensive Microbial Resource (CMR) [22] version 1.0.Sequences of S. cerevisiae proteins are from the Saccharomyces Genome Database (SGD) [23] version number 20031001.The D. melanogaster's sequences are from the Berkeley Drosophila Genome Project [24], release 3.1.C. elegans -Wormbase [25], release WS127.H. sapiens -the NCBI database, [26], build 34.1.The initial set of paralogous pairs for each of the organisms was identified by an all-to-all alignment of sequences of its proteins to each other using the BLASTP program [1].For H. pylori, E. coli, S. cerevisiae, and D. melanogaster genomes, the E-value threshold of 10 −10 was employed.This corresponds to p-values of the order of 10 −12 (for H. pylori) and lower.Due to larger genome sizes of C. elegans and H. sapiens an even more conservative E-value of 10 −30 was used to reduce the number of hits generated by the algorithm.
The "raw" datasets for worm, fly and human often contain multiple overlapping protein sequences predicted by different gene models of the same gene (including but not limited to different splicing variants).To avoid spurious hits we first mapped entries in raw datasets to unique gene IDs.This was easy to accomplish in the fly and worm datasets, where names of different gene models differ from each other by the last letter.In human genome, this was done by mapping the gi numbers of sequences in the raw dataset to unique GeneID (LocusLink) identifiers from the Entrez Gene database [27].Subsequently, if multiple BLAST hits were connecting the same pair of gene IDs we kept the one with the longest aligned region.This way we were guaranteed that one and only one pair of splicing (or gene model) variants per pair of gene IDs would contribute to the PID histogram.
In all genomes, only pairs in which the aligned region constituted at least 80% of the length of the longer protein were kept [15].This excludes contribution from pairs of multi-domain proteins paralogous over only one of their domains.
Initially, the PID histogram in S. cerevisiae had two very sharp peaks at 51% and 70%.A close inspection revealed that these peaks are produced by evolutionary related subfamilies of nearly identical transposable elements.To correct for this obvious artifact in S. cerevisiae we removed 108 proteins encoded by known transposable elements listed in the Saccharomyces Genome Database [23] and their homologs.
The overall shape of the PID histogram in regions I and II is not sensitive to the E-value cutoff chosen.In Fig. 1S we show that when the E-value cutoff in the fly dataset was changed from a less conservative 10 −10 to a more conservative 10 −30 value, the shape of the histogram above 40% remained virtually unchanged.
Similarly, the results are nearly independent on the type of the BLOSUM substitution matrix used (in the end we opted for the BLOSUM45.)Finally, we verified that our results are independent of the alignment algorithm utilized to calculate PIDs.Indeed, in the fly dataset we have recalculated PIDs for all paralogous pairs detected by BLAST using much more sophisticated Smith-Waterman algorithm [28].The resulting histogram (shown as blue stars in Fig. 1S) is virtually indistinguishable from that based on the blastp output.

Numerical model of the proteome evolution
We numerically simulated a birth and death model mimicking the evolution of a fixed-size proteome by duplication, deletion and substitutions.We first randomly fill a 2,000×100 matrix with integer numbers ranging from 1 and 20 (20 types of "amino-acids").This constitutes the initial state of our artificial genome/proteome, encoding 2,000 "proteins" of 100 "amino-acids" each.Every amino-acid position in each of the proteins is randomly assigned the substitution rate µ drawn from a Gamma-distribution with α = 1/3.One evolutionary timestep consists of: 1. Duplicate a randomly selected gene in the genome and use this duplicated copy to replace another randomly selected gene (deletion).Thus in this model the deletion rate is exactly equal to the duplication rate.
2. Randomly pick 400 amino-acid positions in the whole genome and substitute amino-acids at those positions to a randomly selected new value.The probability of a particular amino-acid position to be picked is proportional to its substitution rate µ.
This choice of parameters in our model corresponds to r del /μ = r * del /μ = r dup /μ = r * dup /μ = 0.25.Indeed, the average substitution rate per amino acid during one timestep is given by 400/(100 × 2000) = 1/500.It is equal to 0.25 of the per-gene per timestep duplication/deletion rate of 1/2000.In this artificial evolutionary process we have the advantage of keeping track of all the duplicated pairs.Thus, after each duplication event the list of all duplicated pairs is updated and can be directly read off.After repeating the above steps for 20,000 times the full genome alignment of all proteins is produced and stored.The distributions of duplicated and all paralogous pairs shown in Fig. 3 are generated by averaging over 20 such samples.

Identification of true duplicated (sibling) pairs by the Minimum Spanning Tree algorithm
We are naturally not in possession of the set of protein pairs that actually underwent duplication in the course of evolution of a given genome.The identification of the most likely candidates for these "true" duplicates is in general a rather complicated task which involves reconstructing the actual phylogenetic tree for every family in a genome.However, we could make a much simpler educated guess about past duplication events by connecting paralogous proteins in a given family with the Minimum Spanning Tree (MST) that is the tree maximizing the sum of PIDs along its edges (or, to agree with its name, minimizing its opposite sign value).For a family consisting of F proteins such tree has exactly F − 1 edges representing our best guess about the actual duplication events.One can prove the truth of this by induction: when a freshly duplicated pair is created with PID=100% it extends the previously existing Minimum Spanning Tree of a family by one edge.Assuming a constant rate of divergence for all paralogous pairs in a given family, the set of duplicated pairs would continue to form the Minimum Spanning Tree at all times.We used the Kruskal algorithm [29] to approximately detect the MST.

Detection of families of paralogous genes
Families of paralogous proteins used in Figure 5 are defined as mutually isolated clusters of proteins in the network in which paralogous pairs are connected by a link.Every two nodes within a family are either directly or indirectly connected to each other by at least one chain of paralogous links, while different clusters (families) are completely disconnected from each other.Because of our requirement for the length of the aligned region to be >80% of the length of the longest protein in a pair, all proteins within such legacy of whole-genome duplications but I find that explanation dubious.Traces of this duplication in yeast and, especially, in vertebrates are very weak.Including more genomes would help to clarify this issue.
The reported genome analysis is very simple, it cannot be computationally prohibitive.

Authors response:
We agree that extending our analysis to include more genomes is fairly straightforward.However, we want to save the subject of lineage-dependence of the exponent γ (apart from that related to Whole Genome Duplications (WGD)) for future studies and to report it in a separate publication.To check our hypothesis that WGD are responsible for unusual profile of the PID-histograms in human and yeast we analyzed the genome of Paramecium tetraurelia.The lineage leading to this organism underwent as many as four separately identifiable WGD events.The results of our analysis presented in Fig. 6 and the accompanying section of the manuscript have beautifully confirmed our initial hypothesis: while the all-to-all N a (p) histogram in the whole proteome of Paramecium tetraurelia (solid diamonds in Fig. 6) has an unusually flat profile similar to the one we saw in human (blue ×-es in Fig. 6), the removal of proteins generated in WGD events results in a stepper PID-histogram (red stars in Fig. 6) which is in excellent agreement with the universal p −4 functional form (dashed line in Fig. 6).This provides necessary support to our original conjecture that unusually flat PID histograms in human and baker's yeast are also due to (possibly less obvious) duplicated pairs of proteins generated in WGD events in these two lineages.
2. The mathematical model developed in the paper is a typical birth-and-death model.I wonder why the phrase is not used (it would immediately clarify the matter to those familiar with the field) and some of the relevant literature is not cited.

Authors response:
We have modified our notation to incorporate this comment.We also cited the appropriate literature on birth-and-death models ( [4][5][6]) 3. The "true" duplicates are identified using minimum spanning tree under the constant rate assumption.This is quite a crude method and an unrealistic assumption, too.Building actual phylogenetic trees, certainly, would be more appropriate.This might be too hard technically for this amount of material but, at least, the issues should be acknowledged, I think.
Authors response: In the Background section of the manuscript we now explicitly mention that the minimum spanning tree is just a simpler (yet less precise) alternative to reconstructing the actual phylogenetic tree for every family in a genome.
(Clauset et al 2007) presents maximum likelihood estimators for scaling parameters free from these biases.
How do the authors establish confidence in their estimates of the exponent?Authors response: Due to an unavoidably narrow range of our power law fits along the x-axis (The region II includes 0.3 < p < 0.95 or half a decade) we didn't use any sophisticated techniques in our fitting protocols.In fact, in this interval the exponential fits look only marginally worse than those with a power law.Thus for us the exponent γ of a power law fit is just a convenient single parameter quantifying the distribution which is A) consistent with the Gamma-distribution traditionally used to describe substitution rates; B) nearly universal in a broad variety of organisms ranging from H. pylori to D. melanogaster.
2. I was somewhat confused by the renormalization procedure for the raw distance histograms.While I understand that this is required in order to ensure a stationary distribution, I do not see clearly what the biological implications or assumptions of this step are.Perhaps this could be clarified?Authors response: A standard mathematical approach to describing stationary probability distributions in growing systems is to normalize the histogram in question by the sum of its elements ( p N a (p) in our case).However, we noticed that the total number of paralogous pairs p N a (p) detected by all-to-all alignment of all protein sequences grows at the same exponential rate 2(r dup − r del ) as the square of the number of genes -N 2 genes .Moreover, in real genomes the relationship p N a (p) ∼ N 2 genes holds very well (see the newly created Supplementary Figure S2).Thus we decided to use the total number of gene pairs N genes (N genes − 1)/2 ∼ N 2 genes (the theoretical upper bound to the number of paralogous pairs) to normalize the N a (p).The biological implication of this result is that the fraction of paralogous pairs among all gene pairs is roughly the same in all genomes (as manifested by the collapse of normalized distributions in Fig. 1B).
We have also added the new subsection "An estimate of the number of superfamilies in different genomes" in which we speculate that the p −4 could be extended well into the region III down to its theoretical minimum of 5 − 8%.This could be (at least partially) accomplished if in addition to sequence similarity one would use structural similarity to define superfamilies of paralogous proteins.These results indicate that normalizing N a (p) by N genes (N genes − 1)/2 is a remarkably close approximation to normalizing it by the actual (presently unknown) number of paralogous pairs contained in the genome (including the evolutionary relationships missed by sequence-alignment algorithms).
3. I worried a little about the uniqueness of the Gamma-distribution in generating the scaling behavior given the absence of a robust test for the scaling exponent.How many different distributions have been tested, and how robust is the result to departure from the Gamma?
Authors response: As we explained in our response to your question #1, due to an inherently narrow range of the Region II (half a decade) we never state that the power law functional form resulting from Gamma-distributed substitution rates µ is the unique way to mathematically describe the N a (p) histogram in real genomes.Thus we didn't test multiple µ-distributions.Our only claim is that the observed shape of N a (p) is consistent with Gamma-distributed (α = 0.33) substitution rates of individual amino-acids within a protein.

4.
A little more could have gone into the discussion on the mutational processes and the role of selection.
Should we assume rate variation to be the outcome of selection (as in the example of the slow rate of evolution of ribosomal proteins) where perhaps an active site remains more highly conserved, or is it the contention of the authors that some purely stochastic process at the mutational level accounts for this variation?A similar argument could be made for the duplication and deletion equilibrium.As the paper reads now, I am not sure what processes the authors have in mind.
Authors response: These and other points are now explained in the new subsection "Robustness of the functional form of N a (p) with respect to assumptions used in the model" added to the Results section of the manuscript.
5. Surely the study of evolution through the properties of genetic distance histograms is older than Lynch and Conery (2000).I am aware of earlier work by Gillespie and many others.
Authors response: Thank you for pointing this out.We have added the book by Gillespie and references therein to our citation list.
6.The paper needs to be spell checked and grammar checked.
Authors response: Done.
Reviewer 4: Eugene Shakhnovich, Harvard University, Cambridge, Massachusetts, USA Power-law distributions are ubiquitous in Protein Universe and were reported to describe distribution of sizes of gene families, fold families, structural similarity relationships and other properties (1)(2)(3)(4).It had been widely accepted that the underlying reason for their emergence is in evolutionary dynamics of creation of new genes and proteins and several dynamics models have been proposed to describe it (1,3,5,6).Here the authors study the distribution of amino acid sequence similarities in paralogous families and also find a regime where power-law describes the observed histogram well.They proposed a mathematical model akin to master equation approach which essentially assumes that the rate of divergence is non-uniform it depends on the sequence ID of genes in question.The model and the analysis are interesting and make a valuable contribution to the literature on evolutionary dynamics.The major strength of this study is in its highly quantitative character which provides interesting insights about duplication/deletion rates which are apparently dependent on past history.However I would like the authors to address the following questions: 1) The authors consider all gene families in various organisms regardless of their functional distributions.
However recent work of B.Shakhnovich and Koonin (7)  2) The major weakness of this analysis (and other phenomenological approaches) is that the "explanation" for power-law regime comes from an assumption of a certain form of the distribution of substitution rates in the form of Gamma-function.While assuming Gamma-function may result in good fits it is entirely mysterious why does it emerge.The authors make a very interesting hint that Gamma emerges from intra-protein variability of substitution rates but they do not dwell much further on that.In fact such variability does exist.It was quantitatively studied in a microscopic protein evolution model by Dokholyan and myself in 2001 (8).It would be highly instructive to check whether distributions of substitution rates observed in (8) can provide additional insights into the microscopic origin of empirical fits used in this work.
Authors response: It would be indeed extremely exciting to find a truly "microscopic" explanation of the universal parameters of the Gamma-distribution reported in our manuscript along the lines of the Ref. (8).
We feel however, that this goes well beyond the scope of this article.In the new subsection of our manuscript "Robustness of the functional form of N a (p) with respect to assumptions used in the model" we now cite the Ref. (8) and mention that its results lead to a biophysical explanation to the remarkable universality of the exponent α reported in our manuscript.
3) The effective distribution of intra-protein variability seems to depend on family size.Why? Can it be related to functional constrains (e.g.E-and N-families).A comment or further analysis will be helpful.
Authors response: Soon after sending our manuscript for review we realized that our mathematical model is applicable only to whole genomes or large individual families and does not describe PID histograms in collections of many families grouped by their size.Indeed, such collections would have additional birth-and-death events due to whole families entering or leaving the selected bin of family sizes.The rates of these processes would have a non-trivial dependence on the age of a family and thus cannot be easily incorporated into our mathematical framework.Thus we removed the Fig. 5B showing the apparent systematic variation of the exponent γ with the family size, which, in the hindsight, is likely caused by these extra birth-and-death terms.For the discussion of E-and N-families see our response to your question 2).
4) The power-law regime is observed only at sufficient level of divergence.Why?How can current model be modified to account for the full histogram, not only its power-law part?sapiens.As explained in the text, more complex organisms (those with larger N genes ) tend to be characterized by higher values of the first three ratios but lower values of the last ratio.Tables Table 1 -Statistics of datasets used in this study.
The first column is the name of the organism, the second column -the number of protein-coding genes in its genome, N genes , the third column -the number of proteins for which we found at least one paralogous partner, the fourth column is the percentage of proteins with at least one paralog, the fifth column -the total number of distinct BLAST hits generated before we applied subsequent filtering, the sixth columnthe number of paralogous pairs included in N a (p), and the seventh column -in N d (p).

Organism
Proteome The first column contains the name of the organism, the second column -N genes , the number of genes in its genome, the third column is the value of the exponent γ in the best fit with p −γ to N a (p) in the region II.The fourth, fifth, sixth and seventh columns are correspondingly the ratios r * dup /μ, r dup /μ, r del /μ, and r * del /μ defined and measured as described in the text.

Fig. 1A shows
Fig. 1A shows the histogram N a (p) of amino-acid sequence identities (PIDs) p of all pairs of paralogous proteins encoded in different genomes.The p-dependence of these histograms has three distinct regions I,II,III.
family of four evolutionary related proteins A,B,C,D contributes six paralogous pairs to N a (p).This family was actually created by three subsequent duplication events: first A duplicated to give rise to B, then B duplicated to C and finally C duplicated to D. Thus only three out of total six paralogous pairs are directly produced in gene duplication events.The actual number of duplicated pairs could be even smaller if some intermediate genes were deleted in the course of the evolution.In general a family consisting of F proteins contributes at or around F (F − 1)/2 paralogous pairs to N a (p), but only F − 1 duplicated pairs to N d (p).

3 .
The proteome generated by this dynamical process is periodically analyzed in terms of sequence identity of all pairs of its proteins.The resulting distributions N a (p) (filled circles) and N d (p) (open circles) are shown in Fig. 3.They agree quite well with our theoretical predictions: N a (p) ∼ 1/p 4 (solid line) and N d (p) ∼ 1/p 4 exp(−r del /[3µp 3 ]) (dashed line).The best fit to N d (p)/N a (p) generated in our numerical simulation with A exp(−r del /[3µp 3 ]) gives r del /µ = 0.237 in excellent agreement with the actual value of r del /µ = 0.25 used in our simulation.This demonstrates that evolutionary parameters can be successfully reconstructed from the shapes of N d (p) and N a (p).This is especially encouraging in case of N d (p) because of the approximations that went into identifying true duplicated (sibling) pairs by the Minimum Spanning Tree algorithm.
However, only a few of the largest families in any genome contain sufficient number of paralogous pairs (up to F (F − 1)/2 in a family of size F ) to have a meaningful individual N a (p) histogram.Fig.5shows the results of such analysis in C. elegans with individual curves corresponding to N a (p) in the 5 largest families of paralogous proteins.Without exception all these histograms are well fitted by Cp −γ with γ LF ≃ 5 − 6.

Figure 2 -
Figure 2 -Illustration of dynamical processes changing the PID histogram.

Figure 3 -
Figure 3 -Histogram of sequence identities in a numerical simulation of proteome dynamics.

Figure 4 -
Figure 4  -Correlation between the number of genes in an organism and its duplication/deletion rates.

Figure 5 -
Figure 5 -Histogram of sequence identities of individual families in the genome of C. elegans.

Figure 6 -
Figure 6 -Histogram N a (p) of sequence identities and four rounds Whole Genome Duplications (WGD) in Paramecium tetraurelia.

Table 2 -
Deletion and duplication rates