# Rate of sequence divergence under constant selection

- Alexey S Kondrashov
^{1}, - Inna S Povolotskaya
^{2}, - Dmitry N Ivankov
^{2, 3}and - Fyodor A Kondrashov
^{2}Email author

**5**:5

https://doi.org/10.1186/1745-6150-5-5

© Kondrashov et al; licensee BioMed Central Ltd. 2010

**Received: **29 December 2009

**Accepted: **21 January 2010

**Published: **21 January 2010

## Abstract

### Background

Divergence of two independently evolving sequences that originated from a common ancestor can be described by two parameters, the asymptotic level of divergence *E* and the rate *r* at which this level of divergence is approached. Constant negative selection impedes allele replacements and, therefore, is routinely assumed to decelerate sequence divergence. However, its impact on *E* and on *r* has not been formally investigated.

### Results

Strong selection that favors only one allele can make *E* arbitrarily small and *r* arbitrarily large. In contrast, in the case of 4 possible alleles and equal mutation rates, the lowest value of *r*, attained when two alleles confer equal fitnesses and the other two are strongly deleterious, is only two times lower than its value under selective neutrality.

### Conclusions

Constant selection can strongly constrain the level of sequence divergence, but cannot reduce substantially the rate at which this level is approached. In particular, under any constant selection the divergence of sequences that accumulated one substitution per neutral site since their origin from the common ancestor must already constitute at least one half of the asymptotic divergence at sites under such selection.

### Reviewers

This article was reviewed by Drs. Nicolas Galtier, Sergei Maslov, and Nick Grishin.

## Background

Constant selection mostly manifests itself as negative selection, because permanently beneficial alleles tend to become common. Negative selection constrains and decelerates evolution. Slower evolution of more functionally important nucleotide sites, such as nonsynonymous coding sites, compared to less important sites, such as synonymous sites, is one of the most pervasive patterns in evolution at the sequence level and reflects ubiquity of negative selection and relative rarity of positive selection. Constant selection can accelerate evolution when it favors more mutable alleles [1], but the magnitude of this effect is substantial only when a number of conditions are met [2].

Still, the impact of constant selection on the rate of evolution, manifested in the dynamics of divergence of two independently evolving sequences that originated from a common ancestor, is not as simple as one might think, and should be viewed from two complementary perspectives. Indeed, constant selection affects both 1) the asymptotic value of sequence divergence *E*, reached by two sequences after a long time, and 2) the rate *r* at which the divergence between two sequences approaches *E*. Here, we separately study the impact of constant selection on *E* and on *r*.

## Model

Consider one locus (site) with *I* alleles *A*_{1}, ... *A*_{
I
}with frequencies *x*_{1}, ..., *x*_{
I
}, and fitnesses *1, 1-s*_{2}, ... *1-s*_{
I
}, where *s*_{
i
}are coefficients of selection (without loss of generality, the fitness of one of the alleles can be assumed to be 1). A population is either haploid or diploid with intermediate dominance. The rate of mutation of the *i*th allele into the *j*th allele is *μ*_{i, j}, and the effective size of the population is *N*_{
e
}. We assume that *N*_{
e
}*μ*_{i, j}*<<*1 for all *i*, *j*, so that most of the time one of the alleles is fixed in the population [3], and the dynamics of the model consists of allele replacements occurring at a site at random moments. We consider dynamics of divergence of two long sequences of sites, all with the same properties, that were identical at the moment when they originated from their last common ancestor and evolved independently after this moment. The level of divergence *D* will be characterized by the fraction of homologous sites occupied by different alleles.

## Results

### Asymptotic value of divergence *E*

*D*reaches its asymptotic value

*E*, given by

where *X*_{
i
}is the equilibrium value of *x*_{
i
}under given mutation and selection. *X*_{
i
}can be found from a set of *I* linear equations (eq. 10 from [3]; eq. (11) below presents this set in the case of *I* = 4). A strong enough constant selection favoring one allele over all others can make *E* arbitrarily close to 0.

### The rate r of approaching E: selective neutrality

Let us now concentrate on how *D* approaches *E*. We will characterize the rate of this approach by *r*, the power of an exponential decline of the deviation of *D* from *E*, and, equivalently, by the half-approach period *τ* = *ln2/r*. Indeed, we will see that *D* approached *E* exactly (in special cases) or approximately (in the general case) exponentially.

*s*

_{ i }= 0 for all

*i*. Let us also assume that

*μ*

_{i, j}=

*μ*for all

*i*,

*j*. Then,

*dt*is

*2 μdt*, and the probability of a match becoming a mismatch is

*2(I-1)μdt*(see Chapters 3 and 4 from [4]). The solution of this equation with

*D*= 0 at time 0 can be obtained from

*E*= 1/2, 2/3, and 3/4; and

*r*=

*4 μ*,

*6 μ*, and

*8 μ*, with 2, 3, and 4 selectively neutral alleles (Figure 1). We will mostly assume that our loci are nucleotide sites, so that these three cases are particularly relevant. Note, that nucleotide sites where strong selection acts against one or two "forbidden" nucleotides but fitnesses of the remaining three or two "permitted" nucleotides are the same are described by the model with 3 or 2 selectively neutral alleles.

### The rate r of approaching *E*: two alleles with selection

Let us now consider the case of only two alleles *A*_{1} and *A*_{2}, under an arbitrary selection, described by *s*_{2} = *s*. We will keep the assumption that *μ*_{i, j}= *μ* for all *i*, *j*, which in this case simply means that the rates of reciprocal mutations are the same. The dynamics of *D* are described by (5)

where *x*_{i, j}is the frequency of pairs of sites occupied by allele *A*_{
i
}in one sequence and allele *A*_{
j
}in the other sequence (due to symmetry of the two sequences, we will use this notation assuming that *i* ≤ *j*), and *f*_{i, j}is the rate at which allele *i* is replaced with allele *j*. Then, *D = x*_{1,2} and *x*_{1,1} = *x*_{1} - 0.5*D*, because *x*_{1} = 0.5*x*_{1,2}*+ x*_{1,1}. Analogously, *x*_{2,2} = 1-*x*_{1} - 0.5*D*, because *x*_{2}*+ x*_{1} = 1.

*x*

_{1}=

*X*

_{1}is invariant and is equal to

*e*

^{ S }/

*(e*

^{ S }

*+*1) (eq. 6 from [3]). According to eq. 7 from [3],

*f*

_{1,2}= -

*μS/(*1-

*e*

^{ S }) and

*f*

_{2,1}=

*μS/(*1-

*e*

^{-S}), where

*S = 4N*

_{ e }

*s*, in the case of diploidy. Thus, the equation (5) can be written as:

*r(S)*has a single minimum at

*S*= 0. Therefore, with only two permitted alleles the lowest rate

*r(0)*= 4

*μ*of approaching E, and the longest half-approach period

*τ = ln*2/

*(*4

*μ)*is achieved under neutrality (4), where the value of

*E*= 1/2 is the highest, because

*X*

_{1}=

*X*

_{2}. When

*S*increases,

*E*rapidly approaches zero, and

*r(S)*slowly increases without a limit (Figure 2). A more general case of unequal rates of reciprocal mutations can also be considered using eq. 7 from [3].

### The rate r of approaching E: general 4-allele case

*I*= 4. Here it will be more convenient to use letters

*A*,

*T*,

*G*, and

*C*to denote the four alleles. With 4 alleles under selection,

*D*becomes dynamically insufficient, and the divergence of sequences needs to be described in terms of frequencies of the six possible mismatches

*x*

_{i, j}([5], eq. 1):

*f*

_{i, j}, as before are given by eq. 7 from [3] with the coefficient of selection reflecting the difference between fitnesses of the

*i*th and the

*j*th allele and

*f*

_{i, noti}is the sum of rates of replacement of

*i*th allele into all other alleles. Frequencies of the four matches are inferred from the dynamical variables:

*x*

_{A, A}=

*X*

_{ A }- 0.5

*(x*

_{A, T}

*+ x*

_{A, G}

*+ x*

_{A, C});

*x*

_{T, T}=

*X*

_{ T }- 0.5

*(x*

_{A, T}

*+ x*

_{T, G}

*+ x*

_{T, C});

*x*

_{G, G}=

*X*

_{ G }- 0.5

*(x*

_{A, G}

*+ x*

_{T, G}

*+ x*

_{G, C});

*x*

_{C, C}=

*X*

_{ C }- 0.5

*(x*

_{A, C}

*+ x*

_{T, C}

*+ x*

_{G, C}), where

*X*

_{ i }, are obtainable from [3] (eq 10)

under the assumption that *x*_{
A
}*+ x*_{
T
}*+ x*_{
G
}*+ x*_{
C
}= 1. *D* is the sum of all the six variables in (10).

Numerical investigation of (10) revealed the following facts:

1)*D* approaches *E* strictly exponentially only in the case of just two permitted alleles. Otherwise the approach is decelerating: the time it takes for *D* to decline from 0 to *E*/2, *t*_{0,1/2} is shorter than the time *t*_{1/2,3/4} during which D declines from *E*/2 to 3*E*/4. However, this effect is weak: *t*_{1/2,3/4} never exceeds *t*_{0,1/2} by more than 10%. Thus, we can safely regard *t*_{0,1/2} as the half-approach period *τ*.

2) *D* approaches *E* at a rate that cannot be below *r* = 4 *μ*, which is only 2 times less than *r* = 8 *μ*, the rate for the case of 4 neutral alleles (*S*_{2} = *S*_{3} = *S*_{4} = 0) (eq 4). This lowest *r* = 4 *μ* was observed only in the case of 2 neutral alleles (*S*_{2} = *0; S*_{3} = *S*_{4}-*> 8*) (eq 9). In contrast, *r* can be arbitrarily large, but this happens only when just one allele is strongly favored (*S*_{2}, *S*_{3}, *S*_{4}-*> 8*) (eq 9). When selection is moderate (*S* ≤ 2), 4 *μ < r <*8 *μ* (Figure 3).

3) If the assumption of equal mutation rates is relaxed, the expected value of *r* under a particular mode of selection does not change much (Figure 4).

## Discussion

We have found a contrast between how constant selection affects the asymptotic level of divergence of independently evolving sequences *E* and the rate *r* at which this level is approached. While *E* always declines and can be affected arbitrarily strongly, *r* can both increase and decrease. The increase of *r* can be arbitrarily large, and occurs under conditions when *E* approaches 0, *i. e*. when selection strongly favors one allele over all others. However, *r* cannot decline by more than a factor of two (assuming equal rates of all mutations) relative to the case of 4 selectively neutral alleles, with the lowest *r* attained when two alleles have equal high fitnesses and the other alleles are strongly selected against. Unequal mutation rates apparently do not affect this conclusion. Of course, if selection coincidently permits only those two alleles that rarely mutate into each other, it can lead to an arbitrary decrease of *r*. However, in a general situation, unequal mutation rates do not substantially affect the impact of selection on *r*.

In other words, the highest, for the case of 4 alleles, asymptotic divergence *E* = 3/4 is approached with a half-approach period *τ = ln*2/*(*8 *μ)* ≈ 0.087 *μ*^{-1}, which is neither the shortest nor the longest possible under constant selection. However, the longest possible half-approach period *τ*_{
max
}= *ln*2/(4 *μ) ≈* 0.173 *μ*^{-1}, is just two times longer. Thus, the assertion that constant selection impedes evolution mostly describes its impact on the asymptotic level of divergence, and not on the rate at which this level is attained.

*D*travels half of its way to

*E*in the case of 4 neutral alleles,

*D*= 3/8. For this

*D*, application of Jukes-Cantor formula [4], (eq. 4.2) produces the following value of the per-site number of nucleotide substitutions per a selectively neutral site:

Indeed, under selective neutrality the rate of evolution is equal to mutation rate, so that *K*_{1/2, neutral}= 6 *μτ*_{
neutral
}with *I* = 4, and *τ*_{
neutral
}= *ln*2/(8 *μ)* in this case. Thus, because constant selection cannot reduce the rate of approaching E by more that a factor of 2, at all kinds of sites under constant selection divergence between two species with *K*_{
neutral
}= 1.04 must already represent, at the very least, 50% of their equilibrium divergence, as long as the allelic composition of the diverging sequences was at equilibrium from the very beginning. And divergence of all orthologous sequences that are under constant selection in species with *K*_{
neutral
}*>* 5 must be already very close to its asymptotic level.

It did not escape our attention that the dynamics of divergence of functionally important sequences are usually inconsistent with this key property of the model considered here: orthologous sequences keep diverging even between very evolutionarily remote genomes. There would be little hope to reconstruct the history of a molecule from extant sequences if selection coefficients were constant in time. In this case, sequences having diverged long enough would essentially include only invariant (strongly selected) and saturated (weakly selected) sites. Since we know that protein sequences can actually help in measuring divergences and building deep phylogenies, this suggests that variations in time of N_{
e
}and/or of selection coefficients at individual sites are common, and that their rate of change is what controls in the first place the amount of amino-acid differences between distantly related extant sequences.

## Reviewers' comments

### Reviewer 1

Referee 1:

Nicolas Galtier

CNRS-Université Montpellier II

Montpellier, France

This manuscript implements a very good idea, namely injecting population genetic parameters, typically used to model microevolution, in substitution matrices, typically used to model long-term sequence evolution. The two must be connected in principle, yet classical Markov models used in phylogenetics are made of arbitrary parameters.

These theoretical results are good to know and keep in mind, even though they perhaps do not apply so frequently to real data, as correctly acknowledged, given the assumptions of constant in time population size and selection coefficients. At any rate, this piece might stimulate other studies trying to understand long-term sequence evolution in the light of population genetics. Here are a couple of more specific comments.

1. I am not sure if the model is haploid or diploid. Diploidy is apparently assumed at some stage, but since there is no description of the fitness of a diploid genotype, I guess codominance is assumed. In this case, the diploid model is equivalent to the haploid one, so I would suggest to keep it haploid throughout for simplicity. Or if you make it diploid, please discuss dominance effects.

**Indeed, this needed to be clarified. Bulmer's framework we used can handle either haploidy or diploidy with codonimance. We assumed the latter case, but the only difference this made is that**
S = 4N
_{
e
}
s
**, instead of**
S = 2N
_{
e
}
s
**with haploidy. This is now stated.**

2. I feel like a better job could be made in accounting for, and referring to, the 80's/90's literature on Markov models for pairwise distance calculation between sequences. The papers by Rodriguez et al [5] and Lockhart et al [6], for instance, include equations very close to equation 10 of this manuscript. The matrix expressions and calculations used in these papers might help, by the way - I suspect that the numerical exploration of equation 10 performed by the authors could be done analytically. Once you have expressed the substitution rate from allele i to allele j in terms of mutation, selection and drift, you are within Rodriguez's formalism.

**Indeed, we needed to cite Rodriguez et al.**[5], **because our eq. 10 is essentially identical to theirs basic equation 1. However, as Dr. Galtier noted, Rodriguez** et al **.** [5]**did not study the dependence of solution to this equation on mutation and selection parameters separately; instead they only considered fluxes of allele substitutions. In contrast, we do not see how our results are related to that of Lockhart** et al **.**[6], **because they were primarily concerned with taking into account different nucleotide compositions of different sequences, and we, as well as Rodriguez** et al **.**[6]**, assumed these to be invariant. Of course, our eq. 10 can, as a system of linear ordinary differential equation, be solved analytically using matrix exponents. However, it will probably still be necessary to investigate this solution numerically, to understand its properties.**

3. If I understand correctly, this manuscript demonstrates that there would be little hope to reconstruct the history of a molecule from extant sequences if selection coefficients were constant in time. In this case, sequences having diverged long enough would essentially include only invariant (strongly selected) and saturated (weakly selected) sites. Since we know that protein sequences can actually help in measuring divergences and building deep phylogenies, this suggests that variations in time of Ne and s are common, and that their rate of change is what controls in the first place the amount of amino-acid differences between extant sequences - at least as far as nonadaptive changes are concerned. This conclusion is reminiscent of Gillespie's thoughts about the overdispersed molecular clock. It makes the more or less clock-like behavior of many known molecules even more amazing to me.

**We are very thankful for this interpretation of our main conclusion, and we plagiarize it for Discussion with minor changes, placing it at the very end of the paper. In contrast, we do not see a direct correspondence between our results and Gillespie's overdispersed molecular clock. Indeed, changes of selection, which can lead to prolonged divergence, do not necessarily need to occur at random moments of time.**

### Reviewer 2

Referee 2:

Sergei Maslov

Brookhaven National Laboratory Upton, NY, USA

The conclusions of this study can be summarized in the following statement: the strength of selection has a relatively small effect of the rate at which the evolutionary steady state is approached. Authors offer detailed derivations for the case of a locus consisting of a single nucleotide site. Hence their number of alleles is limited to 4 (A, C, G, and T). In cases where the selection indeed operates at the level of a single nucleotide the derivation presented in the manuscript is through and exhaustive. The mathematically transparent summary of the central results is contained in Eqs. (8) and (9).

I found the discussion part of this manuscript to be somewhat cryptic. For example, I didn't get the point authors were trying to make by using the celebrated Watson and Crick opening sentence in: "It did not escape our attention that the dynamics of divergence of functionally important sequences are usually inconsistent with this key property of the model considered here". What key property of their model authors are talking about?

**We expanded this place a bit, using insightful comments of Dr. Galtier. We believe that the main reason why our analysis may be interesting is that it provides a proof by contradiction to the assertion that selection at individual sites keeps constantly changing, which has long been suspected but never demonstrated convincingly. However, this is a separate story.**

Another confusing part of the discussion is that after deriving K_1/2,neutral = 0.52 in Eq. (12) the authors state that sequences diverged to the level of

K_neutral = 2K_1/2,neutral = 1.04 represent at least 50% of their equilibrium divergence. I thought that is exactly how K_1/2,neutral was defined. Why double it? In short, I feel the manuscript would benefit from giving a bit more details in the discussion session.

**Double it because constant selection can, in the extreme situation, half the rate at which the asymptotic level of divergence is approached. This is now explained.**

Another thing, I would like authors to better describe what practical questions their mathematical observations help to understand? I am shooting in the dark here but do these results might explain a puzzlingly uniform rate at which non-neutral substitutions accumulate in the long-term evolutionary experiment by Lenski and collaborators [7]? Having a few concrete examples where the mathematical treatment presented in the manuscript is important would greatly improve the readability of the manuscript.

**One cannot be sure that Lenski's lab data are relevant to our analysis. Indeed, they placed their bacteria in a novel environment, so their system reflects a dynamics of adaptation and may not be at equilibrium, in contrast to our assumptions. We believe that our analysis is important because it demonstrates that, in the long term, evolution of even conservative orthologs does not proceed under constant selection.**

### Reviewer 3

Referee 3:

Nick Grishin

University of Texas Southwestern Medical Center

Dallas, TX, USA

This reviewer provided no comments for publication.

## Declarations

## Authors’ Affiliations

## References

- Eyre-Walker A: The effect of constraint on the rate of evolution in neutral models with biased mutation. Genetics. 1992, 131: 233-234.PubMedPubMed CentralGoogle Scholar
- Kondrashov FA, Ogurtsov AY, Kondrashov AS: Selection in favor of nucleotides G and C diversifies evolution rates and levels of polymorphism at mammalian synonymous sites. J Theor Biol. 2006, 240: 616-626. 10.1016/j.jtbi.2005.10.020.PubMedView ArticleGoogle Scholar
- Bulmer M: The selection-mutation-drift theory of synonymous codon usage. Genetics. 1991, 129: 897-907.PubMedPubMed CentralGoogle Scholar
- Li WH: Molecular Evolution. 1997, Sunderland, MA, SinauerGoogle Scholar
- Rodríguez F, Oliver JL, Marín A, Medina JR: The general stochastic model of nucleotide substitution. Journal of Theoretical Biology. 1990, 142: 485-501. 10.1016/S0022-5193(05)80104-3.PubMedView ArticleGoogle Scholar
- Lockhart PJ, Steel MA, Hendy MD, Penny D: Recovering evolutionary trees under a more realistic model of sequence evolution. Mol Biol Evol. 2004, 11 (4): 605-12.Google Scholar
- Barrick JE, Yu DS, Yoon SH, Jeong H, Oh TK, Schneider D, Lenski RE, Kim JF: Genome evolution and adaptation in a long-term experiment with Escherichia coli. Nature. 2009, 461: 1243-1247. 10.1038/nature08480.PubMedView ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.