γ-MYN: a new algorithm for estimating Ka and Ks with consideration of variable substitution rates

Background Over the past two decades, there have been several approximate methods that adopt different mutation models and used for estimating nonsynonymous and synonymous substitution rates (Ka and Ks) based on protein-coding sequences across species or even different evolutionary lineages. Among them, MYN method (a Modified version of Yang-Nielsen method) considers three major dynamic features of evolving DNA sequences–bias in transition/transversion rate, nucleotide frequency, and unequal transitional substitution but leaves out another important feature: unequal substitution rates among different sites or nucleotide positions. Results We incorporated a new feature for analyzing evolving DNA sequences–unequal substitution rates among different sites–into MYN method, and proposed a modified version, namely γ (gamma)-MYN, based on an assumption that the evolutionary rate at each site follows a mode of γ-distribution. We applied γ-MYN to analyze the key estimator of selective pressure ω (Ka/Ks) and other relevant parameters in comparison to two other related methods, YN and MYN, and found that neglecting the variation of substitution rates among different sites may lead to biased estimations of ω. Our new method appears to have minimal deviations when relevant parameters vary within normal ranges defined by empirical data. Conclusion Our results indicate that unequal substitution rates among different sites have variable influences on ω under different evolutionary rates while both transition/transversion rate ratio and unequal nucleotide frequencies affect Ka and Ks thus selective pressure ω. Reviewers This paper was reviewed by Kateryna Makova, David A. Liberles (nominated by David H Ardell), Zhaolei Zhang (nominated by Mark Gerstein), and Shamil Sunyaev.


Background
Comparative sequence analysis is a powerful tool for biologists to study evolutionary relationship among animals and plants across diverse taxonomic lineages [1][2][3]. Pairwise sequence comparison is perhaps the simplest com-parative analysis for phylogeny for two reasons [4]. First, calculating pair-wise distances is the initial step for distance-matrix methods of phylogeny reconstruction. Second, Markov-process models of nucleotide substitution used in distance calculations lay a foundation for likeli-hood and Bayesian analyses. One of the sophisticated methods is to estimate nonsynonymous and synonymous substitution rates for interrogating sequence dynamics and constructing phylogenetic trees. Since Ka and Ks represent the number of substitutions per nonsynonymous and synonymous site, respectively, these parameters (or often their ratio Ka/Ks or ω) are important for the estimation of evolutionary rates. The indications of Ka < Ks (ω < 1), Ka > Ks (ω > 1), and Ka = Ks (ω = 1) on evolutionary trends are negative (purifying), positive (adaptive), and neutral mutations, respectively. Ka and Ks can be estimated based on approximate methods, which typically involve three essential steps [4]: (1) counting the number of synonymous (S) and nonsynonymous (N) sites among targeted sequences, (2) counting the number of synonymous (S d ) and nonsynonymous (N d ) substitutions between two orthologous sequences, and (3) calculating the number of synonymous (d s ) and nonsynonymous (d n ) substitutions per site after correcting for multiple substitutions. Most of the methods assume simplified nucleotide substitution paths and involve ad hoc data treatments that are not well-justified [5,6]. For instance, NG (Nei-Gojobori) method, a commonly-used approximate method in the early days, considers all possible evolutionary courses among compared DNA sequences and assumes that each nucleotide can be substituted with any of three other nucleotides at equal rate when it counts both sites and substitutions [7]. It adopts Jukes-Cantor's one parameter formula only to correct for multiple substitutions. Another example, LWL (Li-Wu-Luo) method, classifies sites and substitutions as i-fold degenerate sites (i = 0, 2, 4) and considers unequal rates between transitional and transversional changes only when it counts substitutions, but considers equal rates when counting sites [8]. A modified LWL, LPB (Li-Pamilo-Bianchi) method corrects for bias in counting sites by using different formulas for Ka and Ks estimation, which differentiate LPB from LWL method [9,10]. Versions of LWL and LPB methods were also proposed by distinguishing two-fold degenerate sites and substitutions, taking the account of the transition/transversion rate bias when counting sites and correcting for arginine codons [11,12].
Among approximate methods, YN (Yang-Neilsen) method made significant progress through consideration of transition/transversion rate and nucleotide frequency biases [13]. Based on YN method, we recently proposed a modified YN method (MYN) to distinguish substitutions between purines (A/G) and between pyrimidines (T/C) [13,14]. MYN incorporates most of the major features of sequence evolution but assumes that different sites in sequences evolve the same way and at the same rate. This assumption is somewhat less thorough, and accumulating evidence of rate variation over sites is rather overwhelming [15][16][17][18][19][20]. Since mutation rates certainly vary among sites, and mutations at different sites may be fixed or drifting at different rates due to their versatile roles in the structure and function of gene products (mostly proteins albeit RNAs also fold into different conformations), unequal nucleotide frequencies, different codon usage among species, and variation of substitution rates among different sites should all be taken into account, allowing for significant yet maybe incremental improvements on various parameter estimations. Some sixteen years ago, one of the pioneers of this field, Ziheng Yang suggested γ-distribution (gamma-distribution) as an adequate approximation based on his intensive comparative analysis on several continuous distributions leveraging on sequence data from the globin genes [21]. As γ-distribution has been frequently used in estimating sequence divergence [7,[22][23][24][25][26][27], we adopt it to formulate an improved approximate method, denoted as γ-MYN, based on MYN method [14].
In this method, we assume that nucleotide substitutions follow γ-distribution because negative binomial distribution is known to be generated when Poisson parameter γ varies according to a particular γ distribution among sites [28]. We would like to emphasize that the γ distribution here refers to raw mutation rate rather than γ distribution of ω itself. It has been proposed that nucleotide substitution in coding region is context-dependent [29], and therefore, substitution rates depend on not only the neighboring sequences but also their functional constraints and models that allow for the correlation of substitution rates at adjacent sites were also developed [30,31]. However, as these models tend to produce results similar to the simple gamma model and variations of α can make the distribution suitable for accommodating different levels of rate variations in various datasets [31], we chose the simple gamma distribution as the depiction of raw various mutation rates. Since YN and MYN methods perform better as compared to numerous other methods [12] and MYN improves the performance of YN for most parameter combinations [14], we focus on evaluating the performance of γ-MYN by comparing it to YN and MYN under variable conditions. The definitions of symbols used in Ka and Ks estimations are listed in Table 1.

Computer simulation
Computer simulation is a routine approach for evaluating computational procedures of different algorithms. In molecular phylogeny, one major approach for simulating DNA sequence evolution is to generate an ancestral sequence for the root of a tree and "evolve" it along the tree building process according to substitution models, branch lengths, and substitution parameters [11,[32][33][34][35]. This approach can be implemented in the evolver program in the PAML (Phylogenetic Analysis by Maximum Likelihood [36]) package, which usually uses nucleotide or amino acid sequence data to simulate evolving protein-coding sequences. To assess the advantages of γ-MYN in comparison with YN and MYN, we generated three groups of simulated sequences with the PAML package: (1) equal codon frequencies, (2) human frequencies (based on human protein-coding genes from the ENSEMBL database) [37] and (3) rice frequencies (based on rice proteincoding genes) [38]. We also generated 2,000 sequence pairs with 1,200 bp in length for examining the effect of different parameters.

Consistency analysis and effect of codon frequencies
In general, a better method should have relatively minimal deviations from real values with near infinite amount of data and within a reasonable range of all relevant parameters. In reality, we have to define both data in a limited way and parameter ranges within reasonable boundaries. In this exercise, we use ω = 0.3, 1, and 3 to represent negative, neutral, and positive mutations, respectively [3], and fix parameter t to 0.6 for initial assessment. Since genuine values for κ often range from 1.5 to 5, we always fix κ = 3.75 as typical. Considering that γ-MYN differentiates κ Y from κ R , we always fix one of them to 3.75 and allow the other varying from 1 to 10. We then analyze ω among data generated with YN, MYN, and γ-MYN against κ R (fixing κ Y = 3.75), using the three codon frequencies under different selective pressures ( Figure 1A-I). We observed that γ-MYN produces less deviated ω from the standard data under negative selection as we perform analyses for different species. Although γ-MYN performs in a very similar way as MYN does, it is obviously better than YN under either positive or neutral selections. Since biased codon frequencies often have opposite effects as compared to the bias of transition/transversion rate ratio, ignoring codon frequency bias can lead to an overestimation of ω [13]. Using empirical data from human and rice, which represent distinct codon usages, we also did not detect any effect among different codon frequencies (Figure 1A-I). Since most of the evolutionary studies tend to calculate evolutionary rates between closely-related species, future research should focus more on the effect of different parameters and the improvement of calculations under negative selection.
Effect of γ-distribution MYN method assumes that different sites in a sequence evolve in the same way and at the same rate. It is obvious that such an assumption does not happen in the real world for most proteins and their genes. For instance, mutation rates are not the same in nuclear and organellar genomes among different species [39]. In addition, sequence variations among portions or domains of proteins mutate differently from a fixed mutation rate due to their specific structural and functional constraints for different genes under different selective pressures. Therefore, we introduced a parameter α in MYN method so that each substitution rate across sites is assumed to follow γ-distribution.
Since α is an unknown random variable and its variations may lead to changes of probability density of γ-distribution as well as deviations of γ-MYN method, we chose different parameters to force it to deviate from real values under different selective pressures (Figure 2). For a qualitative survey, the order of estimated values of ω, in the cases of κ R = 1, 2, and 3, is: YN <γ-MYN < MYN; the order of estimated values of ω for the rest cases, κ R = 4, 5, 6, 7, 8, 9, and 10, is: γ-MYN<MYN <YN. Furthermore, we observed that estimated ω do not change much as α varies when expected ω = 1 or 3, and γ-MYN again performs better when ω = 0.3 than it does when ω = 1 or 3. Because most calculated ω values indicate negative selection, and variation of α has stronger influence under negative selec- Estimated ω based on YN, MYN, and γ-MYN Estimated ω based on YN, MYN, and γ-MYN. We plotted ω values estimated by YN, MYN, and γ-MYN when κ Y = 3.75, considering κ R varying from 1 to10. We used the canonical genetic code for simulated sequences with 1.6 million codons and three sets of codon frequencies: equal (A to C), human (D to F) calculated from human protein-coding genes, and rice (G to I) calculated from rice protein-coding genes. ω Estimated ω when κ Y = 3.75 and κ R varies from 1 to 10 under negative selection Figure 2 Estimated ω when κ Y = 3.75 and κ R varies from 1 to 10 under negative selection. We obtained better ω estimates by introducing parameter α when orthologous genes are under negative selection with ω varying from 0.1 to 0.9. The canonical genetic code was used for simulated sequences with 1.6 million human codons. tion, we analyzed the variation of ω in a range of 0.1 to 0.9 to evaluate the effect of α on ω. We obtained different optimal α values when ω varies from 0.1 to 0.9, and plotted different γ distribution densities ( Figure 3). Each curve appears reaching its maximum and goes down with an increasing substitution rate. The peaks of the curves shift to the left and become lower in density when optimal α values decrease from 4.8 to 1.5; the decrease is attributable to the increase of ω (selective pressure) from 0.1 to 0.9. Furthermore, selective pressure shows significant effects on α, with an increase in the probability at the lowest and highest substitution rates across all sites. Because γ-MYN produces less biases than both YN and MYN do when ω varies from 0.1 to 0.9 under α = 4 (data not shown), we chose α = 4 as a typical value for our analyses.

Effect of t
The parameter t represents divergence time between two sequences. To test the effect of t on our method, we use human codon frequency (2,000 pairs of sequences with 400 codons for each case), and vary t from 0.1 to 1. γ distribution density as a function of substitution rates at optimal α values. We plotted different γ distribution densities as a function of substitution rates at optimal α values: 0.8, and 0.9, α = 1.5. Note that each curve reaches its maximum and goes down with increasing substitution rates.  Effects of κ R and κ Y We used the same data (2,000 pairs of human codon sequences with 400 codons for each case) and methods to test the effects of κ R and κ Y . We plotted the average estimates of ω from YN, MYN, and γ-MYN methods against κ Y = κ R for the parameter combinations: the expected ω values vary as 0.2 and 0.3 when α = 4 ( Figure 5). While the curves produce from YN and MYN methods superimpose each other when κ R (=κ Y ) varies from 1 to 10, γ-MYN deviates clearly less from the expected ω. We found that γ-MYN still performs better than the other two methods, whereas MYN is degraded to YN when κ Y is equal to κ R . The result suggests that the assumption of variable substitution rates among different sites is necessary to Ka and Ks calculations.

Effect of S%
Usually the effect of S% (the fraction of synonymous sites in a sequence) is considered as a factor of method evaluation. Changes in ω in relation to S% are often evaluated based on the effect of S% on the deviation of Ka and Ks. Therefore, an overestimated S% may give rise to underestimation of Ks and overestimation of Ka, resulting in overestimation of ω. Likewise, underestimation of S% may also lead to overestimation of Ks and underestimation of ω. It has been reported that S% has enormous influence on Ka and Ks under negative selection but has neglectable effect under positive selection [40]. We used human sequences to examine the effect of S% on γ-MYN (Table   2), fixing κ Y to 3.75. As κ R increases, the value of S% generated from our method exhibits minor fluctuations under different negative selections, when compared to that from YN. But the difference between γ-MYN and MYN is minute under this condition. In more details, the order of estimated values of S%, in the cases of κ R = 1, 2, and 3, is: YN < MYN <γ-MYN; in the rest cases, the order of estimated values of S%, when κ R = 5, 6, 7, 8, 9 and 10, is: MYN <γ-MYN < YN. We did not observe any obvious trend under the condition of κ R = 4. Therefore, γ-MYN is deemed insensitive to S% changes.

Effect of sequence lengths
The length of homologous genes subjected to an analysis usually varies in actual calculation. In order to evaluate the effect of variable sequence lengths, we use two groups of simulated rice sequences under the conditions of (1) ω = 0.2, κ R = 10, κ Y = 1, t = 0.6, and α = 4; and (2) ω = 0.3, The effects of κ R and κ Y based on YN, MYN, and γ-MYN   (Table 3). It appears that all three methods overestimate ω regardless the number of codons in the datasets. In particular, despite the fact that all three methods give rise greater biases for shorter sequences (<300 codons), γ-MYN performs better than the other two methods. We also found that the performance of γ-MYN is getting better faster than the other two methods as the number of codon increases.

Testing real data
We used three ortholog datasets for the test, 14,323 from human-dog, 16,066 from human-mouse, and 12,351 from human-chimp. For a more comprehensive display, we examined the cumulative percentage of κ R -κ Y ( Figure  6), showing different transitional substitutions with unequal frequencies. For example, the cumulative percentages for κ R -κ Y > 0.4 for human-dog, human-mouse and human-chimp orthologs are 52.27%, 52.66%, and 24.47% and those for κ R -κ Y < -0.4 are 25.36%, 24.31%, and 21.87%, respectively. In the rest cases, for |κ R -κ Y | ≤ 0.4, they are 22.37%, 23.02%, and 53.66% for the three ortholog groups. We found that the value for humanchimp is more than twice as much as that of human-dog (or human-mouse), and the reasons are attributable to a close evolutionary relationship between human and chimpanzee [41].   (Table 4). We chose the value of 0.4 as a threshold so that the three cases can stand for three groups of κ R under the condition of κ Y = 3.75: (1) κ R = 5, 6, 7, 8, 9 and 10; (2) κ R = 1, 2, and 3; (3) κ R = 4. Other than YN and MYN, we also used a maximum likelihood method proposed by Goldman and Yang (denoted as GY) [33].
The results showed a few interesting trends. First, GY performs in a similar way as YN does as compared to MYN and γ-MYN; it is consistent with our previous simulation results, as they share a common consideration of transition/transversion rate bias and nucleotide frequencies bias [12,42]. failed. This gene has been studied previously based on a population genetics analysis of extended-haplotypehomozygosity in Northeast Asians, and a possible positive selection scheme was proposed for it [43]. This result is in accordance with the result of large-scale scanning on positively selected genes between human and chimpanzee genomes [44,45].

Rice Codon Frequencies (α = 4)
Number Cumulative percentage of κ R -κ Y for human-dog, human-mouse and human-chimp orthologs at a bin size of 0.2 Figure 6 Cumulative percentage of κ R -κ Y for human-dog, human-mouse and human-chimp orthologs at a bin size of 0.2. We divided the x-axis into 100 bins and plotted the cumulative percentage of κ R -κ Y from the orthologous genes of human-dog, human-mouse and human-chimp.

Why should we continue developing Ka/Ks methods?
A major limitation of Ka/Ks methods, mentioned in literatures, is their poor ability for detecting positive selection (adaptive selection) [6,[46][47][48][49][50][51][52]. To detect positive selection at sites requires that ω value averaged over all branches is >1 and to detect positive selection along lineages requires ω value averaged over all sites is >1 [6]. Therefore, Ka/Ks methods are only useful to weight average selection pressure over sites and branches. They may not be able to detect positive selection for some highly conserved proteins that are mostly invariable but become fragile when a single site alters. Other detrimental cases include transmembrane domains where high variability may not change its physiochemical property. To overcome the weakness, there have been methods developed, such as Likelihood Ratio Test (LRT) [53][54][55][56] implemented in PAML [57] and Hyphy software [58], to identify positive selection [45,[59][60][61], which tend to be qualitative. An obvious pitfall of these methods is that they do not weigh the relative degree of two genes under negative selection. Ka/Ks methods can perform better in this regard [38,[62][63][64] as they tend to be more quantative. In addition, Ka/Ks methods can be readily extended from our current work to detect the sequence alternations that lead to protein structure changes and positive selection, in combination with other techniques, such as ancestral sequence reconstruction [65][66][67] and primary [68,69] or tertiary windowing [70,71]. Therefore, we believe that the two different lines of methods (LRT-like methods and Ka/Ks methods) should also be useful under appropriate conditions.

Why should we introduce new parameters?
With the introduction of the parameter α, our method γ-MYN shows significant improvements when compared with the other two related methods in both simulation and tests on real data. As γ-MYN assumes that evolutionary rate at each site follows γ-distribution, we found that the parameter α has observable effects under different evolution rates. For instance, when ω >= 1, γ-MYN  remains stable. We also observed that selective pressure can overwhelm the variable substitution rates across sites and it becomes the most influential factor when increasing dramatically. Therefore, when we consider strong positive selection and neutral evolution, the effect of variable substitution rates across sites can be somewhat neglected.
In addition, more parameters often lead to increase of complexity of an algorithm, resulting in the decrease of efficiency. However, we hold the view that proper introduction of parameters is worthwhile.
It has been noticed that the majority of the evolutionary selections are actually negative in nature, and the statement is confirmed by our analyses on real data. When ω varies from 0.1 to 1, we selected optimal α to minimize biases and found that γ-MYN is very sensitive when α changes under negative selection. Furthermore, the optimal α becomes smaller when ω becomes larger under negative selection, and the effects of various substitution rates across sites become evident under negative selection, emphasizing the importance of our method in the calculation under negative selection.
As to γ-Tamura-Nei model, it usually leads to higher variations, especially in phylogeny analyses. However, we found some multi-level observations in both simulation and real data testing (see in additional file 1). The difference of variations of ω between YN and γ-MYN (or MYN) seems to be correlated to the value of κ R -κ Y , when results in Tables S2-S6 were examined together. For instance, when κ R -κ Y < 0, ω values vary more when γ-MYN is used than YN is. As κ R -κ Y increases, the ω variation values from γ-MYN decrease, leading to lower numbers than those of YN. The variations of ω calculated with γ-MYN are slight higher than those yielded from MYN in most cases but not all. These results reflect the distinction between the usage of γ-Tamura-Nei model (or Tamura-Nei model) in KaKs computation and that of phylogeny reconstruction.

How the variable substitution rates influence the Ka/Ks calculations?
We begin our discussion with how α parameter in γ-MYN improves MYN method. It is known that ignoring rate variation among sites leads to underestimation of both the sequence distance and the transition/transversion rate ratio κ (both κ R and κ Y ) [4]. κ is used not only in estimating S and N but also in generating a transition probability matrix for estimating S d and N d . If we derive an approximate formula for ω = Ka/Ks ≈ (N d /N)/(S d /S) (the symbol of "≈" is used to emphasize the absence of correction for multiple substitutions), ω is composed of two parts: N d /N and S d /S. For purifying selection, synonymous substitutions occur more frequently than nonsynonymous ones so we should only focus on Ks (S d /S). Since κ decrease is related to the reduction of substitution rate between two codons, underestimation of κ leads to underestimation of S d . In addition, nucleotide transitions between two codons are more likely to be synonymous especially at the third codon positions, underestimation of κ leads to underestimation of S. However, the influence of κ on S d is significantly stronger than that on S. As a consequence, an underestimated κ, when used in MYN, may give rise to underestimation of S d /S, resulting in overestimation of ω as compared with γ-MYN. Our theoretical deductions are consistent with both simulation (Figure 2) and real data ( Table 4).
In addition, we found the optimal α values fall between 1 and 5 ( Figure 3). In these cases, the distribution of gamma values is bell-shaped, meaning that most sites have intermediate rates around 1 whereas a few sites have either very low or very high rates [4]. When selection pressure increases, the number of sites of intermediate rates decreases (Figure 3). In particular, when α approaches infinity, the distribution diminishes into the model of a single rate for all sites, which is used in MYN method. If α ≤ 1, the distribution has a highly skewed L-shape, suggesting that most sites have either very low rates of substitution or are nearly "invariable" with possible substitution hotspots. Furthermore, estimates of α from real data in many species over multiple sequences show increases from 0.26 to 3.0, and this relatively wide window allows us to explore the spectrum of different substitution rates over different sites [4] Applications of our new method Divergence time (t) is another parameter important for the estimation of Ka and Ks. When divergence time reaches the extremes, the compared sequences among genes often vary considerably and their corresponding protein structures may changed over greater evolutionary time scale. Therefore, under such conditions it may become meaningless to calculate the substitution rates of such genes. However, most methods for calculating Ka and Ks use homologous genes for estimating substitution rates among closely-related species or within close lineages, and observable selections are mostly negative. Since our method γ-MYN has better performance than other methods when ω < 1, it provides a useful alternative for more comprehensive Ka and Ks calculations.
Our past work has testified that methods for estimating Ka and Ks should be used cautiously and one should not draw simple conclusions on gene evolution from Ka and Ks analyses based on a single method. Therefore, we recommend a method based on model selection and model averaging [12,42], and γ-MYN has just brought a new choice into such endeavours. Our method does not challenge other methods such as GY (Goldman-Yang) method, a typical maximum likelihood (ML) that has been always considered to be the method of choice [13,33]. It has been suggested in the literature that GY and YN both give rise to similar estimates on Ka and Ks primarily due to the fact that they both take account of major dynamic features of DNA sequence evolution, including transition/transversion rate and nucleotide/codon frequency biases [12,42]. As γ-MYN performs better than other methods under certain conditions, despite the fact that its advantages seemed less obvious under other conditions, we believe that γ-MYN may become a useful tool for large-scale sequence analysis when ML-based methods are deemed time-consuming.

Conclusion
We compared γ-MYN with two other methods, YN and MYN, by examining long sequences, performing computer simulations, and analyzing real datasets. Since neglecting the variation of substitution rates among different sites may lead to biased estimates, our new method has minimal deviations when parameters vary within normal ranges defined by empirical data. γ-MYN performs better when genes are under strong purifying selection and comparable to the other two methods when genes are under positive selection or remain neutral. In addition, we showed that biased estimates of Ka and Ks primarily originate not only from biased estimates of κ-or both κ R and κ Y -but also from the neglect of variable substitution rates.

Mutation model
In Markov-chain models of codon substitution, the codon triplet is considered as the unit of evolution, and a Markov chain is used to describe substitutions from one codon to another codon [4]. In detail, the state space of the chain is the sense codons with regard to the canonical genetic code. Stop codons are not allowed inside a functional protein and are not considered. Although there are several mutation (substitution) models that take different sequence variation features into account, in this report we limit our discussions to the Tamura-Nei Models [24] (see Table S1 in additional file 2 for details).
γ-MYN also needs a transition probability matrix similar to YN and MYN. We assigned the substitution rate q ij from any codon i to j (i ≠ j) to generate a transition probability matrix as follows: The diagonal elements of the transition probability matrix, Q = {q ij }, are determined based on the mathematical requirement that the row sums equal to zero. The matrix is normalized with the result that the sum over non-diagonal terms is 1.
Estimating κ R and κ Y To generate the transition probability matrix, we need to estimate κ R and κ Y . Similar to YN and MYN, we calculated four nucleotide frequencies (g T , g C , g A , g G ), proportions of transitional differences between purines (T R ), and between pyrimidines (T Y ), and the proportion of transversional differences (V) from compared sequences: where g R = g A + g G and g Y = g T + g C . Note that α is the square of the inverse of the variation coefficient in the gamma function.
We then used equation 3 to estimate κ R and κ Y .
The detailed procedures for deducing κ R and κ Y were summarized in additional file 2. We also made other modifications accordingly, such as using κ R and κ Y to estimate S and N, generating relevant transition probability matrix (Equation 1), considering different transitional evolution pathways to count S d and N d , and correcting for multiple substitutions when estimating Ka and Ks (Equation 4; [24]).

The algorithm
When compared to MYN and YN, our γ-MYN method considers that the rate of nucleotide substitution λ approximately follows γ-distribution. In fact, if the rate of nucleotide substitution λ is the same for all sites considered, the model becomes the model that is used in MYN.
γ-MYN uses an iterative approach to estimate Ka and Ks.
Before iteration, γ-MYN computes nucleotide frequencies (regarding to the three codon positions), κ R , κ Y , S, and N from the sequences to be analyzed. Based on the F3 × 4 model [36], codon frequencies are calculated by multiplying each nucleotide frequencies. κ R and κ Y are estimated from four-fold degenerate sites at the third codon position and non-degenerate sites. S and N are calculated by using κ R , κ Y , and codon frequencies. γ-MYN chooses initial values for t and ω as starting point for iteration. It generates a transition probability matrix that represents substitution probabilities from one codon to another by using ω, t, κ, and codon frequencies. This transition probability matrix is subsequently used to deduce S d and N d and for new estimates of ω and t. γ-MYN repeats the calculation for another transition probability matrix, until the algorithm converges.

Comparative analysis on Ka and Ks estimations
We used simulated sequences generated from hypothetical common ancestral sequences for our comparative analysis by randomly choosing codons (61 excluding stop codons) from the ancestral sequences according to codon frequencies that were derived from three empirical datasets: (1) equal codon frequencies [13], (2) human codon frequencies deduced from 39,420 human protein-coding genes from ENSEMBL database (Release 35; [37]) and (3) rice codon frequencies deduced from 19,079 rice proteincoding genes [38] (see in additional file 1).
In addition to codon frequencies, we also have to fix or choose ranges of other parameters for the simulation, including sequence length, divergence time (t), two ratios of transitional rate between purines (κ R ) and between pyrimidines (κ Y ) to transversional rate, and selective pressure ω. Although ω varies from gene to gene, ω = 1 and 3 can be regarded as "typical values" for neutral mutation and positive selection, respectively [3,13,67], which are observable from real datasets. Since most calculated ω values indicate negative selection and variation of parameter α has stronger influence under negative selection, we analyzed the variation of ω in a range of 0.1 to 0.9 for the evaluation of effects of α on ω. To accurately examine the effect of one parameter and to avoid stochastic errors arising from other factors, we generated 2,000 pairs of sequences. Three orthologous gene sets were downloaded from NCBI's HomoloGene database (Build 61), which contained 14,725 human-dog, 16,368 human-mouse, and 15,646 human-chimp gene pairs [72]. We considered "NA" occurrence (in any of Ka, Ks or ω) as unreliable data and filtered the orthologous pairs (extremes in sequence homology) that have such labels, and 14,323 human-dog, 16 [13]. Previous work added parameters to differentiate between purine transitions and pyrimidine transitions [14]. The current work adds a gamma distribution on top of the previously described work.
The stepwise addition of parameters to the Yang and Nielsen approach reflects an attempt to add increasing layers of biological realism. Differentiating between purine and pyrimidine transitions is driven by potential underlying forces like codon bias to the extent that it is correlated across codons in a gene and the chemistry (specificity) of DNA damaging agents, DNA polymerase, and DNA repair enzymes (see [73] for example). The biological link to the gamma distribution is somewhat less clear in the way that it has been applied. Nucleotide sequences as well as amino acid sequences typically show support for a gamma distribution characterizing rates across sites. At the nucleotide level, this is typically related to two components: differences in substitution rate in the different codon positions due to the nature of the genetic code as well as amino acid level constraint on the protein. The former category is already modeled with ω, potentially creating some degree of redundancy between the ω and α parameters. Modeling α at the amino acid level (translated codons) would not suffer from the redundancy and likely accounts for the improvement in performance by γ-MYN.

Authors' response
In our previous work, we did incorporate AIC to KaKs calculations [42]and found that the selected models in the calculation did not depend on combinations of various parameters. We speculate that γ-MYN perhaps may not be the best choice under certain conditions, when the smallest AIC is considered as the criteria. We will incorporate AIC into our new model and the updated KaKs Calculator (the software through model selection and model averaging).
A more minor point is that the authors suggest that approximate methods need to average over all sites or all branches. Based upon earlier work using ancestral sequence reconstruction coupled with counting methods [65][66][67], approximate methods have been developed or can easily be extended from current work based upon primary windowing to detect selective sweeps [68,69] and tertiary windowing to detect structural covariation leading to positive selection [70,71]. This should probably be discussed when discussing the power of these methods.

Authors' response
We expanded our discussion into this issue. The power of detecting positive selection in KaKs methods certainly can be enhanced by introductions of ancestral sequence reconstruction and sliding windows. However, it is still an interesting question as which one is better when compared to the LRT methods.
Further development of models based upon mechanistic molecular and biological underpinnings is always a welcome addition to the literature. A number of problems from multiple sequence alignment to amino acid-based phylogeny to problems in detecting positive selection suf-fer from the divorce of common models from underlying processes. Well-performing mechanistic models will be broadly applicable across bioinformatics.

Authors' response
We fully agree with this comment. In the real world, sequence analysis can be complex and difficult. Therefore, models considering more biological parameters lay foundations for broader applications, especially in the field of molecular evolution (phylogeny tree reconstruction and mechanics of evolution dynamics).

General comments
This manuscript describes a new method to estimate the ratio of Ka/Ks taking into account the evolutionary rate variation. Ka/Ks ratio is commonly used as an indicator of selective pressure acting on protein-coding genes. Current methods mostly use simplified substitution models, which may have effect on the estimation of Ka/Ks. Here, based on their previous work, the authors present a new method that the evolutionary rates across sites are modeled by a gamma distribution. Using both simulated and real data, the authors show that the new method performs better than current methods under some conditions. The novelty of this manuscript is that this is the first Ka/Ks estimation method that considers the rate variation among sites. It is an important contribution to the scientific communities that use Ka/Ks in their research, and likely will open avenues for new researches in this area.
Specific comments I found the overall writing is clear, albeit a little verbose at some places.

Authors' response
We revised the manuscript again for clarity.

Concern of overfitting
Can the authors address the concern of over-fitting by introducing additional parameters?