Pitfalls of the most commonly used models of context dependent substitution

Correction to Lindsay H, Yap VB, Ying H, Huttley GA: Pitfalls of the most commonly used models of context dependent substitution. Biology Direct 2008, 3:52


Background
Sequence neighborhoods have been identified as exerting a strong influence on mutation with the most striking illustration being the elevated mutation rate affecting C within the dinucleotide CpG. This elevated mutation rate putatively arises because C within the CpG dinucleotide is the preferred target for modification by DNA methylases [1] and the resulting modified base, 5-methyl-cytosine (henceforth 5 mC), has an elevated mutation rate [2][3][4]. Analyses of sequence neighborhoods of human mutations have established, for this species at least, that a mutagenic influence of neighboring nucleotides on mutation rates is not restricted to CpG dinucleotides [5]. As different mutagenic and repair processes target different classes of sequence, determining how the substitution rate of a nucleotide is affected by the identity of neighboring nucleotides can assist in identifying the metabolic origins of a substitution. Establishing the relative importance of neighboring nucleotides and the etiology of these effects therefore has important implications for relating specific metabolic processes to existing genetic variation [6,7]. The contributions of factors that affect mutation rates can be identified using analyses of substitution rates. In the absence of natural selection, mutation and substitution rates are equal [8], allowing mutation rates to be inferred from neutral substitution rates. Since variation in substitution rates among sequence residues strongly influences estimation of evolutionary relationships [9][10][11], improved understanding of the origins of these effects, and how to correctly model them, will further benefit the reconstruction of accurate phylogenies.
Estimation of neighborhood effects from comparative genomic sequence data has been performed using socalled context dependent substitution models. A number of approaches have been developed which can be broadly classified according to whether they treat sequences as a series of independent tuples [12][13][14] or not [15][16][17]. We restrict our attention here to the independent tuple model class. All subsequent statements concerning context dependent models refer to the independent tuple case. Early models of context dependent substitution focused on measuring effects arising from RNA secondary structure [18] or from the genetic code in protein coding sequences [19,20]. (While employed for analysis of nonneighboring nucleotides involved in RNA stem structures, the model of Schöninger and von Haeseler [ [18], hereafter SvH] is also a context dependent model.) The codon models constitute a special case of independent tuple models where sequence states that correspond to stop codons are excluded from the state-space, e.g. a codon model for the standard genetic code has 61 states corresponding to the sense codons, 3 states less than the full trinucleotide statespace of 64. While the originally defined codon models of Muse and Gaut [ [19], hereafter MG] and Goldman and Yang [[20], hereafter GY] differed considerably in how they represented the influence of natural selection, with the model of GY in particular seeking to employ aspects of amino-acid chemistry, subsequent refinements [21] have resulted in the MG and GY models being more similar with respect to their exchangeability parameters.
The primary difference between these model forms of interest here is their different weighting of exchanges in the instantaneous rate matrix. In the MG codon model, the rate of nucleotide substitution within a codon context is weighted by the resulting nucleotide frequency. In the SvH and GY models, the rates are proportional to the frequency of the resulting tuple (doublet or codon respectively). We therefore classify models according to whether rates are weighted by nucleotide frequencies (hereafter NF models) or tuple frequencies (hereafter TF models). The effect of these differing definitions is on the expected equilibrium frequencies of tuples, with those under NF being the product of the independent monomer frequencies [22] (for the codon case, these are normalized for omission of the stop codons). The TF model formulation has proved more popular and several studies of context dependent substitution in non-coding sequences have used derivatives of the TF rate matrix form. Of particular interest has been assessment of the contribution of specific processes such as methylation to the overall rate of substitution [13,16,23]. A broader examination of the properties of a series of fully parameterized models both for independent and overlapping-tuple cases has also been done [16], demonstrating a substantial improvement in fit conferred by context dependent models over independent nucleotide models.
The distinct equilibrium frequencies expected under the TF and NF models indicate they differ in their relationship to an independent monomer process. Previous efforts at understanding the distinct properties of these model forms were performed on the more complex case of understanding protein coding sequence evolution and were based on contrasting statistical properties derived from analysis of codon alignments simulated under a TF variant [24], an approach acknowledged by the original authors as biased towards the TF form. The exact nature and significance of the difference between these model classes for general independent tuple models remains to be explored.
We suggest that a context dependent substitution model should specify the null model of context independence with fewer parameters than required to specify alternate hypotheses of context dependence. Here we show that the NF form satisfies this condition, that aside from very special cases the TF form does not and that naïve use of TF leads to incorrect detection of context effects. In light of the latter finding, we consider the consistency of the two model forms when applied to real data. Specifically, we reconsider the importance of nearest neighbor sequence context using reversible, stationary and homogeneous models of dinucleotide substitution.

The relationship of NF and TF to an independent nucleotide process
We present the NF and TF models for dinucleotides but note that essentially the same treatment applies to trinucleotides and larger units. We also note here that different model forms will be referenced using the following notation: NF/TF refers to the component of a substitution rate matrix that stems from the motif probabilities while the subscript (e.g. nuGTR, explained below) refers to the exchangeability terms (r).
Let q denote a general time-reversible (GTR) substitution rate matrix [25,26] on nucleotides, i.e., for i ≠ j, with elements q(i, j) = r(i, j) π (j) for some symmetric matrix r and equilibrium distribution frequencies π. For t ≥ 0, let p t be the transition matrix across a time interval of length t. The element p t (i, j) is the conditional probability that the state at time t is j given that it is i at time 0. We have where I is the identity matrix. Furthermore, q is the derivative of p t at t = 0: View two independent nucleotides undergoing the process defined by q as a unit, so that we have a reversible process on the dinucleotides with transition matrix P t and rate matrix Q. We now derive an expression for Q. By (2), for distinct dinucleotides a = i 1 i 2 and b = j 1 j 2 , There are two cases to consider. (i) a and b differ in both positions. By (1) and (2), (ii) a and b differ in exactly one position. In the case where they differ at the first position, the same calculation gives The calculation for the second position case is analogous. Thus, Now we generalise by defining a dinucleotide rate matrix as follows, where the 16 × 16 r NF is symmetric. This new process is an NF (nucleotide frequency weighted) model. It will be shown in the next paragraph that its equilibrium frequencies are homogeneous multiplicative, i.e., π (a) = π (i 1 ) π (i 2 ), and that it is reversible. (As an aside, we note that it is possible to generalize the definition of NF to allow for position-specific differences in π.) Within NF, the nucleotide GTR process is an appropriate null model for investigating context effects. The joint contribution of various specific contexts can be easily specified and estimated from data by exploiting a whole spectrum of intermediate models. For this reason, the baseline nucleotide GTR process is called NF nuGTR .
The more widely used TF models are specified in a slightly, but crucially, different way from the above. In a TF (tuple frequency weighted) dinucleotide rate matrix, if a ≠ b, then where r TF is symmetric and π are the equilibrium frequencies. As π is not necessarily multiplicative, then the TF form is the most general reversible rate matrix for dinucleotides in the class of models where each substitution event involves only one nucleotide. The TF models with multiplicative π are precisely the NF models. To see this, first note that every NF model can be written in TF form by letting π be multiplicative. Then, for a and b differing in exactly one position, Conversely, if in a TF model π is multiplicative, then using the above relationship gives the same model in NF form. As a consequence, NF models have multiplicative equilibrium frequencies and are reversible. As for NF, the TF model where the r TF terms depend on the corresponding nucleotide substitution and nothing else is called TF nuGTR . When the r TF terms are symmetric and otherwise unconstrained, the model is called TF diGTR .
The fact that TF diGTR is a general reversible process makes it intuitive to expect it to be the "right" generalization of the nucleotide GTR process, i.e., that TF nuGTR with multiplicative equilibrium frequencies is the nucleotide GTR process, but this is false unless the nucleotides are equally frequent. For example, q TF (AA, AC) = q TF (CA, CC) implies r TF (AA, AC) π (A) = r TF (CA, CC) π (C), so that the two r TF terms are equal only if π (A) = π (C). This property predisposes TF to misinterpretations. Suppose that data are generated from the nucleotide GTR process, or equivalently, NF nuGTR , with unequal nucleotide frequencies. The maximized log likelihood under TF nuGTR will be appreciably less than that under NF nuGTR , which will result in a large gain in maximized log likelihood when TF nuGTR is compared with TF diGTR (which contains the nucleotide GTR process), leading to a spurious detection of context effects; but the NF models will behave as expected. We demonstrated this by simulating 100 50 kbp long alignments under NF nuGTR , using parameters estimated from the genomic alignment of orthologous intron sequences from human, chimpanzee and macaque for ENSG00000003147. The likelihood ratio statistic LR between NF nuGTR and NF diGTR , i.e., twice the difference in the maximized log likelihoods, has an approximate χ 2 distribution with 42 degrees of freedom, as predicted by theory; in particular, their average and standard deviation (SD) were 42.2 and 9.2. However, between TF nuGTR and TF diGTR the LR was typically very large, averaging to 161.0 with an SD of 25.1.
The TF form can also be unsatisfactory for detecting real context effects. Suppose that data are generated from TF nuGTR with non-multiplicative π, which is not a nucleotide GTR model. Then the LR between TF nuGTR and TF diGTR can be so small as to obscure the context effects. Somewhat unexpectedly, comparing NF nuGTR to NF diGTR can still succeed. This is illustrated by a similar simulation as in the previous paragraph, except that the data come from TF nuGTR . The TF diGTR LR averaged to 43.3 with an SD of 10.1, like a χ 2 distribution with 42 degrees of freedom, but for NF diGTR the average and SD were 313.9 and 30.4.

Parameters estimated under a TF form can be misleading
We now show more clearly how TF can be misleading. Let π be a probability distribution on states {R, Y}. Consider an analogue of NF on {RR, RY, YR, YY} with rate matrix The diagonal entries are omitted since they are determined by the off-diagonal entries. κ measures the effect of a neighboring R on substitution rates, with κ = 1 corresponding to no context effect, analogous to a nucleotide process. Let a long pairwise alignment be generated by this process. Since it is nested in TF, fitting TF to the data will give Q NF (κ) up to a constant multiple, but in this form where If κ = 1, then the MLE of κ will be close to 1 by the consistency property. However, the MLE of will be close to π (R)/π (Y), an erroneous detection of a context effect if π (R) ≠ π (Y). More generally, and κ will be consistent if π (R) = π (Y), but will be larger or smaller according to whether π (R) > π (Y) or π (R) <π (Y).
We illustrate the impact of this confounding of exchangeability parameters in TF with nucleotide composition by estimation of a context parameter from simulated data. We simulated 1000 alignments under NF nuGTR using the same parameter estimates as for the previous simulation (see Methods for details). For each alignment we fit variants of the NF nuGTR and TF nuGTR models that had been extended to include a single additional rate term for CpG substitutions (CG ⇔ NN). Frequency histograms of the resulting maximum-likelihood estimates (MLEs) of the context parameter ( ) obtained under each model form are presented in Figure 1. As the alignments were simulated without any sequence context effects on CpG substitution, the null hypothesis of context independence is true and the expected value of CG ⇔ NN is therefore 1. The distribution obtained against the NF nuGTR baseline spans this expected value, whilst the distribution of estimated under TF nuGTR is NN ⇔ strongly right-shifted. This result is consistent with the proofs above showing the relationship of NF model forms to an independent nucleotide process, and the confounding (relative to this process) of exchangeability parameters under TF.

The fit of TF and NF to primate intron alignments differs substantially
The analytical results indicate that the TF and NF forms will exhibit distinct model likelihoods and parameter MLEs -the critical statistics used for drawing inference on the importance of parameters and the nature of their effects. We contrasted the practical impact of the differences in the two model forms by analysis of aligned orthologous intron sequences from the human, chimpanzee and macaque primate lineages. Intronic sequences were used due to increased confidence in their orthology resulting from comparisons of their exons and the relatively low fraction of sequence likely to be subjected to the scrutiny of natural selection [27]. Introns were sampled from all human autosomes so as to capture the genomic diversity of mutation processes. Sequence regions likely to evolve by a non-point mutation process, including lowcomplexity sequence and indels, were excluded in a manner that preserved the integrity of naturally occurring dinucleotides. The intron sequence alignments from the resulting sampled genes were broken into exactly 50 kbp long blocks to facilitate comparisons of parameters estimated from different alignments. Each alignment block was derived from a single gene but multiple alignment blocks may derive from the same gene. There were a total of 470 such alignments. (The full sampling protocol is presented in Methods.) We contrasted comparable variants of the TF and NF forms for analysis of the real biological sequences by comparison to the corresponding nuGTR baseline parameterization. Given their relationships to an independent nucleotide process, the NF nuGTR baseline is a nucleotide GTR model while the TF nuGTR baseline is not. The only difference between the models being their differing weighting of elements of the instantaneous rate matrix by monomer and tuple frequencies respectively. The richest rate matrix parameterization considered was the diGTR, a fully general time reversible dinucleotide model. The diGTR model includes a parameter for each of the 48 instantaneous dinucleotide exchanges (conventionally one is omitted to calibrate the model, resulting in 47 free parameters). Because of considerable variation in both composition and substitution process across the genome of primates, we fit each model to each alignment independently and determined the support for the model across the entire data set as a cumulative ln L obtained as the sum of corresponding ln L from all alignments. See Methods for the complete model definitions, model implementation in software and procedure used for parameter estimation.
The magnitude of improvement over the nucleotide GTR model differed substantially between the TF and NF forms. Even the TF nuGTR model was 'better' than the fully parameterized NF diGTR model ( Figure 2), despite the fact that for some alignments TF nuGTR was worse than the nucleotide GTR (results not shown). This property originates from the intrinsically context dependent nature of TF. The likelihood for the NF diGTR model still confers, however, a massive improvement over an independent nucleotide process, supporting the importance of a sequence neighborhood effect.

Parameter estimates differ substantially between NF and TF forms
The MLE of a model parameter has particular significance since its relative position to the value 1 influences the interpretation concerning enhanced or suppressive influence of a context on substitution rate. We first examined the consistency in improvement over the nuGTR likelihood conferred by individual dinucleotide parameters. Each of the 48 distinct reversible dinucleotide exchanges was added to nuGTR for the TF and NF forms (see Methods). For example, we measured the influence of CpG to TpG substitutions, represented by the parameter CG ⇔ TG, in models that also included the nucleotide GTR exchangeability parameters, denoting the resulting models NF nuGTR+CG⇔TG and TF nuGTR+CG⇔TG . The improvement Context effects are detected using TF when none exist Figure 1 Context effects are detected using TF when none exist. Sequences were simulated under a nucleotide GTR model with π (A) + π (T) ≈ 0.6. A single dinucleotide parameter for CG ⇔ NN was added to the TF nuGTR and NF nuGTR dinucleotide baseline models. The red vertical line represents the expected value under the null hypothesis (the parameter has no effect). x-axis is the MLE of CG ⇔ NN, y-axis is the number of simulated alignments.
conferred by the CG ⇔ TG parameter was determined by contrasting the likelihood against that from the corresponding nuGTR model: NF nuGTR vs NF nuGTR+CG⇔TG ; and, TF nuGTR vs TF nuGTR+CG⇔TG . As these comparisons are between nested models, the differences were measured using the standard likelihood ratio test statistic LRT. To facilitate comparisons between the two model forms, we further expressed the relative improvement conferred by each individual term as the proportion of that obtained for the diGTR. For example, the relative improvement over a nuGTR model conferred by the CG ⇔ TG term is expressed as LR(CG ⇔ TG)/LR(diGTR), where the ratio is between models belonging to the same model form.
The contribution of each dinucleotide pair in improving model fit over nuGTR was assessed under both NF and TF forms. Dinucleotide parameters are listed in Table 1 by order of influence in an NF model. Whilst the seven most influential parameters were consistent between the NF and TF model forms, the order of importance was not preserved. The relative importance of dinucleotide parameters was reasonably consistent between the two forms, with CpG transition substitutions accounting for ~50% of the diGTR improvement over nuGTR. The degree of concordance between the mean MLEs obtained for the 48 parameters was considerable, but with striking outliers corresponding to the CpG terms ( Figure 3). Of particular note is the observation that MLEs for CpG transversions were strongly discordant between the NF and TF forms. Under TF, CpG transversion MLEs are predominantly > 1, indicative of a greater rate of substitution than the back-ground substitution rate ( Figure 4). In contrast, under NF, the distribution of MLEs spans 1 ( Figure 4) and the relatively small mean relative improvement percentage (≤ 0.5) along with a mean (per alignment) LR < 2 (Table 1) suggests CpG transversion rates are in fact not elevated at all, contradicting previous reports [13,16].
More generally, the results from the NF model indicate that (after normalization) the 10 top-ranked parameters account for ~85% of the improvement conferred by the diGTR models 48 free parameters. Each parameter in the top 10 list was a transition substitution whose strandcomplement was also in the top 10 list. Indicating that substitution processes were strand-asymmetric, the strand-complementary parameters were not always immediately adjacent in rank, e.g. AT ⇔ GT was 3rd, while AT ⇔ AC was 10th (Table 1, Figure 5).

Discussion
The different rate matrix weighting of the TF and NF model forms have considerable influence on their statistical properties. Relative to an independent nucleotide process, when nucleotide frequencies are asymmetric, all exchangeability parameters under TF are confounded with sequence composition parameters. Our analysis of real biological data indicate that this difference is affecting inference. Although the relative importance of exchangeability parameters determined under either model form were related, striking discrepancies between the models were evident for important processes, indicating the parameters estimated under TF can be strongly misleading.
A key property of TF, that the stationary motif distribution readily becomes the observed distribution, is more difficult to achieve with NF. In real biological sequences, the frequencies of dinucleotide and higher order tuples are not products of their monomer frequencies. Accounting for this feature is a natural motivator in the TF model design where, by definition, the evolutionary process arrives at the observed distribution of sequence states. In contrast, specifying an NF model where the tuple frequencies are not just the product of monomer frequencies is more complex. One way it could be achieved is by specifying a non-reversible process, although the suitability of such a model for the commonly employed eigendecomposition matrix exponentiation algorithm remains to be established. Across a range of genomes, codon frequencies are reasonably approximated by position-specific nucleotide frequencies [28]. This suggests using position-specific nucleotide frequencies as a reasonable alternate approach, at least for protein coding sequences.
That the likelihoods achieved under TF are enormously improved compared to NF ( Figure 2) provides a cautionary note against focussing solely on the likelihood as the Contrasting improvement of the dinucleotide GTR model over a nucleotide GTR process Figure 2 Contrasting improvement of the dinucleotide GTR model over a nucleotide GTR process. nuGTR -for NF this is the nucleotide GTR model, for TF it is a comparable parameterization; diGTR -fully general time reversible dinucleotide model, with 48 parameters in the rate matrix; Δ ln L -is the difference in the cumulative ln L from the alternate model compared to the cumulative ln L for the nuGTR model.

LR % Rank LR % Rank
LR % Rank basis for contrasting models. In the current case, a naive interpretation of the different likelihood magnitudes would lead to the conclusion that the TF forms were superior. The same conclusions are reached even when an information theoretic transformation, such as AIC, that takes account of the larger number of free parameters in TF is used (result not shown). That this apparent superiority over a simple model does not hold universally (e.g. ln L from TF nuGTR can be worse than that from nucleotide GTR) illustrates the challenge in using TF. It also highlights the necessity that consideration be given to whether the parameters estimated under a model make sense.
Using the latter benchmark, however, requires knowing a priori the expected values of the parameters in the model.
One benchmark for establishing the suitability of a model form is how the desired null hypothesis is nested within it. We suggest that as context dependent models are intended to measure departures from independence, they should conveniently include an independent monomer process. As we have shown, for the TF form this is far from convenient. Comparable rate parameters between the TF and NF forms are only identical when the tuple frequencies are identical (see equation 3). For homogenous multiplicative π [π (a) = π (i 1 ) π (i 2 )], at least 4 more exchangeability parameters are required for the TF variant to include the nucleotide GTR. From this it follows that simpler TF models, such as TF nuGTR , do not always contain the independent process but instead can be nested within the TF variant that does contain the independent process, substantially complicating inference of LRTs and the interpretability of exchangeability parameter MLEs. Contrast this with the NF form where NF nuGTR is the nucleotide GTR, NF diGTR is always measuring departures from independence and the context rate parameters (e.g. CG ⇔ TG) in all NF models have, under the null of no effect, the expected value of 1 regardless of sequence composition. On this basis we suggest the NF form as a superior framework for the examination of context dependent effects.
Our analyses of real, substantively neutrally evolving biological sequences suggests that the use of nested models for measuring support ensures a modest consistency between the models forms. The ranking of likelihood ratios for comparable models were largely consistent between the two model forms (Table 1). However, the MLEs of some equivalently specified exchangeability parameters are worryingly inconsistent. This discordance was most apparent for CpG transversions (Figures 3 and  4), whose substitution rate was consistently greater than 1 for the TF model form, an observation previously reported from application of TF based models [13,16] which has been interpreted as indicating CpG transversions occur at a rate greater than background. This effect indicates that under TF, CpG transversion parameters are compensating for the deficit in CpG dinucleotides relative to their expected frequency. In contrast, under NF the MLE distributions ( Figure 4) and relative importance (≤ 0.5%, Table  1) of CpG transversions indicate they are occurring in a context-independent manner. This fits with current biochemical knowledge which indicates deamination of 5 mC converts a 5 mC·G base pair into a T-G mismatch that, contingent on a repair failure, results in a C⇒T transition mutation.
The analysis of context dependent effects supports a strong influence of sequence neighborhood on mutation, evidenced by the significance of NF diGTR (Figure 2). Our assessment of the contributions to this fit from individual neighborhood under NF distinguishes nucleotide from neighborhood influences. Even after removing the hypermutable CpG effects, the NF model indicates that sequence neighborhoods are important. When we normalized the relative importance statistic (% column, Table 1) so they summed to 100, the top 10 ranked parameters account for ~85% of the fit achieved by the NF diGTR model, with CpG transitions alone accounting for 46%. The etiology for these neighborhood influences is known for CpG transitions [2]. Six of the remaining eight parameters involve transition substitutions affecting the dipyrimidines CpC, CpT, TpC and TpT (and their strand complements). A candidate for the dipyrimidine effects is the dedicated repair system for repair of lesions arising at dipyrimidines [29,30] consistent with the slower than Relationship between dinucleotide MLEs obtained under TF and NF from primate data Figure 3 Relationship between dinucleotide MLEs obtained under TF and NF from primate data. Plotted are the mean MLEs for rate parameters obtained under NF nuGTR+Parameter (x-axis) and TF nuGTR+Parameter (y-axis) forms. Means and their standard errors were estimated using the jackknife procedure. Standard errors are not shown as they were smaller than the plotted marker sizes. Coordinates corresponding to the CpG related rate parameters are labelled.
CpG substitution rate MLEs were discordant between TF and NF Figure 4 CpG substitution rate MLEs were discordant between TF and NF. The MLEs were those used to calculate the means in Figure 3. The vertical red line corresponds to the expected value under the null hypothesis (the parameter has no effect).
background rates of substitution (i.e. MLEs < 1, Figure 5) at these contexts. The strand asymmetry in support and MLEs of context effects is consistent with previous reports [31] and likely reflects the influence of transcription coupled DNA repair (TCR) processes, which selectively repair the transcribed strand [32]. Thus, although a specific can-didate process for the remaining neighborhood influence (AT ⇔ GT/AT ⇔ AC) is not clear, the strong strand asymmetry of the MLEs indicate it is strongly affected by TCR and thus a target of either the base excision or nucleotide excision repair systems. x-axis -MLEs, these were the same as those used to calculate the means in Figure 3; y-axis -the number of alignments. The vertical red line corresponds to the expected value under the null hypothesis (the parameter has no effect). Strand complementary substitutions are plotted on the same chart.

Distribution of MLEs for non-CpG top 10 ranked dinucleotide parameters under NF
The substantial number of published independent tuple models which adopt the TF equilibrium distribution will display similar properties to those demonstrated here. Noteworthy, but by no means the only, examples of software implementing TF based models include the codon models of PAML [33], the doublet and codon models of Mr Bayes [14] and the dinucleotide and codon implementations in PyEvolve [23] and PyCogent versions up to 1.2 [34]. Software implementing the NF model forms include HyPhy [35] and PyCogent versions after 1.2.

Conclusion
We have shown that models with NF form measure the effect of sequence neighborhood as departure from an independent nucleotide process, and that estimates of parameters within this model are robust to changes in sequence composition. In contrast, measurement of context dependent substitution influences with parameters from models with the TF form are confounded with sequence composition. We suggest, therefore, that results from models with the latter form be re-evaluated with a comparable NF model. Our application of NF dinucleotide models to primate introns confirms that sequence neighborhoods exert a strong influence on the rates of substitution and that transitions affecting CpG and dipyrimidine dinucleotides account for ~75% of the effect of immediate neighbors measured by the diGTR model.

Data
Aligned introns from the human, chimpanzee and macaque genomes were obtained from Ensembl release 49 [36]. Using coordinates of human genes we sampled a maximum of 100 alignments from each of 5 blocks of human autosomes (1-3, 4-6, 7-10, 11-15, and 16-22) to ensure representation of substitution processes across the primate genomes. As the substitution models represent the dynamics of point mutations only, any sequence that may have evolved by a non-point mutation process was replaced by an equivalent length run of '?'. Given their susceptibility to mutation by slipped strand mispairing, we specifically masked simple tandem-repeat sequences, including di-to tetra-nucleotide repeats with 5 or more repeat units. Mono-nucleotide repeats ≥ 10 nucleotides long were also masked. Sequence masking was done on the raw sequences (without gaps) while the locations of gaps were retained. Gaps were also masked as they arise by a non-point mutation process. The resulting masked alignments were then split into columns of non-overlapping dinucleotides which were filtered such that any column containing a dinucleotide made up of one or more of the characters 'N,?,-' was eliminated and the remaining columns were then joined to form the filtered alignments. Only alignments ≥ 50 kbp long were retained and these alignments were sliced into exactly 50 kbp long aligned blocks with any remaining sequence discarded. There were 470 such alignments.

Model definitions
The following conditions were applied to all model definitions: all exchangeability matrices (r) were symmetric such that r(a, b) = r(b, a); diagonal elements of instantaneous rate matrices are determined by the constraint that the row sums are 0; and, the instantaneous rate matrices were scaled such that their trace = -1.
The nucleotide substitution models employed constitute variants of the rate matrix defined in the Results. Consider the nucleotide substitution rate matrix q, which has elements q(i, j) = r(i, j) π (j), where i, j ∈ {A, C, G, T}. When we have the F81 substitution model [37], otherwise when r(i, j) ≠ 1, we have the general time reversible nucleotide substitution model [25,26]. For dinucleotides a, b made up of nucleotides i 1 i 2 and j 1 j 2 respectively, the NF nuGTR model rate matrix Q nuGTR has elements Q nuGTR (a, b) = r(i 1 , j 1 ) π (j 1 ) when dinucleotides a, b differ at the first position, Q nuGTR (a, b) = r(i 2 , j 2 ) π (j 2 ) when dinucleotides a, b differ at the second position, and Q nuGTR (a, b) = 0 when dinucleotides a, b differ at both positions. We illustrate the nuGTR + Param substitution models for the case of CpG/TpG exchanges. Under a nuGTR + CG ⇔ TG model, Q nuGTR+CG⇔TG (a, b) = r(CG, TG)Q nuGTR (a, b) when {a, b} is {CG, TG} and Q nuGTR+CG⇔TG (a, b) = Q nuGTR (a, b) otherwise. The NF diGTR model rate matrix Q diGTR has elements Q diGTR (a, b) = r diGTR (a, b) π (j 1 ) when dinucleotides a, b differ at the first position and Q diGTR (a, b) = r diGTR (a, b) π (j 2 ) when they differ at the second position. The TF form dinucleotide models were defined similarly to the comparable NF ones, but replacing the π (j 1/2 ) with π (b).

Validation of Model Implementation
The accuracy of all software implementations of the different substitution model forms were subject to stringent validation. Three of the authors (HL, VBY and GAH) reimplemented both the TF and NF model classes substantively independently from the core PyCogent classes; only PyCogent's matrix exponentiation routines were common to the implementations. The correctness of this code was established by comparison against hand-calculated examples for species triplets. We relied on the theoretical relation between the independent nucleotide process and NF in the development of these calculations. In particular, when the null hypothesis of independence between nucleotides is true, that the matrix of dinucleotide substitution probabilities for time t is equal to the Kronecker product of the nested nucleotide substitution matrix for time t, i.e.. P dinuc = P nuc ^ P nuc . We also relied on the fact that a TF model is equal to a NF model when the motif states are equi-frequent, and different otherwise. The results from a modest number of examples were also compared to those computed using R [38]. We then modified the existing PyCogent substitution model classes to allow specification of instantaneous rate matrices with the NF model form -where the π elements included in the rate matrix correspond to the probability of the ending nucleotide. The different implementations and indicated tests were used to validate the accuracy of the implementation in the modified PyCogent library.

Simulations
We used a genomic alignment, processed as described above, of introns from the arbitrarily selected human gene ENSG00000003147 to estimate branch lengths on the unrooted tree relating human, chimpanzee and macaque. For Figure 1 a nucleotide GTR substitution model [25,26] was Simulations under the TF form were also done using parameters obtained from fitting ENSG00000003147. Specifically, the TF nuGTR model was fitted to the alignment and the resulting parameter MLEs used to simulate 100, 50 kbp long alignments. An important difference to the simulations was that dinucleotide (not nucleotide) frequencies were estimated from the alignment and used for simulation. All simulations were performed using PyCogent [34].

Statistical testing
The substitution model classes of PyCogent [34] were modified to allow specification of these NF model form and this version of PyCogent is included as Additional data file 1. Maximum likelihood estimates of parameters were obtained for each alignment separately by numerical optimization using the built-in Powell optimization routine [39]. A tolerance of 10 -6 was applied with a maximum of 5 restarts. All LRT tests were executed using PyCogent.

Availability of algorithms and data
The modifications to PyCogent implementing the NF model form will be available (on acceptance) as part of the standard open source PyCogent distribution [34]. They will appear in PyCogent versions greater than 1.2. (A modification of PyCogent version 1.1 with these new capabilities is included as Additional data file 1.) Details of the scripts used to simulate alignments and for model fitting are included in Additional data file 2. The simulated alignments and the masked and filtered primate intron alignments are available on request from the authors.

Reviewer's report 1 Rob Knight, University of Colorado, Boulder
Models of molecular evolution are critical for providing the background against which adaptive evolution can be detected, and for understanding how mutational and selectional processes differ in different genomes (or parts of the same genome). In this manuscript, Lindsay et al. compare two widely used methods of incorporating context effects (i.e. the effects of neighboring nucleotides) in the standard Markovian model of molecular evolution: weighting by nucleotide frequencies (NF) and by tuple frequencies (TF). TF is very widely used in software such as the popular PAML and MrBayes phylogenetics packages. They are able to show both analytically and through simulations that the TF form introduces subtle biases due to sequence composition that could lead to incorrect biological conclusions: for example, the effect of methylation at CpG islands and subsequent deamination of the C to T is thought to be a key mutational pressure in mammals, but the TF form "identifies" this effect even under simulation conditions where it cannot exist. The study thus has important implications for other studies of molecular evolution (in particular, recommendation of NF rather than TF models), and may result in widespread evaluations of the degree of adaptation and the key evolutionary parameters in genes and genomes throughout the tree of life.
One interesting feature of this work is that the key parameters (as measured by order of importance) important for nucleotide substitution in mammalian introns differ substantially between the NF and TF forms on the same data. For example, the AT ↔ GT rate is third in importance with the NF model but seventh in importance with the TF model; the CG ↔ AG rate is tenth in importance with the TF model but 26th in importance with NF. These sorts of differences would lead to very different understandings of the relative contributions of different kinds of mutations and inferences about the molecular mechanisms involved. One addition that would make this table easier to interpret would be the addition of errors on the % importance and rank importance obtained by bootstrapping the data set if this would take an acceptable amount of computation time -for example, how large a change in rank is meaningful?
Author's response: We thank the reviewer for this suggestion and have modified the Table to include estimates of the variability in both relative importance (%) and rank. These were obtained using a jackknife resampling procedure. The associated text in the manuscript has also been revised.
Two other additions that would be useful, but perhaps best left for future work, would be (i) more detail about the kinds of molecular mechanisms (deamination, G oxidation, repair pathways, etc.) that are most likely to be prone to over-or under-representation using the TF model, and (ii) extension of the work beyond mammals. Mutational processes are of intense medical interest in studies of retroviral adaptation, for example. I also note that my 2001 Genome Biology paper (PMID 11305938) indicates that the product of position-specific nucleotide frequencies does in fact recapture codon frequencies extremely well across a wide range of genomes, so the authors might want to use that argument to bolster the applicability of their modeling assumptions. Overall, I believe that this paper is an important contribution and is likely to lead to much wider development and application of NF models in future.

Reviewer's report 2 Joshua Cherry, NCBI (nominated by David Lipman)
This article makes an important point: some models of the TF form, including some widely used models, are inadequate for many purposes and can yield biased results. However, it unfairly and incorrectly impugns TF in several places. It seems to claim that TF, unlike NF, cannot represent independence of sites in the face of unequal base frequencies, but this is only true of special cases of TF.
In fact the most general TF model contains every NF case as a submodel, so TF can model any case that NF can model (and more). Furthermore, a simplified example is incorrectly alleged to demonstrate bias of estimates based on TF. Also, NF has its own weaknesses, and the best model for a particular purpose might conform to neither NF nor TF. The sweeping conclusions that "models with the TF form are systematically biased by sequence composition" (Conclusions) and "the NF form should be used" (Abstract) are unjustified.

Author's response:
We agree with the reviewer that the TF form can also represent independence of sites when base frequencies are unequal and have now revised substantial sections of the results and discussion to better explain the relationship between these two model forms. We also agree that the NF form has its own weaknesses, and detail these in the discussion. We have also toned down statements concerning systematic bias. We dispute, however, that we are not justified in claiming the NF form should be used. We elaborate on each of these issues in our response to the reviewer's detailed comments below.
Paragraph 3 of Results seems to argue that when base frequencies are unequal, TF cannot represent cases where changes are independent of sequence context. This is not true of all TF models. The argument that π (T) π (G) is in general different from π (C) π (G) ignores the r TF terms, which may be such that the rates are equal. This null hypothesis can be tested against the alternative hypothesis TF diGTR . This procedure will detect context effects when they are present but not when they are absent, subject to the usual statistical uncertainties.
It is unimportant whether the null hypothesis is expressed in TF form: the NF form or the single-nucleotide form will yield the same likelihood. The important point is that TF diGTR is a legitimate and possibly very useful non-null model. Indeed, as I noted above, it appears to be quite useful for modelling the intron sequence data analyzed in the article. Any extra effort required to interpret the parameters is the price that one pays for using a richer model that is capable of representing biological reality. Sometimes a less general model will be preferable, but the same can be said of any general model (e.g., single-nucleotide GTR). I see no valid argument for banishing this useful tool from our arsenal of models.
Author's response:Regarding the reviewer's statement concerning our specification of an "inappropriate TF model", we agree that is inappropriate but point out that Cherry's first review continues: Paragraph 2 of the Discussion states that "...specifying an NF model where the tuple frequencies are not just the product of monomer frequencies is more complex. One way it could be achieved is by specifying a non-reversible process...." I would expand on this. I believe the following to be true, but not necessarily obvious: 1. If r NF (a, b) is symmetric, the equilibrium tuple frequencies equal the products of the nucleotide equilibrium frequencies, but 2. Symmetry of r NF (a, b) is not necessary (though it is sufficient) for time reversibility.
I would also note that NF, unlike TF, cannot represent different equilibrium frequencies at different tuple positions. This would seem to be a serious deficit for codon-based models. One can imagine an extension to NF in which there are distinct frequency parameters for each tuple position. I am unsure of the relationship of this extension to TF. All that prevents any NF model from qualifying as a TF model is the restriction that r TF does not depend on π dinuc .
Perhaps people tend to create this sort of model, which would make it worth analyzing. However, there is no reason for models to have this restriction. Even in single-base models, the analogous restriction is not always observed: in the F84 (Felsenstein, 1984) model, for example, some of the r values depend on π values. Perhaps the best models for some purposes are TF-like models that violate this restriction.
Author's response:We concede that there may be some purposes for which the TF-like models may be better suited. Our focus has been on establishing how to measure departures from independence, and in particular the identification of primary influences in departure from independence in the evolution of DNA sequences. A motivator for this is the intra-genomic diversity in composition evident for many organisms, including the primate species used for the intron analysis. In such a case, using a model in which the critical parameters are independent of composition seems the best choice. Under the NF form, parameters estimated from genomic regions that differ in composition can be readily compared. As pointed out by the reviewer, similarly defined parameters estimated under the TF form from the same regions can also be compared. However, the failure to appreciate the influence of composition on the latter estimates can strongly mislead their interpretation.

Reviewer's report 3 Stephen Altschul, NCBI (nominated by David Lipman)
This paper compares two different, widely used models used for studying the evolution of nucleic acids. A specific question is whether specific nucleotide substitutions depend upon the context in which they occur, and in what way. This is addressed by modeling the evolution of nonoverlapping nucleotide "tuples", and comparing the results to a model of independent nucleotide evolution.
A structure imposed on such models assumes an instantaneous a→b tuple mutation rate expressible as a symmetric matrix on a and b, multiplied by a factor dependent upon the equilibrium frequency distribution of either the resulting tuple (b) or the resulting nucleotide in the mutation that occurs. The nucleotide-frequency normalized model is called NF, the tuple-frequency normalized model TF.
A central point of the paper is that for NF, a null model is exactly the same as a model of independent nucleotide evolution, whereas this is true for TF only when the equilibrium distributions of the nucleotides are identical. When there are non-equal nucleotide frequencies, TF can be seen to imply context-dependent factors in nucleic acid evolution when there are none, and sometimes to miss detecting such factors when they exist. The NF model does not have equivalent difficulties.
Author's response:We note here that the last point relates to an earlier version of the manuscript.
Applied to real, non-protein-coding sequences, the NF and TF models both recognize very significant contextdependent effects on nucleic acid evolution, and agree on the most important dinucleotide pairs implicated. However, the models can diverge substantially concerning more subtle issues. Specifically, the NF model does not recognize a significant context-dependent influence on transversions involving CpG dinucleotides, whereas the TF model does.
In conclusion, this paper presents a provocative critique of the widely used TF model, and of some of the conclusions that have been derived therefrom. Its arguments will need to be taken into account in further modeling of nucleic acid evolution.