### Reviewer's report 1

*Rob Knight, University of Colorado, Boulder*

Models of molecular evolution are critical for providing the background against which adaptive evolution can be detected, and for understanding how mutational and selectional processes differ in different genomes (or parts of the same genome). In this manuscript, Lindsay et al. compare two widely used methods of incorporating context effects (i.e. the effects of neighboring nucleotides) in the standard Markovian model of molecular evolution: weighting by nucleotide frequencies (NF) and by tuple frequencies (TF). TF is very widely used in software such as the popular PAML and MrBayes phylogenetics packages. They are able to show both analytically and through simulations that the TF form introduces subtle biases due to sequence composition that could lead to incorrect biological conclusions: for example, the effect of methylation at CpG islands and subsequent deamination of the C to T is thought to be a key mutational pressure in mammals, but the TF form "identifies" this effect even under simulation conditions where it cannot exist. The study thus has important implications for other studies of molecular evolution (in particular, recommendation of NF rather than TF models), and may result in widespread evaluations of the degree of adaptation and the key evolutionary parameters in genes and genomes throughout the tree of life.

One interesting feature of this work is that the key parameters (as measured by order of importance) important for nucleotide substitution in mammalian introns differ substantially between the NF and TF forms on the same data. For example, the AT ↔ GT rate is third in importance with the NF model but seventh in importance with the TF model; the CG ↔ AG rate is tenth in importance with the TF model but 26th in importance with NF. These sorts of differences would lead to very different understandings of the relative contributions of different kinds of mutations and inferences about the molecular mechanisms involved. One addition that would make this table easier to interpret would be the addition of errors on the % importance and rank importance obtained by bootstrapping the data set if this would take an acceptable amount of computation time – for example, how large a change in rank is meaningful?

*Author's response*: *We thank the reviewer for this suggestion and have modified the Table to include estimates of the variability in both relative importance (%) and rank. These were obtained using a jackknife resampling procedure. The associated text in the manuscript has also been revised*.

Two other additions that would be useful, but perhaps best left for future work, would be (i) more detail about the kinds of molecular mechanisms (deamination, G oxidation, repair pathways, etc.) that are most likely to be prone to over- or under-representation using the TF model, and (ii) extension of the work beyond mammals. Mutational processes are of intense medical interest in studies of retroviral adaptation, for example. I also note that my 2001 Genome Biology paper (PMID 11305938) indicates that the product of position-specific nucleotide frequencies does in fact recapture codon frequencies extremely well across a wide range of genomes, so the authors might want to use that argument to bolster the applicability of their modeling assumptions.

*Author's response*: *We can address the first point only in those cases where a strong candidate mechanism has been identified for a context. The dominant influence of deamination of 5-methyl-cytosine on genomic composition (depletion of CpG) appears responsible for the high ranking of CpG transversions under the TF form. Aside from this obvious example of cytosine deamination, the association between the context effects and their causative mutagenic processes is poor, limiting our ability to comment further*.

*We thank the reviewer for bringing to our attention this finding from his Genome Biology paper. Codon models are certainly the most popular application of context dependent models and the result concerning the consistency of observed codon frequencies with those calculated from position-specific nucleotide frequencies provides strong guidance for model design. We now include this reference in our discussion of the different properties of the NF/TF models*.

Overall, I believe that this paper is an important contribution and is likely to lead to much wider development and application of NF models in future.

### Reviewer's report 2

*Joshua Cherry, NCBI (nominated by David Lipman)*

This article makes an important point: some models of the TF form, including some widely used models, are inadequate for many purposes and can yield biased results. However, it unfairly and incorrectly impugns TF in several places. It seems to claim that TF, unlike NF, cannot represent independence of sites in the face of unequal base frequencies, but this is only true of special cases of TF. In fact the most general TF model contains every NF case as a submodel, so TF can model any case that NF can model (and more). Furthermore, a simplified example is incorrectly alleged to demonstrate bias of estimates based on TF. Also, NF has its own weaknesses, and the best model for a particular purpose might conform to neither NF nor TF. The sweeping conclusions that "models with the TF form are systematically biased by sequence composition" (Conclusions) and "the NF form should be used" (Abstract) are unjustified.

*Author's response*: *We agree with the reviewer that the TF form can also represent independence of sites when base frequencies are unequal and have now revised substantial sections of the results and discussion to better explain the relationship between these two model forms. We also agree that the NF form has its own weaknesses, and detail these in the discussion. We have also toned down statements concerning systematic bias. We dispute, however, that we are not justified in claiming the NF form should be used. We elaborate on each of these issues in our response to the reviewer's detailed comments below*.

Paragraph 3 of Results seems to argue that when base frequencies are unequal, TF cannot represent cases where changes are independent of sequence context. This is not true of all TF models. The argument that *π* (T) *π* (G) is in general different from *π* (C) *π* (G) ignores the *r*_{TF} terms, which may be such that the rates are equal. The most general TF model contains the model with independence as a submodel, so it clearly can represent that case and would give consistent estimates of deviation from independence. In short, some TF models can model independence and are suitable for measuring context effects, even when there is compositional bias.

*Author's response*: *We agree that TF can represent cases where changes are independent of sequence context and there are unequal base frequencies. We have revised the manuscript to clarify this and the role of the r*_{TF} *terms. Our point is, while* NF_{nuGTR} *is the nucleotide GTR process*, TF_{nuGTR} *is not, although one would expect it to be so. As a consequence, comparing* NF_{nuGTR} *and* NF_{diGTR} *is correct and clearly interpretable, at least for models with multiplicative* *π* *, and to a lesser extent for others, whereas TF is cumbersome: one needs to spend quite a lot of effort to specify the nucleotide GTR process, even after the unintuitive realization that* TF_{nuGTR} *doesn't do the job*.

The above illustrates a more general point. A model need not be a case of TF in order to be nested in a case of TF. Every NF case is nested in the most general TF model, and may be nested in more restricted TF models as well. A model that only allows independence of sites may not be a TF model, but it is nested in a TF model. Since we are interested in models that allow context dependence, we are interested in more general models anyway.

*Author's response*: *We disagree with the last point. We are interested in context dependence in order to understand what contributes to it, not simply identifying its existence. We also suggest that models whose parameters do not affect the likelihood when the null of independence is true are superior to the alternative. This condition is satisfied by the NF form, not the TF form*.

Paragraphs 4–6 of Results claim to prove, using a simplified case with a two-letter alphabet, that TF leads to biased estimates when equilibrium frequencies are unequal. No bias is actually demonstrated, even for the special case of TF used. The intent in constructing the NF model seems to be that *κ* is a measure of a certain type of context effect, with *κ* = 1 corresponding to independence (zero context effect). For the TF model, the same *r* matrix is used. The values of *κ* estimated for the TF model are indeed quite different from 1 despite independence, but this is not an indication of bias. Rather, the *κ* in the TF model is a different parameter than the *κ* in the NF model, and is not expected to be 1 in this case. The *Q*_{TF} matrix is capable of perfectly modeling the independent case. It does so with *κ* ≠ 1 when frequencies are unequal. The estimated *κ* values are not biased, but are (in expectation, asymptotically) precisely what is required to yield a *Q*_{TF} matrix that models the data. Modulo multiplication by a constant, the NF and TF models produce the same matrix values, but using different parameterizations.

*Author's response*: *We agree that the term 'bias' is not correctly applied here and have revised this portion of the manuscript. We also emphatically agree that* *κ* *in the two models is a different parameter when the sequence states* *are* *unequal, but exactly the same parameter when the sequence states are equal. This is the point of the section, and a major point of the manuscript. In most applications, it is the r terms from NF models that are being interpreted. Thus, even if an identical likelihood were to result from application of NF and TF models, the biological inference drawn from those models could be markedly different based on whether an r term was* < 1, = 1, > 1.

This would all be clearer if the parameter in the *Q*_{TF} matrix were given a different name, e.g., *λ*. The estimates of *λ* do not cluster around 1, but these are not biased estimates of *κ*: they are asymptotically unbiased estimates of *λ*. To interpret *λ* ≠ 1 as an indication of context effects is a human error. Whether there are context effects can be determined from consideration of the estimates of *λ* and *π* (R). This takes a bit more effort, but inconvenience is quite different from bias.

*Author's response*: *We have renamed the parameter to* \mathcal{K}.

I am unsure why the *Q*_{TF} matrix was given in terms of *π* (X) *π* (Y) rather than *π*_{dinuc} (XY). The latter corresponds to eq. 2, and yields a different, more general model (modulo multiplication by a constant, it contains the *Q*_{NF} matrix as a submodel).

*Author's response:* *Our point was to demonstrate with a simple case the conditions where κ* = \mathcal{K} *could occur. We have clarified this section of the manuscript*.

I am not sure that the other analyses are entirely fair to TF. The authors apparently design an NF model that is appropriate and then transfer its *r* values to TF to yield a model that is inappropriate. Perhaps it is equally possible to design an appropriate TF model, and transfer of its *r* values to NF would yield an inappropriate model (this must be true for some underlying true models). Along these lines, for the case of intron sequence evolution the authors consider the most general time reversible dinucleotide model. This model contains the others, and appears to justify its extra parameters with sufficient likelihood improvement. Why do the authors not simply take the results of this model as their best estimates and recommend the use of this TF model for this application?

*Author's response:* *We accept the reviewer's suggestion of defining a TF model and simulating under this and then transferring the parameters to an NF model. This is now presented in the last paragraph of Results section on 'The relationship of NF and TF to an independent nucleotide process'*.

*Regarding our analysis of intron sequence evolution, we do not take the fully general model purely on the face value of its likelihood improvement because we are not interested in just identifying the best fitting model. Instead, we seek to understand the contributions of the different terms to the fit of this model, and the consistency between the models in interpreting their effects. We demonstrate there is broad consensus between the two forms, but that they differ with respect to key parameters that strongly influence our inference concerning the significance of a mutagenic process – namely the rate of transversions at CpG dinucleotides. We also point out that some potential applications of these models, such as phylogenetic reconstruction, would substantially benefit from using parameterizations that are less computationally intensive to fit than diGTR but succinctly capture the dominant effects*.

*Cherry responds in a second review*: This new paragraph, which mainly illustrates that a true null hypothesis is not often rejected, does not address the point that I was trying to make. In fact it is another example of the phenomenon that I was criticizing. Rather than using a properly designed TF model for assessing context effects, here and elsewhere the authors use an inappropriate TF model, TF_{nuGTR}, obtained by blind transfer of the exchangeability matrix from a working NF model. They then blame TF in general for the result, and conclude, incorrectly, that TF cannot be used for the intended purpose.

It is possible, reasonable, and likely often desirable to use TF models to study context effects. The null hypothesis of independent nucleotide GTR can be expressed in TF form (as can anything that can be expressed in NF form, according to the definitions in the revised manuscript). This TF form is different from what the authors misleadingly call TF_{nuGTR}. That may be surprising, but nonetheless it is possible to use a correct null hypothesis rather than TF_{nuGTR}. This null hypothesis can be tested against the alternative hypothesis TF_{diGTR}. This procedure will detect context effects when they are present but not when they are absent, subject to the usual statistical uncertainties.

It is unimportant whether the null hypothesis is expressed in TF form: the NF form or the single-nucleotide form will yield the same likelihood. The important point is that TF_{diGTR} is a legitimate and possibly very useful non-null model. Indeed, as I noted above, it appears to be quite useful for modelling the intron sequence data analyzed in the article. Any extra effort required to interpret the parameters is the price that one pays for using a richer model that is capable of representing biological reality. Sometimes a less general model will be preferable, but the same can be said of any general model (e.g., single-nucleotide GTR). I see no valid argument for banishing this useful tool from our arsenal of models.

*Author's response:* *Regarding the reviewer's statement concerning our specification of an "inappropriate TF model", we agree that is inappropriate but point out that this form was not used as a soft target. This parameterization was motivated by analogous TF form codon model parameterizations, specifically those of the Goldman and Yang substitution model family and subsequent refinements. Thus, the TF*_{nuGTR}*model choice is pertinent to current applications*.

*Regarding the reviewers suggestion that TF models are "reasonable, and likely often desirable" for the study of context effects, we illustrate again the additional complexity necessary to relate a TF model form to the relatively simple F81 substitution model. To parameterize F81 in TF, we need to set*

{r}_{\text{TF}}({i}_{1}{i}_{2},{j}_{1}{j}_{2})=\{\begin{array}{ll}1/\pi ({j}_{1})\hfill & {i}_{1}={j}_{1}\hfill \\ 1/\pi ({j}_{2})\hfill & {i}_{2}={j}_{2}\hfill \end{array}

*thus apparently requiring 4 extra parameters, which are in fact functions of π, and the form of Q is harder to comprehend than the analogous expression for the NF model form, where there are no additional parameters and hence the r*_{NF} *all equal 1*.

*Concerning the unimportance of the form of the null, we suggest it is enormously relevant to know which model actually corresponds to the desired null. We have argued that modelling context dependence can be done sensibly when an appropriate null can be identified, and that this null is obvious under the NF form. We claim for TF that even if a more restricted TF model than TF*_{diGTR}*nests the independent process, identifying this model as the null for measuring context effects from the myriad possibilities is decidedly non-obvious*.

*Cherry's first review continues*: Paragraph 2 of the Discussion states that "...specifying an NF model where the tuple frequencies are not just the product of monomer frequencies is more complex. One way it could be achieved is by specifying a non-reversible process...." I would expand on this. I believe the following to be true, but not necessarily obvious:

1. If *r*_{NF}(*a*, *b*) is symmetric, the equilibrium tuple frequencies equal the products of the nucleotide equilibrium frequencies, but

2. Symmetry of *r*_{NF}(*a*, *b*) is not necessary (though it is sufficient) for time reversibility.

I would also note that NF, unlike TF, cannot represent different equilibrium frequencies at different tuple positions. This would seem to be a serious deficit for codon-based models. One can imagine an extension to NF in which there are distinct frequency parameters for each tuple position. I am unsure of the relationship of this extension to TF.

*Author's response:* *We thank the reviewer for their interesting suggestion regarding asymmetric r. We have not modified the Discussion as we feel the existing text is adequate*.

*We agree that the NF form presented in the manuscript cannot represent position-specific equilibrium frequencies. However, as suggested by the reviewer, specifying independent and non-identical nucleotide frequencies is easy. Such models have already been described (e.g. Pond SK, Muse SV: Site-to-site variation of synonymous substitution rates. Mol Biol Evol 2005, 22:2375–2385). An argument for the suitability of this model form for the codon case was made by the first reviewer (R Knight, see above)*.

All that prevents any NF model from qualifying as a TF model is the restriction that *r*_{TF} does not depend on *π*_{dinuc}. Perhaps people tend to create this sort of model, which would make it worth analyzing. However, there is no reason for models to have this restriction. Even in single-base models, the analogous restriction is not always observed: in the F84 (Felsenstein, 1984) model, for example, some of the *r* values depend on *π* values. Perhaps the best models for some purposes are TF-like models that violate this restriction.

*Author's response:* *We concede that there may be some purposes for which the TF-like models may be better suited. Our focus has been on establishing how to measure departures from independence, and in particular the identification of primary influences in departure from independence in the evolution of DNA sequences. A motivator for this is the intra-genomic diversity in composition evident for many organisms, including the primate species used for the intron analysis. In such a case, using a model in which the critical parameters are independent of composition seems the best choice. Under the NF form, parameters estimated from genomic regions that differ in composition can be readily compared. As pointed out by the reviewer, similarly defined parameters estimated under the TF form from the same regions can also be compared. However, the failure to appreciate the influence of composition on the latter estimates can strongly mislead their interpretation*.

### Reviewer's report 3

*Stephen Altschul, NCBI (nominated by David Lipman)*

This paper compares two different, widely used models used for studying the evolution of nucleic acids. A specific question is whether specific nucleotide substitutions depend upon the context in which they occur, and in what way. This is addressed by modeling the evolution of non-overlapping nucleotide "tuples", and comparing the results to a model of independent nucleotide evolution.

A structure imposed on such models assumes an instantaneous a→b tuple mutation rate expressible as a symmetric matrix on a and b, multiplied by a factor dependent upon the equilibrium frequency distribution of either the resulting tuple (b) or the resulting nucleotide in the mutation that occurs. The nucleotide-frequency normalized model is called NF, the tuple-frequency normalized model TF.

A central point of the paper is that for NF, a null model is exactly the same as a model of independent nucleotide evolution, whereas this is true for TF only when the equilibrium distributions of the nucleotides are identical. When there are non-equal nucleotide frequencies, TF can be seen to imply context-dependent factors in nucleic acid evolution when there are none, and sometimes to miss detecting such factors when they exist. The NF model does not have equivalent difficulties.

*Author's response:* *We note here that the last point relates to an earlier version of the manuscript*.

Applied to real, non-protein-coding sequences, the NF and TF models both recognize very significant context-dependent effects on nucleic acid evolution, and agree on the most important dinucleotide pairs implicated. However, the models can diverge substantially concerning more subtle issues. Specifically, the NF model does not recognize a significant context-dependent influence on transversions involving CpG dinucleotides, whereas the TF model does.

In conclusion, this paper presents a provocative critique of the widely used TF model, and of some of the conclusions that have been derived therefrom. Its arguments will need to be taken into account in further modeling of nucleic acid evolution.