Distinguishing HIV-1 drug resistance, accessory, and viral fitness mutations using conditional selection pressure analysis of treated versus untreated patient samples

Background HIV can evolve drug resistance rapidly in response to new drug treatments, often through a combination of multiple mutations [1-3]. It would be useful to develop automated analyses of HIV sequence polymorphism that are able to predict drug resistance mutations, and to distinguish different types of functional roles among such mutations, for example, those that directly cause drug resistance, versus those that play an accessory role. Detecting functional interactions between mutations is essential for this classification. We have adapted a well-known measure of evolutionary selection pressure (Ka/Ks) and developed a conditional Ka/Ks approach to detect important interactions. Results We have applied this analysis to four independent HIV protease sequencing datasets: 50,000 clinical samples sequenced by Specialty Laboratories, Inc.; 1800 samples from patients treated with protease inhibitors; 2600 samples from untreated patients; 400 samples from untreated African patients. We have identified 428 mutation interactions in Specialty dataset with statistical significance and we were able to distinguish primary vs. accessory mutations for many well-studied examples. Amino acid interactions identified by conditional Ka/Ks matched 80 of 92 pair wise interactions found by a completely independent study of HIV protease (p-value for this match is significant: 10-70). Furthermore, Ka/Ks selection pressure results were highly reproducible among these independent datasets, both qualitatively and quantitatively, suggesting that they are detecting real drug-resistance and viral fitness mutations in the wild HIV-1 population. Conclusion Conditional Ka/Ks analysis can detect mutation interactions and distinguish primary vs. accessory mutations in HIV-1. Ka/Ks analysis of treated vs. untreated patient data can distinguish drug-resistance vs. viral fitness mutations. Verification of these results would require longitudinal studies. The result provides a valuable resource for AIDS research and will be available for open access upon publication at Reviewers This article was reviewed by Wen-Hsiung Li (nominated by Eugene V. Koonin), Robert Shafer (nominated by Eugene V. Koonin), and Shamil Sunyaev.


Background
During the past two decades, researchers and clinicians have made enormous efforts to identify drug resistance mutations in HIV-1 protease, a major target of anti-retroviral therapy. Discovery of a new drug resistance mutation typically requires a combination of clinical research (e.g. AIDS patients displaying drug resistance), molecular biology (e.g. sequencing), and biochemistry (e.g. obtaining viral samples and performing phenotypic assays). Over the last 20 years, many mutations and combinations of mutations in protease have been reported to play a role in drug resistance [4,5]. But this discovery process is laborious, expensive, and far from automatic -indeed, it is research. Meanwhile, HIV is constantly evolving new drug resistance mutations, often within weeks of introduction of a new drug [6][7][8]. Thus, a completely automated approach for identifying drug resistance mutations would be valuable [9,10].
Differences in reproductive success (such as the enhanced survival of a drug-resistant mutant) are reflected in selection pressure, which can be measured directly from sequence data. For example, K a /K s is a well-known metric of selection pressure for amino acid mutations, measured relative to the observed frequency of synonymous mutations (which cause no change to protein sequence or function) [11,12]. Ordinarily K a /K s values much less than one are observed, indicating that most amino acid mutations are deleterious. This is referred to as "negative selection". By contrast, K a /K s values greater than one indicate that amino acid mutations improve reproductive success (positive selection), a very unusual condition. By screening for individual amino acid mutations in HIV protease that displayed positive selection, this metric was able to correctly predict 19 out of 23 known drug resistance mutation sites in HIV protease [10]. This approach has several advantages: it is completely automatic; it is based on a large and well-understood body of evolutionary theory; it only requires raw sequence data to perform this analysis.
However, not all positively selected mutations are created equal. It would be very useful to be able to distinguish different types of mutations on the basis of their interactions with other sites. For example, a binding site mutation that interacts directly with the drug molecule might confer primary drug resistance [13][14][15]. Alternatively, a mutation that interacts with another site within HIV protease might compensate for destabilizing effects of mutations at the other site [16,17]. Unfortunately, selection pressure met-rics such as K a /K s provide no explicit mechanism for taking such interactions into account.
To address this problem, we have extended the K a /K s concept to consider interactions with other sites, by formulating a conditional K a /K s that measures the effects of mutations at one site on the selection pressure for mutations at another site. We have applied it to four completely independent datasets: Specialty: 50,126 HIV-1 positive patient plasma samples sequenced by Specialty Laboratories, Inc. between October 1999 and October 2003 [10]. This dataset represents a mix of treated and untreated patient samples from the U.S. Treated: 1797 samples collected by Shafer and colleagues, from U.S. patients undergoing specific drug treatments [9]. Untreated: 2628 samples collected by Shafer and colleagues specifically from untreated patients [9]. Africa: 399 African HIV-1 subtype C samples downloaded from Los Alamos HIV Sequence Database [18][19][20]. These samples predominantly represent untreated patients. These sequences cover the whole protease region (codons 1 -99) of HIV-1. The conditional Ka/Ks results are highly reproducible among these independent datasets. They reveal complex interactions between amino acid sites, and distinguish different types of mutations such as primary drug-resistance mutations vs. accessory mutations.

HIV-1 sequence data
We filtered all four datasets to exclude contamination by other HIV-1 subtypes. We compared all of the sequences in these datasets against HIV-1 subtype reference sequences from the Los Alamos database by using the program BLAST and assigned each sample to the subtype with the highest identify. All non-subtype B sequences were removed from the U.S. datasets and all non-subtype C sequences were removed from the African dataset. There might be multiple samples from the same individual in Specialty dataset. To eliminate the phylogenetic relationships among sequences, we also excluded samples that have 98% or greater nucleotide sequence similarity to each other.
Multiple sequence alignment and mutation detection were performed as previously described [10]. We used the widely accepted nucleotide reference sequence from the Los Alamos HIV Sequence Database as the "wildtype" for HIV-1 subtype B [21,22] and used the nucleotide consensus of all the African subtype C samples as the "wildtype" for HIV-1 subtype C. In this paper, we considered only single nucleotide substitutions; all other mutations were excluded.

Unconditional Ka/Ks measurement for individual codon positions
Amino acid selection pressure can be calculated for an individual codon [23,24]; we will refer to this as the "unconditional K a /K s " to distinguish it from the conditional approach presented in the next section. We first measured the transition and transversion frequencies f t and f v from the entire dataset, according to where S is the total number of samples, N t and N v are the number of observed transition and transversion mutations respectively, n t is the number of possible transitions in the region that was sequenced (simply equal to its length L in nucleotides), and n v is the number of possible transversions (equal to 2L). In this calculation (and all others below) we only counted single nucleotide substitutions; all other mutations were excluded. We performed the calculation of unconditional K a /K s as previously described [10], as the ratio of N a , the count of samples with amino acid mutations observed at that codon, over N s , the count of samples with synonymous mutations observed at that codon. This N a /N s ratio is then normalized by the ratio expected under a random mutation model (i.e., in the absence of any selection pressure), according to the following formula: where n a,t is the number of possible transition mutations in the codon that would change the wildtype amino acid, n s,t is the number of possible transition mutations in the codon that are synonymous, and n a,v and n s,v are the equivalent numbers for transversions. When the observed ratio (N a /N s ) is greater than the expected ratio (in the denominator of the expression above), the selection pressure K a / K s is greater than 1 and we say that the codon is under positive selection pressure.
We calculated an LOD confidence score for a codon to be under positive selection pressure according to the following formula: where N is the total number of mutations observed in the codon, and q is calculated as follows:

Conditional selection pressure analysis
To measure how mutation at one site alters the selection pressure at another site, we define the "conditional selection pressure" for a site Y in the presence of an amino acid mutation at another site X as: where N YaXa is the number of samples with an amino acid mutation at codon Y and an amino acid mutation on codon X; and N YsXa is the number of samples with a synonymous mutation at codon Y and an amino acid mutation on codon X. We will refer to (K a /K s ) Y|Xa as the "conditional K a /K s " at Y given an amino acid mutation at X.
We define the "conditional selection ratio" as the ratio of the conditional Ka/Ks divided by the selection pressure at Y measured in the absence of any mutation at X: where N YaXo and N YsXo are the numbers of samples containing either an amino acid mutation or synonymous mutation at Y and no mutation at codon X.

Confidence scoring for conditional selection pressure analysis
We screened for codon pairs with conditional Ka/Ks and conditional selection ratios above 1. To evaluate the statistical significance of apparent positive conditional selection pressure, we defined a LOD score based on the likelihood of observing at least N YaXa samples with amino acid mutations at codon Y and X, assuming neutral selection at codon Y: where N = N YaXa + N YsXa ; and q as defined above.
To evaluate the statistical significance of the conditional selection ratio, we used a one-sided Fisher's exact test to measure the p-value for the null hypothesis that ( , , We performed Fisher's exact test on a 2-by-2 contingency table containing the counts N YaXa -1, N YsXa , N YaXo and N YsXo , using the statistical software package R. We deducted a single count from N YaXa to make the test stricter against cases with small numbers of counts. Throughout this study, we used the following criteria for high confidence positive conditional selection pressure: (K a /K s ) Y|X > 1, (K a /K s ) Y|Xa > 1, LOD > 3, and p < 10 -3 .

Comparing K a /K s values between datasets
To test whether the difference in K a /K s value computed from two datasets D and D' was statistically significant, we calculated a p-value using a one-sided Fisher's exact test, using the 2 × 2 contingency table representing the counts of amino acid mutations and synonymous mutations from the two datasets: for unconditional Ka/Ks, (N a , N s , N' a , N' s ); for conditional K a /K s , (N YaXa , N YsXa , N' YaXa , N' YsXa ). We used a p-value threshold of 0.01.

Detection of selection pressure interactions via conditional K a /K s
To dissect the functional roles of amino acid mutations in HIV protease, we analyzed the Specialty dataset of 48,387 HIV-positive samples for conditional K a /K s relationships. We screened all possible codon pairs using several criteria: positive conditional selection pressure (with a statistical significance test of LOD >3), and a conditional selection ratio greater than 1 (with a statistical significance test of p < 10 -3 ). 400 codon pairs showed strong conditional relationships by these criteria (Table 1). These effects ranged from relatively mild increases (for example, the presence of an amino acid mutation at position 88 increased the K a /K s value at codon 10 by about 27%) to very large (at position 30, the K a /K s value increased from 0.4 to 71.1 depending on the absence or presence of an amino acid mutation at codon 88).
These data reveal a basic distinction between two types of positive selection effects: those that depend on the presence of another amino acid mutation at a specific site, versus those that do not. For example, amino acid mutations at codon 90 are strongly selected for, regardless of the presence or absence of a mutation at codon 10 ( Fig. 1). By contrast, codon 10 displayed negative selection in samples with no mutation at 90, but strongly positive selection in samples containing an amino acid mutation at 90. These data show that mutations at 90 are directly selected by drug treatment, that mutations at 10 are not favorable by themselves, but that they become advantageous in viruses bearing a 90 mutation. These results closely match previous experimental studies showing that mutations at 90 cause drug resistance, while mutations at 10 have an accessory effect of compensating for the destabilizing effect of mutations at 90 [25]. We have observed similar asymmetric effects at other sites, for example 24/82 (Fig.  1a). In general, primary drug resistance mutations showed consistent positive selection, whereas accessory mutations often showed only conditional positive selection, depend- Due to the long list of positive conditional selection pressure detected in the Specialty dataset, only the 15 positive selections with the strongest increase in (K a /K s ) Y|Xa value after conditioning were showed here. Please see the Additional File for the complete list. Codon X: the codon site where the first mutation appears. Codon Y: the codon site where the second mutation appears.
(K a /K s ) Y|Xa : conditional selection pressure of codon Y in the presence of an amino acid mutation at codon X.
(K a /K s ) Y|Xo : conditional selection pressure of codon Y in the absence of any mutation at codon X. LOD: confidence score for (K a /K s ) Y|Xa > 1. p: p-value for (K a /K s ) Y|X > 1 to be random.
Conditional Ka/Ks measurements for primary vs. accessory drug-resistance mutations   Specialty ent on the presence of a primary drug resistance mutation. By exposing such asymmetries, the conditional K a /K s analysis may be able to predict whether mutations have a primary drug resistance vs. accessory effect. For example, mutations at position 54 are positively selected, which in turn induce positive selection for mutations at 89, which by themselves are unfavorable (Fig. 1b). These data suggest that 89 is an accessory position, whose mutations can stabilize mutations at 54 (which are known to cause drug resistance [26]).

Comparison with independent drug treatment studies of HIV protease
One weakness of the Specialty dataset is that it includes no information about individual drug treatments. To assess the significance of our conditional K a /K s results, we have compared them with the independent experimental study of Wu et al., who identified correlated mutation pairs, i.e. pairs of codons where mutations co-occurred much more frequently in patients with specific drug treatments, than in patients who received no drug treatment [9]. By comparing 1,004 HIV isolates from untreated patients with 1,240 patients from patients treated with one or more protease inhibitors, Wu et al. identified 92 protease codon pairs with significant correlated mutations associated with drug treatment. Our conditional K a /K s analysis of the Specialty dataset matched these results closely, identifying 80 of these 92 codon pairs as having positive conditional selection pressure (LOD>2 and p < 0.01). This result had strong statistical significance (p-value = 10 -70 ). Thus conditional K a /K s appears to robustly detect mutational interactions that are genuinely associated with drug treatment, even from a dataset (Specialty) lacking any drug treatment information.

Distinguishing drug-resistance vs. fitness mutations by comparison of K a /K s for treated vs. untreated datasets
We have sought to distinguish mutations that are selected specifically in response to drug treatment, from fitness mutations that are positively selected even in the absence of drug treatment [27]. To do so, we first compared unconditional K a /K s values from the Treated dataset with those from the Untreated dataset (Table 2). Nine codons underwent a confident shift from negative selection in Untreated, to positive selection in Treated. Of these, five (30, 53, 54, 82, and 90) are known primary drug-resistance mutations, and three (10, 24, 73) are known accessory drug-resistance mutation codons [28]. Thus, this shift appears to be an accurate predictor of sites involved in drug-resistance. Intriguingly, the one remaining codon (74) is not currently known to cause drug-resistance, but showed confident positive selection both in the Treated (K a /K s = 3.75) and Specialty (K a /K s = 5.87) datasets.
A total of fifteen codons were positively selected in both Treated and Untreated dataset (Table 3; codon 71 has K a / K s value of 1.11 but a LOD = 0.63 in the Untreated dataset). The fact that these mutations were positively selected even in the absence of drug treatment suggests that they contribute directly to viral fitness. Of these, five (19, 62, 63, 71 and 77) displayed a statistically significant increase in K a /K s from Untreated to Treated (Table 2). Codons 63, 71 and 77 are known accessory drug-resistance mutations [28]. The functions of the remaining two codons are unknown. The significant increase of K a /K s value in the Treated dataset predicts an accessory role in drug-resistance.
Strikingly, all of the known primary drug resistance mutations were observed to be negatively selected in the Untreated dataset, suggesting that they reduce fitness in the absence of drug treatment [29][30][31]. These negative selection effects were strongest for mutations at codons 66 (K a /K s < 0.01), 84(< 0.01), 32 (0.01), 48 (0.02) and 54 (0.03), and weakest at codons 82 (K a /K s = 0.46), 53 (0.31), 24(0.33), and 90 (0.25). Thus, in the absence of continuing drug treatment, these mutations would be expected to be gradually lost from the evolving viral population due to their greatly reduced fitness [29][30][31]. By contrast, while most accessory drug-resistance mutations were negatively selected in Untreated, some were neutral (e.g. 71), while others were strongly positively selected (e.g. 63, 77) even in the absence of drug treatment. This may explain why mutations at position 63 appear to be becoming dominant in the HIV population [26,32] It is interesting to note that at no codon did we ever observe the pattern of positive selection in Untreated but negative selection in Treated (Table 3). Thus, drug-treatment appears to preserve positive selection pressure for fitness mutations that are found in the untreated dataset, merely adding new positive selection effects for drugresistance.
Although the Treated and Untreated datasets are much smaller than Specialty, we were able to calculate conditional K a /K s values for many sites. Fig. 1e-h shows a comparison of conditional K a /K s results from the Treated and Untreated data for positions 10/90 and 63/90, versus the results from the Specialty Data ( Fig. 1c-d). The Specialty and Treated results display good qualitative agreement, in that mutations at 90 (a primary drug-resistance mutation site) are positively selected regardless of whether an accessory mutation is present at 10 or 63, and each mutation reproducibly increases selection for the other mutation [25,31]. By contrast, in the Untreated dataset, mutations at codon 90 become unfavorable (negative selection pressure), while selection pressure for the accessory mutation (10 or 63) drops much less.
Although estimating conditional K a /K s accurately from the Treated dataset is difficult due to insufficient counts, we identified 274 codon pairs that showed statistically significant increases in conditional K a /K s for Treated relative to the Untreated dataset (Table 4). Of these shifts, eleven exhibited positive conditional selection pressure in Treated, but not in Untreated: 10→54, 71→54, 71→82, 63→90 (primary drug-resistance mutation codons); 20→10, 82→10, 71→10, 24→10, 10→63, 36→71, 63→73 (accessory drug-resistance mutation codons). And two codon pairs (93→63, an accessory drug-resistance mutation; 12→19, function unknown) are positively selected in both datasets. It is striking that almost all of these increases in conditional K a /K s involved known drugresistance mutation sites.

K a /K s comparison between the specialty and treated datasets
To evaluate the robustness of our approach, we compared its results from two independent datasets (Specialty; Treated), both from U.S. patients who received drug treatment. We first computed unconditional K a /K s values from the two datasets, and compared the results (Fig. 2). These independent data displayed a significant quantitative agreement, with a correlation coefficient of 0.887. Out of the 97 protease codons analyzed, only two significant differences were observed between the datasets. At codon 24, the Treated data showed mild positive selection (K a /K s = 1.59), but the Specialty dataset indicated slightly negative selection (K a /K s = 0.55). At codon 15, the situation was reversed (Treated K a /K s = 0.88; Specialty K a /K s = 1.30), but since both values were very close to neutral (K a /K s = 1.0), the difference was not statistically significant. The almost exact overlap in positive selection between the two datasets was highly significant (p-value = 10 -15.9 ).
Overall, the positively selected K a /K s values computed from the Treated dataset tended to be slightly higher than those from the Specialty dataset. This is reflected in the ratios of K a /K s values from Treated vs. Specialty data, whose average (1.37) indicates that the Treated K a /K s values were 37% higher than the Specialty values. This is consistent with the fact that the Treated dataset consists exclusively of patients under drug treatment, whereas the Specialty dataset contains a mixture of samples from treated and untreated patients. Such mixing is expected to reduce the observed selection pressure arising from drugtreatment.
We next compared conditional K a /K s values from the two datasets (Fig. 3). These results display considerably more variation, but this appears to be due to insufficient counts in the Treated dataset for accurate estimate of conditional K a /K s values for many codon pairs. Whereas the unconditional K a /K s uses the whole dataset to estimate the rates of amino acid mutations (K a ) and synonymous mutations (K s ), the conditional K a /K s for a mutation Y conditioned on another mutation X is by definition limited to the small subset of sequences that contain mutation X. Since the Treated dataset contains only 1,797 samples, this drops many sites below the number of samples needed for accurate estimation of conditional K a /K s . To filter the dataset to codon pairs where (K a /K s ) Y|Xa could be estimated with some accuracy, we restricted the analysis to cases where N YaXa + N YsXa > 100. These results showed a strong match between conditional K a /K s values calculated from  None a positively selected. For selection pressure, K a /K s > 1 and LOD > 2; for conditional selection pressure, (K a /K s ) Y|Xa > 1, (K a /K s ) Y|X > 1, LOD > 3 and p < 0.001. b negatively selected (K a /K s < 1 and LOD > 2) c codon 71 is positively selected in Treated samples (K a /K s > 1 and LOD > 2), and it has a K a /K s value greater than 1 but a LOD score less than 2 in Untreated samples. d codon 53 is positively selected in Treated samples (K a /K s > 1 and LOD > 2), and it has a K a /K s value less than 1 but a LOD score less than 2 in Untreated samples. e In Treated dataset, codon 93 is positively selected with (K a /K s ) Y|Xa > 1, (K a /K s ) Y|X > 1, LOD > 3 but p > 0.001.
the two independent datasets, with a correlation coefficient of 0.495. None of the codon pairs with positive conditional selection in the Specialty dataset showed negative selection in the Treated dataset.
Again, the positively selected conditional K a /K s values computed from the Treated dataset were slightly higher than those from the Specialty dataset (by 17.9% on average). This is consistent with the fact that the Specialty dataset is a mixture of treated and untreated samples, whereas the Treated dataset was collected exclusively from treated patients.

K a /K s between the untreated and Africa datasets
We computed K a /K s values from two untreated datasets: 2,628 HIV-positive samples from U.S. patients who had received no drug treatment (Untreated); and 399 HIVpositive samples from African patients (predominantly untreated [18][19][20]). While the African dataset is too small for accurate estimation of K a /K s values, the two datasets show strong qualitative agreement (Fig. 4). Of the 97 protease codons analyzed, seventeen codons showed positive selection (K a /K s >1, LOD>2) in at least one of the two datasets. In twelve cases, both datasets indicated positive selection (K a /K s >1); in four cases, the Untreated data detected positive selection but the African dataset did not; and in one case (codon 39), the reverse was observed. This high level of agreement was statistically significant (Pvalue = 10 -3.75 ). Despite the fact that these data were collected on different continents and represent different HIV-1 subtypes (B in Untreated, and C in African), they indicate that fitness mutations are found mostly at the same positions in the two populations.

Assessing the effects of mixing treated and untreated data on selection pressure calculations
In principle, the K a /K s analysis approach differs from associational studies in that it does not require an untreated "reference" dataset in order to detect drug-resistance signals, and should even be tolerant of significant levels of Positive conditional selection pressure in the Specialty and Treated datasets Figure 3 Positive conditional selection pressure in the Specialty and Treated datasets. (K a /K s ) Y|Xa values for positive conditional selection pressures in the Specialty and Treated datasets are plotted. To filter the dataset to sites where (K a / K s ) Y|Xa could be estimated with accuracy, only selection pressures with N XaYa + N XaYs > 100 in both datasets were showed. An empty triangle indicated that we were not confident about the conditional selection value in the Treated dataset. The correlation coefficient of the conditional selection pressures in these two datasets is 0.495.  Specialty Treated contamination by untreated samples. To test its robustness to such contamination, we deliberately mixed the Treated and Untreated datasets, and analyzed K a /K s from the mixed data. We tested several mixtures: adding 25% contamination of Untreated samples (i.e. N untreated = 0.25 N treated ); 50% contamination; and 146% contamination (i.e. all the available Untreated samples). The 25% and 50% mixtures caused only minor effects on the measured Ka/Ks values (data not shown). The maximum contamination (146%) caused up to two-fold decreases in positively selected K a /K s values (Fig. 5), with shifts averaging of 32.5%. Out of the 24 codons detected to be positively selected in the Treated dataset, only two became negatively selected in the mixture dataset. These data show that the K a /K s approach is relatively robust even to very high levels of contamination by untreated samples.

Discussion
Naïve K a /K s analysis involves several assumptions that can undercut the robustness of its results. In particular, the assumption of a single consensus sequence as the common ancestor of all sequences in the data sample ignores the possibility of important phylogenetic relationships among these sequences. Phylogenetic structure in the dataset thus can cause artifacts in K a /K s analysis, and some methods perform K a /K s analysis in the context of an explicit phylogeny [33]. Linkage disequilibrium between sites can also cause artifacts in conditional K a /K s analysis, by creating correlations between sites that are purely historical, and not due to functional interaction [34,35]. These problems are likely to be most severe in genomes with a low mutation rate and low recombination rate, and  To assess these problems, we have used HIV mutation data to test the utility and robustness of conditional K a /K s analysis, by comparison with known functional interactions involved in drug resistance, and comparison of the K a /K s results for multiple independent datasets. Several lines of evidence indicate that, at least for HIV protease, conditional K a /K s analysis provides robust detection of real functional interactions that are not artifacts: 1) K a /K s identification of sites with strong positive selection correctly identified known drug resistance mutations in HIV protease (p-value = 10 -3.3 ; [10]); 2) K a /K s identification of mutations with strong positive selection matched independent experimental phenotypic assays [10,36]; 3) conditional K a /K s identification of pairs of sites with strongly positive conditional selection ratios ((K a /K s ) Y|X >>1) strongly matched independent experimental studies iden-tifying pairs of mutations that are correlated in treated vs. untreated patient datasets [9]; 4) detailed functional predictions from conditional K a /K s (such as distinguishing primary vs. accessory drug resistance mutations) matched well-known examples in HIV protease where such detailed functional information is available; 5) conditional K a /K s values computed from independent datasets (Specialty vs. Stanford-Treated) yielded highly similar results, indicating that these values are not artifacts, but broadly reproducible characteristics of HIV evolution in the wild population; 6) the strong positive selection patterns identified by conditional K a /K s in treated patients (Specialty; Stanford-Treated) vanished in the untreated patient dataset (Stanford-Untreated), indicating that these K a /K s patterns genuinely reflect selection pressure associated with drug treatment. Considered as a whole, all of these assessments suggest that conditional K a /K s robustly identifies genuine functional selection pressure in HIV protease, not artifacts, and thus may also be useful in other genes with similar characteristics, such as other HIV genes, or other rapidly-evolving viruses. However, it is likely that a more sophisticated approach, incorporating phylogenetic structure, will be required for other types of problems.
The high level of reproducibility of K a /K s results across four independent datasets in this study (Fig. 4) merits further consideration. These datasets were collected entirely independently, and come from diverse sources [9,10,[18][19][20]. The quantitative agreement between K a /K s values obtained from these datasets is surprisingly consistent considering that the input data were collected by different investigators, from different patient groups, over different time periods, under differing drug treatment regimens (and in the case of the Specialty dataset, no treatment information available whatsoever), and even from different continents and HIV-1 subtypes. Even the weakest level of agreement between datasets (Untreated vs. Africa, by far the smallest dataset) was significant (p-value = 10 -3.75 ). Moreover, comparison of the K a /K s results versus known drug-resistance mutations demonstrates that this approach predicts true drug-resistance mutation codons with a high level of accuracy (13 drug-resistance mutation codons in K a /K s analysis, and 19 in the conditional K a /K s analysis). These results also show that comparison of K a / K s values between treated and untreated datasets is effective at distinguishing primary and accessory drug-resistance mutations (only observed in the treated dataset) from viral fitness mutations (observed in the untreated dataset). This suggests that this approach could be applied successfully to more specific dataset comparisons (such as two different drug treatments).
These four datasets also suggest that viral fitness mutations in HIV protease are surprisingly consistent through-Effects of mixing Treated and Untreated data on selection pressure calculation Treated Mixture out HIV-1 populations. In general, mutations that were positively selected in untreated samples (i.e. the Untreated and African datasets) were also observed to be positively selected in treated samples (i.e. in both the Specialty and Treated datasets). To a first approximation, these mutations seem to represent a uniform background of positive selection pressures that are consistently observed in any set of samples we have analyzed. Furthermore, these "viral fitness" mutation codons matched strongly between the Africa and Untreated datasets, indicating that these fitness pressures are even consistent across different HIV-1 subtypes (B in Untreated vs. C in Africa) [37,38].
These comparisons suggest that it may be useful to classify mutations according to their selection pressure values in treated vs. untreated patient populations. For example, whereas mutations that are positively selected even in untreated samples may be classed as improving viral fitness [37], mutations that are positively selected in treated samples but not untreated samples are likely to be drugresistance mutations [13][14][15]. It is also interesting to classify mutations according to whether their positive selection effects are conditional or unconditional. A mutation that is positively selected by drug treatment, with no conditional dependence on any other mutation, evidently makes a primary (i.e. direct) contribution to drug-resistance [13][14][15]. By contrast, a mutation that is positively selected by drug treatment only in the presence of a primary mutation, evidently does not make a direct contribution to drug-resistance, but rather a secondary (or "accessory") contribution mediated by the primary mutation [16,17]. Finally, a mutation that compensates for a deleterious fitness effect caused by a primary drug-resistance mutation can also be distinguished by K a /K s analysis [31]. Such a mutation should not exhibit unconditional positive selection in either untreated or treated samples, but should show positive conditional selection (only in the presence of the primary mutation) in both types of samples.
How much "mixing" of sample types can this approach tolerate? We have tested the effects of deliberately mixing treated and untreated datasets upon positive selection mapping by K a /K s . These data show that K a /K s can robustly detect drug resistance mutations even when the level of mixing equals or even exceeds 100%. Such mixing reduced K a /K s estimates somewhat, but qualitatively the results were robust. Thus, not only can K a /K s detect drug resistance mutations in a single dataset (with no requirement for a reference dataset of untreated samples), it can do so even when the primary dataset is contaminated with a large fraction of untreated samples, or samples with different treatments. This can be very useful in situations where treatment information is incomplete or unavailable.
Since different amino acid mutations can have different functional effects [36], an obvious extension of this analysis to consider the conditional selection pressure effects of individual amino acid mutations, rather than treating all amino acid mutations at one site as equivalent, as was done in this analysis. Such an extension would yield detailed information about the amino acid interactions between two codon sites [10].
Conditional Ka/Ks represents a slight extension of existing work on co-evolution of amino acid residues in protein sequences. One well-established approach to this problem is the use of correlation analysis to look for correlations between amino acid residues in a set of related sequences [17,[39][40][41][42][43][44]. For example, by analyzing mutation correlation in an alignment of 266 non-redundant SH3 domain sequences, Larson et al. were able to predict residue-residue contacts in the structure of the SH3 domain [40]. Such correlated mutation analysis tests for non-additive effects of a pair of mutations on reproductive fitness, such as epistasis [45] or compensation [43]. One important question not addressed by correlation analysis is whether an interaction between a pair of mutations is symmetric or asymmetric; that is, whether the influence of one mutation on the fitness effect of the second mutation is the same as the influence of the second mutation upon the fitness effect of the first. Since the definition of correlation is inherently symmetric (i.e. the correlation coefficient ρ(x,y) = ρ(y,x) is invariant to exchange of x and y), it does not answer this question. Conditional Ka/Ks combines the benefits of correlation analysis (detection of interactions between sites) with the benefits of standard Ka/Ks (a measurement of functional selection pressure). Moreover, this method can distinguish symmetric vs. asymmetric functional interactions.
We emphasize that in the simple form presented in this paper, conditional Ka/Ks is probably not applicable in datasets with a low mutation rate or strong linkage disequilibrium. If the number of mutation observations at a site is small, results will not be statistically significant. Worse, if individual observations of a mutation are not independent mutation events, but instead arose by common descent from a single ancestral mutation event, the results will be skewed by such "double-counting" bias within the data. In sequence datasets with clear phylogenetic structure, it will be essential to compute conditional Ka/Ks for specific branches of the phylogenetic tree, as has been done for standard Ka/Ks calculations [33,46]. This improved approach control for departures from conditional independence in the data sample, by explicitly analyzing the conditional dependence structure (i.e. the phylogenetic tree). Thus conditional Ka/Ks need not be limited to cases that fit a "star topology", but can be generalized to work with phylogenetic trees.

Conclusion
Conditional K a /K s analysis can detect mutation interactions between different sites. The analysis results also reveal important effects on fitness that are asymmetric and are very useful to predict primary vs. accessory drug resistant mutations. Comparison of K a /K s values between treated and untreated datasets is effective at distinguishing drug-resistance associated mutations from viral fitness mutations. Analysis results from four completely independent datasets are highly consistent both qualitatively and quantitatively. These results can be very helpful to understand the functional roles of different mutations in the development of drug resistance.

Authors' contributions
L. Chen carried out the sequence alignment, mutation detection, selection pressure and conditional selection pressure analysis. C. Lee proposed the conditional Ka/Ks metric, participated in the design of the study, and drafted the manuscript. My main suggestion is that the authors should explain in more detail the methodological parts, so that readers can understand them better.

Reviewers' comments
Minor comments: Give an explanation of Ka and Ks in the Introduction.
Explain the definition of q on page 7.
In many places "amino acid mutations" should be "amino acid substitutions or replacements". The manuscript has been greatly improved by addressing the larger specific issues raised in the initial review and also by improving the presentation of the results and clarity of the writing. Below I've summarized some of the specific improvements and do raise two additional pointsthe second of these points should definitely be addressed because it will be raised by others knowledgeable about the specific patterns of mutations observed in drug-resistant isolates.

Author response
1. Phylogenetic relationships among the sequences. The manuscript has been improved by discussing the relevance of this issue and by adopting a heuristic approachexcluding samples with >98% sequence identity. The authors should add to the methods that this step was taken.
2. Inherent variability of HIV-1 sequences. The authors have explained in more detail how conditional Ka/Ks was determined. Although their approach may remain controversial, it is certainly novel and worthy of publication and broader discussion.

Author response
As suggested by the reviewer, we have added a sentence in the Method Section to explain that we have excluded samples with >98% sequence identity.
3. Availability of the data. Although this may be partly beyond their control, the authors' attempts to make the data available -ideally the complete nucleotide sequence -is laudatory.  This would not  influence the protease because, as far as I am aware, the 2bp changes I47A, I54A, V82S, I84A/C are each uncommon mutations at positions at which 1-bp changes also  cause resistance (I47V, I54V, V82A, I84V). But it would influence a similar analysis on RT as T215Y and T215F are each 2-bp changes and are the only drug-resistance mutations at these positions.

Author response
I believe the authors need to add further explanation as to the implications of conditional Ka/Ks because of the following: In table 1, several of the Codon X's are actually known to follow the Codon Y. Therefore the statistical dependence of Y on X, does not necessarily mean that Y follows X. The most obvious example of this is the relationship between position 30 and 88. There is only one mutation at position 30 (D30N). There are several at position 88 (N88D >> N88S >>> N88T/G). N88S occurs completely independent of D30N but is much less common than N88D. However, N88D always follows and never precedes D30N. Therefore there are two paths to resistance involving these mutations: D30N followed by D30N+N88D and N88S alone. I believe that this is also the case for codons 73 and 90 (73 nearly always follows 90) and 24 and 82 (24 nearly always follows 82). This is an interesting work and the manuscript has improved upon revision. The revised manuscript includes discussion of the existing literature on co-evolution of amino acid residues. However, I would still like to make a few comments, some of them being critical.

1) A few comments on the method
a) The authors make a series of simplifications including assumptions of star-like phylogeny and absence of multiple substitutions in individual codons. Therefore, the method assumes that all sequences are descendants of a single sequence and the estimation of Ka and Ks is done by simple counting. I agree with the authors that currently available methods are not applicable to the dataset of 50,000 sequences. Thus, the authors have no other choice rather than to rely on simpler computationally efficient methods and making these assumptions is a necessity. The authors also point that the number of codons with multiple substitutions is not expected to be high. However, if the above assumptions are not valid (and it is not clear why the phylogeny is expected to be star-like), the results of the analysis have to be received with caution. Inference of positive selection can be erroneous. Analysis of statistical significance which relies on the binomial distribution and application of the Fisher exact test are not justified if these assumptions are not satisfied.

Author response
We have identified all mutations (both single nucleotide polymorphisms and multiple nucleotide polymorphisms) in this dataset. The results show that only a very small fraction of the mutations involve multiple substitutions in a codon (about 10%). They are excluded from this analysis because their evolution patterns are different from single nucleotide polymorphism.
HIV has extremely high mutation rate and high recombination rate. And it is under strong selection pressure, especially when the patient is under drug treatment. These same factors tend to make HIV evolution within a specific subtype follow the star topology model.
b) It is my understanding that frequencies of transition and transversion rates were estimated using both synonymous and non-synonymous changes. It may be better to use synonymous changes only.

Author response
As suggested by the reviewer, we have estimated the transition and transversion rates only using synonymous changes (F t /F v = 23.75). This does not alter the results of selection pressure and conditional selection pressure analysis much. For the purposes of this paper (selection pressure analysis) we prefer to use the more conservative assumption of measuring F t /F v from all nucleotides (F t /F v = 8.90), since this matches what has been published previously (Wilson, J. W., J Clin Microbiol, 2000). c) Testing all pairs of codons creates an obvious multiple test problem. The manuscript would probably benefit from the discussion on how the rate of false-positive predictions is controlled given this multiple test problem.
2) It is of interest whether amino acid residue pairs which display conditional selection located close to each other in protein structure. The structural analysis can serve as a confirmation of the conclusions made by the authors. However, I agree with the authors's response that this should be a subject of a separate study.
3) Most importantly, I do not find the explanation of the observed "conditional selection" satisfactory. The authors argue that mutations with significant "conditional Ka/Ks values" are not directly involved in the drug resistance and play accessory role. The first mutation which is directly involved in the drug resistance is destabilizing (in other words, is expected to be deleterious in the absence of the drug treatment) and the second mutation is needed to compensate the destabilizing effect of the first mutation. Although, this is certainly a possibility, it is feasible to imagine many other scenarios. For example, is it possible that the first mutation enables the effect of the second mutation on the drug response, i.e. the first mutation is not destabilizing but the second mutation would not have an effect on drug resistance without the first mutation? Alternatively, is it possible that the first mutation is not destabilizing, whereas the second mutation would be destabilizing (and deleterious) in the absence of the first mutation (Dobzhansky-Muller incompatibility)? In this case also the second mutation may be directly involved in drug resistance. Importantly, the latter scenarios do not involve initial increase in frequency of a destabilizing variant.