Skip to main content


Distinguishing HIV-1 drug resistance, accessory, and viral fitness mutations using conditional selection pressure analysis of treated versus untreated patient samples



HIV can evolve drug resistance rapidly in response to new drug treatments, often through a combination of multiple mutations [13]. It would be useful to develop automated analyses of HIV sequence polymorphism that are able to predict drug resistance mutations, and to distinguish different types of functional roles among such mutations, for example, those that directly cause drug resistance, versus those that play an accessory role. Detecting functional interactions between mutations is essential for this classification. We have adapted a well-known measure of evolutionary selection pressure (K a /K s ) and developed a conditional K a /K s approach to detect important interactions.


We have applied this analysis to four independent HIV protease sequencing datasets: 50,000 clinical samples sequenced by Specialty Laboratories, Inc.; 1800 samples from patients treated with protease inhibitors; 2600 samples from untreated patients; 400 samples from untreated African patients. We have identified 428 mutation interactions in Specialty dataset with statistical significance and we were able to distinguish primary vs. accessory mutations for many well-studied examples. Amino acid interactions identified by conditional K a /K s matched 80 of 92 pair wise interactions found by a completely independent study of HIV protease (p-value for this match is significant: 10-70). Furthermore, K a /K s selection pressure results were highly reproducible among these independent datasets, both qualitatively and quantitatively, suggesting that they are detecting real drug-resistance and viral fitness mutations in the wild HIV-1 population.


Conditional K a /K s analysis can detect mutation interactions and distinguish primary vs. accessory mutations in HIV-1. K a /K s analysis of treated vs. untreated patient data can distinguish drug-resistance vs. viral fitness mutations. Verification of these results would require longitudinal studies. The result provides a valuable resource for AIDS research and will be available for open access upon publication at


This article was reviewed by Wen-Hsiung Li (nominated by Eugene V. Koonin), Robert Shafer (nominated by Eugene V. Koonin), and Shamil Sunyaev.

Open peer review

Reviewed by Wen-Hsiung Li (nominated by Eugene V. Koonin), Robert Shafer (nominated by Eugene V. Koonin), and Shamil Sunyaev. For the full reviews, please go to the Reviewers’ comments section.


During the past two decades, researchers and clinicians have made enormous efforts to identify drug resistance mutations in HIV-1 protease, a major target of anti-retroviral therapy. Discovery of a new drug resistance mutation typically requires a combination of clinical research (e.g. AIDS patients displaying drug resistance), molecular biology (e.g. sequencing), and biochemistry (e.g. obtaining viral samples and performing phenotypic assays). Over the last 20 years, many mutations and combinations of mutations in protease have been reported to play a role in drug resistance [4, 5]. But this discovery process is laborious, expensive, and far from automatic – indeed, it is research. Meanwhile, HIV is constantly evolving new drug resistance mutations, often within weeks of introduction of a new drug [68]. Thus, a completely automated approach for identifying drug resistance mutations would be valuable [9, 10].

Differences in reproductive success (such as the enhanced survival of a drug-resistant mutant) are reflected in selection pressure, which can be measured directly from sequence data. For example, K a /K s is a well-known metric of selection pressure for amino acid mutations, measured relative to the observed frequency of synonymous mutations (which cause no change to protein sequence or function) [11, 12]. Ordinarily K a /K s values much less than one are observed, indicating that most amino acid mutations are deleterious. This is referred to as "negative selection". By contrast, K a /K s values greater than one indicate that amino acid mutations improve reproductive success (positive selection), a very unusual condition. By screening for individual amino acid mutations in HIV protease that displayed positive selection, this metric was able to correctly predict 19 out of 23 known drug resistance mutation sites in HIV protease [10]. This approach has several advantages: it is completely automatic; it is based on a large and well-understood body of evolutionary theory; it only requires raw sequence data to perform this analysis.

However, not all positively selected mutations are created equal. It would be very useful to be able to distinguish different types of mutations on the basis of their interactions with other sites. For example, a binding site mutation that interacts directly with the drug molecule might confer primary drug resistance [1315]. Alternatively, a mutation that interacts with another site within HIV protease might compensate for destabilizing effects of mutations at the other site [16, 17]. Unfortunately, selection pressure metrics such as K a /K s provide no explicit mechanism for taking such interactions into account.

To address this problem, we have extended the K a /K s concept to consider interactions with other sites, by formulating a conditional K a /K s that measures the effects of mutations at one site on the selection pressure for mutations at another site. We have applied it to four completely independent datasets: Specialty: 50,126 HIV-1 positive patient plasma samples sequenced by Specialty Laboratories, Inc. between October 1999 and October 2003 [10]. This dataset represents a mix of treated and untreated patient samples from the U.S. Treated: 1797 samples collected by Shafer and colleagues, from U.S. patients undergoing specific drug treatments [9]. Untreated: 2628 samples collected by Shafer and colleagues specifically from untreated patients [9]. Africa: 399 African HIV-1 subtype C samples downloaded from Los Alamos HIV Sequence Database [1820]. These samples predominantly represent untreated patients. These sequences cover the whole protease region (codons 1 – 99) of HIV-1. The conditional Ka/Ks results are highly reproducible among these independent datasets. They reveal complex interactions between amino acid sites, and distinguish different types of mutations such as primary drug-resistance mutations vs. accessory mutations.


HIV-1 sequence data

We filtered all four datasets to exclude contamination by other HIV-1 subtypes. We compared all of the sequences in these datasets against HIV-1 subtype reference sequences from the Los Alamos database by using the program BLAST and assigned each sample to the subtype with the highest identify. All non-subtype B sequences were removed from the U.S. datasets and all non-subtype C sequences were removed from the African dataset. There might be multiple samples from the same individual in Specialty dataset. To eliminate the phylogenetic relationships among sequences, we also excluded samples that have 98% or greater nucleotide sequence similarity to each other.

Multiple sequence alignment and mutation detection were performed as previously described [10]. We used the widely accepted nucleotide reference sequence from the Los Alamos HIV Sequence Database as the "wildtype" for HIV-1 subtype B [21, 22] and used the nucleotide consensus of all the African subtype C samples as the "wildtype" for HIV-1 subtype C. In this paper, we considered only single nucleotide substitutions; all other mutations were excluded.

Unconditional Ka/Ks measurement for individual codon positions

Amino acid selection pressure can be calculated for an individual codon [23, 24]; we will refer to this as the "unconditional K a /K s " to distinguish it from the conditional approach presented in the next section. We first measured the transition and transversion frequencies f t and f v from the entire dataset, according to

where S is the total number of samples, N t and N v are the number of observed transition and transversion mutations respectively, n t is the number of possible transitions in the region that was sequenced (simply equal to its length L in nucleotides), and n v is the number of possible transversions (equal to 2L). In this calculation (and all others below) we only counted single nucleotide substitutions; all other mutations were excluded. We performed the calculation of unconditional K a /K s as previously described [10], as the ratio of N a , the count of samples with amino acid mutations observed at that codon, over N s , the count of samples with synonymous mutations observed at that codon. This N a /N s ratio is then normalized by the ratio expected under a random mutation model (i.e., in the absence of any selection pressure), according to the following formula:

where na,tis the number of possible transition mutations in the codon that would change the wildtype amino acid, ns,tis the number of possible transition mutations in the codon that are synonymous, and na,vand ns,vare the equivalent numbers for transversions. When the observed ratio (N a /N s ) is greater than the expected ratio (in the denominator of the expression above), the selection pressure K a /K s is greater than 1 and we say that the codon is under positive selection pressure.

We calculated an LOD confidence score for a codon to be under positive selection pressure according to the following formula:

where N is the total number of mutations observed in the codon, and q is calculated as follows:

Conditional selection pressure analysis

To measure how mutation at one site alters the selection pressure at another site, we define the "conditional selection pressure" for a site Y in the presence of an amino acid mutation at another site X as:

where N YaXa is the number of samples with an amino acid mutation at codon Y and an amino acid mutation on codon X; and N YsXa is the number of samples with a synonymous mutation at codon Y and an amino acid mutation on codon X. We will refer to (K a /K s )Y|Xaas the "conditional K a /K s " at Y given an amino acid mutation at X.

We define the "conditional selection ratio" as the ratio of the conditional Ka/Ks divided by the selection pressure at Y measured in the absence of any mutation at X:

where N YaXo and N YsXo are the numbers of samples containing either an amino acid mutation or synonymous mutation at Y and no mutation at codon X.

Confidence scoring for conditional selection pressure analysis

We screened for codon pairs with conditional Ka/Ks and conditional selection ratios above 1. To evaluate the statistical significance of apparent positive conditional selection pressure, we defined a LOD score based on the likelihood of observing at least N YaXa samples with amino acid mutations at codon Y and X, assuming neutral selection at codon Y:

where N = N YaXa + N YsXa ; and q as defined above.

To evaluate the statistical significance of the conditional selection ratio, we used a one-sided Fisher's exact test to measure the p-value for the null hypothesis that (K a /K s )Y|Xa≤ (K a /K s )Y|Xo. We performed Fisher's exact test on a 2-by-2 contingency table containing the counts N YaXa -1, N YsXa , N YaXo and N YsXo , using the statistical software package R. We deducted a single count from N YaXa to make the test stricter against cases with small numbers of counts.

Throughout this study, we used the following criteria for high confidence positive conditional selection pressure: (K a /K s )Y|X> 1, (K a /K s )Y|Xa> 1, LOD > 3, and p < 10-3.

Comparing Ka/K s values between datasets

To test whether the difference in K a /K s value computed from two datasets D and D' was statistically significant, we calculated a p-value using a one-sided Fisher's exact test, using the 2 × 2 contingency table representing the counts of amino acid mutations and synonymous mutations from the two datasets: for unconditional Ka/Ks, (N a , N s , N' a , N' s ); for conditional K a /K s , (N YaXa , N YsXa , N' YaXa , N' YsXa ). We used a p-value threshold of 0.01.


Detection of selection pressure interactions via conditional K a /K s

To dissect the functional roles of amino acid mutations in HIV protease, we analyzed the Specialty dataset of 48,387 HIV-positive samples for conditional K a /K s relationships. We screened all possible codon pairs using several criteria: positive conditional selection pressure (with a statistical significance test of LOD >3), and a conditional selection ratio greater than 1 (with a statistical significance test of p < 10-3). 400 codon pairs showed strong conditional relationships by these criteria (Table 1). These effects ranged from relatively mild increases (for example, the presence of an amino acid mutation at position 88 increased the K a /K s value at codon 10 by about 27%) to very large (at position 30, the K a /K s value increased from 0.4 to 71.1 depending on the absence or presence of an amino acid mutation at codon 88).

Table 1 Positive conditional selection pressure in HIV-1 protease

These data reveal a basic distinction between two types of positive selection effects: those that depend on the presence of another amino acid mutation at a specific site, versus those that do not. For example, amino acid mutations at codon 90 are strongly selected for, regardless of the presence or absence of a mutation at codon 10 (Fig. 1). By contrast, codon 10 displayed negative selection in samples with no mutation at 90, but strongly positive selection in samples containing an amino acid mutation at 90. These data show that mutations at 90 are directly selected by drug treatment, that mutations at 10 are not favorable by themselves, but that they become advantageous in viruses bearing a 90 mutation. These results closely match previous experimental studies showing that mutations at 90 cause drug resistance, while mutations at 10 have an accessory effect of compensating for the destabilizing effect of mutations at 90 [25]. We have observed similar asymmetric effects at other sites, for example 24/82 (Fig. 1a). In general, primary drug resistance mutations showed consistent positive selection, whereas accessory mutations often showed only conditional positive selection, dependent on the presence of a primary drug resistance mutation. By exposing such asymmetries, the conditional K a /K s analysis may be able to predict whether mutations have a primary drug resistance vs. accessory effect. For example, mutations at position 54 are positively selected, which in turn induce positive selection for mutations at 89, which by themselves are unfavorable (Fig. 1b). These data suggest that 89 is an accessory position, whose mutations can stabilize mutations at 54 (which are known to cause drug resistance [26]).

Figure 1

Conditional Ka/Ks measurements for primary vs. accessory drug-resistance mutations. For the two possible pathways from wildtype to a double mutant, we computed the conditional K a /K s values for each mutation conditioned on the presence or absence of the other mutation (shown as numbers next to each edge in the figure). Primary drug resistance codons are highlighted in bold. a) 24/82; b) 54/89; c, e, g) 10/90; d, f, h) 63/90. a-d are from the Specialty dataset; e-f from the Treated dataset; and g-h from the Untreated dataset (see text).

Comparison with independent drug treatment studies of HIV protease

One weakness of the Specialty dataset is that it includes no information about individual drug treatments. To assess the significance of our conditional K a /K s results, we have compared them with the independent experimental study of Wu et al., who identified correlated mutation pairs, i.e. pairs of codons where mutations co-occurred much more frequently in patients with specific drug treatments, than in patients who received no drug treatment [9]. By comparing 1,004 HIV isolates from untreated patients with 1,240 patients from patients treated with one or more protease inhibitors, Wu et al. identified 92 protease codon pairs with significant correlated mutations associated with drug treatment. Our conditional K a /K s analysis of the Specialty dataset matched these results closely, identifying 80 of these 92 codon pairs as having positive conditional selection pressure (LOD>2 and p < 0.01). This result had strong statistical significance (p-value = 10-70). Thus conditional K a /K s appears to robustly detect mutational interactions that are genuinely associated with drug treatment, even from a dataset (Specialty) lacking any drug treatment information.

Distinguishing drug-resistance vs. fitness mutations by comparison of K a /K s for treated vs. untreated datasets

We have sought to distinguish mutations that are selected specifically in response to drug treatment, from fitness mutations that are positively selected even in the absence of drug treatment [27]. To do so, we first compared unconditional K a /K s values from the Treated dataset with those from the Untreated dataset (Table 2). Nine codons underwent a confident shift from negative selection in Untreated, to positive selection in Treated. Of these, five (30, 53, 54, 82, and 90) are known primary drug-resistance mutations, and three (10, 24, 73) are known accessory drug-resistance mutation codons [28]. Thus, this shift appears to be an accurate predictor of sites involved in drug-resistance. Intriguingly, the one remaining codon (74) is not currently known to cause drug-resistance, but showed confident positive selection both in the Treated (K a /K s = 3.75) and Specialty (K a /K s = 5.87) datasets.

Table 2 Significant differences of selection pressure between Treated and Untreated dataset

A total of fifteen codons were positively selected in both Treated and Untreated dataset (Table 3; codon 71 has K a /K s value of 1.11 but a LOD = 0.63 in the Untreated dataset). The fact that these mutations were positively selected even in the absence of drug treatment suggests that they contribute directly to viral fitness. Of these, five (19, 62, 63, 71 and 77) displayed a statistically significant increase in K a /K s from Untreated to Treated (Table 2). Codons 63, 71 and 77 are known accessory drug-resistance mutations [28]. The functions of the remaining two codons are unknown. The significant increase of K a /K s value in the Treated dataset predicts an accessory role in drug-resistance.

Table 3 Positive selection pressure in Treated and Untreated datasets

Strikingly, all of the known primary drug resistance mutations were observed to be negatively selected in the Untreated dataset, suggesting that they reduce fitness in the absence of drug treatment [2931]. These negative selection effects were strongest for mutations at codons 66 (K a /K s < 0.01), 84(< 0.01), 32 (0.01), 48 (0.02) and 54 (0.03), and weakest at codons 82 (K a /K s = 0.46), 53 (0.31), 24(0.33), and 90 (0.25). Thus, in the absence of continuing drug treatment, these mutations would be expected to be gradually lost from the evolving viral population due to their greatly reduced fitness [2931]. By contrast, while most accessory drug-resistance mutations were negatively selected in Untreated, some were neutral (e.g. 71), while others were strongly positively selected (e.g. 63, 77) even in the absence of drug treatment. This may explain why mutations at position 63 appear to be becoming dominant in the HIV population [26, 32]

It is interesting to note that at no codon did we ever observe the pattern of positive selection in Untreated but negative selection in Treated (Table 3). Thus, drug-treatment appears to preserve positive selection pressure for fitness mutations that are found in the untreated dataset, merely adding new positive selection effects for drug-resistance.

Although the Treated and Untreated datasets are much smaller than Specialty, we were able to calculate conditional K a /K s values for many sites. Fig. 1e–h shows a comparison of conditional K a /K s results from the Treated and Untreated data for positions 10/90 and 63/90, versus the results from the Specialty Data (Fig. 1c–d). The Specialty and Treated results display good qualitative agreement, in that mutations at 90 (a primary drug-resistance mutation site) are positively selected regardless of whether an accessory mutation is present at 10 or 63, and each mutation reproducibly increases selection for the other mutation [25, 31]. By contrast, in the Untreated dataset, mutations at codon 90 become unfavorable (negative selection pressure), while selection pressure for the accessory mutation (10 or 63) drops much less.

Although estimating conditional K a /K s accurately from the Treated dataset is difficult due to insufficient counts, we identified 274 codon pairs that showed statistically significant increases in conditional K a /K s for Treated relative to the Untreated dataset (Table 4). Of these shifts, eleven exhibited positive conditional selection pressure in Treated, but not in Untreated: 10→54, 71→54, 71→82, 63→90 (primary drug-resistance mutation codons); 20→10, 82→10, 71→10, 24→10, 10→63, 36→71, 63→73 (accessory drug-resistance mutation codons). And two codon pairs (93→63, an accessory drug-resistance mutation; 12→19, function unknown) are positively selected in both datasets. It is striking that almost all of these increases in conditional K a /K s involved known drug-resistance mutation sites.

Table 4 Significant differences of conditional selection pressure between Treated and Untreated dataset

K a /K s comparison between the specialty and treated datasets

To evaluate the robustness of our approach, we compared its results from two independent datasets (Specialty; Treated), both from U.S. patients who received drug treatment. We first computed unconditional K a /K s values from the two datasets, and compared the results (Fig. 2). These independent data displayed a significant quantitative agreement, with a correlation coefficient of 0.887. Out of the 97 protease codons analyzed, only two significant differences were observed between the datasets. At codon 24, the Treated data showed mild positive selection (K a /K s = 1.59), but the Specialty dataset indicated slightly negative selection (K a /K s = 0.55). At codon 15, the situation was reversed (Treated K a /K s = 0.88; Specialty K a /K s = 1.30), but since both values were very close to neutral (K a /K s = 1.0), the difference was not statistically significant. The almost exact overlap in positive selection between the two datasets was highly significant (p-value = 10-15.9).

Figure 2

Positive selection pressure in the Specialty andTreated datasets. K a /K s values for positive selection pressures in the Specialty and Treated datasets are plotted. K a /K s values with a LOD < 2 in the Treated dataset were plotted as empty triangles. Different colors represent codons with different functions on drug resistance: Red: primary drug resistance mutation codon; Blue: accessory drug resistance mutation codon; Light blue: function unknown codon. The correlation coefficient of the selection pressures in these two datasets is 0.887, suggesting a significant quantitative agreement.

Overall, the positively selected K a /K s values computed from the Treated dataset tended to be slightly higher than those from the Specialty dataset. This is reflected in the ratios of K a /K s values from Treated vs. Specialty data, whose average (1.37) indicates that the Treated K a /K s values were 37% higher than the Specialty values. This is consistent with the fact that the Treated dataset consists exclusively of patients under drug treatment, whereas the Specialty dataset contains a mixture of samples from treated and untreated patients. Such mixing is expected to reduce the observed selection pressure arising from drug-treatment.

We next compared conditional K a /K s values from the two datasets (Fig. 3). These results display considerably more variation, but this appears to be due to insufficient counts in the Treated dataset for accurate estimate of conditional K a /K s values for many codon pairs. Whereas the unconditional K a /K s uses the whole dataset to estimate the rates of amino acid mutations (K a ) and synonymous mutations (K s ), the conditional K a /K s for a mutation Y conditioned on another mutation X is by definition limited to the small subset of sequences that contain mutation X. Since the Treated dataset contains only 1,797 samples, this drops many sites below the number of samples needed for accurate estimation of conditional K a /K s . To filter the dataset to codon pairs where (K a /K s )Y|Xa could be estimated with some accuracy, we restricted the analysis to cases where N YaXa + N YsXa > 100. These results showed a strong match between conditional K a /K s values calculated from the two independent datasets, with a correlation coefficient of 0.495. None of the codon pairs with positive conditional selection in the Specialty dataset showed negative selection in the Treated dataset.

Figure 3

Positive conditional selection pressure in the Specialty and Treated datasets. (K a /K s )Y|Xa values for positive conditional selection pressures in the Specialty and Treated datasets are plotted. To filter the dataset to sites where (K a /K s )Y|Xa could be estimated with accuracy, only selection pressures with NXaYa + NXaYs > 100 in both datasets were showed. An empty triangle indicated that we were not confident about the conditional selection value in the Treated dataset. The correlation coefficient of the conditional selection pressures in these two datasets is 0.495.

Again, the positively selected conditional K a /K s values computed from the Treated dataset were slightly higher than those from the Specialty dataset (by 17.9% on average). This is consistent with the fact that the Specialty dataset is a mixture of treated and untreated samples, whereas the Treated dataset was collected exclusively from treated patients.

K a /K s between the untreated and Africa datasets

We computed K a /K s values from two untreated datasets: 2,628 HIV-positive samples from U.S. patients who had received no drug treatment (Untreated); and 399 HIV-positive samples from African patients (predominantly untreated [1820]). While the African dataset is too small for accurate estimation of K a /K s values, the two datasets show strong qualitative agreement (Fig. 4). Of the 97 protease codons analyzed, seventeen codons showed positive selection (K a /K s >1, LOD>2) in at least one of the two datasets. In twelve cases, both datasets indicated positive selection (K a /K s >1); in four cases, the Untreated data detected positive selection but the African dataset did not; and in one case (codon 39), the reverse was observed. This high level of agreement was statistically significant (P-value = 10-3.75). Despite the fact that these data were collected on different continents and represent different HIV-1 subtypes (B in Untreated, and C in African), they indicate that fitness mutations are found mostly at the same positions in the two populations.

Figure 4

Independent HIV-1 datasets yield reproducible positive selection results. Our method identified positively selected sites (marked with different symbols depending on whether the site is a known drug resistance mutation, accessory mutation, etc.) in independent analyses of four different datasets: Specialty: 48,387 U.S. patient samples, a mix of treated and untreated patients; Stanford-Treated: 1797 U.S. patient samples with specific drug treatments; Stanford-Untreated: 2628 U.S. untreated patient samples; African: 399 African patient samples, HIV-1 subtype C (predominantly untreated patients).

Assessing the effects of mixing treated and untreated data on selection pressure calculations

In principle, the K a /K s analysis approach differs from associational studies in that it does not require an untreated "reference" dataset in order to detect drug-resistance signals, and should even be tolerant of significant levels of contamination by untreated samples. To test its robustness to such contamination, we deliberately mixed the Treated and Untreated datasets, and analyzed K a /K s from the mixed data. We tested several mixtures: adding 25% contamination of Untreated samples (i.e. N untreated = 0.25 N treated ); 50% contamination; and 146% contamination (i.e. all the available Untreated samples). The 25% and 50% mixtures caused only minor effects on the measured Ka/Ks values (data not shown). The maximum contamination (146%) caused up to two-fold decreases in positively selected K a /K s values (Fig. 5), with shifts averaging of 32.5%. Out of the 24 codons detected to be positively selected in the Treated dataset, only two became negatively selected in the mixture dataset. These data show that the K a /K s approach is relatively robust even to very high levels of contamination by untreated samples.

Figure 5

Effects of mixing Treated and Untreated data on selection pressure calculation. We have mixed the Treated (1797 samples) with the Untreated (2628 samples) to assess the effects of a mixed data on selection pressure analysis. This mixing caused up to two-fold decreases in positively selected Ka/Ks values (compared with the Treated dataset), with shifts averaging of 32.7%. Out of the 24 positively selected codons in the Treated dataset, only two became negatively selected in the Mixture dataset. K a /K s values with a LOD < 2 in the Mixture dataset were plotted as empty triangles.


Naïve K a /K s analysis involves several assumptions that can undercut the robustness of its results. In particular, the assumption of a single consensus sequence as the common ancestor of all sequences in the data sample ignores the possibility of important phylogenetic relationships among these sequences. Phylogenetic structure in the dataset thus can cause artifacts in K a /K s analysis, and some methods perform K a /K s analysis in the context of an explicit phylogeny [33]. Linkage disequilibrium between sites can also cause artifacts in conditional K a /K s analysis, by creating correlations between sites that are purely historical, and not due to functional interaction [34, 35]. These problems are likely to be most severe in genomes with a low mutation rate and low recombination rate, and less severe in genomes with very high mutation and recombination rates, such as HIV.

To assess these problems, we have used HIV mutation data to test the utility and robustness of conditional K a /K s analysis, by comparison with known functional interactions involved in drug resistance, and comparison of the K a /K s results for multiple independent datasets. Several lines of evidence indicate that, at least for HIV protease, conditional K a /K s analysis provides robust detection of real functional interactions that are not artifacts: 1) K a /K s identification of sites with strong positive selection correctly identified known drug resistance mutations in HIV protease (p-value = 10-3.3; [10]); 2) K a /K s identification of mutations with strong positive selection matched independent experimental phenotypic assays [10, 36]; 3) conditional K a /K s identification of pairs of sites with strongly positive conditional selection ratios ((K a /K s )Y|X >>1) strongly matched independent experimental studies identifying pairs of mutations that are correlated in treated vs. untreated patient datasets [9]; 4) detailed functional predictions from conditional K a /K s (such as distinguishing primary vs. accessory drug resistance mutations) matched well-known examples in HIV protease where such detailed functional information is available; 5) conditional K a /K s values computed from independent datasets (Specialty vs. Stanford-Treated) yielded highly similar results, indicating that these values are not artifacts, but broadly reproducible characteristics of HIV evolution in the wild population; 6) the strong positive selection patterns identified by conditional K a /K s in treated patients (Specialty; Stanford-Treated) vanished in the untreated patient dataset (Stanford-Untreated), indicating that these K a /K s patterns genuinely reflect selection pressure associated with drug treatment. Considered as a whole, all of these assessments suggest that conditional K a /K s robustly identifies genuine functional selection pressure in HIV protease, not artifacts, and thus may also be useful in other genes with similar characteristics, such as other HIV genes, or other rapidly-evolving viruses. However, it is likely that a more sophisticated approach, incorporating phylogenetic structure, will be required for other types of problems.

The high level of reproducibility of K a /K s results across four independent datasets in this study (Fig. 4) merits further consideration. These datasets were collected entirely independently, and come from diverse sources [9, 10, 1820]. The quantitative agreement between K a /K s values obtained from these datasets is surprisingly consistent considering that the input data were collected by different investigators, from different patient groups, over different time periods, under differing drug treatment regimens (and in the case of the Specialty dataset, no treatment information available whatsoever), and even from different continents and HIV-1 subtypes. Even the weakest level of agreement between datasets (Untreated vs. Africa, by far the smallest dataset) was significant (p-value = 10-3.75). Moreover, comparison of the K a /K s results versus known drug-resistance mutations demonstrates that this approach predicts true drug-resistance mutation codons with a high level of accuracy (13 drug-resistance mutation codons in K a /K s analysis, and 19 in the conditional K a /K s analysis). These results also show that comparison of K a /K s values between treated and untreated datasets is effective at distinguishing primary and accessory drug-resistance mutations (only observed in the treated dataset) from viral fitness mutations (observed in the untreated dataset). This suggests that this approach could be applied successfully to more specific dataset comparisons (such as two different drug treatments).

These four datasets also suggest that viral fitness mutations in HIV protease are surprisingly consistent throughout HIV-1 populations. In general, mutations that were positively selected in untreated samples (i.e. the Untreated and African datasets) were also observed to be positively selected in treated samples (i.e. in both the Specialty and Treated datasets). To a first approximation, these mutations seem to represent a uniform background of positive selection pressures that are consistently observed in any set of samples we have analyzed. Furthermore, these "viral fitness" mutation codons matched strongly between the Africa and Untreated datasets, indicating that these fitness pressures are even consistent across different HIV-1 subtypes (B in Untreated vs. C in Africa) [37, 38].

These comparisons suggest that it may be useful to classify mutations according to their selection pressure values in treated vs. untreated patient populations. For example, whereas mutations that are positively selected even in untreated samples may be classed as improving viral fitness [37], mutations that are positively selected in treated samples but not untreated samples are likely to be drug-resistance mutations [1315]. It is also interesting to classify mutations according to whether their positive selection effects are conditional or unconditional. A mutation that is positively selected by drug treatment, with no conditional dependence on any other mutation, evidently makes a primary (i.e. direct) contribution to drug-resistance [1315]. By contrast, a mutation that is positively selected by drug treatment only in the presence of a primary mutation, evidently does not make a direct contribution to drug-resistance, but rather a secondary (or "accessory") contribution mediated by the primary mutation [16, 17]. Finally, a mutation that compensates for a deleterious fitness effect caused by a primary drug-resistance mutation can also be distinguished by K a /K s analysis [31]. Such a mutation should not exhibit unconditional positive selection in either untreated or treated samples, but should show positive conditional selection (only in the presence of the primary mutation) in both types of samples.

How much "mixing" of sample types can this approach tolerate? We have tested the effects of deliberately mixing treated and untreated datasets upon positive selection mapping by K a /K s . These data show that K a /K s can robustly detect drug resistance mutations even when the level of mixing equals or even exceeds 100%. Such mixing reduced K a /K s estimates somewhat, but qualitatively the results were robust. Thus, not only can K a /K s detect drug resistance mutations in a single dataset (with no requirement for a reference dataset of untreated samples), it can do so even when the primary dataset is contaminated with a large fraction of untreated samples, or samples with different treatments. This can be very useful in situations where treatment information is incomplete or unavailable.

Since different amino acid mutations can have different functional effects [36], an obvious extension of this analysis to consider the conditional selection pressure effects of individual amino acid mutations, rather than treating all amino acid mutations at one site as equivalent, as was done in this analysis. Such an extension would yield detailed information about the amino acid interactions between two codon sites [10].

Conditional Ka/Ks represents a slight extension of existing work on co-evolution of amino acid residues in protein sequences. One well-established approach to this problem is the use of correlation analysis to look for correlations between amino acid residues in a set of related sequences [17, 3944]. For example, by analyzing mutation correlation in an alignment of 266 non-redundant SH3 domain sequences, Larson et al. were able to predict residue-residue contacts in the structure of the SH3 domain [40]. Such correlated mutation analysis tests for non-additive effects of a pair of mutations on reproductive fitness, such as epistasis [45] or compensation [43]. One important question not addressed by correlation analysis is whether an interaction between a pair of mutations is symmetric or asymmetric; that is, whether the influence of one mutation on the fitness effect of the second mutation is the same as the influence of the second mutation upon the fitness effect of the first. Since the definition of correlation is inherently symmetric (i.e. the correlation coefficient ρ(x,y) = ρ(y,x) is invariant to exchange of x and y), it does not answer this question. Conditional Ka/Ks combines the benefits of correlation analysis (detection of interactions between sites) with the benefits of standard Ka/Ks (a measurement of functional selection pressure). Moreover, this method can distinguish symmetric vs. asymmetric functional interactions.

We emphasize that in the simple form presented in this paper, conditional Ka/Ks is probably not applicable in datasets with a low mutation rate or strong linkage disequilibrium. If the number of mutation observations at a site is small, results will not be statistically significant. Worse, if individual observations of a mutation are not independent mutation events, but instead arose by common descent from a single ancestral mutation event, the results will be skewed by such "double-counting" bias within the data. In sequence datasets with clear phylogenetic structure, it will be essential to compute conditional Ka/Ks for specific branches of the phylogenetic tree, as has been done for standard Ka/Ks calculations [33, 46]. This improved approach control for departures from conditional independence in the data sample, by explicitly analyzing the conditional dependence structure (i.e. the phylogenetic tree). Thus conditional Ka/Ks need not be limited to cases that fit a "star topology", but can be generalized to work with phylogenetic trees.


Conditional K a /K s analysis can detect mutation interactions between different sites. The analysis results also reveal important effects on fitness that are asymmetric and are very useful to predict primary vs. accessory drug resistant mutations. Comparison of K a /K s values between treated and untreated datasets is effective at distinguishing drug-resistance associated mutations from viral fitness mutations. Analysis results from four completely independent datasets are highly consistent both qualitatively and quantitatively. These results can be very helpful to understand the functional roles of different mutations in the development of drug resistance.

Reviewers' comments

Reviewer's report 1

Wen-Hsiung Li (nominated by Eugene V. Koonin), University of Chicago, IL, USA

Second review following revisions:

This is a very interesting study. The author developed a conditional Ka/Ks test and applied it to large amounts of HIV sequence data to detect drug resistance mutation and fitness mutations. Since the MS reads well and since I am traveling, I have not read it carefully, but I suppose there are no serious problems.

My main suggestion is that the authors should explain in more detail the methodological parts, so that readers can understand them better.

Minor comments:

Give an explanation of Ka and Ks in the Introduction.

Explain the definition of q on page 7.

In many places "amino acid mutations" should be "amino acid substitutions or replacements".

Author response

We have briefly introduced Ka/Ks in the Background and there is detailed description of how to calculate Ka/Ks in the Method. We have defined the mathematical formula to calculate q on page 7.

Reviewer's report 2

Robert Shafer (nominated by Eugene V. Koonin), Stanford University, CA, USA

Second review following revision

The manuscript has been greatly improved by addressing the larger specific issues raised in the initial review and also by improving the presentation of the results and clarity of the writing. Below I've summarized some of the specific improvements and do raise two additional points – the second of these points should definitely be addressed because it will be raised by others knowledgeable about the specific patterns of mutations observed in drug-resistant isolates.

1. Phylogenetic relationships among the sequences. The manuscript has been improved by discussing the relevance of this issue and by adopting a heuristic approach – excluding samples with >98% sequence identity. The authors should add to the methods that this step was taken.

2. Inherent variability of HIV-1 sequences. The authors have explained in more detail how conditional Ka/Ks was determined. Although their approach may remain controversial, it is certainly novel and worthy of publication and broader discussion.

Author response

As suggested by the reviewer, we have added a sentence in the Method Section to explain that we have excluded samples with >98% sequence identity.

3. Availability of the data. Although this may be partly beyond their control, the authors' attempts to make the data available – ideally the complete nucleotide sequence – is laudatory.

Author response

All the sequence data and analysis results will be available to open access upon the publication of this manuscript at:

4. Presentation of the results. This has been improved and the Additional File is also useful.

Several additional comments:

The authors state in the methods that "we only counted single nucleotide substations". Would this influence their ability to detect positions at which the mutations are 2-bp mutations. If so this should be stated. This would not influence the protease because, as far as I am aware, the 2-bp changes I47A, I54A, V82S, I84A/C are each uncommon mutations at positions at which 1-bp changes also cause resistance (I47V, I54V, V82A, I84V). But it would influence a similar analysis on RT as T215Y and T215F are each 2-bp changes and are the only drug-resistance mutations at these positions.

I believe the authors need to add further explanation as to the implications of conditional Ka/Ks because of the following: In table 1, several of the Codon X's are actually known to follow the Codon Y. Therefore the statistical dependence of Y on X, does not necessarily mean that Y follows X. The most obvious example of this is the relationship between position 30 and 88. There is only one mutation at position 30 (D30N). There are several at position 88 (N88D >> N88S >>> N88T/G). N88S occurs completely independent of D30N but is much less common than N88D. However, N88D always follows and never precedes D30N. Therefore there are two paths to resistance involving these mutations: D30N followed by D30N+N88D and N88S alone. I believe that this is also the case for codons 73 and 90 (73 nearly always follows 90) and 24 and 82 (24 nearly always follows 82).

Author response

Only a very small fraction of the mutations involve multiple nucleotide substitutions in a codon (about 10%). They are excluded from this analysis because their evolution patterns are different from single nucleotide polymorphism (ex. It takes two steps for ctc mutating to gcc, which could be either of these two processes: 1. ctc→gtc→gcc; 2. tct→ccc→gcc)

Reviewer's report 3

Shamil Sunyaev, Harvard Medical School, MA, USA

Second review following revisions:

This manuscript presents an analysis of HIV protease sequencing datasets. Comparing synonymous and non-synonymous sequence changes, the authors detected sites under putative positive selection possibly in response to drug treatment.

The manuscript is based on the earlier work by the same authors (J Virol. 2004 Apr;78(7):3722–32.) and introduces a new concept of conditional positive selection. The authors argue that selection in some sites is conditional on amino acid residue at a different site.

This is an interesting work and the manuscript has improved upon revision. The revised manuscript includes discussion of the existing literature on co-evolution of amino acid residues. However, I would still like to make a few comments, some of them being critical.

1) A few comments on the method

a) The authors make a series of simplifications including assumptions of star-like phylogeny and absence of multiple substitutions in individual codons. Therefore, the method assumes that all sequences are descendants of a single sequence and the estimation of Ka and Ks is done by simple counting. I agree with the authors that currently available methods are not applicable to the dataset of ~50,000 sequences. Thus, the authors have no other choice rather than to rely on simpler computationally efficient methods and making these assumptions is a necessity. The authors also point that the number of codons with multiple substitutions is not expected to be high. However, if the above assumptions are not valid (and it is not clear why the phylogeny is expected to be star-like), the results of the analysis have to be received with caution. Inference of positive selection can be erroneous. Analysis of statistical significance which relies on the binomial distribution and application of the Fisher exact test are not justified if these assumptions are not satisfied.

Author response

We have identified all mutations (both single nucleotide polymorphisms and multiple nucleotide polymorphisms) in this dataset. The results show that only a very small fraction of the mutations involve multiple substitutions in a codon (about 10%). They are excluded from this analysis because their evolution patterns are different from single nucleotide polymorphism.

HIV has extremely high mutation rate and high recombination rate. And it is under strong selection pressure, especially when the patient is under drug treatment. These same factors tend to make HIV evolution within a specific subtype follow the star topology model.

b) It is my understanding that frequencies of transition and transversion rates were estimated using both synonymous and non-synonymous changes. It may be better to use synonymous changes only.

Author response

As suggested by the reviewer, we have estimated the transition and transversion rates only using synonymous changes (F t /F v = 23.75). This does not alter the results of selection pressure and conditional selection pressure analysis much. For the purposes of this paper (selection pressure analysis) we prefer to use the more conservative assumption of measuring F t /F v from all nucleotides (F t /F v = 8.90), since this matches what has been published previously (Wilson, J. W., J Clin Microbiol, 2000).

c) Testing all pairs of codons creates an obvious multiple test problem. The manuscript would probably benefit from the discussion on how the rate of false-positive predictions is controlled given this multiple test problem.

2) It is of interest whether amino acid residue pairs which display conditional selection located close to each other in protein structure. The structural analysis can serve as a confirmation of the conclusions made by the authors. However, I agree with the authors's response that this should be a subject of a separate study.

3) Most importantly, I do not find the explanation of the observed "conditional selection" satisfactory. The authors argue that mutations with significant "conditional Ka/Ks values" are not directly involved in the drug resistance and play accessory role. The first mutation which is directly involved in the drug resistance is destabilizing (in other words, is expected to be deleterious in the absence of the drug treatment) and the second mutation is needed to compensate the destabilizing effect of the first mutation. Although, this is certainly a possibility, it is feasible to imagine many other scenarios. For example, is it possible that the first mutation enables the effect of the second mutation on the drug response, i.e. the first mutation is not destabilizing but the second mutation would not have an effect on drug resistance without the first mutation? Alternatively, is it possible that the first mutation is not destabilizing, whereas the second mutation would be destabilizing (and deleterious) in the absence of the first mutation (Dobzhansky-Muller incompatibility)? In this case also the second mutation may be directly involved in drug resistance. Importantly, the latter scenarios do not involve initial increase in frequency of a destabilizing variant.


  1. 1.

    Mahalingam B, Boross P, Wang YF, Louis JM, Fischer CC, Tozser J, Harrison RW, Weber IT: Combining mutations in HIV-1 protease to understand mechanisms of resistance. Proteins 2002,48(1):107-116. 10.1002/prot.10140

  2. 2.

    Ohtaka H, Schon A, Freire E: Multidrug resistance to HIV-1 protease inhibition requires cooperative coupling between distal mutations. Biochemistry 2003,42(46):13659-13666. 10.1021/bi0350405

  3. 3.

    Rousseau MN, Vergne L, Montes B, Peeters M, Reynes J, Delaporte E, Segondy M: Patterns of resistance mutations to antiretroviral drugs in extensively treated HIV-1-infected patients with failure of highly active antiretroviral therapy. J Acquir Immune Defic Syndr 2001,26(1):36-43.

  4. 4.

    Johnson VA, Brun-Vezinet F, Clotet B, Conway B, D'Aquila RT, Demeter LM, Kuritzkes DR, Pillay D, Schapiro JM, Telenti A, Richman DD: Update of the drug resistance mutations in HIV-1: 2004. Top HIV Med 2004,12(4):119-124.

  5. 5.

    Muzammil S, Ross P, Freire E: A major role for a set of non-active site mutations in the development of HIV-1 protease drug resistance. Biochemistry 2003,42(3):631-638. 10.1021/bi027019u

  6. 6.

    Dybul M, Fauci AS, Bartlett JG, Kaplan JE, Pau AK: Guidelines for using antiretroviral agents among HIV-infected adults and adolescents. Recommendations of the Panel on Clinical Practices for Treatment of HIV. MMWR Recomm Rep 2002,51(RR-7):1-55.

  7. 7.

    Richman DD, Havlir D, Corbeil J, Looney D, Ignacio C, Spector SA, Sullivan J, Cheeseman S, Barringer K, Pauletti D: Nevirapine resistance mutations of human immunodeficiency virus type 1 selected during therapy. J Virol 1994,68(3):1660-1666.

  8. 8.

    Wei X, Ghosh SK, Taylor ME, Johnson VA, Emini EA, Deutsch P, Lifson JD, Bonhoeffer S, Nowak MA, Hahn BH: Viral dynamics in human immunodeficiency virus type 1 infection. Nature 1995,373(6510):117-122. 10.1038/373117a0

  9. 9.

    Wu TD, Schiffer CA, Gonzales MJ, Taylor J, Kantor R, Chou S, Israelski D, Zolopa AR, Fessel WJ, Shafer RW: Mutation patterns and structural correlates in human immunodeficiency virus type 1 protease following different protease inhibitor treatments. J Virol 2003,77(8):4836-4847. 10.1128/JVI.77.8.4836-4847.2003

  10. 10.

    Chen L, Perlina A, Lee CJ: Positive selection detection in 40,000 human immunodeficiency virus (HIV) type 1 sequences automatically identifies drug resistance and positive fitness mutations in HIV protease and reverse transcriptase. J Virol 2004,78(7):3722-3732. 10.1128/JVI.78.7.3722-3732.2004

  11. 11.

    Li WH: Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J Mol Evol 1993, 36: 96-99. 10.1007/BF02407308

  12. 12.

    Nei M, Gojobori T: Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 1986,3(5):418-426.

  13. 13.

    Mahalingam B, Louis JM, Hung J, Harrison RW, Weber IT: Structural implications of drug-resistant mutants of HIV-1 protease: high-resolution crystal structures of the mutant protease/substrate analogue complexes. Proteins 2001,43(4):455-464. 10.1002/prot.1057

  14. 14.

    Perryman AL, Lin JH, McCammon JA: HIV-1 protease molecular dynamics of a wild-type and of the V82F/I84V mutant: possible contributions to drug resistance and a potential new target site for drugs. Protein Sci 2004,13(4):1108-1123. 10.1110/ps.03468904

  15. 15.

    Wittayanarakul K, Aruksakunwong O, Saen-oon S, Chantratita W, Parasuk V, Sompornpisut P, Hannongbua S: Insights into saquinavir resistance in the G48V HIV-1 protease: quantum calculations and molecular dynamic simulations. Biophys J 2005,88(2):867-879. 10.1529/biophysj.104.046110

  16. 16.

    Devereux HL, Emery VC, Johnson MA, Loveday C: Replicative fitness in vivo of HIV-1 variants with multiple drug resistance-associated mutations. J Med Virol 2001,65(2):218-224. 10.1002/jmv.2023

  17. 17.

    Gobel U, Sander C, Schneider R, Valencia A: Correlated mutations and residue contacts in proteins. Proteins 1994,18(4):309-317. 10.1002/prot.340180402

  18. 18.

    Gordon M, De Oliveira T, Bishop K, Coovadia HM, Madurai L, Engelbrecht S, Janse van Rensburg E, Mosam A, Smith A, Cassol S: Molecular characteristics of human immunodeficiency virus type 1 subtype C viruses from KwaZulu-Natal, South Africa: implications for vaccine and antiretroviral control strategies. J Virol 2003,77(4):2587-2599. 10.1128/JVI.77.4.2587-2599.2003

  19. 19.

    Novitsky V, Smith UR, Gilbert P, McLane MF, Chigwedere P, Williamson C, Ndung'u T, Klein I, Chang SY, Peter T, Thior I, Foley BT, Gaolekwe S, Rybak N, Gaseitsiwe S, Vannberg F, Marlink R, Lee TH, Essex M: Human immunodeficiency virus type 1 subtype C molecular phylogeny: consensus sequence for an AIDS vaccine design? J Virol 2002,76(11):5435-5451. 10.1128/JVI.76.11.5435-5451.2002

  20. 20.

    Ndung'u T, Renjifo B, Novitsky VA, McLane MF, Gaolekwe S, Essex M: Molecular cloning and biological characterization of full-length HIV-1 subtype C from Botswana. Virology 2000,278(2):390-399. 10.1006/viro.2000.0583

  21. 21.

    Kuiken CL, Foley B, Hahn B, Marx PA, McCutchan F, Mellors JW, Mullins JI, Wolinsky S, and Korber B (Eds): Human retroviruses and AIDS: a compilation and analysis of nucleic and amino acid sequences In Theoretical Biology and Biophysics Group. Los Alamos National Laboratory, Los Alamos, NM.; 1999.

  22. 22.

    Gonzales MJ, Machekano RN, Shafer RW: Human immunodeficiency virus type 1 reverse-transcriptase and protease subtypes: classification, amino acid mutation patterns, and prevalence in a northern California clinic-based population. J Infect Dis 2001,184(8):998-1006. 10.1086/323601

  23. 23.

    Suzuki Y, Gojobori T: A method for detecting positive selection at single amino acid sites. Mol Biol Evol 1999,16(10):1315-1328.

  24. 24.

    Yang Z, Nielsen R, Goldman N, Pedersen AM: Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 2000,155(1):431-449.

  25. 25.

    Para MF, Glidden DV, Coombs RW, Collier AC, Condra JH, Craig C, Bassett R, Leavitt R, Snyder S, McAuliffe V, Boucher C: Baseline human immunodeficiency virus type 1 phenotype, genotype, and RNA response after switching from long-term hard-capsule saquinavir to indinavir or soft-gel-capsule saquinavir in AIDS clinical trials group protocol 333. J Infect Dis 2000,182(3):733-743. 10.1086/315769

  26. 26.

    Hertogs K, Bloor S, Kemp SD, Van den Eynde C, Alcorn TM, Pauwels R, Van Houtte M, Staszewski S, Miller V, Larder BA: Phenotypic and genotypic analysis of clinical HIV-1 isolates reveals extensive protease inhibitor cross-resistance: a survey of over 6000 samples. Aids 2000,14(9):1203-1210. 10.1097/00002030-200006160-00018

  27. 27.

    Quiñones-Mateu MiguelE, Arts EricJ: HIV-1 Fitness: Implications for Drug Resistance, Disease Progression, and Global Epidemic Evolution. In HIV Sequence Compendium. Edited by: Kuiken C, McCutchan F, Foley B, Mellors JW, Hahn B, Mullins J, Marx P, Wolinsky S. Los Alamos, NM , Theoretical Biology and Biophysics Group, Los Alamos National Laboratory; 2001.

  28. 28.

    Shafer RW, Dupnik K, Winters MA, Eshleman SH: A Guide to HIV-1 Reverse Transcriptase and Protease Sequencing for Drug Resistance Studies. In HIV Sequence Compendium. Edited by: Kuiken C, McCutchan F, Foley B, Mellors JW, Hahn B, Mullins J, Marx P, Wolinsky S. Los Alamos, NM , Theoretical Biology and Biophysics Group, Los Alamos National Laboratory; 2001.

  29. 29.

    Croteau G, Doyon L, Thibeault D, McKercher G, Pilote L, Lamarre D: Impaired fitness of human immunodeficiency virus type 1 variants with high-level resistance to protease inhibitors. J Virol 1997,71(2):1089-1096.

  30. 30.

    Mammano F, Trouplin V, Zennou V, Clavel F: Retracing the evolutionary pathways of human immunodeficiency virus type 1 resistance to protease inhibitors: virus fitness in the absence and in the presence of drug. J Virol 2000,74(18):8524-8531. 10.1128/JVI.74.18.8524-8531.2000

  31. 31.

    Martinez-Picado J, Savara AV, Sutton L, D'Aquila RT: Replicative fitness of protease inhibitor-resistant mutants of human immunodeficiency virus type 1. J Virol 1999,73(5):3744-3752.

  32. 32.

    Kozal MJ, Shah N, Shen N, Yang R, Fucini R, Merigan TC, Richman DD, Morris D, Hubbell E, Chee M, Gingeras TR: Extensive polymorphisms observed in HIV-1 clade B protease gene using high-density oligonucleotide arrays. Nat Med 1996,2(7):753-759. 10.1038/nm0796-753

  33. 33.

    Wakefield MJ, Maxwell P, Huttley GA: Vestige: maximum likelihood phylogenetic footprinting. BMC Bioinformatics 2005,6(1):130. 10.1186/1471-2105-6-130

  34. 34.

    Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, Dress AW: Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol Biol Evol 2000,17(1):164-178.

  35. 35.

    Nordborg M, Tavare S: Linkage disequilibrium: what history has to tell us. Trends Genet 2002,18(2):83-90. 10.1016/S0168-9525(02)02557-X

  36. 36.

    Loeb DD, Swanstrom R, Everitt L, Manchester M, Stamper SE, Hutchinson CAIII: Complete mutagenesis of the HIV-1 protease. Nature 1989, 340: 397-400. 10.1038/340397a0

  37. 37.

    Holguin A, Paxinos E, Hertogs K, Womac C, Soriano V: Impact of frequent natural polymorphisms at the protease gene on the in vitro susceptibility to protease inhibitors in HIV-1 non-B subtypes. J Clin Virol 2004,31(3):215-220. 10.1016/j.jcv.2004.03.015

  38. 38.

    Wainberg MA: HIV-1 subtype distribution and the problem of drug resistance. Aids 2004, 18 Suppl 3: S63-8. 10.1097/00002030-200406003-00012

  39. 39.

    Chelvanayagam G, Jakobsen IB, Gao X, Easteal S: Structural comparison of major histocompatibility complex class I molecules and homology modelling of five distinct human leukocyte antigen-A alleles. Protein Eng 1996, 9: 1151-1164.

  40. 40.

    Larson SM, Di Nardo AA, Davidson AR: Analysis of covariation in an SH3 domain sequence alignment: applications in tertiary contact prediction and the design of compensating hydrophobic core substitutions. J Mol Biol 2000,303(3):433-446. 10.1006/jmbi.2000.4146

  41. 41.

    Noivirt O, Eisenstein M, Horovitz A: Detection and reduction of evolutionary noise in correlated mutation analysis. Protein Eng Des Sel 2005,18(5):247-253. 10.1093/protein/gzi029

  42. 42.

    Oliveira L, Paiva AC, Vriend G: Correlated mutation analyses on very large sequence families. Chembiochem 2002,3(10):1010-1017. 10.1002/1439-7633(20021004)3:10<1010::AID-CBIC1010>3.0.CO;2-T

  43. 43.

    Pollock DD, Taylor WR, Goldman N: Coevolving protein residues: maximum likelihood identification and relationship to structure. J Mol Biol 1999,287(1):187-198. 10.1006/jmbi.1998.2601

  44. 44.

    Pritchard L, Bladon P, J MOM, M JD: Evaluation of a novel method for the identification of coevolving protein residues. Protein Eng 2001,14(8):549-555. 10.1093/protein/14.8.549

  45. 45.

    Bonhoeffer S, Chappey C, Parkin NT, Whitcomb JM, Petropoulos CJ: Evidence for positive epistasis in HIV-1. Science 2004,306(5701):1547-1550. 10.1126/science.1101786

  46. 46.

    Liberles DA, Schreiber DR, Govindarajan S, Chamberlin SG, Benner SA: The Adaptive Evolution Database (TAED). Genome Biol 2001,2(8):RESEARCH0028. 10.1186/gb-2001-2-8-research0028

Download references


We wish to thank Drs. B. Boyadzhyan, I. Chen, C. French, C. Ramirez Kitchen, P. Krogstad, T. Schutzbank, N. Wylie, and T. Yashina for valuable discussion and comments on this work, and A. Alekseyenko for discussion of simulation studies. This work was funded by a Dreyfus Foundation Teacher-Scholar Award to C.J.L., and by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant U54 RR021813 entitled Center for Computational Biology (CCB). Information on the National Centers for Biomedical Computing can be obtained from

Author information

Correspondence to Christopher Lee.

Additional information

Authors' contributions

L. Chen carried out the sequence alignment, mutation detection, selection pressure and conditional selection pressure analysis. C. Lee proposed the conditional Ka/Ks metric, participated in the design of the study, and drafted the manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article


  • Codon
  • Independent Dataset
  • Amino Acid Mutation
  • Author Response
  • Conditional Selection