Assuming that, at a particular early stage of evolution, the primordial genetic code consisted of 16 supercodons, we postulate the following 'parsimony principle':
If the primordial code encoded an amino acid, then this amino acid was encoded by the same supercodon (four-codon series) that encodes the same amino acid in the standard genetic code (or, at least, a subset of the series encodes the same amino acid).
The expansion of the code from codons with two meaningful letters to codons with three meaningful letters is required to involve the minimum possible number of amino acid reassignments; accordingly, expansion of the code only allows recruitment of a subset of codons in a supercodon for a new amino acid but not reassignment of codons within the primordial set of amino acids. This assumption is natural because reassignment of amino acids between supercodons series, obviously, is substantially more disruptive than capturing new amino acids within pre-existing codon series [13]. With one exception, there are no contradictions between the list of putative ancestral amino acids (1) and the parsimony principle: most of the 'early' amino acids are encoded by four-codon series, and only two, Asp and Glu, do not satisfy the two-letter code scheme and the parsimony principle in that they are encoded by the same supercodon. Following the suggestion of Travers [30], we speculate that decoding of the supercodon GAN initially was stochastic, that is, these very similar amino acids were incorporated more or less randomly in response to the codons of this series, and differentiation of Asp and Glu was established only after the expansion of the genetic code to three-letter codons.
Using the parsimony principle, the primordial two-letter code can be partially reconstructed as shown in Fig. 1. Obviously, the parsimony principle does not allow one to infer the assignment for those supercodons that, in the standard code, do not encode any of the primordial amino acids (question marks in Fig. 1). To fill these gaps, additional assumptions on the amino acid assignments are required.
It is instructive to compare the putative core of the primordial genetic code in Fig. 1 with the order of stabilities of the interactions between the first two bases of codons and the cognate anticodons [30] (Fig. 2). There is a striking congruence between the two lists of amino acids. Indeed, the supercodons for 10 early amino acids include 9 of the top 10 most strongly interacting dinucleotides as determined by the stacking and melting thermostabilities. The sole exception is the supercodon CGN that encodes Arg, not an early amino acid, but is more stable than CUN and AUN which encode the early amino acids Leu and Ile, respectively (Fig. 2).
The standard genetic code is manifestly non-random. In particular, the assignments of amino acids to codons are such that the detrimental effect of mistranslation and/or mutation is minimized. That is, in the standard genetic code, codons that differ by one nucleotide code for physicochemically similar amino acids, thus reducing the cost of possible mistranslations and mutations. Quantitative evidence in support of this error-minimization property comes from the comparison of the standard code with random alternatives [11, 34–36]. It is thus necessary, when considering any scenario for the origin and evolution of the code, to account for this property. There are two possible explanations for error minimization in the code. The first possibility is that the high degree of error minimization is a byproduct of other processes that shaped the structure of the genetic code (e.g., [13, 37, 38]). The alternative is the error-minimization (adaptive) theory of the code's evolution which posits that the code evolved under the selective pressure to reduce the consequences of mistranslations and/or mutations [39]. Here we use the same quantitative approach ([11] and see Methods for details) to estimate the error-minimization level of the putative primordial 'two-letter' codes that have at their core the amino acid assignments shown in Fig. 1.
For the time being, let us disregard the unassigned entries in the code table (question marks in Fig. 1). For any permutation of the amino acid assignments in the code table, a code cost can be calculated. This cost depends on the probability of a given mistranslation error and on the relative cost associated with the replacement of the corresponding wild-type amino acid with a new one (see Methods for the exact details of the calculation of the code cost). Disregarding the unassigned supercodons but otherwise allowing all permutations of amino acid assignments within the rest of the supercodons (9, 10 or 11, depending on whether amino acids are assigned to the UUN and AGN supercodons or not), we find that the code structure in Fig. 1 is close to optimal in terms of error minimization. More precisely, the code structure in Fig. 1 is extremely robust to translational errors irrespective of the assignments of the UUN or AGN supercodons. In two of the four possible cases (Fig. 3a and 3d), there is no permutation that would reduce the cost of the code, that is, the minimization percentage (MP; see Methods for details) of the code is 1; in the other two cases, the optimal codes differ from the code in Fig. 1 only by permutations in the second column, and the MP of these codes is greater than 0.98 (Fig. 3b and 3c).
One possible interpretation of the high robustness of the doublet codes shown in Fig. 3 could be that, with this particular choice of amino acids and supercodons, and the employed measure of the code cost, most of the random codes yield low cost. However, this is not the case, as can be seen from the distribution of random code costs shown in Fig. 4, for the versions of the code from Figs. 3a and 3d. Interestingly, the cost distribution for the code from Fig. 3a is bimodal (a similar distribution was obtained for the code in Fig. 3b; not shown) whereas the distribution for the code from Fig. 3d is a more typical, roughly bell-shaped one. The difference between the cost of the standard code (Fig. 1) and the means of the distributions measured in standard deviations is 2.2, 2.65, 2.91, and 2.5 for the cases (a), (b), (c), (d) in Fig. 3, respectively. Even in the cases (b) and (c), where the assignment of amino acids to supercodons could be improved, the code structure in Fig. 1 is extremely close to the optimum (that, the global cost minimum).
Thus, we showed that the part of the putative two letter primordial genetic code that can be unambiguously inferred assuming the list of early amino acids (1) and the parsimony principle is, in effect, optimal with respect to error minimization property. It seem virtually impossible to explain away this 'perfect' structure as a by-product of some evolutionary process for which error minimization is of secondary importance or neutral. Neither is it possible to explain these codon assignments by random effects because, for instance, for the code in Fig. 3a, there are 181440 (9!/2) alternatives all of which are worse than the one shown in the figure.
There is, of course, a major caveat in these conclusions. The code cost function is not linear in the sense that adding another amino acid generally destroys the optimal assignments. Given that we disregarded some of the supercodons when performing the numerical experiments described above, the observed extreme error minimization of the putative primordial 2-letter code might be illusory. Therefore, additional assumptions were necessary to fill those supercodons of the 2-letter codes which do not have amino acid assignments after applying of the parsimony principle to the standard code given the list of early amino acids (1). A possible solution that we consider first, is to fill unassigned cells with the amino acids from the same column, in accordance with the 'four-column' theory of the origin of the genetic code [13, 40]. For instance, consider the code in Fig. 5. We take the amino acid assignments from Fig. 1 whenever possible, disregard Ser for supercodon AGN, so that the whole column codes for the same amino acid, and either assign Leu to UUN, because the closest amino acid in this code is Leu, or assume the existence of two supercodons for Leu (incidentally, the most abundant amino acid in extant proteins) already at the 2-letter stage of the code's evolution. Allowing random permutations of amino acid assignments within the colored cells in Fig. 5 and filling other cells using the 'column-wise' approach, the error minimization properties of the code in Fig. 5 can be assessed. It turns out that the code in Fig. 5 is also highly robust although not quite at the level of the abridged codes in Fig. 3 (Fig. 6). Specifically, if supercodon UUN is filled using the assignment of CUN (Val), the MP of the code from Fig. 5 is 0.94 (Fig. 6a); if two supercodons for Leu are assumed, then the MP is 0.987, and the optimal code is very close to that in Fig. 5 (Fig. 6b). In both cases, lowest cost was obtained for the assignments where the third and fourth columns code for Asp and Gly, respectively. The distributions of the random code costs are shown in Fig. 7.
Thus, at least, the part of the 2-letter code that can be inferred from the standard code using the set of (putative) primordial amino acids, the parsimony principle, and a straightforward additional assumption for the assigning the remaining supercodons, is structured in such a way that an a priori chosen standard cost function (see Methods) renders the code near-optimal. Indeed, the most conservative estimates yield MP > 0.98 for the cases when the question marks Fig. 1 are disregarded, and MP > 0.94 when the 'four-column' theory is used to assign amino to the unassigned supercodons (Fig. 6), in a sharp contrast to the 78% MP for the standard code [12] (this estimate was obtained using the same cost function as described in the Methods section but for the complete, standard genetic code, and is somewhat higher than the previously reported estimates [41]).
A different approach to assigning the vacant supercodons in the 2-letter in Fig. 1 involves using the parsimony principle not only for the putative early amino acids but for all supercodons. Under this strategy the 2-letter codes cease being special with respect to error-minimization. Consider, for instance, the code shown in Fig. 8a that obtained from the standard code using the parsimony principle. This version of the 2-letter code was proposed as a possible ancestral code [42] and was analyzed with respect to error minimization [43]. This code has MP of 0.51, and the result does not change qualitatively when ambiguous amino acid assignments are changed (for instance, when Gln is substituted for His). Here our conclusion is in agreement with the conclusions of Butler et al. [43] that were obtained using a different cost function.
With regard to the low error minimization in 2-letter codes obtained using the parsimony principle, we were interested in determining which amino acid assignments contributed the most to this non-optimality. In the standard genetic code, the most non-optimally assigned amino acid is Arg [11]; the underlying reason is not only the placement of Arg in the code table as such but also the fact that Arg has 6 codons and so makes a disproportionate contribution to the cost of the code. In 2-letter codes, an amino acid can be encoded by two supercodons at the most, so it would not be surprising if an amino acid(s) other than Arg occupied the 'worst' position from the point of view of the error minimization.
To address this question for 2-letter codes but taking into account all 20 standard amino acids, we devised the following experiment: for a given natural number N ≤ 16, choose randomly N cells in the 16-cell code table. Then assign amino acids to the chosen cells according to the parsimony principle (if for some cells two amino acids are encoded in the respective 4-codon series in the standard code, one is randomly chosen). Allowing permutations of amino acid assignments between these fixed N cells, we can estimate the MP for a given code. Other cells, not chosen in the experiment, can be disregarded, as it was done for the code in Fig. 1, or filled by using, e.g., the four-column rule specified above, as in Fig. 5. Repeating this procedure and collecting random codes with high MP, we can rank the amino acids by the frequency with which they are found in highly optimized codes and similarly rank the cells (supercodons) in the code table.
Independent of the number of chosen cells N and the strategy that is used to fill (or not to fill) the remaining cells, the results qualitatively appear as shown in Fig. 9. The general conclusion is that the major reason of non-optimality of 2-letter codes obtained with the parsimony principle (as in Fig. 8a) are the amino acid assignments in the supercodons UAN and UAG which correspond to Tyr, Cys, and Trp (and two of the three stop codons) in the standard code. We were unable to discriminate the effects of other amino acids except that these effects were relatively small and sensitive to the choice of N (Fig. 9 and data not shown) but the non-optimality of the assignments of Tyr, Cys, and Trp was striking and is unambiguous (Fig. 9).
Taking into account that Tyr, Cys, and Trp are among the 'latest' amino acids according to Trifonov's consensus of amino acid appearance [25], and that they are coded by supercodons with the lowest stability of codon-anticodon interactions (Fig. 2), it appears most likely that the primordial 2-letter genetic code did not accommodate these amino acids that were added to the amino acid repertoire only after the transition to the standard 3-letter code. Given these observations, we assessed the error minimization level of 2-letter codes without assigning the supercodons UAN and UGN (Figure 10). Such a 2-letter code is significantly more robust than the fully specified code in Figure 8a: the MP of this code is 0.88, a value that is significantly greater than the MP of the standard code (0.78), with the probability to find a better code of approximately 1/50000.
In the original experiment on spontaneous formation of organic compounds, Miller [14] observed detectable amounts of only three amino acids: Ala, Asp and Gly. In most of the subsequent abiogenic synthesis experiments, these amino acids were most abundant. Thus, it seems to be a plausible assumption that these amino acids were the first to be encoded unambiguously in the primordial code, and their positions were fixed by chance ('frozen accident' sensu Crick). We measured the level of error minimization for the 2-letter code, with permutations of amino acid assignments allowed only for the entries other than GCN, GAN, GGN, UAN, and UGN (Fig. 11a).
The codes in this group are not exceptionally robust to translational mistakes (MP is 0.91-0.93 depending on the choice of amino acids for the UUN, CAN, AAN, AGN supercodons). Inspection of the optimal codes readily reveals the main source of this non-optimality: in all optimal solutions Arg changes its position from the fourth to the third column of the table (Fig. 11b). Arginine has a prominent place in the study of the genetic code evolution. From the point of view of the adaptive theory, Arg is the amino acid that brings most non-optimality into the standard code [11, 44, 45]. At the same time, Arg is the amino acid for which the strongest support for a stereochemical affinity with the respective codon is available [46–49].
Having found that the position of Arg is so critical for the code robustness, the following experiment was conducted. We start with the code table in Fig. 10a and the contribution of the UAN and UGN supercodons disregarded. From all other cells, two amino acids are chosen randomly and their assignments are fixed. Thus, a code table is obtained in which 4 cells are fixed (the two chosen amino acids and the supercodons UAN and UGN), whereas the assignments for the remaining 12 cells are freely permuted, and the MP is calculated for all such permutations. We found that Arg is unique in this setting: for most of the amino acids, pairing with Arg yields the highest MP of all possible pairings. The resulting MP values are all within the range of 0.89 to 0.94, with one notable exception: if the pair Asp-Arg is fixed, then MP of the code in Fig. 12a is 0.98 (the optimal code is shown in Fig. 12b).