A framework for application of metabolic modeling in yeast to predict the effects of nsSNV in human orthologs

Background We have previously suggested a method for proteome wide analysis of variation at functional residues wherein we identified the set of all human genes with nonsynonymous single nucleotide variation (nsSNV) in the active site residue of the corresponding proteins. 34 of these proteins were shown to have a 1:1:1 enzyme:pathway:reaction relationship, making these proteins ideal candidates for laboratory validation through creation and observation of specific yeast active site knock-outs and downstream targeted metabolomics experiments. Here we present the next step in the workflow toward using yeast metabolic modeling to predict human metabolic behavior resulting from nsSNV. Results For the previously identified candidate proteins, we used the reciprocal best BLAST hits method followed by manual alignment and pathway comparison to identify 6 human proteins with yeast orthologs which were suitable for flux balance analysis (FBA). 5 of these proteins are known to be associated with diseases, including ribose 5-phosphate isomerase deficiency, myopathy with lactic acidosis and sideroblastic anaemia, anemia due to disorders of glutathione metabolism, and two porphyrias, and we suspect the sixth enzyme to have disease associations which are not yet classified or understood based on the work described herein. Conclusions Preliminary findings using the Yeast 7.0 FBA model show lack of growth for only one enzyme, but augmentation of the Yeast 7.0 biomass function to better simulate knockout of certain genes suggested physiological relevance of variations in three additional proteins. Thus, we suggest the following four proteins for laboratory validation: delta-aminolevulinic acid dehydratase, ferrochelatase, ribose-5 phosphate isomerase and mitochondrial tyrosyl-tRNA synthetase. This study indicates that the predictive ability of this method will improve as more advanced, comprehensive models are developed. Moreover, these findings will be useful in the development of simple downstream biochemical or mass-spectrometric assays to corroborate these predictions and detect presence of certain known nsSNVs with deleterious outcomes. Results may also be useful in predicting as yet unknown outcomes of active site nsSNVs for enzymes that are not yet well classified or annotated. Reviewers This article was reviewed by Daniel Haft and Igor B. Rogozin.


Background
An enyzme's active site specificity is an important determinant of functional catalysis. For this reason, if a nonsynonymous single nucleotide variation (nsSNV) occurs at the active site, thereby changing an active site amino acid residue, it is highly likely that enzyme activity will be lost or altered [1][2][3][4][5][6][7]. It follows that pathway activity should also be affected, with a potential end result of disease [8][9][10] or lethality.
In our previous paper [11] we used the SNVDis Tool [12], part of the High-performance Integrated Virtual Environment (HIVE) tools suite (accessed at hive.biochemistry.gwu.edu) [13] to identify the entire set of 559 human proteins with nonsynonymous variation at an active site residue resulting from a single nucleotide variation. Pathway, substrate and product annotation was manually retrieved from Kyoto Encyclopedia of Genes and Genomes (KEGG) [14] for all possible proteins in the dataset. To the best of our knowledge, a subset of 34 of the original proteins had a 1 enzyme:1 pathway:1 substrate/product set ratio, meaning these enzymes catalyze a single reaction for a single set of substrates and products and are currently annotated to participate in a single pathway. While there may be possible interaction between specified metabolites and alternative enzymes, the 1:1:1 relationship ensures no alternative metabolic interaction with the specified enzymes, thus making them suitable candidates for in vivo laboratory validation.
For this paper, we used the reciprocal best BLAST hits method [15] to query the set of 559 human active site nsSNV proteins against the yeast (Saccharomyces cerevisiae) proteome and, conversely, the top yeast protein hits back against the human proteome. We then checked for active site residue conservation and pathway/ substrate/enzyme conservation between the 113 identified human-yeast orthologous pairs. We found 6 proteins which satisfied all criteria, 5 of which had human disease associations including anemias, porphyrias and others. The high incidence of dysfunction annotated among these active-site-nsSNV-containing proteins reinforces the notion that mutations in the active site disrupt normal enzyme and/or pathway activity. Although it may seem intuitive that the active site modifications are to blame for these disease associations, it is surprising to note that the disease-related annotations are not currently attributed to the active site variations for the 5 proteins in question, but are assigned to other sequence variations or causes.
Merely studying genomic variation may not be enough to determine the effects of active site variation, however, as one needs to examine the interactions and physiological outcome within the context of the cell, organ, or entire system [16]. Metabolic modeling is our attempt to test the impact of active site variation in a more meaningful way as such methods get more sophisticated. To this end, we used the Yeast 7.0 flux balance analysis (FBA) model [17] to predict the effects of deleting these 6 proteins, constructed with the relevant nsSNV active site mutation, on the growth of yeast. Since several of these proteins are involved in production of important metabolites that are not contained within the Yeast 7.0 biomass function, we repeated the analysis using a revised biomass equation that incorporates the relevant metabolites. FBA is an established mathematical and computational approach for studying metabolic networks [18,19]. FBA modeling has a number of biotech applications as it allows programmatic prediction of a metabolic phenotype using in silico computations of reaction stoichiometry without using reaction kinetics [20][21][22]. FBA results have been shown to correlate well with experimental observations [23].
In the future, predictions given by this in silico cell model can be subjected to laboratory validation using yeast cultures with site-induced mutagenesis to alter the proper active site residue and observe the outcomes with respect to growth rate, substrate and product quantities, byproduct generation and general viability. We reason that the conservation of these protein sequences from yeast to humans, in conjunction with conserved active site residues and conserved pathway interplay, provides strong evidence that similar outcomes are likely to result from mutations in the human orthologs [24].
Our overarching goal is to better understand phenotypic effects of nsSNVs on the active site of enzymes. Although we use a small set of enzymes here to perform a preliminary proof of principle, this experiment lays the foundation for a method of cellular modeling that will move toward an "omics" approach with potential predictive ability. Similar large-scale studies already demonstrate the utility of this type of approach to predict metabolite concentrations related to environmental conditions [25], enzyme phosphorylation [26] and even whole gene deletion [27], but to the best of our knowledge there is no such effort to apply a metabolomics approach to the analysis of activesite-nsSNV-containing proteins to yield predictive and/or diagnostic information.

Results and discussion
The complete human and yeast proteomes, and all protein subsets, available from UniProtKB/Swiss-Prot, provide curated annotation information regarding both nsSNV and active sites [28]. The dbSNP database [29] provides additional information on variations and the Conserved Domain Database (CDD) [30] provides additional information on active sites. The SNVDis tool uses this information to return an output table with all human proteins containing nonsynonymous substitutions at the active site.
Note here that we do not discriminate between heterozygous and homozygous nsSNVs because pertinent information is unavailable for the majority of human variation data. Previous findings imply that most nsSNVs in active sites are rare or heterozygous because they are expected to be selected against and, therefore, unlikely to be observed in a homozygous pair [31]. Studies of consanguineous families may prove a good source for detection of homozygous active site nsSNVs in the future, but there are currently few documented cases [32] of catalytic homozygote variants. This is problematic as currently available models assume the haploid condition of a yeast cell. Results of reducing the flux to zero in such a model cannot be easily generalized to a diploid, heterozygous state unless known loss of function via haploinsufficiency or dominant negative phenotypes is established. Although we cannot say with certainty that the variations presented in this paper confer this phenotype, there is literature evidencing the heterozygous loss of function associated with active site mutations in a number of mammalian genes including mouse DNA polymerase δ [33], human DNA polymerase γ [34] and mammalian 11β-hydroxysteroid dehydrogenase type I [35]. Furthermore, the aforementioned murine study was extended from studies of homologous mutations in haploid yeast. Thus, while we cannot fully extend haploid modeling to proteins in diploid organisms, we suggest that yeast enzymes whose knockouts alter flux balance necessitate further attention be given to their orthologous human counterparts, for both homozygous and heterozygous variations. Specifically, we propose that haploid models could still provide quantitative value toward the development of predictive assays for highly conserved pathways in cases of true loss of function due to heterozygous mutation at human active sites.
A proteome-wide search in SNVDis returns 559 unique human proteins (Additional file 1: Table A1) with nsSNV at the active site. Subsequent pathway analysis using annotations from KEGG shows a subset of 34 proteins (updated from previous publication to include two additional proteins from the same set of 559 which were later found to meet the criteria) with nsSNV at the active site and a 1 enzyme:1 pathway:1 reaction relationship. Full methods and discussion for these and other preliminary results can be seen in the prior publication.

Reciprocal best BLAST hits method to identify human-yeast ortholog pairs
To identify proteins from yeast (S. cerevisiae) orthologous to the 559 active-site-nsSNV-containing human proteins, the reciprocal best BLAST hits method was used. The sequences for the 559 active-site-nsSNV-containing human proteins were extracted from UniProtKB/Swiss-Prot by UniProt Accession and used as the query in a BLAST [36] against the yeast proteome. Of the 559, 173 proteins have no hits against the yeast proteome with E-value less than 0.0001 and 146 proteins have significant hits but are deemed to be paralogous. Thus, 240 of the 559 human proteins are considered to have a one-way best hit. These 240 best-hit yeast proteins were then used as the query in a second BLAST against the entire human proteome. Of these 240, 127 have a unidirectional best-match such that the highest scoring yeast match from the human BLAST against the yeast proteome has an alternative highestscoring human match when queried against the entire human proteome. This leaves 113 proteins with active site nsSNV to have a reciprocal best-match ortholog in yeast ( Table 1, full reciprocal best BLAST hits data can be shown in Additional file 1: Table A2).

Identification of candidate enzymes for in vivo validation
To be considered ideal for laboratory validation, we want to first limit our consideration to the proteins in Table 2 which have a simple 1 enzyme:1 pathway:1 reaction relationship relationship. This 1:1:1 ratio drastically limits the possibility for an enzyme to affect or be affected by multiple metabolites during laboratory experiments. Of the 34 potential candidates, 11 had no best hit and 10 had a one-way best hit, leaving 11 proteins as potential candidates with reciprocal best-match orthologs.
We must also consider the case when the product of an enzymatic reaction can be synthesized by alternative means. For this paper, we decided to consider these cases due to literature supporting biochemical importance of dysfunction despite multiple pathways to product.
Additionally, we want to ensure the variable site of interest, the active site, is conserved across ortholog pairs. If functional residues are not conserved, we cannot say that experimental observations resulting from a variation in a yeast enzyme is at all indicative of outcomes of variation in the corresponding human enzyme. This was checked by manual examination of alignments ( Figure 1). The minimum requirement for active site conservation is that the residue(s) annotated as the active site residue(s) are identical across species, although many of the humanyeast pairs have blocks of up 10 amino acids conserved across species. 10 of the 11 ortholog pairs were conserved at the active site.
We want to further limit wet-lab candidate enzymes to those with no other orthologs or paralogs. Even though the proteins under consideration at this point are reciprocal best-matches, proteins with closely related homologs limit the confidence with which we can report results of laboratory validation experiments. For example, if a human protein has a best match with a yeast protein which has paralogs, we cannot be sure upon knockout of the enzyme that the paralogous proteins will not interfere in the metabolism we are trying to assay. Exclusion of proteins with orthologs and paralogs was performed by manually examining alignments. This exclusion now reduces our list of potential candidates to 6. Table 3 lists the active site residue and surrounding conserved residues for these 6 proteins.
Finally, we want to make sure the ortholog pairs are involved in similar pathways. If a human protein has diverged to act on different metabolites we again cannot use its yeast ortholog as a model to understand effects of variation in human proteins. Pathway information for the human proteins was previously retrieved and summarized in Table 2. Pathway information for the yeast proteins was This table lists the pool of 113 reciprocal best match human-yeast ortholog pairs, a subset from the original 559 proteins with active site nsSNV.  This table provides the UniProtKB accession ID, normal and variable residues, position of the variation and substrate/product pairs with PubChem IDs for 34 proteins with annotated nsSNVs at the active site. These proteins were chosen for this study based on a one-protein, one-variation, one-substrate/product pair relationship to simplify preliminary modeling by ensuring no complex interactions for the given protein pathway. NOTE: This table has been updated since the prior publication to include 2 additional proteins which were later identified to meet the criteria.
similarly retrieved from KEGG and recorded. Pathway conservation is indicated by similar pathway map with identical substrates and products. All 6 ortholog pairs were found to participate in the same pathways, acting on the same substrates and yielding the same products. Pathway information for these 6 proteins is included in Table 4. Further support for functional similarity can be seen in Figure 2, showing evidenced interactions (collected and curated following guidance by Lim et. al. [37] with other proteins and the distribution of functional human-yeast orthologs predicted by Isobase [38] among these subsets. A summary schema of the overall method used in identification of these 6 candidate proteins is presented in Figure 3.

Yeast 7.0 COBRA modeling
Five of the six identified candidate enzymes were input into the Yeast 7.0 model available at http://pathway.yeastgenome.org/ [17]. Yeast methylthioribulose-1-phosphate dehydratase (MDE1), P47095, had no corresponding gene in the model and was therefore excluded from this analysis. When knockouts are generated in these genes, they all show normal growth rates except ribose-5-phosphate isomerase (RPIA),Q12189, which showed no growth. However, Yeast 7.0 is not equipped to predict growth for all of these reactions because some important compounds and cofactors are not currently accounted for in the model. Findings are summarized in Table 5. (Full output available in Additional file 1: Table A3).

Modified yeast 7.0 modeling with updated biomass function
Under the consideration that a modeling tool with a more comprehensive biomass function could provide more information, we repeated modeling of the 5 relevant enzymes using a biomass function with added requirements for ferroheme b and charged tRNA(Tyr) in the mitochondrion. This model was able to implicate three additional proteins with function in essential reactions: delta-aminolevulinic acid dehydratase (ALADH), P05373; mitochondrial ferrochelatase, P16622; and mitochondrial tyrosine-tRNA ligase, P48527. This additional examination, in conjunction with sequence and pathway conservation, supports our argument for inclusion of the pertinent pathways in yeast cell models. A summary of the findings can be seen in Table 5. (see Additional file 1: Table A3 for full output). The above Yeast 7.0 results, combined with literature support, sequence and pathway annotations, maintain our hypothesis that we should be able to use yeast orthologs with active site conservation to model the effect of a variation at the active site of its human ortholog in order to obtain a predictive overview of altered enzyme activity resulting from the variation. Furthermore, we see the potential for the utility of this method to improve as advanced modeling efforts continue to progress.

Biomedical relevance
We hypothesize that nonsynonymous variation in an enzyme's active site should have deleterious effects on  Figure 1 Manual verification of residue conservation at the active site in both human and yeast orthologs. This is the reciprocal best BLAST hits alignment using yeast protein P05373, Delta-aminolevulinic acid dehydratase for S. cerevisiae, as the query searching against the entire human proteome. The only human hit is P13716, also delta-aminolevulinic acid for H. sapiens. From genbank annotations, we know the active site should occur at position 221 in the human sequence, corresponding to position 232 in the yeast sequence. Here we see conservation not only among the active site residues, but also in the surrounding region, which can also be important in facilitating active site binding.
the enzyme's activity and function. Although only one of the six human enzyme SNPs has a direct link to the Online Mendelian Inheritance in Man® database (OMIM®) [41] in dbSNP (rs28936396 in P48637, human glutathione synthetase linked to OMIM 266130 -Glutathione synthetase deficiency), all six have literature related to dysfunction and disease caused by variants found at other positions within the protein sequence. If our assumptions regarding active site variation are correct, then nsSNVs at the active site of enzymes involved in metabolic syndromes will likely alter protein function and contribute to similar syndromes. Furthermore, yeast orthologs for these enzymes with strong pathway and active site conservation may be used to verify the metabolic effects of such variation with respect to metabolite concentrations. Although the current Yeast 7.0 model supports this hypothesis for RPIA alone, we argue for inclusion of the pathways and metabolites associated with these proteins in future yeast models based on consideration of the augmented model findings, literature and conservation among orthologs. Thus, we present five potential disease-related proteins with active site nsSNVs (along with yeast orthologs) in need of laboratory experimentation to validate the veracity of disease related to the SNV.

Human P49247/Yeast Q12189
Ribose-5 phosphate isomerase (RPI) participates in the pentose phosphate pathway (PPP), an alternative pathway for glucose oxidation accounting for up to 20% of glucose oxidation in normal tissue. [42] Synthesis of ribose 5-phosphate from ribulose 5-phosphate via RPI is required for nucleotide and nucleic acid synthesis, and participates in the downstream production of glycolytic intermediates [43]. With reduced RPI activity, ribose 5-phosphate becomes unavailable and the necessary glycolytic metabolites are not produced. Enzyme deficiency results in clinical symptoms of leukoencephalopathy and peripheral polyneuropathy. The RPIA gene has also been identified to be hypermethylated at CpG sites in breast cancer [44]. As of 2004, RPI was only the second known inborn error in the reversible portion of the PPP [45]; the PPP connects pentose phosphates to glycolytic intermediates. Furthermore, there are currently only 2 mutations listed in the Human Gene Mutation Database [46] (accessed 8/22/13) for the RPIA gene, neither of which is the aspartic acid to tyrosine variation at active site position 160 reported here (rs11549730). RPI has been observed to be variably conserved among species ranging from 52% similarity between human and E. coli to strong conservation noted between human and other mammals [47] and RKI1 disruption has been shown to be deleterious to yeast growth [48].

Human P13716/Yeast P05373
Delta-aminolevulinic acid dehydratase (ALADH), encoded by the ALAD gene in humans, catalyzes the first common step in heme and other tetrapyrrole biosynthesis [49] and has also been observed to interact with the proteasome This table summarizes the pathway involvement of the six identified candidate proteins. For all 6 proteins, the primary pathway involvement is the same, down to the same products and substrates and the same place within a pathway. For three of the yeast proteins, however, the pathway annotation in KEGG listed an additional pathway not annotated for humans: "Biosynthesis of secondary metabolites." This is likely due to differential annotation between yeast and humans within KEGG. Because the substrates and products annotated for these pathways were identical to those of the primary pathways, these three proteins were not excluded and went onto analysis in the yeast FBA modeling. [50]. The wild-type enzyme exists as a high-activity homooctamer, but alternative lower-activity hexamer assemblies have been found [51]. In fact, naturally occurring ALAD porphyria-associated human variants have been shown to have an increased susceptibility to hexamer-stabilizing inhibitors. Lead-poisoning is also hallmarked by reduced ALAD activity. 5-aminolevulinate accumulates as a result of the lowered activity and is responsible for the toxic effects observed in both disease states. It has been suggested that small molecule stabilization of alternate oligomers (morpheeins) can be used as a therapy to target delta-aminolevulinic acid dehydratase [52], in addition to human immunodeficiency virus integrase, tumor necrosis factor α, mammalian ribonucleotide reductase and other disease-related proteins for which there is evidence of alternate, functionally differential oligomers [53].

Figure 2
Similar protein interactions between human and yeast counterparts. Known and predicted protein interactions were retrieved from IntAct [39] and BioGrid [40] using both human and yeast gene names as query terms. Interactions were organized and edited following suggestions by Lim et al. [37]. It is interesting to note that although a sizeable portion of predicted interactors are also predicted to have functional orthology between species, very few of the inter-protein interactions among predicted orthologs were reported for both species.  Mutants in the corresponding yeast HEM2 gene have been found to disrupt heme biosynthesis, requiring provision of heme for growth in culture [54]. Yeast and human ALADH differ with respect to Zn cofactor binding, thus care is required to ensure homology models account for this difference in stoichiometry [55]. However, comparison of primary sequences across a diverse array of species shows a strong degree of similarity in the region of the active site and a zinc-binding motif [56].
There are many studies discussing ALAD-porphyria associated mutations [57,58] but responsibility for the disease state has not yet been assigned to the deletion at nucleotide position 843. Interestingly, this deletion in the active site codon also results in a frameshift, further complicating comprehension of potential disease mechanism(s).

Human P22830/Yeast P16622
Encoded by the FECH gene, ferrochelatase is the final enzyme in heme biosynthesis, catalyzing the formation of protoheme IX from protoporphyrin IX and ferrous iron [59]. Ferrocheletase is important to human health due to the critical and diverse functions of its product cofactor, heme. Defects in this enzyme have been related to erythropoietic protoporphyria (EPP). EPP is a relatively benign disease with predominant manifestation of photosensitivity, although it has been reported with liver complications in 2-5% of patients [60]. Biological findings include elevated concentrations of protoporphyrin in erythrocytes, plasma and feces, and a large amount of protoporphyrin in the skin [61].
Comparisons between ferrochelatases of multiple species reveal the proteins belonging to human, mouse, chicken, frog and Drosophila melanogaster to be metalloenzymes with a [2Fe-2S] cofactor, thought to play a structural stabilizing role in humans [62]. The corresponding bacteria, yeast and plant proteins contain no iron-sulfur center, but the ferrochelatase from S. cerevisiae is similar in other respects to eukaryotic ferrochelatases. Yeast mutants with total or partial FECH activity deficiency and resultant protoporphyrin accumulation have been isolated [63].

Human Q9Y2Z4/Yeast P48527
Aminoacyl-tRNA synthetases (ARSs) catalyze the linkage of specific amino acids to their cognate tRNAs [64]. Mitochondrial tyrosyl-tRNA synthetase (TyrRS) is responsible for charging tyrosine to its tRNA. Although belonging to class I synthetases, tRNA recognition occurs in a way more common to class II ARSs [65]. Defects in TyrRS are associated with myopathy with lactic acidosis and sideroblastic anemia 2 (MLASA2). It has been proposed that diminished aminoacylation activity of the enzyme leads to decreased mitochondrial protein synthesis and mitochondrial respiratory chain dysfunction [66]. TyrRS also has known involvement with cell-signaling [67] and angiogenesis [68]. Although there are multiple known disease-related TyrRS mutants, it is unclear for some mutants whether pathogenicity results from dysfunction in aminoacylation or cytokine activation.
Alignments with TyrRSs from several species reveals 34% identity with yeast and similarly low identity across the board, with greatest conservation confined to the N-terminus [69]. Gly-244 and Asp-246 are conserved among class I synthetase catalytically important residues, but a total of 8 residues are known to participate in hydrogen bonding [70].
APIP is highly expressed in skeletal muscle and known for an anti-apoptotic role [73]. One common APIP SNP was observed to reduce enzyme activity, resulting in an increase in cell death in response to Salmonella [74].
Currently excluded from the Yeast 7.0 model, the MDH1 enzyme is of critical importance to the pathway. Loss of function leaves the cell with no isozymes for methionine salvage, as demonstrated in the YeastCyc pathway/genome database (PGDB) within BioCyc [75]. The enzyme and the entire methionine salvage pathway seem to be functionally conserved in yeast [76] and should therefore serve as an adequate model in yeast experimentation to validate predicted loss of function due to the contained nsSNV.

Conclusions
The method proposed by this series of studies can be generalized to identify all proteins affected by variation at functionally important residues, to determine the availability of orthologous yeast proteins to model the effects of such variation and, in the future, to predict and quantify cellular and potentially organismal responses to the variation. The specific analyses presented here for demonstration of the method further demonstrate that nonsynonymous single nucleotide variation at the active site should disrupt enzyme function and potentially induce disease in yeast. However, to fully and confidently extrapolate this finding to humans, disruption of enzyme function must be checked against the effects of zygosity. This is currently a limitation imposed by existing data due to the lack of relevant frequency information for the majority of the variations studied. We have provided 6 sets of human/yeast enzyme ortholog pairs for which the human protein has been experimentally shown with nsSNV at the active site and the yeast enzyme has been verified to share important conserved properties. Each of these enzymes is known to have deleterious effects on human physiology when dysfunctional, but the loss of function has been directly related in OMIM to only one of the active site variants, glutathione synthetase. Thus, we present five viable candidates for variant-related disease discovery. Preliminary FBA modeling shows predictive support for one of these enzymes, ribose-5-phosphate isomerase. Experimental validation of these and similar findings will continue to increase the knowledge universe, in turn facilitating the development of more powerful, robust models. Advances in the technology employed by this method may enable the long-term creation of metabolite profiles associated with active site variation for corresponding human and yeast orthologs. Simple biochemical assays comparing human and yeast metabolite concentrations to the profiles could facilitate detection of active site non-synonymous variation. This may further lay the foundation for future diagnostic pipelines to predict unknown outcomes of active site nsSNVs in these and other enzymes.

Datasets and annotated information
Proteomes and all protein subsets were retrieved from UniProtKB/Swiss-Prot (UniProt release 2012_06 [28]. nsSNV data were extracted from UniProtKB/Swiss-Prot, dbSNP (build 135) [29], COSMIC (version 59) [77] and NCI 60 [78] (accessed June 28, 2012). The SNVDis tool (https://hive.biochemistry.gwu.edu/snpdis.cgi?cmd=dmSnpdis) [12] was used to retrieve the set of all proteins with nsSNVs at the active site (as annotated by UniProtKB/Swiss-Prot or CDD). Pathway information was retrieved from Kyoto Encyclopedia of Genes and Genomes (KEGG) [79] using the Protein Analysis Through Evolutionary Relationships (PANTHER) Classification System [80]. Substrate, product and disease information was gathered manually from KEGG and UniProtKB/Swiss-Prot. Official names and PubChem [81] Compound IDs (CIDs) for compounds with specific chemical structures, or Substance IDs (SIDs) for those without, were recorded for all substrates and products.

Reciprocal best BLAST hits method
Stand-alone BLAST (version 2.2.27+) [82]was downloaded from the NCBI website. The yeast (S. cerevisiae) proteome was downloaded from Uniprot by using the term "Baker's yeast [559292]" for the field of Organism[OS], combined with the term "Complete proteome"' in the Keyword field. The set of 559 previously selected human proteins was retrieved from Uniprot/Swis-protKB in batch mode using the corresponding accession numbers.
The yeast proteome file was indexed as the database, and the human protein dataset was used as the query for the nucleotide-to-nucleotide Blastn algorithm with all default parameters and E-value higher than 10 −5 . The customized script was developed to extract best hits from the Blastn result. All the yeast proteins that met the criteria were batch-retrieved from UniprotKB/Swiss-prot by accession numbers and prepared for the opposite direction BLAST. Reciprocal best BLAST hits method was then accomplished by using the selected yeast proteins as the query set to align against the indexed human proteome database downloaded from Uniprot following the similar method described above. A human-yeast ortholog table was generated via a customized script parsing the two-way BLAST results. Thus, only those proteins which are mutual best hits with p-value higher than 10 −5 were marked as the ortholog.

Manual identification of orthologous pairs
Manual examination of BLAST alignment results was performed to verify residue conservation at and around the active site, as well as exclusion of proteins with additional homologs. Residue conservation was observed for all 6 candidate enzymes and reported as the longest continuous sequence conservation block containing the active site residue. Confirmation of active site annotation across species was achieved by consulting relevant databases. Manual confirmation of single-best-match orthologs was done by visual verification of BLAST results of corresponding human and yeast proteins performed separately with human and yeast as query species.

FBA procedures
COBRA simulations of the Yeast 7.0 genome-scale model were performed using Yeast 7.0.0 in the COBRA Toolbox 5.0.0 within MATLAB R2010a. In order to simulate the effects of knocking out genes YGL040C, YOR176W, and YPL097W, we added small (0.0005 mmol/gCDW/hr) requirements for ferroheme b and charged tRNA(Tyr) in the mitochondrion to the Yeast 7.0 biomass function, creating an augmented biomass function. Experimental evidence for these deletions drawn from the Saccharomyces Genome Database [83] suggests that YGL040C and YOR176W null mutants are inviable in large scale surveys, and that YPL097W mutants are deficient in respiration. Reference growth rates for wild-type cells lacking gene knockouts were obtained based on aerobic growth in glucose minimal media, using both the augmented and the original Yeast 7.0 biomass function. Additional file 1: Table A3 contains fluxes obtained for simulations of wildtype cells using the augmented biomass function and are labeled as 'augmented_flux' , while fluxes obtained for simulations of wild-type cells using the Yeast 7.0 biomass function are labeled as '7.00_flux'. Gene knockouts were simulated by limiting flux through the reaction carried out by the gene product in question to zero, equaling a total loss of function for that reaction. All COBRA simulations employed taxicab norm minimization of fluxes as described in the documentation for the optimizeCbModel function.

Reviewers' comments
We appreciate the reviewer's comments from Dr. Daniel Haft and from Dr. Igor B. Rogozin. We have revised the manuscript according to your comments and suggestions.

Reviewer 1: Dr. Daniel Haft
Previous work by this group identified 559 human enzymes in which missense mutations replace an enzyme's active site residue in rare alleles. Of these, 34 are assigned a single enzymatic activity rather than a panel of alternate substrates, with the reaction belonging to a single pathway in human as documented by KEGG. For this paper, the collection is winnowed further, to count only those for which bi-directional best BLAST matches between human and yeast identify putative ortholog pairs, uncomplicated by paralogs that could overlap in function. This stringent filter gets the list down to six enzymes, the topic of this paper. A tool for flux balance analysis (FBA) is available for yeast, but not for human. Thus, for these six enzymes, computation based on a yeast metabolic model leads to predictions that loss of the same function in human causes disease.
In one sense, there is no surprise here. Mutations away from the active site already were known to cause human disease for five of the six enzymes. Loss of the active site residue should be no less catastrophic to function than truncation or frameshift errors. It is possible that a heterozygous active site mutant could produce a currently unrecognized form of a disease now known only for homozygotes of an incompletely inactivated enzyme, but that is not the point of the paper. The yeast model is haploid. Presumably, the implications are for homozygotes of the mutant form in human.
Authors' response: Perhaps it wasn't clear from our original discussion, but we agree that the haploid yeast model is only suited to truly "model" homozygous cases. However, we can envision potential scenarios in which a haploid model could still provide predictive quantitative value in assays in highly conserved pathways for loss of function due to heterozygous mutation at active sites in humans. We did revisit the text relevant to this issue and modified it to improve clarity. Furthermore, we are not claiming that homozygous alleles are altogether too rare, in fact the journal article we cite here (Ng PC1, Levy S, Huang J, Stockwell TB, Walenz BP, Li K, Axelrod N, Busam DA, Strausberg RL, Venter JC. Genetic variation in an individual human exome. PLoS Genet. 2008 Aug 15;4(8):e1000160) reports that just under 50% of all nonsynonymous SNPs in HuRef are homozygous. However, they go on report that only about 15% are likely to affect protein function in a deleterious way and in this case, heterozygous nsSNPS are two times more likely than homozygotes to be predicted to affect protein activity. Similarly, rare nsSNPS are twice as likely as common nsSNPs to affect protein function. Thus, at the active site and other functionally important residues, protein-affecting SNPs should be selected against and should be less likely to be observed in homozygote form.
At your suggestion, we conducted a new PubMed search and were able to find an article identifying a homozygous mutation in the catalytic domain of the MMP2 gene in two sisters with Winchester syndrome (Rouzier C1, Vanatka R, Bannwarth S, Philip N, Coussement A, Paquis-Flucklinger V, Lambert JC. A novel homozygous MMP2 mutation in a family with Winchester syndrome. Clin Genet. 2006 Mar;69(3):271-6.) The long-term combination of (1) improved classification and annotation of variation data and (2) simple increase in available studies should allow us to find these cases and truly study the quantitative differences of heterozygous and homozygous states, ultimately helping to develop and refine possible detection assays.
The surprising finding, actually, seems to be that even under the stringent filtering usedconserved orthologs from yeast to human of enzymes in a conserved pathwayfive of the six genes nominated for study are associated with described human genetic diseases, not simply non-viability. The alleles responsible for the genetic disorders seen in patients may encode enzymes that retain some residual function, probably more than an active site mutant retains, allowing for viability.
Rather than presenting a new method to find previously unrecognized human genetic disease, this paper forms a field test of the bioinformatics infrastructure one would need for such a pipeline in the future. If FBA is not yet available for human cells, the yeast model can stand in for a few validated human/yeast orthologs. When it turns out that zeroing activity of several genes causes no change, it suggests the FBA model needs repair. The authors describe making such repairs, on their way to completing their proof-of-principle for notion that a yeast metabolic model, and an infrastructure for data integration, could point the way to the prediction of diseases that could be caused by certain missense mutations.