Referee 1:
Pierre Pontarotti
Directeur de Recherche CNRS
Marseilles, France
Reviewer comments
I carefully read your article with great interest. Unfortunately, I do not see any new information in your article. Indeed, gene duplication related to functional evolution has been highly described in the literature as well as the link with physiology.
Maybe I miss something: if this is the case, I suggest that you should better explain the originality of your work to the reader and you also could provide a comparative description with the already published articles.
Despite this comment, the analyze is straightforward and carefully carried out.
Authors' response
We appreciate your prompt reading of our paper. We can see that we have not done a good job of explaining how our study differs from others. Many studies of gene duplication gather total data on sizes of paralogous families in organisms, analyze numbers and rates of mutation etc., as a mathematical model, but do not bring into the picture the difference in functions developed by some of the duplications. We have purposely undertaken to examine closely just a few paralogous families where in most cases the enzymes made by the genes in the families are known. This allows us to see what functions are in common in the chosen microorganisms and what functions have arisen presumably by mutation that are specific to one organism or to closely related organisms, but not to others. In other words, since we know what these gene products do, what pathways they participate in, we can learn something about how organisms became differentiated and unique from one another in biochemical terms.
We will be making this point much more clearly in the manuscript now, thanks to your comments. If you know of other studies along these lines that we should be aware of, it would be a kindness to direct us to them.
Referee 2:
Iyer Aravind
NCBI, NIH
Bethesda, MD
Reviewer's comments
"These proteins share many sequence similarities except that the repressor has a DNA-binding sequence at the N-terminal end, but the transport protein does not."
- This sentence should be modified to simply reflect the fact that the proteins share a PBP domain and that the transcription regulator has acquired a DNA-binding domain.
"Pair-wise related sequences from the entire genome were assembled, using the criteria of similarity as having Pam values below 200 and alignments of at least 83 residues. The groups ranged in size from 92 members in the largest group down to the smallest size, simple pairs."
- This is an underestimate of the actual paralogy situation in the genome. A disclaimer to this effect would be appropriate, indicating that the above method provides an approximate estimate of the cluster sizes of paralogs in the proteome. It might also be proper to differentiate between the paralogy of domains and whole proteins like the RbsR/RbsB example discussed above.
"...(CaiD) in both E. coli and Typhimurium."
- Better to spell out the whole name Salmonella typhimurium and thereafter use S.typhimurium
"P. aeruginosa has a large number of such single organism occurring enzymes"
- The sentence is highly agglutinative, could modified to express the point better. Secondly a more quantitative estimate of the "large number" would be useful. A comparison relative another organism could also be of value.
"...we suggest that members of the families arose in the course of evolution at least in large part, by duplication followed by divergence."
- This statement is entirely true, but it seems to be a bit of a platitude in this context because the introduction itself starts of stating the role of duplication in diversification of protein families. Certainly the protein families have emerged through this process. But what does the "large part" mean? Does it imply that a part of the family did not arise by this process? Or are the authors trying to say within a genome in large part the process was one of duplication/divergence but a smaller fraction could be lateral transfer.
This leads to a more general issue regarding the current article. The conclusions would possibly benefit from a more explicit delineation of the relative contributions of lateral gene transfer and lineage-specific expansions of genes (i.e. duplications) in the evolution of families considered here. In terms of physiological adaptation there is ample evidence from hyperthermophiles and photosynthetic organisms that gene transfer between phylogenetically distant lineages is a major contributor to the paralog complement of these organisms and their proteomes in general. This raises the possibility that in the adaptive transition to new niches the acquisition of genes by lateral transfer is a big player.
- Regarding the final discussion on epigenetics: It is known that proteins mediating epigenetic controls are very variably distributed across the bacterial phylogenetic tree. So is it correct to generalize a major role for epigenetics? Probably not -- it might provide some fine-tuning mechanisms but is unlikely to make a fundamental physiological difference for after the more fundamental determinants are directly inferred from the proteome.
Authors' response
Thank you for helping us improve our manuscript with your many insightful comments and helpful suggestions. We have adopted or addressed these as follows.
The sequence relationships of RbsR/RbsB has been explained as similarity and differences in domain content.
We have explained that the sequence similar groups we generate not are based on similarity of smaller domains or motifs, but rather require larger fractions of the proteins to be aligned, in an attempt to simulate gene duplication. As a result our estimates of paralogy may be considered conservative.
Salmonella enterica subsp. enterica serovar Typhimurium LT2 is now referred to as S. enterica rather than S. typhimurium so as to conform to current correct nomenclature.
We have clarified our statement about the large number of single organism occurring enzymes in P. aeruginosa and have included specific numbers and comparisons between the organisms analyzed.
On the influence of duplication and divergence versus lateral transfer as well as gene loss on the current protein family compositions, we have opted not to quantify these sources. We feel that our dataset is too small both in the number of enzymes and organisms compared to make such calculations. When selecting our dataset we sought to use experimentally characterized model organisms and families where the members had known metabolic functions. We have modified the discussion section to further state how gene loss and lateral gene transfer influence today's family compositions, but that based on the difficulty in distinguishing horizontally transferred genes from gene duplications and divergence (Lawrence and Hendrickson reference) we opted not to make such estimations for our dataset.
The section on epigenetics has been slightly modified. While the role of epigenetics may not be the major force affecting evolution of protein families and phenotypes of organisms, we do believe it represents an area of potential new insights into how functional diversity arises and is maintained in organisms.
Referee 3:
Arcady Mushegian
Stowers Institute
Kansas City, KA
Reviewer's comments
The manuscript deals with the fates of duplicated genes in bacterial genomes, focusing on the selected families of the enzymes with related, diverged functions and their sequence homologs. In the last 15 years, there has been a considerable amount of work on the subject, relating to each other such factors as rate of duplication, rate of duplicate retention, rate of sequence divergence between duplicates, subfunctionalization, speciation, etc. Many of the relevant papers from this corpus of work are cited in this manuscript. The manuscript would benefit from engaging with these cited papers in a constructive way, i.e., by trying to apply some of the quantitative estimates obtained by other workers to the cases that are studied here.
More specifically, I would like to see much more definitive statements about the timing of gene duplication within the selected three families vs. splits of the lineages that the authors study. Polytomies or lack of support for deep nodes in the tree may be a real problem in the subset of cases, but the analysis should be attempted anyway, and specific cases when the results lack support should be noted.
Abstract
"Sequence related families of genes and proteins" is perhaps a tautology - "families" already means "sequence-related", does it not?
"In Escherichia coli they constitute over half of the genome." - the total length of these genes is indeed likely to be over half of the genome length; but for this statement to be accurate, the length of the non-coding regions needs to be added to the denominator - has this been done? In fact, I suspect that the authors meant "over half of all proteins encoded by the genome"
"Equivalent families from different genera of bacteria are compared." - what does "equivalent" mean - homologous, of same size, or something else?
"They show both similarities and differences to each other." - consider deleting?
"At least some members of gene families will have been acquired by lateral exchange and other former family members will have been lost over time." - is it "will have been", i.e., expected of the data, or "have been", i.e., shown in this work?
"These families seem likely to have arisen during evolution by duplication and divergence where those that were retained are the variants that have led to distinct bacterial physiologies and taxa." - hard to argue with this, and yet: what would the alternative explanation be - purely stochastic expansion and shrinkage of the families?
Background
Par. 1 "Darwin formulated the Origin of Species" - either formulated the theory of Origin of Species, or written The Origin of Species perhaps?
Par. 3, last line: "Stepwise" means "relatively large" in context, but perhaps it should be made more explicit (otherwise, may be interpreted as "step by step", i.e., gradual).
Par. 4: the example of recruitment that the authors discuss is apparently recruitment by addition of novel domain. This is one mechanism of acquiring new function, but I am not sure that this is what R. Jensen meant; as far as I know, his thoughts were more along the lines of sequence drift and polyfunctionality.
Par. 5: "Some attempts to quantify the importance of horizontal, or lateral, transmission in the bacterial genome conclude that foreign gene uptake rather than gene duplication has been a large player in assembling a genome [29]." - I do not think that the study by Lerat et al. is an either/or proposition. They show that a large absolute number of detected gene transfers can coexist with the low frequency of such transfers in most gene families, which is in my opinion a profound result. They do not argue that gene duplication is less important than horizontal transfer, nor I think have their results been disproved. I agree with the authors' approach expressed in the rest of this paragraph, so I think an attempt to argue against the role of HGT is a red herring.
Last paragraph in the Introduction: "In the context of evolution, one might ask whether the genes for this expansive superfamily in one organism (not from many organisms) bear similarity to one another in their sequences." The authors already asserted that SDR is a superfamily - or is it a family, as both terms are used seemingly interchangeably in this paragraph? On what basis has this been established? Most likely, it was sequence similarity (I have no evidence that structures were matched directly, and indeed similarity comparison is what the first paragraph of the Results also suggests), in which case why this needs to be investigated again, or what are perhaps more specific questions that need to be addressed?
Results and Discussion
par. 4 - consider deleting?
par. 5 "The groups ranged in size from 92 members in the largest group" - please mention that this is from one study with a conservative similarity threshold; the current count for Walker-box ATPases/GTPases seems to be more than 120 members...
par. 7 "sequence and mechanistically related" - replace with "related by sequence and showing similar molecular mechanism"?
par. 8. Is it important to the authors to make sure that they know all members of each family in E.coli? If the answer is yes, is the AllAllDb comparison sufficient, or perhaps better to build an HMM or a PSIBLAST profile of the already known members and scan the proteome again? If the answer is no, why not?
par. 9: "Some of the SDR enzymes and some of the crotonases are almost universally present in organisms in all three domains of life. Thus one pictures the generation of these enzymes as happening early in evolutionary time, distributed vertically to most organisms." - one may wish to build a phylogenetic tree of the family and compare it with the tree of species to see whether there is any direct evidence for or against horizontal transfer - why not?
Ibid. "Some family members will be virtually universal, but others will differ from one organism or taxa to another, contributing to differences in phenotypes in separate lineages." - is this a statement of the fact or a prediction?
par. 10: "members of three enzyme families are the same in other bacteria" - what does 'the same' mean here?
par. 12: "One supposes such commonly held important functions could have arisen by duplication and divergence early in evolutionary time." - why one has to suppose it - can this again be evaluated by comparing species tree and gene tree?
the next paragraphs: interesting differences are discussed, but no specific evolutionary scenarios are proposed viz. the timing of the events. Can one distinguish between 1. the presence of an enzyme in the common ancestor of the lineages under study (ie., more or less in the common bacterial ancestor) with secondary loss in some of the lineages and 2. emergence of a specific family member by duplication in some but not all of the lineages? When a horizontal transfer event is suspected (e.g. "As is the case for any of the enzymes present in one organism, not the others, the gene could have been acquired by lateral transmission [26]. However when the enzyme is one of a family of similar enzymes, it is at least as possible that it arose by gene duplication and divergence."), why not attempt to sort out what was actually going on?
Authors' response
Thank you for having taken the time to look carefully at the manuscript. In response to your comments, we have done a major rewrite, during which we incorporated all suggestions about language and expression. We have expanded explanations and have tried to make much clearer the basic thrust of the paper.
In the first part of your review you suggest we do quantitative analysis to sort out when duplication occurred, when divergence occurred, plus when gain of genes by lateral transfer and loss of genes occurred. Our data set is much too small to undertake this type of analysis. We have expanded discussion to include this explanation in the revised manuscript.
You ask what alternatives there are to the process of duplication and divergence. We agree that alternatives are stochastic changes, or perhaps horizontal transfer. But mainly we are saying that one mechanism, perhaps the most important force, in creating the different kinds of bacteria today was duplication and divergence.
We have considered the issue of how we could try to quantify the importance of Lateral Gene Transfer in the four enzyme families we deal with, but we see no obvious outliers in our family groups. Members of these families do not deviate from properties of other members, thus if they came from another host source, time has brought about "amelioration", therefore they are not clearly identifiable as horizontally acquired. We agree that the issue is a "red herring" and have minimized discussion of it in our rewrite.
We have clarified that the definition of the SDR family was originally based on similarity of structure of the regions of substrate binding, cofactor binding and reaction site. Sequence similarity followed soon. The referenced papers give this history.
To our knowledge we are alone in having gathered all members of this family and the others in this paper from a single organism, as detected by the methods we describe, Darwin AllAll algorithm and PSI-Blast. These have been known already as paralogous groups. We are emphasizing their likely formation by duplication and divergence.
It is not surprising to find that there are more Walker ATPase/GTPase motifs than there are ATP-binding subunits of transporters because this motif appears in some other proteins such as helicases.
Reviewer suggests we might build phylogenetic trees of these families. This has been done in a prior report from our laboratory, which we referenced. In our extensive revision we give our reasons for not expecting gene trees for enzymes to be the same as RNA trees representing species.
As to the last comment by the referee, the goal of determining the history of each family of enzymes that led to the distribution and characterization seen today. We have explained in the revision that we have too small a data set to do retrospective analysis, building trees of how the enzymes were generated in each bacterium. Trees of these enzyme families as of today have been presented in a previous publication. We are not able to determine with our data set when specific losses occurred, or whether any of the genes were acquired by LGT. In our revision we have tried to explain much more clearly that this is a qualitative, not quantitative study. What we observe is perhaps no more than common sense, but we show how differences in the members of an enzyme family (divergence) are the kinds of differences that make each bacterial genus unique. Divergence of duplicate enzymes generated differences we now use to characterize bacterial genera.