Reviewer 1: Michael Gromiha
Reviewer comments:
In this work, the authors systematically analyzed structural disorder in a set of minimal organisms and showed that few characteristic functions are linked with conformational disorder. Further, they suggested that these functions correspond to the most essential ones fulfilled by structural disorder in cellular organisms. The analysis has been carried out extensively with specific examples to RNA polymerase, chaperone protein, single stranded DNA binding proteins, DnaK and so on. The work is interesting and the data provide new insights.
The following comments may be addressed for improvements.
1. The results obtained with negative dataset may be discussed.
Authors’ response: We thank the reviewer for appreciating our work and his suggestions for improvements. We are not entirely sure what the reviewer means by negative dataset. Since the investigated proteomes are extremely minimalized, which affect both the number and the length of proteins, in some of them even the otherwise essential proteins listed in Table 1
are missing. Also, in many of them the disordered regions either disappeared or shortened to an extent that they do not fulfil our criteria of LDRs (Fig. 1
indicates that even though in some cases the disordered regions are shorter, they are mostly preserved). Now we added orthologs from reference bacteria of different phylogenetic groups to address if the identified regions are generally disordered or only due to the minimalistic nature of the investigated species. We find that many are also disordered in the reference proteomes. Now we also validate our results by using another method for disorder prediction.
2. The threshold value of 580 proteins for selecting the proteins with extreme genome reduction may be justified.
Authors’ response: We have chosen the threshold of 580 for proteome size because above this threshold the obtained proteome set would have been too much biased towards Mycoplasma species. With this threshold we ensured that only two Mycoplasmas are selected, but obligate endosymbionts of different phylogenetic groups are still well represented. Also, somewhat above this threshold there are non-Mycoplasma proteomes in which the proteins were not annotated, only numbered and hence they could not have been used for this analysis.
Reviewer 2: Zoltan Gaspari
Reviewer comments:
This paper describes some important findings about the role of intrinsic protein disorder in minimal genomes. Its original and can be of interest for researchers working in the respective field. Although I think that the work contains novel findings, the volume of the data processed and the novel information provided is a bit limited. I think that the amount of sequences analyzed makes a more detailed study possible and I make some recommendations for this below that the authors might consider to improve the manuscript.
- The authors used one prediction algorithm (IUPred), one threshold (0.5) for classifying residues and one (20 aa) for identifying long regions. Where such necessarily subjective choices should be made, it can be important to prove the robustness of the main conclusions by repeating the analysis with some parameters varied. Can the authors provide such considerations?
Authors’ response: We thank the reviewer for appreciating our work and his suggestions for improvements. IUPred is a very widely used and trusted method that shows good correspondence with consensus predictions obtained based on many different methods. In many of the previous analyses we and others have repeated the predictions with different methods and the identified tendencies were always preserved. This is why we thought one trusted method should be enough to point out such tendencies. Also, some of the methods that can be locally used on complete proteomes, for example VSL2B is known to predict very similar patterns but with elevated absolute values, meaning that it usually overestimates the number and extent of disordered regions compared to consensus disorder patterns. In this analysis overprediction is definitely not desired because we wanted to find the set of protein regions that consistently preserve their disordered nature. In our view it is better to obtain a relatively restricted but stable set of regions, than getting a larger set that is diluted with false positive cases. Now we repeated the analysis using another prediction method, ESpritz X-ray, and we find very similar tendencies as described in the last paragraph of the findings section.
The published threshold value for IUPred that discriminated folded and disordered regions the best is 0.5; this is not a value that we considered to change. In Fig. 1
, however, we have also highlighted residues with a prediction score >0.4 but <0.5 to show that disorder scores often do not drop very steeply and thus the residues surrounding or intervening predicted disordered regions usually receive predicted scores implying high flexibility.
We have now tried to look for even longer regions of 30 consecutive residues. Only 7 ribosomal and 4 non-ribosomal proteins retained LDRs of at least 30 consecutive residues in at least three minimal organisms from at least two of the represented bacterial clades. We added two Additional file 1: Tables S3 and S4 to show the results with this parameter, and also describe those in the manuscript.
We did not want to look for regions <20 residues because those cannot be considered as LDRs (at least we do not know of any analysis in the ID literature where <20 residues long regions were considered as LDRs). 20 residues thus seemed as the ideal choice. The C-terminal region of GroEL, for example, is just between 15 and 25 residues in most organisms regardless of proteome size, which is most probably restricted by the size of the interior cavity. Multiple experiments with deletion mutants for this region demonstrate that true ensemble-like LDRs of around 20 residues can fulfil crucial functions (see literature references [36–39]).
- It could be of interest to analyze some of the protein families with LDR regions in a bit more detail, including sequences from organisms with non-minimal genomes. There might be interesting patterns in the presence/absence of LDRs that are only apparent on a larger data set.
Authors’ response: We agree. We extended the analysis, Table 1
, all additional tables and the alignments of Fig. 1
with orthologs from non-minimal reference bacteria as explained below.
- In general, the study could benefit from using some “reference organisms” with non-minimal genome from all investigated groups. It can be of interest whether any feature might be associated with being in a minimal genome.
Authors’ response: We agree. We extended the analysis, Table 1
, all additional tables and the alignments of Fig. 1
with some reference organisms, namely Escherichia coli (strain K12) representing Proteobacteria, Bacteroides vulgatus (strain ATCC 8482) representing Bacteroidetes, and Bacillus subtilis (strain 168) representing the Firmicutes clade. From Tenericutes we did not include a reference proteome because those are best represented by Mycoplasmas, which are included in the dataset anyways. However, we did not accept further protein hits that only show disorder in the reference proteomes because we were explicitly interested in proteins/protein regions that preserve their disordered nature after severe genome minimisation. We only use the reference proteomes to demonstrate that the identified LDRs are also mostly present in those and are thus not a consequence of genome minimization, or the associated fast evolutionary changes and instability of the proteins.
- The authors might want to comment on whether all the identified LDRs are ‘genuine’ disordered regions and not coiled coils or other segments commonly predicted to be disordered.
Authors’ response: We do not know about the sparsely detected regions within polymerase subunits b and b’ and FtsH, because those are not consistently disordered and not mentioned in the literature as such. However, the other regions that we identified, the C-terminal tails of DnaK, GroEL, a linker within RpoD and the N-terminus of GrpE and InfB, as well as the tail region of SSB were checked and they are all ‘genuine’ disordered regions, which correspond to missing regions in the respective PDB structures, as now explained in the manuscript.
- It is not described how the authors identified orthologs as gene/protein names might not be conclusive. Kindly comment on this as there are many missing homologs/orthologs indicated in Table 1.
Authors’ response: We actually identified orthologs based on gene and protein names, because we found that in case of these minimal organisms and the associated crucial proteins those were conclusive. When absolutely crucial proteins were missing we were also trying to identify those with blast, but we could never find them. For instance we were surprised to see that GroEL was missing in the two Mycoplasmas that are not even among the smallest proteomes, however we could not find any sequence resembling GroEL and finally found publications stating that in many Mycoplasmas the GroEL-GroES system is completely missing (Wong P and Houry WA, 2004). So the missing homologs/orthologs in Table 1
are really missing as a consequence of genome minimization and not due to misannotation.
- The authors might want to comment on the proteins with LDRs found in only one of the investigated proteomes. It can be of interest whether these proteins are in any way associated with being located in a minimal genome.
Authors’ response: This is an interesting suggestion but in our view this would be out of the scope of this discovery note. We did not identify anything that seemed specific for minimal genomes, however they have surprisingly many putative genes/proteins regarding their minimalistic nature. Some of those are disordered. Since several works suggest that their proteins need the extensive assistance of chaperones, we were guessing that among those putative proteins there might be disordered chaperones, but we cannot prove this assumption. Also, we are highly restricted with both text length and number of display items, so we would like to stick to the original idea that is, looking for regions that are consistently disordered in different phylogenetic groups after extensive genome reduction.
- Page 5, line 39: “translation initioation factor” is misspelled.
Authors’ response: We thank for this remark, we corrected the mistake.
- The two additional xls files could be combined to a single one with the data in two tabs.
Authors’ response: We thank for this remark, now all the Additional Tables with IUPred data are arranged as tabs of a single excel file and those with ESpritz are collected in another.
Reviewer 3: Sandor Pongor
Reviewer comments:
The manuscript by Pancsa and Tompa highlights the fact that in organisms with a minimal genome, essential protein functions are linked with structural disorder. The ms is well written and understandable. The figures are clear.
I feel that the message could be made more succinct by emphasizing a few aspects that may not be immediately clear to wider audiences. For instance, i) Why was the set of 13 proteomes selected? Is the finding—i.e. the set of proteins found—sensitive to the selection? The NCBI list of complete (annotated) genomes includes over 80 endosymbionts. In more detail, the Mazumder dataset is optimized for sequence similarity as well bibliographic criteria in the context of all proteomes, while taxonomic coverage within the group of the selected (minimal genome) organisms may be more relevant to this work, at least according to this reviewer. The original paper of Mazumder et al states that CMT55 is superior to other dbase distributions in this respect ü can the author explain why they chose the CMT15 dataset?
Authors’ response: We thank the reviewer for appreciating our work and his suggestions for improvements. We have chosen the CMT15 dataset because that contains well-annotated species, whose phylogenetic group is assigned. Although this dataset is smaller than the CMT55, the quality of annotations is much better. For Bacteria, the CMT55 dataset is identical to the list of UniProt reference proteomes that currently contains 4159 proteomes (UniProt 08_2016). Back than in 2011 when the Mazumder paper was published they had altogether only 637 proteomes in the CMT55 dataset, while now it is over 5000. The expansion of the sequence space is so fast that annotation procedures can clearly not catch up any more. So, although there are much more proteomes in CMT55, many of those are not well annotated, many come from environmental samples, there are several species represented from the same genus, so it is quite redundant, and there are many unclassified species whose phylogenetic group is not known (for example there are multiple reference proteomes with GW numbers termed as Parcubacteria group bacterium or Microgenomates group bacterium). Also, we have found several mistakes that would potentially affect our dataset, for instance under the UniProt code UP000064377 there is a Salmonella enterica subsp. enterica serovar Enteritidis str. LA5 species assigned as a reference proteome with only 102 proteins. It was clear that this annotation cannot be correct, since Salmonella enterica species usually have huge genomes and proteomes with >5000 proteins. We have checked this entry and it turned out that it contains only the proteins of a respective plasmid. Although there are mistakes that are easy to identify and filter out, there are also less shouting annotation mistakes that would not necessarily pop up in the automated data analysis pipeline used here, so we decided to use the smaller but better annotated, more trustworthy CMT15 dataset. For the proteomes that we use, the genes and proteins are also well annotated, so we can rely on the annotated gene and protein names, while for many reference proteomes of the CMT55 dataset they are only numbered with no information on the function of the protein whatsoever. Lastly, if using a considerably larger dataset we could not show the corresponding data table in the manuscript, neither to depict the complete alignments of the orthologs.
ii) Are the orthologues of the identified proteins found in non-reduced genomes also disordered?
Authors’ response: For the polymerase subunits we did not find any literature evidence on conserved disordered regions, but for DnaK, DnaJ, GroEL, GrpE, SSB, translation initiation factor 2 (infB) and Peptide chain release factor 1 (prfA) there are analyses in the literature and available protein structures supporting that the respective regions are disordered in orthologs from non-reduced genomes. Now we included three representative reference bacterial proteomes into the analysis to show the disorder status of the identified regions in those and mention the corresponding protein structures.
iii) The identification of disordered proteins relies on the prediction of long disordered stretches. Does the length threshold and the selection of the prediction program influence the findings? Would a different prediction method give different functional predictions?
Authors’ response: We have already answered similar questions for Zoltan Gaspari above. The length definitely influences the findings. With a minimum LDR length of 30 consecutive residues, we found less conserved disordered regions, but 7 ribosomal proteins and 4 other proteins still retained LDRs in more than one bacterial clade (see newly added Additional file 1: Table S3 and S4.
Twenty residues seemed as the ideal choice for an LDR. We did not want to look for regions <20 residues because those cannot be considered as LDRs.
Now we repeated the analysis using another conservative prediction method, ESpritz X-ray, and we find very similar tendencies. Please see the new paragraph before the Discussion section.
iii) Can one assign statistical significance to the findings, for instance by simply repeating the predictions with a series of subsets of the selected proteomes?
Authors’ response: We could maybe assign statistical significance but we do not think it is necessary. The average fraction of disordered residues in these proteomes ranges between 1 and 12 %, with a median of 3.7 % that is extremely low compared to other organisms (Pancsa and Tompa, 2012). The number of identified LDRs was between 1 and 57 in the 13 minimal proteomes including ribosomal proteins, but only 0 to 41 excluding ribosomal proteins (with medians of 17 and 7 regions, respectively). The chance to repeatedly pick the same 20 residues long protein segment just by chance (without assuming the evolutionary conservation of disorder in those regions) is negligibly low. We do not think we must force statistics on that, especially that we are bound by the strict length limitations of the discovery note format. It is clear that most of the regions identified here (except for those in the polymerase subunits and FtsH that appeared in different regions of the orthologous proteins) represent evolutionarily conserved disordered regions that do not only pop up due to prediction or annotation mistakes. Now we show that the respective regions are identified independently from the prediction method used and that they are also mostly disordered in non-minimal reference bacteria from diverse clades.