The mysterious orphans of Mycoplasmataceae

Background: The length of a protein sequence is largely determined by its function, i.e. each functional group is associated with an optimal size. However, comparative genomics revealed that proteins length may be affected by additional factors. In 2002 it was shown that in bacterium Escherichia coli and the archaeon Archaeoglobus fulgidus, protein sequences with no homologs are, on average, shorter than those with homologs. Most experts now agree that the length distributions are distinctly different between protein sequences with and without homologs in bacterial and archaeal genomes. In this study, we examine this postulate by a comprehensive analysis of all annotated prokaryotic genomes and focusing on certain exceptions. Results: We compared lengths distributions of having homologs proteins (HHPs) and non-having homologs proteins (orphans or ORFans) in all currently annotated completely sequenced prokaryotic genomes. As expected, the HHPs and ORFans have strikingly different length distributions in almost all genomes. As previously established, the HHPs, indeed, are, on average, longer than the ORFans, and the length distributions for the ORFans have a relatively narrow peak, in contrast to the HHPs, whose lengths spread over a wider range of values. However, about thirty genomes do not obey these rules. Practically all genomes of Mycoplasma and Ureaplasma have atypical ORFans distributions, with the mean lengths of ORFan larger than the mean lengths of HHPs. These genera constitute over 80% of atypical genomes. Conclusions: We confirmed on a ubiquitous set of genomes the previous observation that HHPs and ORFans have different gene length distributions. We also showed that Mycoplasmataceae genomes have distinctive distributions of ORFans lengths. We offer several possible biological explanations of this phenomenon.


Background
Different factors affect properties of prokaryotic proteins. Some of them appear to be general constraints on protein evolution. For example, genomic studies revealed that the base composition of a genome (i.e. GC content) correlates with the overall amino acid composition of its proteins [2]. There are also general constraints on protein size, such as, in general, smaller proteins for prokaryotes compare to eukaryotes [3].
Previously, we revealed some other factors affecting the lengths of protein-encoding genes [4][5][6]. However, there are numerous protein-encoding genes without homologues in genomes of other organisms called "ORFans" or "orphans" (the term coined by Fisher and Eisenberg [7]). The ORFans are not linked by overall similarity or shared domains to the genes or gene families characterized in other organisms.
Tautz and Domazet-Lošo [8] were the first to discuss systematic identification of ORFan genes in the context of gene emergence through duplication and rearrangement processes. Their study was supported by other excellent reviews [9][10][11].
ORFan genes were initially described in yeast as a finding of the yeast genomesequencing project [12,13], followed by identification of ORFans in all sequenced bacterial genomes. Comparative genomics has shown that ORFans are an universal feature of any genome, with a fraction of ORFan genes varying between 10-30% per a bacterial genome [14]. Fukuchi and Nishikawa [15] identified that neither organism complexity nor genome length correlate with the percentage of ORFan genes in a genome.
ORFans are defined as the genes sharing no similarity with genes or coding sequence domains in other evolutionary lineages [12,13]. They have no recognizable homologs in distantly related species. This definition is conceptually simple, but operationally complex. Identification of ORFans depends both on the detection method and the reference set of genomes, as this defines the evolutionary lineage to be investigated. Albà and Castresana [16] have questioned whether BLAST was a suitable procedure to detect all true homologues and concluded that BLAST was a proper algorithm to identify the majority of remote homologues (if they existed).
Lipman et al. [1] studied the length distributions of the Having Homologs Proteins (HHP) and Non-Conserved Proteins (ORFans in our nomenclature) sets for the bacterium Escherichia coli, the archaeon Archaeoglobus fulgidus, and three eukaryotes. Regarding the two prokaryotes, the group made the following observations: i. HHPs are, on average, longer than ORFans.
ii. The length distribution of ORFans in a genome has a relatively narrow peak, whereas the HHPs are spread over a wider range of values.
Lipman et al. [1] proposed that there is a significant evolutionary trend favoring shorter proteins in the absence of other, more specific functional constraints. However, so far research in this area was limited in the scope of organisms. Here, we have tested the above-mentioned observations by Lipman et al. [1] on a comprehensive set of all sequenced and annotated bacterial genomes. We performed comparisons of length distributions of HHP and ORFans in all annotated genomes and confirmed, to a large extend, the conclusions of Lipman et al. [1]. Below, we described and discussed the few remarkable exceptions to the general rules.

Results and discussion
Most exceptions species to the Lipman's rule [1] belong to the Mycoplasmataceae family. Mycoplasmataceae lack the cell wall, feature some of the smallest genomes known and are "metabolically challenged", i.e missing some essential pathways of free-living organisms [24][25][26][27][28]. Many Mycoplasmataceae species are pathogenic in humans and animals.

HHPs and ORFans lengths
We have selected four genomes out of currently sequenced and annotated 1484 bacterial genomes to illustrate typical protein lengths distributions for HHPs and ORFans, (Fig. 1, Panels A-D). The ORFans' length distributions are relatively narrow, in contrast to the HHPs with lengths spread over a wider range of values. ORFans are obviously shorter than HHPs in all four species (Fig. 1, Panels A-D). Note that the distributions of protein lengths in the four selected bacteria are similar to the global distribution presented in Fig. 1 (Panels E-F).
Based on the data from two genomes, Lipman et al. [1] suggested that HHPs are, on average, longer than the ORFan proteins in general case. In order to test this statement, we have calculated distributions of protein lengths for all COG-annotated genomes, and built a histogram of differences between the means of HHPs and ORFans, which happened to be approximately bell-shaped (Fig.   2). On average, HHPs are longer than ORFans by approximately 150 amino acids.
However, the bell-shaped distribution has a heavy left tail containing genomes with ORFans' mean length equal to or exceeding HHPs' mean length ( Figure 2). In order to atypical ORFans, which is sufficient for statistical analysis (see Table 1). Therefore, we restricted our analysis to the Mycoplasma and Ureaplasma genera.

Variability of protein lengths
The Mycoplasmataceae genomes challenge the second conclusion of Lipman et al.
[1] that the length distributions of ORFans have a relatively narrow peak, whereas those of the HHP are spread over a wider range of values. The histogram of differences between HHPs and ORFans in these atypical genomes is shown in Fig. 2 (red bars). We calculated the Correlation of Variation ( , where Y is a set of protein lengths); average difference between CV for ORFans and HHPs in "atypical" genomes was 0.31. We also computed variances of lengths for ORFans and HHPs separately and conducted the F-test, resulting in p-values <10 -64 for all tested pairs. Therefore, the ORFan proteins of these genomes are more variable in length than the HHPs.

Selection of a statistic for identification of atypical genomes
We tested the relationships between the mean HHP length and the mean ORFan

Functional annotation of ORFans
We selected a 9350 ORFans of from 32 species Mycoplasmataceae, found the best hits in other prokaryotic genomes, and stratified them by functional annotation in the COG database. 54% of ORFans were mapped to a "hypothetical protein" category; 6% are 'lipoproteins", further 2% are "membrane lipoproteins"; 3% are "surface protein 26-residue repeat-containing proteins", and the rest is mapped to lesser-abundant categories. A protein is called "hypothetical" if its existence has been predicted in silico, but the function is not experimentally validated. Despite Mycoplasmataceae cells are wall-less with no periplasmic space, they effectively anchor and expose surface antigens using acylated proteins with long-chain fatty acids [29][30][31]. Lipoproteins are abundant in mycoplasmal membranes and are considered to be a key element for diversification the antigenic character of the mycoplasmal cell surface [29,32].
For the long proteins (≥1000 aa) we are especially interested in, we compared the functional annotations between HHPs and ORFans. These two groups were most different in the "hypothetical protein" category (p-value=4. 00195E-21) overrepresented in ORFans, followed by "efflux ABC transporter, permease protein", also over-represented in the long ORFans of Mycoplasmataceae (p-value =4.18953E-06 were excluded due to an inconsistency of annotation between them. Based on the data obtained, we concluded that the phenomenon of extremely long ORFans is specific for the family of Mycoplasmataceae.

Driving forces behind the long ORFans
Why the Mycoplasmataceae have ORFans as long as HHPs with the distribution of ORFans' lengths very similar to HHPs? Mycoplasmataceae are a heterogeneous group of the cell-wall-less, the smallest and the simplest self-replicating prokaryotes.
They have a reduced coding capacity and have lost many metabolic pathways, as a result of parasitic lifestyle [33,34]. These organisms are characterized by lack of a cell wall, small genome size, low G+C content (23% to 40%) and atypical genetic code usage (UGA encodes tryptophan instead of a canonical opal stop codon) [35]. In addition, Mycoplasmataceae genomes lack of 5' UTR in mRNAs as established by Nakagawa et al. [36]. This phenomenon is highly unusual in bacteria. Below we propose and discuss several reasons that might explain the presence of long ORFans in Mycoplasmataceae.

Low GC content and unusual base composition in a reduced bacterial genome
We analyzed 300 genomes with the lowest GC content (ranging from 14% to of proteins with transmembrane helices and/or signal sequences and a unique serinethreonine bias prominent in proteins associated with pathogen-host interactions. The GC 3 is defined as a fraction of guanines and cytosines in the third codon position [40]. The importance of variability in genomic GC and genic GC 3 content for stress adaptation has been established by multiple authors for a number of prokaryotic and eukaryotic organisms [41][42][43][44][45]. The mechanisms behind GC-content differences in bacterial genomes are unclear, although variability in the replication and/or repair pathways were suggested as hypotheses [46][47][48]. One mechanistic clue is the positive correlation between the genome size and GC content (smaller genomes tend to have lower GC-content). This tendency is particularly pronounced for obligate intracellular parasites. Two (not necessarily mutually exclusive) hypotheses have been forwarded to explain this base composition bias in the genomes of intracellular organisms. The first is an adaptive hypothesis, based on selection for energy constraints [49]. It stays that low GC content helps the intracellular parasites to compete with the host pathways for the limited metabolic resources in cytoplasm. The second hypothesis relates to mutational pressure resulting from the limited DNA repair systems in bacterial parasites [50]. Small intracellular bacteria often lose non-essential repair genes, and, therefore, are expected to be deficient in their ability to repair damage caused by spontaneous chemical changes. This is particularly expected for endosymbionts, in which genetic drift plays a major role in sequence evolution [50].
Thus, Mycoplasma, and Ureaplasmae are GC and GC 3 -poor, (Figure 5, Supplemental Table 1). Why GC-poverty is so important? According to the "codon capture model", in GC-poor environment, the replication mutational bias towards AT causes the stop codon TGA to change to the stop codon TAA without affecting protein length [51,52]. The subsequent appearance of the TGA codon through a point mutation leaves it free to encode for an amino acid (Trp). This brings us to our next point of discussion.  Based on these findings, we conclude that GC-content of genes and genome cannot be a sole factor responsible for existence of long ORFans in a

A B
Mycoplasmataceae.

UGA StopRTrp recoding
Almost all bacterial and archaeal species have three stop codons: TAA, TGA and TAG. However, there are 77 exceptions to this rule among the currently completely sequenced 2723 prokaryotic genomes (note that only 1484 of them are COGannotated and, therefore, were used in our study). Seventy-three species out of seventy-seven belong to the genera Mycoplasma, Spiroplasma, and Ureaplasma; all of them are small bacteria of the class Mollicutes. In addition, in several mitochondrial lineages, the UGA StopRTrp recoding is also associated with both genome reduction and low GC content [54][55][56]. For example, Candidatus Hodgkinia cicadicola, mentioned above because of its "dwarf genome", also features the coding reassignment of UGA Stop→Trp [57]. Moreover, two groups of currently uncultivable bacteria, found in marine and fresh-water environment and in the intestines and oral cavities of mammals, use UGA as an additional glycine codon instead of a signal for translation termination [58]. Under the "codon capture" model, a codon falls to low frequency and is then free to be reassigned without major fitness repercussions. Applying this model to the UGA StopRTrp recoding, mutational bias towards AT causes each UGA to mutate to the synonym UAA without affecting protein length [51,52]. When the UGA codon subsequently reappears through a mutation, it is then free to encode for an amino acid [51,52]. While some have argued that codon capture is insufficient to explain many recoding events [2,54,55], the fact that all known UGA StopRTrp recoding has taken place in low GC genomes [54,59] makes the argument attractive for this recoding. It was suggested [51] that the recoding is driven by the loss of translational release factor RF2, which recognizes the

Lack of a cell wall and parasitic lifestyle
Several bacterial species have wall-less cells (L-forms), as a response to extreme nutritional conditions [61]; L-forms might have played a role in evolution with respect to the emergence of Mycoplasma [62]. In order to compensate for the lack of cell wall, Mycoplasma developed extremely tough membranes capable to contend with the host cell factors. Lipoproteins are abundant in mycoplasmal membranes [29,32]. They modulate the host's immune system [63], therefore playing an important role in the infection propagation. Ability of lipoproteins to undergo frequent size or phase variation is considered to be an adaptation to different conditions, including the host's immune response [63,64]. Depending on the species, lipoproteins are encoded by a single or multiple genes (multi-gene families) and some of them are members of paralogous families, such as P35 lipoprotein of M. penetrans [68]. Some lipoproteins are species-specific, while some have homologs among different species, in particular, are associated or share sequence similarity with ABC transporter genes, suggesting that they may play a role in the transport of nutrients into the cell [69]. It is well established that prokaryotic ABC transporters translocate different compounds across cellular membranes in an ATP coupled process (a crucial function for obligate parasites like Mollicutes). They also carry out a remarkable diversity of other functions, some of which are essential for pathogenicity [70].

Conclusions
We have compared lengths' distributions of "having homologs proteins" (HHPs) and "non-having homologs proteins" (orphans or ORFans) in all currently annotated completely sequenced prokaryotic genomes.
In general, we confirmed the findings of Lipman et al. [ which parasitize a wide range of hosts [33,34]. These organisms are characterized by lack of a cell wall, small genome sizes, a low GC content (23% to 40%) of the genome and usage of different genetic code (usage UGA as a tryptophan codon instead of the universal opal stop codon) [29].
We propose that the atypical features of Mycoplasmataceae genomes were likely developed as adaptations to their ecological niche, specifically for "quiet" coexistence with host organisms. Mycoplasma are known to colonize their hosts with no apparent clinical manifestations, using high variability of lipoproteins to trick the host's immune system. These are the lipoproteins that are frequently encoded by the long ORFans in Mycoplasma genomes, alongside with "surface protein 26residue repeat-containing proteins" and "efflux ABC transporters". The latter functions are also associated with the obligatory parasitic lifestyle of Mycoplasma, which supports our hypothesis.

COGs database
The Clusters of Orthologous Groups of proteins (COGs) database (http://www.ncbi.nlm.nih.gov/COG/) has been a popular tool for functional annotation since its inception in 1997, particularly widely used by the microbial genomics community. The COG database is described in detail in a series of publications [74][75][76][77]. Recently, the COG-making algorithm was improved and the COG database updated [78]; however, for the purposes of our study we preferred to use the original COG repository ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/. This choice enabled us to compare the distributions of HHP and ORFans in as many as 1484 prokaryotic genomes, since COG functional classification of the encoded proteins is one of the required descriptors of all newly sequenced prokaryotic genomes [79].
Statistical analysis was conducted in R using built-in functions and custom scripts.

List of abbreviations
COG -a cluster of orthologous groups of genes; each COG consists of groups of proteins found to be orthologous across at least three lineages HHP -having homologs protein; here COG-annotated protein

ORFan open reading frame
ORFannon-HHP; here a protein-encoding gene that is not linked to any COG CG content -relative frequency of guanine and cytosine CG 3 content -relative frequency of guanine and cytosine in the 3rd position of a codon

Competing interests
The authors have no competing interests

Author's contributions
TT and AB jointly performed data analysis, interpretation and wrote the first draft of manuscript. IL and YN helped with data interpretation and manuscript preparation.