Skip to main content

Analysis of lineage-specific protein family variability in prokaryotes combined with evolutionary reconstructions



Evolutionary rate is a key characteristic of gene families that is linked to the functional importance of the respective genes as well as specific biological functions of the proteins they encode. Accurate estimation of evolutionary rates is a challenging task that requires precise phylogenetic analysis. Here we present an easy to estimate protein family level measure of sequence variability based on alignment column homogeneity in multiple alignments of protein sequences from Clade-Specific Clusters of Orthologous Genes (csCOGs).


We report genome-wide estimates of variability for 8 diverse groups of bacteria and archaea and investigate the connection between variability and various genomic and biological features. The variability estimates are based on homogeneity distributions across amino acid sequence alignments and can be obtained for multiple groups of genomes at minimal computational expense. About half of the variance in variability values can be explained by the analyzed features, with the greatest contribution coming from the extent of gene paralogy in the given csCOG. The correlation between variability and paralogy appears to originate, primarily, not from gene duplication, but from acquisition of distant paralogs and xenologs, introducing sequence variants that are more divergent than those that could have evolved in situ during the lifetime of the given group of organisms. Both high-variability and low-variability csCOGs were identified in all functional categories, but as expected, proteins encoded by integrated mobile elements as well as proteins involved in defense functions and cell motility are, on average, more variable than proteins with housekeeping functions. Additionally, using linear discriminant analysis, we found that variability and fraction of genomes carrying a given gene are the two variables that provide the best prediction of gene essentiality as compared to the results of transposon mutagenesis in Sulfolobus islandicus.


Variability, a measure of sequence diversity within an alignment relative to the overall diversity within a group of organisms, offers a convenient proxy for evolutionary rate estimates and is informative with respect to prediction of functional properties of proteins. In particular, variability is a strong predictor of gene essentiality for the respective organisms and indicative of sub- or neofunctionalization of paralogs.


The determinants of protein evolution rates have been studied for decades, with the rate estimates typically based on evolutionary distances between orthologous proteins in pairs of closely related organisms [1,2,3,4,5,6,7]. When functional (transcriptomic and proteomic) data were available, protein abundance or expression level was found to be the strongest correlate for protein conservation, suggesting that the physics of protein folding, and in particular, the probability of misfolding could be among the most important factors limiting protein variability during evolution [8, 9]. On the opposite side of the evolutionary conservation range, very fast sequence divergence was associated with evolution driven by positive selection [10, 11], often limited to specific regions or sites within proteins or acting for relatively short periods of time [12,13,14].

High protein variability and evolutionary fluidity appear to be often associated with the protein’s role in various biological conflict scenarios [15, 16], sometimes serving as a hallmark for the discovery of novel defense and offence systems in prokaryotes [17, 18]. Discovery of multiple diversity-generating mechanisms [19,20,21], which target gene regions that need to adapt particularly rapidly, underscore the importance of this phenomenon.

Quantification of sequence variability is a non-trivial task. Measures based on the distribution of amino acids in alignments (from the number of different characters to the Shannon entropy of an alignment column) do not take into account amino acid properties, effectively assigning the same weight to all mismatches. Measures based on explicit evolutionary reconstructions (tree distances and numbers of mutational events) are highly computationally expensive and require a careful choice of the evolutionary model [22,23,24,25]. Previously, we described a site homogeneity measure [26] that provides a compromise, taking into account an amino acid similarity matrix and using sequence weights to mitigate the effect of uneven distribution of sequences across the range of phylogenetic distances.

Evolutionary distances themselves or homogeneity, used as their proxy, estimate sequence conservation in absolute terms. If different alignments need to be compared to each other, it is important to keep the context as uniform as possible (i.e., using alignments representing comparable evolutionary depth) or to find a way to take the context into account explicitly. Here we suggest a measure of protein variability in the context of alignments of clade-specific orthologs and survey the distribution of the estimated variability in several selected lineages of archaea and bacteria. We explore the genomic features associated with protein variability and investigate gene families with unusual patterns of sequence variation.


Estimation of gene variability

We selected 8 taxonomically diverse lineages of archaea and bacteria at genus or family level, with 30–60 genomes in each. Namely, archaea: Haloferacales, Sulfolobales, Thermococcales, and Methanosarcinales; bacteria: Flavobacteriales, Deinococcales, Paenibacillus, and Rhodococcus. The majority of these lineages include at least one representative amenable to genetic manipulation [27,28,29,30,31], facilitating future validation of functional predictions. For each of these sets of genomes we built clade-specific csCOGs (see Methods for details). Phyletic patterns of these csCOGs, along with the genome tree, were then used to reconstruct the history of gene gains and losses for each csCOG. Multiple protein sequence alignments of all csCOGs were constructed; for alignments containing 8 or more non-identical sequences, the homogeneity values were calculated for each alignment column. This data was used to obtain csCOG- and position-specific variability estimates that relate the csCOG-specific or position-specific homogeneity to the mean across the clade (Fig. 1, Additional file 1, see Methods for details). The distributions of the variability values for all 8 lineages were closely similar (Fig. 2), suggesting that these values indeed are comparable across lineages. csCOGs (or individual alignment positions) with the relative variability \(v<0.5\) were classified as conserved, and those with relative variability \(v>2\) were classified as variable.

Fig. 1
figure 1

Pipeline for protein variability analysis. Homogeneity values are calculated for each position of multiple alignments of clade-specific COG (csCOG) sequences (top left). Homogeneity profiles along the sequences are smoothed and converted to distributions of the homogeneity values (top middle). Distances between the homogeneity value distributions are used to embed csCOGs into a metric space (top right). Homogeneity values, scaled by the average homogeneity across the clade, are transformed into variabilities (bottom middle). csCOG-specific values form clade-level distributions (bottom left). Position-specific variability values allow to categorize alignment sites into conserved, intermediate, and variable; relative frequency of these classes, plotted on a simplex diagram, identifies csCOG with unusual conservation patterns (bottom right)

Fig. 2
figure 2

Distribution of variability values across clade-specific COGs. Gaussian kernel-smoothed probability density functions for variability values in clade-specific pangenomes (plots for eight clades are shown). Threshold values for conserved (variability v < 0.5), intermediate (0.5 < v < 2), and variable (v > 2) csCOGs are indicated

Gene features defining variability

The first question we addressed was to what extent csCOG variability could be explained by a combination of features that are expected to affect or correlate with protein evolution rate. For this analysis, the following features were chosen: membership in bacterial (320 prokaryotic COGs) or archaeal core (218 arCOGs [32]); inferred number of gains in the history of the csCOG; number of paralogs in the csCOG, ancestrality of the csCOG relative to the clade (categorized as ancestral, intermediate and terminal branch acquisition), presence of transmembrane segments (categorized as present if predicted for at least 1/3 of proteins in a csCOG), presence of signal peptide (categorized as present if predicted for at least 1/3 of proteins in a csCOG), fraction of low complexity regions (average across the csCOG members), fraction of microsatellite-like repeats in the respective genes (average across the csCOG members) and functional classification into one of the 21 COG functional groups (Additional file 1: Table S1). Only up to 50% of the variance in the csCOG variability estimates could be explained from all these features combined (Fig. 3A). Next, we examined the correlations between variability and each individual feature. The number of paralogs showed the strongest correlation, explaining from 19 to 30% of the variability values variance, followed by gene gain rate and functional classification (Fig. 3A). Surprisingly, average low complexity masking fraction and microsatellite-like repeats fraction only weakly correlated with variability, comparable with the weak correlation observed for membrane proteins (Fig. 3A).

Fig. 3
figure 3

Association of protein variability with other genomic and biological features. A Fraction (in percent) of variance of protein variability explained by other properties. The “total explained” fraction is estimated using multivariable regression. The fraction explained by individual properties is estimated using ANOVA. The cells, corresponding to properties, excluded by Akaike criterion based stepwise reduction of multivariable regression model, are shaded in gray. B Average variability of subsets of genes categorized by other properties. C Average variability of subsets of genes categorized by COG functional categories. Functional categories are the following: J—Translation, ribosomal structure and biogenesis; K—Transcription; L—Replication, recombination and repair; D—Cell cycle control, cell division, chromosome partitioning; V—Defense mechanisms; T—Signal transduction mechanisms; M—Cell wall/membrane/envelope biogenesis; N—Cell motility; W—Extracellular structures; O—Posttranslational modification, protein turnover, chaperones; X—Mobilome: prophages, transposons; C—Energy production and conversion; G—Carbohydrate transport and metabolism; E—Amino acid transport and metabolism; F—Nucleotide transport and metabolism; H—Coenzyme transport and metabolism; I—Lipid transport and metabolism; P—Inorganic ion transport and metabolism; Q—Secondary metabolites biosynthesis, transport and catabolism; R—General function prediction only; S—Function unknown; Color scale from blue to red is proportional to the value

Although this trend is common for most lineages, the strength of association with variability for some of the features varied substantially. For example, presence of a transmembrane segment explained less than 1% of variability value variance in Flavobacteriales and Deinococcales, but 14% in Thermococcales, and gene gain rate explained 2% of the variance in Sulfolobales but 15% in Deinococcaceae; there were more examples of contrasting associations like this (Fig. 3A).

To gain additional information on differences in variability with respect to the above features of protein families, we analyzed distinct subsets of csCOGs, grouped by each feature separately. To this end, we computed mean variability for each subset and estimated the statistical significance of the differences of variability between the analyzed subsets for each value using ANOVA (Fig. 3B). All the differences were significant (p value < 0.01). Specifically, paralogy numbers were separated into three bins (1–1.25—low; 1.25–3—medium, > 3—high). As expected, mean variability increased with the increase of the number of paralogs in almost all lineages. In 4 lineages, the high-paralogy subset of the csCOGs showed mean variability twofold higher than the clade-specific average. The same trend was observed for three bins of gene gain rate (0–0.5—low; 0.5–2—medium and > 2—high) and ancestrality measure (ancestral vs all other) although the association with variability was weaker for most of the lineages compared with that for the number of paralogs. The association with variability was comparable and weak for secreted and membrane proteins, consistent with previous observations indicating that membrane and surface proteins generally evolve faster than soluble ones [33]. Perhaps surprisingly, archaea have slightly more variable secreted and membrane proteins than bacteria. As expected [8], core genes are significantly less variable than non-core ones and, as a group, show the lowest mean variability among all analyzed cohorts.

We next analyzed mean variability for the 21 functional categories of genes assigned by comparing the csCOGs to prokaryotic COGs and, for archaea, to arCOGs (Fig. 3C). The resulting estimates were qualitatively closely similar to those obtained by analysis of genome flux data for more closely related subsets of bacterial and archaeal genomes [34]. Specifically, the categories X (mobilome), V (defense and offense systems), M (cell wall/membrane/envelope biogenesis) and N (cell motility) tend to have higher mean variability values, whereas categories J (translation, ribosomal structure and biogenesis) and F (nucleotide transport and metabolism) have the lowest values (Fig. 3C).

Despite some strong associations described above, each feature showed high dispersion of variability values (Fig. 4). Among core and ancestral families, there are highly variable ones, and conversely, there are conserved membrane proteins, secreted proteins and proteins with large fraction of low complexity or microsattelite-like regions (Additional file 1).

Fig. 4
figure 4

Multidimensional scaling analysis of variability values and selected features. Homogeneity distribution density was calculated for each csCOG as described in Material and Methods. Classical multidimensional scaling (cmdscale function in R) was applied to visualize the relationship between csCOGs. Hellinger distance (one of the conceptually simplest distance measures which is also symmetrical and metric) was used to quantify the similarity between each two probability distributions. Results for the first two dimensions were used to construct plots. Variability of the data points are shown as follows: Conserved (0–0.5): light blue; medium (0.5–2.0): light gray; variable (> 2.0)” dark blue. The following features are overlayed onto points: presence in the set of core genes—red dots; high gain rate (> 2.5)—magenta dots; membrane (csCOGs with the average fraction of proteins with predicted transmembrane segments > 0.333)—dark green dots; secreted (csCOGs with the average fraction of proteins with signal peptide > 0.333), microsatellite like regions (the average fraction of protein sequences in the csCOG identified >= 0.15)—orange dots; high paralogy (> 2.0)—dark gray dots

Protein families enriched in variable csCOGs

Based on assignments of variable csCOGs to prokaryotic COG families, we estimated abundance of the COG families in the csCOGs (Fig. 5). About 40 to 60% of the variable csCOGs were found to be unique to the respective lineage, whereas the remaining ones were assigned to prokaryotic COGs that are represented in at least one other bacterial or archaeal lineage, including conserved COGs those that are present in 7 or even all 8 groups analyzed here (Table 1). As expected, csCOGs assigned to these families typically have many paralogs and a high gain rate (Additional file 1). Furthermore, we also observed substantial variation of the variability estimates for csCOGs that are assigned to the same prokaryotic COG. Many of such paralogs are not variable, but moderately or even strongly conserved (Table 1, Additional file 1), suggesting functional diversification and/or structural flexibility within the variable csCOGs. Indeed, many of these csCOGs consist of enzymes with diverse and broad specificities, such as RimI-like N-acetyltransferases, COG0456 [35] and class I UbiE/MenG-like methyltransferases, COG2226 [36, 37]. Some of these diverse functions are associated with small molecule modification pathways that are involved in xenobiotic detoxification, production of virulence factors or toxins, or other defense and offence mechanisms, functions that are enriched among variable families (see above).

Fig. 5
figure 5

Breakdown of high variability protein families by presence in 1–8 other analyzed lineages. Numbers on the plot indicate the actual number of csCOGs with high variability (> 2.0) that are present in the given number of genomes; the plots for each family are scaled to 100%

Table 1 COGs that are among hypervariable families among both bacteria and archaea

Three of these variable prokaryotic COGs belong to distinct families of glycosytransferases, WcaA, WcaE, (GT-A fold, GT-2 family, COG0463 and COG1216 respectively) and RfaB (GT-B fold, GT-1 family, COG0438) (Table 1). These enzymes are among the most diverse in prokaryotes and catalyze transfer of various sugar moieties from activated donors to acceptor molecules, forming glycoside bonds [36, 37]. Glycosytransferases are typically associated with other genes encoding enzymes involved in cell wall biosynthesis and surface proteins glycosylation. csCOGs that are assigned to these glycosytransferase families also typically include many paralogs and have a high gain rate although some of them appear to be ancestral (Additional file 1). To explore the potential causes of variability in these families, we selected the csCOG sulfo9.00007 from the Sulfolobales group, which is present in all 52 genomes of this group and consists of 127 proteins (2.4 paralogs per genome on average, variability of 4.9). For all genes in this csCOG, we analyzed the genomic context and performed phylogenetic analysis that also included members of the prokaryotic COG1216 (see Material and Methods for details). Phylogenetic analysis showed that proteins from sulfo9.00007 belonged to at least 6 distinct clades (A-F), but because none of these genes is present in all genomes, they formed a “para-COG”, with many proteins of different origins ending up in their respective genomes as a result of multiple events of gene displacement by distant homologs (xenologs) (Fig. 6 and Additional file 2: Fig. S1). For example, Sulfolobus JCM 16833 lost the clade E gene, but possesses A and F clade genes instead. Generally, genes of all clades except clade E show an extremely patchy distribution in the Sulfolobales genomes and are encoded in different neighborhoods suggesting that most loci encoding sulfo9.00007 genes are hot spots of gene shuffling and recombination. It appears that horizontal gene transfer, and in particular, xenologous gene displacement, is also responsible for the high variability of other csCOGs with multiple paralogs that are predicted to be ancestral based on their phyletic pattern. Functionally, this frequent gene exchange might be relevant for changing the glycosylation pattern of surface proteins to avoid virus attachment as well as other variations related to biological conflicts.

Fig. 6
figure 6

Evolutionary history of sulfo9.00007 family of WcaE-like glycosyltransferases. The neighborhood of all genes from sulfo9.00007 are mapped to 16S rRNA tree of Sulfolobales genomes analyzes in this work. For each gene neighborhood, the genbank accession and coordinates of the locus are indicated on the right. Genes are shown by block arrows, roughly to scale. csCOG number is indicated for all genes and follow gene name (if available). For the genes that are in respective arCOGs the cluster number corresponds to the respective arCOG number. Memebers of sulfo9.00007 are colored by blue shades according to phylogenetic analysis of WcaE-like glycosyltransferases (clades A-E, Additional file 3: Fig. S2). Other glycosyltransferases assigned to COG1216, but not to sulfo9.00007 are shown by blue outline. Closest most frequent gene neighbors are shown by yellow (FabG) and pink (WsaA)

Identification of variable regions in multiple alignments

All estimates described above were based on average variability values calculated for the complete multiple alignment of each csCOG. It is expected, however, that in some proteins, different regions or domains evolve at substantially different rates. To visualize the fraction of positions in multiple alignments with different variability values, we plotted the fractions of conserved, medium and variable positions for each csCOG (Additional file 3: Fig. S2). These plots reveal csCOGs with the unusual prevalence of highly conserved and highly variable positions, but with relatively scarce medium variable positions. We analyzed in detail multiple alignments of several of these csCOGs, focusing on those that are ancestral with few paralogs (Table 2). There are only a few such csCOGs in most of the studied groups of organisms, and Sulfolobales have none. These csCOGs differed among lineages, the only exception being MutL which made the list in both Halobacteria and Paenibacillus. Most of the respective csCOG are ancestral, and many have important and even essential house-keeping functions (Table 2, Additional file 1). Visual examination of the identified variable regions showed that many of them contained variable-length runs of the same amino acid or short repeats and multiple insertions-deletions (Additional file 4: Fig. S3). In order to characterize these regions in greater detail, we performed additional analyses focusing on 34 ancestral families, in which the region of variability was maintained throughout the evolution of an entire lineage. First, we checked whether the respective genes contained an increased fraction of microsatellite-like regions, which might be responsible for polymerase slippage and tandem repeat genome instability [38]. The results of this analysis demonstrate considerable heterogeneity of the average fraction of such regions in these proteins, ranging from none to one third, with the average of 8% across the 34 csCOGs (Additional file 1). Such variation implies that different processes likely contribute to the high variability of these regions. Principal Component Analysis of amino acid frequencies in variable and conserved positions of these csCOGs showed that variable regions are enriched in proline, serine, threonine, aspartate and glutamate (Additional file 5: Fig. S4), that is amino acids with a low propensity for secondary structure formation, suggesting that these regions are unstructured. Indeed, using IUpred [39], all variable regions were predicted to be structurally disordered (Additional file 4: Fig. S3). Function of any of these disordered regions is not known. As could be expected, in protein structures that were available for members of 14 csCOGs in this set, the variable regions either formed insertions or terminal regions that were either unresolved/disordered (as in RpoD), or the structure was solved for separate domains of the protein containing a variable region (for example, MutL) that are connected by a supposedly disordered variable linker, or the structure was solved for homologs that lacked the variable region (for example, Rho and FtsY).

Table 2 Protein families with high fraction of conserved and variable positions

We further sought to determine whether the variable protein regions were specific to the respective lineages or originated earlier during evolution. To this end, we ran PSI-BLAST against a collection of prokaryotic genomes and visually examined the outputs. In 19 of the 34 analyzed csCOGs, the variable regions originated prior to the appearance of the respective lineage whereas in the remaining 12 cases, these regions seemed to be lineage-specific (Table 2). In many families, the variable regions were found to be absent in orthologous proteins from other lineages. Examples include MutL (no variable region in Deinococcus/Thermus bacteria), SecG (no variable tail in Deinococcus/Thermus and Firmicutes bacteria), Rho and FtsY (no variable region in Proteobacteria) and more (Additional file 4: Fig. S3). These observations indicate that variable regions appear in different lineages of prokaryotes and persist in these for considerable evolutionary time but are dispensable in other lineages. In three cases, however, the observed variability of protein regions appears to be due to other causes (Table 2). In one case, erroneous prediction of the ORF start resulted in caused an artifactual high variability value (deino9.00350); another variable region is located in the region of intein insertion in several DnaB proteins from Haloferacales; and finally, in one case (Ribosomal protein S14 in Deinococcus), the variable region apparently resulted from xenologous gene displacement (Table 2).

Variability and protein function

High protein variability poorly correlates with csCOGs functional categories, which makes it a weak predictor of protein function although variability can be considered an additional, indirect indication, along with other lines of evidence, such as suggestive genomic context, for such functions as mobilome (X), defense (V) and cell motility (N) (Fig. 3). However, both high and low variability assignments can be helpful in functional analysis of ancestral protein families. Specifically, high variability might indicate subfunctionalization or neofunctionalization of a paralog. For example, among variable proteins in Sulfolobales, there is a tRNA splicing endonuclease SEN2 (sulfo9.01015), which is present in 51 of the 52 genomes in this lineage and belongs to COG1676. This variable protein has a slowly evolving paralog (sulfo9.00331), which is encoded in all these genomes (Additional file 1). The proteins have been studied experimentally, and it has been shown that tRNA splicing endonuclease SEN2 in Sulfolobus is a heterodimer, in which one subunit is inactivated and poorly conserved, and that both are required for the enzyme function, a characteristic case of subfunctionalization [40]. Four paralogs of CdvB/ESCRTIII family (Additional file 6: Table S2) in Sulfolobales are another example of potential subfunctionalization. All these proteins can form filaments [41] but only two (sulfo9.01480 and sulfo9.00714) are essential in Sulfolobus islandicus [42], and only one of these has been experimentally shown to be recruited by CdvA [43]. Thus, the actual function of three of the four paralogs remains unclear. Other prominent examples for both archaea and bacteria are listed in Additional file 6: Table S2, and their functional specialization could be of interest for future experimental studies.

Low variability, along with the presence of a gene in a high fraction of genomes in a large group of organisms, appears to be an important indicator of gene essentiality. Based on the observations above and previous analyses [44], xenologous gene displacement and acquisition of additional paralogs substantially contribute to the observed variability of protein families. Slowly evolving genes are expected to be least prone to displacement by genes from distant species. Indeed, using linear discriminant analysis, we found that variability and fraction of genomes carrying the gene are the two variables that provide the best prediction of essentiality based on the transposon mutagenesis in S. islandicus [42], with the peak performance of 66.8% true predictions (Additional file 7: Fig. S5). Thus, using these variables, it is possible to identify numerous uncharacterized protein families that are expected to be important and possibly essential for the respective organisms. Table 3 lists 5 such families for each lineage. Two of these families in Sulfolobales were indeed found to be essential in S. islandicus [42]. Furthermore, csCOGs corresponding to two uncharacterized families (DUF424 and DUF555 in the PFAM database) were independently identified as low variability in Thermococcales and Methanosarcinales. According to the arCOG database, DUF424 (arCOG04051) is present in the majority of archaea and DUF555 (arCOG02119) is found in most euryarchaea, which is compatible with the essentiality of these genes. In Methanosarcinales, two families DUF2112 and DUF2102, annotated as methanogenesis markers 5 and 6, respectively, form a conserved operon, which is highly specific to methanogenic archaea. Other uncharacterized families (DUFs) conforming to these two criteria were also identified in bacteria (Table 3). Additionally, several csCOGs were conserved in a narrower group of organisms and are not assigned to any family in the current CDD database (Table 3). These include deino9.00587, deino9.00288 and deino9.01656, which are also shared with Thermus, and are likely to be important in most bacteria of the Thermus/Deinococcus lineage. Notably, none of the proteins that considered to be determinants of radiation resistance specific for the Deinococcus genus, were in this list [45,46,47]. Four families that are implicated in radiation resistance and satisfy the two criteria of essentiality are RecA recombinase, Holliday junction resolvasome helicase RuvB, radiation response regulon transcription factor DdrO, Excinuclease ATPase subunit UvrA, RecO and RecF recombination proteins are all common in other bacteria (Additional file 1) [48]. Thus, Deinococcus-specific protein families contributing to radiation resistance could be dispensable under standard growth conditions, which is indeed the case for several of these families where experimental data is available [49, 50], and additionally, most of these genes are not found in all Deinococcus species (Additional file 1).

Table 3 Selected functionally uncharacterized protein families with low variability and presence in 85% or more genomes in respective lineage


In this work, we developed a quantitative measure of sequence variability in protein families and investigated the connections between variability and various genomic and biological features. Overall, the association of variability with other genomic features follows the expected trends that were previously established in other contexts [8, 9]. Approximately half of the variance in variability values can be explained by the analyzed features, of which gene paralogy is most impactful. Correlation between paralogy and variability likely comes from acquisition of distant paralogs and xenologs introducing sequence variants that are more distant than those that could have evolved in situ during the lifetime of the clade. Notably, more than 50% of the highly variable (\(V>2\)) csCOGs in each clade have homologs in at most one other clade of the 8 analyzed, and more than half (872 out of 1732) of the non-ancestral highly variable csCOGs have more than 1.25 paralogs per genome (Fig. 5, Additional file 8: Table S3). These observations suggest that HGT is a major evolutionary force that shapes the distribution of family-level variability in prokaryotic genomes.

At the level of individual alignments, the distribution of variability across the alignment columns is typically smooth and centered around a value characteristic of the given csCOG. Protein families that combine low-variability and high-variability regions within the same alignment are relatively rare, with highly variable segments often located in indel-rich regions. Such regions are typically lineage-specific and often are completely absent in orthologs from other taxa. The apparent high density of indels also makes alignment reconstruction locally uncertain even between closely related organisms, obscuring the differences between substitution- and indel-generated diversity. Microsatellite-like and low-complexity regions only weakly correlate with protein family variability, suggesting that polymerase slippage is not the major mechanism generating variability at the individual protein level.

Comprehensive analysis of evolutionary regimes requires careful phylogenetic reconstruction, is subject to constraints on evolutionary distances, alignment quality, and confidence in ortholog detection, and could be highly sensitive to evolutionary rate variability between lineages. Here we show that csCOG-level variability estimates can serve as the first approximation for the relative evolutionary rate and appear to be useful in partitioning genome-scale datasets according to sequence conservation as well as for identification of essential genes and subfunctionalized paralogs.


Genome sets and genome phylogeny

Genome assemblies were downloaded from Genbank (Additional file 9: Table S4). The 16S rRNA sequences were aligned using MUSCLE [51] and the tree was reconstructed using FastTree [52] with the GTR evolutionary model, and discrete gamma model with 20 rate categories; the tree topology was used as the proxy for the genome history.

Construction of clade-specific clusters of orthologous genes (csCOGs)

Initial clusters of protein sequences were obtained using MMSEQS2 [53] with the similarity threshold of 0.5. Multiple alignments of cluster members were generated using MUSCLE [51] and compared to each other using HHSEARCH [54]. Clusters that aligned to each other along most of the protein lengths (HHSEARCH hit covering ≥ 75% of the cluster consensus length) were merged using HHALIGN [53]. Approximate maximum-likelihood phylogenetic trees were built for each merged cluster using FastTree [52] with WAG evolutionary model, and gamma-distributed site rates. Trees were parsed into subtrees that maximize the tradeoff between the number of paralogs and the representation of genomes. Formally, within a tree including leaves coming from \(S\) different genomes, a clade that contains \({P}_{C}\) leaves from \({S}_{C}\) different genomes is defined to have the paralogy ratio of \({P}_{C}/{S}_{C}\) and genome coverage ratio of \({S}_{C}/S\). The clade with the maximum coverage-paralogy tradeoff index \({S}_{C}^{2}/({P}_{C}S)\), if distinct from the tree root, is considered a csCOG and is removed from the tree, after which the procedure is repeated with the pruned tree until convergence.

Protein sequence analysis and phylogenetic reconstruction

Multiple sequence alignment of prokaryotic COG [48] and Pfam [55] profiles in the CDD database (as of 2019) were used as queries for Position-Specific Iterated BLAST program [56]. The search against the database, consisting of proteins sequences encoded in our set of genomes, was run at e-value cutoff of 0.0001; the best hits were used to annotate the sequences. Membrane proteins were predicted using TMHMM [57], secreted proteins using SignalP [58], low complexity regions were identified using SEGmasker program [59]. Disordered loops in proteins were predicted using IUPred2A [39]. When the entire csCOG, rather than an individual protein, needed to be characterized by a particular feature (e.g., prevalence of transmembrane segments or signal peptides), the fraction of proteins with this feature was calculated. Multiple alignments for selected csCOGs were generated using MUSCLE [51]. Approximate maximum-likelihood phylogenetic unrooted tree was built for each alignment using FastTree with JTT evolutionary model, and 20 discrete rate categories [52].

Evolutionary history reconstructions

The binary phyletic pattern of csCOGs (presence-absence of the given gene across the species) within each lineage was analyzed using GLOOME [60]. Differences of posterior probabilities of ancestral presence between the parent and descendant nodes of ≥ 0.5 were interpreted as either gains or losses depending on the sign. At least one gain event was detected for an overwhelming majority of the extant genes. Genes with the posterior probability ≥ 0.5 at the tree root were classified as ancestral; many genes were gained multiple times in the history of a given csCOG (in some cases, re-acquired after a loss). The rare exceptions are those genes for which the phyletic pattern did not allow a specific gain point to be inferred. The total number of gains and losses in a csCOG history, regardless of their precise location, was estimated as the sum of positive and negative differences of ancestral posterior probabilities, respectively. One of the important characteristics of a csCOG history is whether it is inferred to be ancestral in the given clade or to have been acquired later in the history of the corresponding group of genomes.

Protein variability estimation

For each csCOG alignment with at least 8 non-identical protein sequences and at least 60 aligned columns (excluding singular insertions), homogeneities of all alignment columns were calculated ([26] and Additional file 2). Specifically, all sequences in an alignment of \(N\) sequences were assigned equal weights \({w}_{i}=1/N\). Next we introduce an amino acid score against an alignment column; for any given amino acid \(x\), \({Q}_{x}=\sum_{i=1}^{N}{w}_{i}{S}_{{a}_{i},x}\), where \({a}_{i}\) is the amino acid in the ith sequence of the column and \({S}_{{a}_{i},x}\) is the score for amino acids \({a}_{i}\) and \(x\) according to the chosen pairwise score matrix (here BLOSUM62 [61]). The amino acid \(c\), satisfying the \(c=\underset{x}{\mathrm{argmax}}{Q}_{x}\), is selected as the effective consensus amino acid for this alignment position (i.e. the one which is most similar to the assortment of amino acids in the column). To calibrate the consensus score \({Q}_{c}\), the expectation of the score is calculated, comparing the alignment column against a random assortment of amino acids, \({Q}_{R}=\sum_{b}{f}_{b}{Q}_{b}\), where \({f}_{b}\) represents relative frequencies of amino acids across the entire protein database (frequencies summing up to 1). The homogeneity of an alignment column is calculated as \(h=\mathit{max}(\frac{{Q}_{c}-{Q}_{R}}{{S}_{c,c}-{Q}_{R}},0)\). When gaps were present in the column, the homogeneity was calculated using the scores for non-gap characters. The homogeneity measure \(h\) is confined within the range \(0\le h\le 1\), where a random column has homogeneity of 0 and a column containing an invariant amino acid has homogeneity of 1.

The (arithmetic) mean homogeneity values were calculated across all alignments in the given clade \({h}_{T}\) and across each csCOG alignment \({h}_{C}\). Relative variability of a csCOG was calculated as \({v}_{C}=\frac{(1-{h}_{C}){h}_{T}}{(1-{h}_{T}){h}_{C}}\), valid under the reasonable assumptions of \({0<h}_{C}\le 1\) and \({0<h}_{T}<1\) (at least some alignment columns in any given csCOG match better than expected by pure chance and at least some alignment columns across all csCOGs are less than perfectly homogeneous). This transformation places variability in the range of \(0\le {v}_{C}\le \infty\), and a csCOG with \({h}_{C}={h}_{T}\) would have \({v}_{C}=1\) (that is, a csCOG with mean homogeneity equal to the clade-wide mean has a relative variability of 1). The same calculation can be performed for each individual alignment column with \(h>0\), obtaining the position-specific variability estimate (columns with \(h=0\) can be assigned arbitrarily high variability value).

To obtain the csCOG-specific distributions of homogeneity values, first, the homogeneity profile along the sequence was smoothed using a Gaussian kernel with bandwidth of \(b=20\) (\({h}_{i}={\sum }_{j}{h}_{j}{K}_{i,j}/{\sum }_{j}{K}_{i,j}\) where \({K}_{i,j}=\mathrm{exp}({(\left(k-i\right)/b)}^{2}\) for homogeneity in the ith position). Then, these values were used to evaluate the probability density function (p.d.f.) for 101 points in the range of \(0\le h\le 1\) [62]. The Hellinger distances between all pairs of distributions were calculated using these p.d.f. estimates (Additional file 2). These distances were embedded into a 2-dimensional plane using Classical Multidimensional Scaling (cmdscale function in R).

Search for microsatellite-like regions in protein coding sequences

Microsatellite-like regions (MSRs) were identified in protein coding nucleotide sequences using the compositional order approach by detecting irregular recurrences of short k-mers [63,64,65]. In random sequences, the probability of identifying a motif of length k that recurs more than n time in a sequence of length L is determined by the binomial distribution, \(P\left(L,p;\ge n\right)=\sum_{i=n}^{L-k+1}\left(\genfrac{}{}{0pt}{}{L}{i}\right){p}^{i}{(1-p)}^{L-i}\), where p is the probability of selecting a k-mer over an alphabet A, such that \(p=1/{A}^{k}\). Here, we define non-random recurrences of a k-mer as those that recur with \(P<{10}^{-6}\), locally (that is within a window of 1000 characters). Thus, for example, for nucleotide sequences (\(A=4\)), and using \(k=6\) (hexamers), this definition translates into the search of non-random recurrence of hexamers that occur at least 6 times within 1000 bp, at least 5 times within 500 bp, at least 4 times within 200 bp and at least 3 times within 80 bp.

To extract MSR, we identify all the non-random locations of all k-mers, allowing motifs to overlap, and define MSR as the coverage of non-random recurrences with interval distance between consecutive recurrences of the same motif (\(I\)) smaller than the motif length \(k\) (i.e., \(I\le k\)). For example, using \(k=6\) in the sequence AAAAAAAAA, the hexamer AAAAAA recurs 4 times, with \(I=1\) between consecutive recurrences, capturing runs of nucleotides. Similarly, in the sequence ATATATATATA the dinucleotide tandem repeats are captured by the hexamers ATATAT and TATATA, each recurring 3 times, with distance interval \(I=2\) (recurring twice each), and so forth up to \(I=6\). To ensure that all non-random patterns are identified, this procedure is done for all \(k\) up to 6 (i.e., \(k=1..6\)). MSR regions that are separated from each other by less than k, are merged into a single region. Using this definition, MSRs in nucleotide sequences include the conventionally defined regions of microsatellite instability (i.e., tracks of units composed of a few bp, typically 1–5 bp). The MSR and LCR measures are correlated, but distinct across the csCOGs (Additional file 10: Table S5).

Statistical analysis

Each csCOG within each lineage was classified with respect to several quantitative or qualitative features (see the list of features in Additional file 1: Table S1). Variability was analyzed as a quantitative (real number continuous) variable; the rest were represented as categorical variables. Associations between variability and each of the other features were analyzed using the ANOVA test for the distribution of variability of csCOGs within the categories and between the categories; the significance of the association was estimated using the F-statistics (ratio of between- to within-group variances); the strength of association was estimated as the relative decrease of the total variance due to grouping csCOGs into feature categories. The total explanatory power of all features was estimated as the unadjusted \({R}^{2}\) of the generalized linear model, predicting the variability of a csCOG given the categorical values of all 9 features using the lm function in R. Each model was subject to stepwise reduction using the step function in R, which attempts to remove low-impact explanatory variables based on the Akaike Information Criterion; successful reduction attempts were reported.

Availability of data and materials

Additional file 1: Raw data used for analyses in this work. Available at: Additional file 2: Code to calculate homogeneity, variability and the Hellinger distance between homogeneity distributions. Available at: csCOG data is available by request from authors.


  1. Rizzato F, Zamuner S, Pagnani A, Laio A. A common root for coevolution and substitution rate variability in protein sequence evolution. Sci Rep. 2019;9(1):18032.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  2. Echave J, Wilke CO. Biophysical models of protein evolution: understanding the patterns of evolutionary sequence divergence. Annu Rev Biophys. 2017;46:85–103.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  3. Zhang J, Yang JR. Determinants of the rate of protein sequence evolution. Nat Rev Genet. 2015;16(7):409–20.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  4. Serohijos AW, Rimas Z, Shakhnovich EI. Protein biophysics explains why highly abundant proteins evolve slowly. Cell Rep. 2012;2(2):249–56.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  5. Lobkovsky AE, Wolf YI, Koonin EV. Universal distribution of protein evolution rates as a consequence of protein folding physics. Proc Natl Acad Sci U S A. 2010;107(7):2983–8.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  6. Wolf YI, Novichkov PS, Karev GP, Koonin EV, Lipman DJ. The universal distribution of evolutionary rates of genes and distinct characteristics of eukaryotic genes of different apparent ages. Proc Natl Acad Sci U S A. 2009;106(18):7273–80.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  7. Novichkov PS, Omelchenko MV, Gelfand MS, Mironov AA, Wolf YI, Koonin EV. Genome-wide molecular clock and horizontal gene transfer in bacterial evolution. J Bacteriol. 2004;186(19):6575–85.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  8. Wolf YI, Carmel L, Koonin EV. Unifying measures of gene function and evolution. Proc Biol Sci. 2006;273(1593):1507–15.

    CAS  PubMed  PubMed Central  Google Scholar 

  9. Drummond DA, Wilke CO. Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution. Cell. 2008;134(2):341–52.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  10. Lannergard J, Kristensen BM, Gustafsson MC, Persson JJ, Norrby-Teglund A, Stalhammar-Carlemalm M, Lindahl G. Sequence variability is correlated with weak immunogenicity in Streptococcus pyogenes M protein. Microbiologyopen. 2015;4(5):774–89.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  11. Lemey P, Rambaut A, Pybus OG. HIV evolutionary dynamics within and among hosts. AIDS Rev. 2006;8(3):125–40.

    PubMed  Google Scholar 

  12. Marchi J, Lassig M, Mora T, Walczak AM. Multi-lineage evolution in viral populations driven by host immune systems. Pathogens. 2019;8(3):115.

    PubMed Central  Article  Google Scholar 

  13. Luksza M, Lassig M. A predictive fitness model for influenza. Nature. 2014;507(7490):57–61.

    CAS  PubMed  Article  Google Scholar 

  14. Wolf YI, Viboud C, Holmes EC, Koonin EV, Lipman DJ. Long intervals of stasis punctuated by bursts of positive selection in the seasonal evolution of influenza A virus. Biol Direct. 2006;1:34.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  15. Burroughs AM, Aravind L. Identification of uncharacterized components of prokaryotic immune systems and their diverse eukaryotic reformulations. J Bacteriol. 2020;202(24):e00365-20.

    PubMed  PubMed Central  Article  Google Scholar 

  16. Zhang D, de Souza RF, Anantharaman V, Iyer LM, Aravind L. Polymorphic toxin systems: comprehensive characterization of trafficking modes, processing, mechanisms of action, immunity and ecology using comparative genomics. Biol Direct. 2012;7:18.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  17. Makarova KS, Wolf YI, Karamycheva S, Koonin EV. A unique gene module in Thermococcales Archaea centered on a hypervariable protein containing immunoglobulin domains. Front Microbiol. 2021;12:721392.

    PubMed  PubMed Central  Article  Google Scholar 

  18. Makarova KS, Wolf YI, Koonin EV. Comprehensive comparative-genomic analysis of type 2 toxin-antitoxin systems and related mobile stress response systems in prokaryotes. Biol Direct. 2009;4:19.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  19. Roux S, Paul BG, Bagby SC, Nayfach S, Allen MA, Attwood G, Cavicchioli R, Chistoserdova L, Gruninger RJ, Hallam SJ, et al. Ecology and molecular targets of hypermutation in the global microbiome. Nat Commun. 2021;12(1):3076.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  20. Yi X, Kazlauskas R, Travisano M. Evolutionary innovation using EDGE, a system for localized elevated mutagenesis. PLoS ONE. 2020;15(4):e0232330.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  21. Medhekar B, Miller JF. Diversity-generating retroelements. Curr Opin Microbiol. 2007;10(4):388–95.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  22. Keane TM, Creevey CJ, Pentony MM, Naughton TJ, McLnerney JO. Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified. BMC Evol Biol. 2006;6:29.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  23. Lemmon AR, Moriarty EC. The importance of proper model assumption in bayesian phylogenetics. Syst Biol. 2004;53(2):265–77.

    PubMed  Article  Google Scholar 

  24. Buckley TR. Model misspecification and probabilistic tests of topology: evidence from empirical data sets. Syst Biol. 2002;51(3):509–23.

    PubMed  Article  Google Scholar 

  25. Buckley TR, Cunningham CW. The effects of nucleotide substitution model assumptions on estimates of nonparametric bootstrap support. Mol Biol Evol. 2002;19(4):394–405.

    CAS  PubMed  Article  Google Scholar 

  26. Esterman ES, Wolf YI, Kogay R, Koonin EV, Zhaxybayeva O. Evolution of DNA packaging in gene transfer agents. Virus Evol. 2021;7(1):veab015.

    PubMed  PubMed Central  Article  Google Scholar 

  27. Heinze S, Kornberger P, Gratz C, Schwarz WH, Zverlov VV, Liebl W. Transmating: conjugative transfer of a new broad host range expression vector to various Bacillus species using a single protocol. BMC Microbiol. 2018;18(1):56.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  28. Leigh JA, Albers SV, Atomi H, Allers T. Model organisms for genetics in the domain Archaea: methanogens, halophiles, Thermococcales and Sulfolobales. FEMS Microbiol Rev. 2011;35(4):577–608.

    CAS  PubMed  Article  Google Scholar 

  29. Staroscik AM, Hunnicutt DW, Archibald KE, Nelson DR. Development of methods for the genetic manipulation of Flavobacterium columnare. BMC Microbiol. 2008;8:115.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  30. Nakashima N, Tamura T. Isolation and characterization of a rolling-circle-type plasmid from Rhodococcus erythropolis and application of the plasmid to multiple-recombinant-protein expression. Appl Environ Microbiol. 2004;70(9):5557–68.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  31. Markillie LM, Varnum SM, Hradecky P, Wong KK. Targeted mutagenesis by duplication insertion in the radioresistant bacterium Deinococcus radiodurans: radiation sensitivities of catalase (katA) and superoxide dismutase (sodA) mutants. J Bacteriol. 1999;181(2):666–9.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  32. Makarova KS, Wolf YI, Koonin EV. Archaeal clusters of orthologous genes (arCOGs): an update and application for analysis of shared features between thermococcales, methanococcales, and methanobacteriales. Life (Basel). 2015;5(1):818–40.

    CAS  Google Scholar 

  33. Sojo V, Dessimoz C, Pomiankowski A, Lane N. Membrane proteins are dramatically less conserved than water-soluble proteins across the tree of life. Mol Biol Evol. 2016;33(11):2874–84.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  34. Puigbo P, Lobkovsky AE, Kristensen DM, Wolf YI, Koonin EV. Genomes in turmoil: quantification of genome dynamics in prokaryote supergenomes. BMC Biol. 2014;12:66.

    PubMed  PubMed Central  Article  Google Scholar 

  35. Favrot L, Blanchard JS, Vergnolle O. Bacterial GCN5-related N-acetyltransferases: from resistance to regulation. Biochemistry. 2016;55(7):989–1002.

    CAS  PubMed  Article  Google Scholar 

  36. Struck AW, Thompson ML, Wong LS, Micklefield J. S-adenosyl-methionine-dependent methyltransferases: highly versatile enzymes in biocatalysis, biosynthesis and other biotechnological applications. ChemBioChem. 2012;13(18):2642–55.

    CAS  PubMed  Article  Google Scholar 

  37. Schubert HL, Blumenthal RM, Cheng X. Many paths to methyltransfer: a chronicle of convergence. Trends Biochem Sci. 2003;28(6):329–35.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  38. Bichara M, Wagner J, Lambert IB. Mechanisms of tandem repeat instability in bacteria. Mutat Res. 2006;598(1–2):144–63.

    CAS  PubMed  Article  Google Scholar 

  39. Meszaros B, Erdos G, Dosztanyi Z. IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res. 2018;46(W1):W329–37.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  40. Tocchini-Valentini GD, Fruscoloni P, Tocchini-Valentini GP. Structure, function, and evolution of the tRNA endonucleases of Archaea: an example of subfunctionalization. Proc Natl Acad Sci U S A. 2005;102(25):8933–8.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  41. Samson RY, Obita T, Freund SM, Williams RL, Bell SD. A role for the ESCRT system in cell division in archaea. Science. 2008;322(5908):1710–3.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  42. Zhang C, Phillips APR, Wipfler RL, Olsen GJ, Whitaker RJ. The essential genome of the crenarchaeal model Sulfolobus islandicus. Nat Commun. 2018;9(1):4908.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  43. Samson RY, Obita T, Hodgson B, Shaw MK, Chong PL, Williams RL, Bell SD. Molecular and structural basis of ESCRT-III recruitment to membranes during archaeal cell division. Mol Cell. 2011;41(2):186–96.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  44. Williams D, Gogarten JP, Papke RT. Quantifying homologous replacement of loci between haloarchaeal species. Genome Biol Evol. 2012;4(12):1223–44.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  45. Lim S, Jung JH, Blanchard L, de Groot A. Conservation and diversity of radiation and oxidative stress resistance mechanisms in Deinococcus species. FEMS Microbiol Rev. 2019;43(1):19–52.

    CAS  PubMed  Article  Google Scholar 

  46. Matrosova VY, Gaidamakova EK, Makarova KS, Grichenko O, Klimenkova P, Volpe RP, Tkavc R, Ertem G, Conze IH, Brambilla E, et al. High-quality genome sequence of the radioresistant bacterium Deinococcus ficus KS 0460. Stand Genomic Sci. 2017;12:46.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  47. Makarova KS, Omelchenko MV, Gaidamakova EK, Matrosova VY, Vasilenko A, Zhai M, Lapidus A, Copeland A, Kim E, Land M, et al. Deinococcus geothermalis: the pool of extreme radiation resistance genes shrinks. PLoS ONE. 2007;2(9):e955.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  48. Galperin MY, Wolf YI, Makarova KS, Vera Alvarez R, Landsman D, Koonin EV. COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Res. 2021;49(D1):D274–81.

    CAS  PubMed  Article  Google Scholar 

  49. Udupa KS, O’Cain PA, Mattimore V, Battista JR. Novel ionizing radiation-sensitive mutants of Deinococcus radiodurans. J Bacteriol. 1994;176(24):7439–46.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  50. Dulermo R, Onodera T, Coste G, Passot F, Dutertre M, Porteron M, Confalonieri F, Sommer S, Pasternak C. Identification of new genes contributing to the extreme radioresistance of Deinococcus radiodurans using a Tn5-based transposon mutant library. PLoS ONE. 2015;10(4):e0124358.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  51. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–7.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  52. Price MN, Dehal PS, Arkin AP. FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS ONE. 2010;5(3):e9490.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  53. Steinegger M, Soding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35(11):1026–8.

    CAS  PubMed  Article  Google Scholar 

  54. Soding J. Protein homology detection by HMM-HMM comparison. Bioinformatics. 2005;21(7):951–60.

    PubMed  Article  Google Scholar 

  55. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, Qureshi M, Richardson LJ, Salazar GA, Smart A, et al. The Pfam protein families database in 2019. Nucleic Acids Res. 2019;47(D1):D427–32.

    CAS  PubMed  Article  Google Scholar 

  56. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  57. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001;305(3):567–80.

    CAS  PubMed  Article  Google Scholar 

  58. Nielsen H, Krogh A. Prediction of signal peptides and signal anchors by a hidden Markov model. Proc Int Conf Intell Syst Mol Biol. 1998;6:122–30.

    CAS  PubMed  Google Scholar 

  59. Wootton JC, Federhen S. Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 1996;266:554–71.

    CAS  PubMed  Article  Google Scholar 

  60. Cohen O, Ashkenazy H, Belinky F, Huchon D, Pupko T. GLOOME: gain loss mapping engine. Bioinformatics. 2010;26(22):2914–5.

    CAS  PubMed  Article  Google Scholar 

  61. Henikoff S, Henikoff JG. Performance evaluation of amino acid substitution matrices. Proteins. 1993;17(1):49–61.

    CAS  PubMed  Article  Google Scholar 

  62. Parzen E. On estimation of a probability density function and mode. Ann Math Stat. 1962;33(3):1065–76.

    Article  Google Scholar 

  63. Persi E, Prandi D, Wolf YI, Pozniak Y, Barnabas GD, Levanon K, Barshack I, Barbieri C, Gasperini P, Beltran H, et al. Proteomic and genomic signatures of repeat instability in cancer and adjacent normal tissues. Proc Natl Acad Sci U S A. 2019;116(34):16987–96.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  64. Persi E, Wolf YI, Koonin EV. Positive and strongly relaxed purifying selection drive the evolution of repeats in proteins. Nat Commun. 2016;7:13570.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  65. Persi E, Horn D. Systematic analysis of compositional order of proteins reveals new characteristics of biological functions and a universal correlate of macroevolution. PLoS Comput Biol. 2013;9(11):e1003346.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  66. Felicori L, Jameson KH, Roblin P, Fogg MJ, Garcia-Garcia T, Ventroux M, Cherrier MV, Bazin A, Noirot P, Wilkinson AJ, et al. Tetramerization and interdomain flexibility of the replication initiation controller YabA enables simultaneous binding to multiple partners. Nucleic Acids Res. 2016;44(1):449–63.

    CAS  PubMed  Article  Google Scholar 

  67. Simonetti A, Marzi S, Billas IM, Tsai A, Fabbretti A, Myasnikov AG, Roblin P, Vaiana AC, Hazemann I, Eiler D, et al. Involvement of protein IF2 N domain in ribosomal subunit joining revealed from architecture and function of the full-length initiation factor. Proc Natl Acad Sci U S A. 2013;110(39):15656–61.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  68. Daniel J, Maamar H, Deb C, Sirakova TD, Kolattukudy PE. Mycobacterium tuberculosis uses host triacylglycerol to accumulate lipid droplets and acquires a dormancy-like phenotype in lipid-loaded macrophages. PLoS Pathog. 2011;7(6):e1002093.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  69. Oke M, Carter LG, Johnson KA, Liu H, McMahon SA, Yan X, Kerou M, Weikart ND, Kadi N, Sheikh MA, et al. The Scottish Structural Proteomics Facility: targets, methods and outputs. J Struct Funct Genomics. 2010;11(2):167–80.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  70. Burroughs AM, Aravind L. RNA damage in biological conflicts and the diversity of responding RNA repair systems. Nucleic Acids Res. 2016;44(18):8525–55.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  71. Ran F, Gadura N, Michels CA. Hsp90 cochaperone Aha1 is a negative regulator of the Saccharomyces MAL activator and acts early in the chaperone activation pathway. J Biol Chem. 2010;285(18):13850–62.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  72. Radauer C, Lackner P, Breiteneder H. The Bet v 1 fold: an ancient, versatile scaffold for binding of large, hydrophobic ligands. BMC Evol Biol. 2008;8:286.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  73. Hall CL, Lytle BL, Jensen D, Hoff JS, Peterson FC, Volkman BF, Kristich CJ. Structure and dimerization of IreB, a negative regulator of cephalosporin resistance in Enterococcus faecalis. J Mol Biol. 2017;429(15):2324–36.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

Download references


The authors’ research is supported by the NIH Intramural Research Program at the National Library of Medicine.

Author information

Authors and Affiliations



KSM and YIW initiated the study; SK and EP analyzed the data; KSM, YIW and EVK wrote the manuscript, which was edited by all authors. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Kira S. Makarova.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Table S1

. Features used for association analysis.

Additional file 2: Fig. S1

. The scheme of phylogenetic tree for WcaE-like glycosyltransferases of COG1216. Approximate maximum likelihood phylogenetic tree was built using FastTree (WAG evolutionary model, gamma distributed site rates) [52] based on multiple alignment of 1423 COG1216 sequences from complete genomes of archaea and bacteria. Six branches (A-D, colored red) belong to same csCOG (sulfo9.00007) and are indicated on the Fig. 6 for respective genes. Other archaeal sequences or branches are colored yellow and bacterial—black.

Additional file 3: Fig. S2

. Fractions of conserved, medium and variable positions in each csCOG by lineage. Red dots correspond to 34 families described in the Table 2.

Additional file 4: Fig. S3

. Selected multiple alignments for 34 families with high fraction of conserved and variably positions. A. The plots below alignment show the propensity for disorder or order: red line—disordered loops (IUPred2); Blue line—ordered structures (ANCHOR2). Sequences identified by protein accessions. csCOG number and protein family description is indicated for each alignment. B. Several alignments of orthologous protein subfamilies without hypervariable regions. Alignments were colored using server with default amino acid groups with 100% consensus.

Additional file 5: Fig. S4

. Amino acid frequency PCA for conserved and variable positions for families from Table 2. High- (V > 2) and low-variable (V < 0.5) sites were extracted from the alignments of 34 csCOGs (Additional file 4: Fig. S3); relative frequencies of amino acids were computed for all 68 (34 × 2) subsets. Principal Component Analysis of amino acid frequencies was performed using the prcomp function of R package. The plot shows the location of the high- (red circles) and low-variable (cyan triangles) and the contributions of individual amino acids (blue arrows) in the plane of the first two principal components.

Additional file 6: Table S2

. Selected examples of potential subfunctionalization of paralogs (proteins which belong to the same COG). Selected by the following criteria: (1) present in most genomes in the respective lineage; (2) have small number of paralogs (3) have low and high variability estimates.

Additional file 7: Fig. S5

. Linear Discriminant Analysis of gene essentiality in S. islandicus.

Additional file 8: Table S3

. Number of local COGs, broken down by ancestrality, paralogy and variability.

Additional file 9: Table S4

. Genomes accessions and summary of genomic data used in this work.

Additional file 10: Table S5

. Correlation between the fraction of microsatellite regions (MSR) and low complexity regions (LCR) across csCOGs.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Karamycheva, S., Wolf, Y.I., Persi, E. et al. Analysis of lineage-specific protein family variability in prokaryotes combined with evolutionary reconstructions. Biol Direct 17, 22 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Variability
  • Clusters of orthologous genes
  • Evolutionary reconstructions
  • Paralogs