The RAMPs
Cas7 represents a distinct major group of RAMPs
Cas7 (COG1857) is present in most of the type I CRISPR-Cas systems. Previously, members of this family have been confidently identified in the I-A, I-B, I-C, I-E systems [5]. As part of the recent update of the classification of CRISPR-Cas systems, we performed exhaustive sequence database searches for all Cas protein families using the HHpred profile against profile search method [27]. These searches revealed statistically significant similarity between Cas7 family proteins and various RAMP families [for example, the search with the query sequence ST0029 of the Cas7 (formerly known as DevR/Csa2 family) from Sulfolobus tokodaii, identifies the TIGR02581 profile for SSO1426 (or Csm3, COG1337) family RAMP with a probability of 0.93 and many other RAMP families with lower scores]. The reciprocal search started from the SSO1426 protein sequence hits the PFAM profile PF01905 which corresponds to the Cas7 family (probability 0.97). We used the alignments obtained during these and other searches started from other query sequences along with secondary structure predictions to construct multiple alignments for Cas7 and a number of most closely related RAMP subfamilies (Figure 1). In all these proteins, HHpred identifies several conserved blocks including the characteristic glycine-rich loop, based on both secondary structure and sequence conservation (Figure 1). In addition, the N-terminal beta strand (the first strand of the RRM fold) that is an essential structural feature of the RAMPs (Additional File 1) could be identified based on the secondary structure prediction. In the Cas7 group RAMP typical of Type I CRISPR-Cas systems, the signature glycine-rich (G-rich) loop of RAMPs is notably eroded. However, the characteristic structural organization of this region, namely the alpha-helix and the beta-strand that flank the glycine-rich loop at the N- and C- termini, respectively, in other RAMPs, seems to be present in these proteins. Collectively, these observations indicate that Cas7 proteins present in the I-A, I-B, I-C, and I-E CRISPR-Cas system subtypes comprise a distinct family within the RAMP superfamily.
After these analyses have been performed, the crystal structure of Cas7 (Csa2) from the Crenarchaeon Sulfolobus solfataricus has been reported [16]. Examination of this structure clearly demonstrates the presence of a single RAMP domain that contains four inserts within the RRM core and a C-terminal extension. None of these additional domains of Cas7 show sequence or structural similarity to any known domains [16].
Classification and evolution of RAMPs
The demonstration that the Cas7 family belongs to the RAMP superfamily prompted us to further investigate the relationships between the RAMPs. We performed DALI searches with all available RAMP structures (Figure 2 and Additional File 2) and HHpred searches using representatives of 19 RAMP families, and collected similarity scores between the corresponding profiles (Additional File 3). For each family we predicted secondary structure or assigned secondary structure elements from the known structures of RAMPs (Additional File 1). Combining the results of these analyses, we classify the RAMP superfamily into three major groups: Cas5, Cas6 and Cas7 (Figure 3).
The Cas5 group RAMPs (Cas5/COG1688, Cmr3/COG1769, Csm4/COG1567, Csy2, Csc1) were unified on the basis of sequence similarity that in most cases was identifiable by HHpred and the presence of a C-terminal domain downstream of the G-rich loop (Figure 2). For some of these C-terminal domains, an RRM fold can be predicted (Additional File 3). For example, in the Cmr3 subfamily (Subtype III-B), the predicted secondary structure elements of the C-terminal domain are compatible with the RRM arrangement (Additional File 1). Moreover, this domain ends with a second G-rich loop whereas in the Csm4 subfamily from type III-A CRISPR-Cas systems (the closest homolog of the Cmr3 family), this loop is almost completely degraded (Additional File 1). The proteins of the Csx10 subfamily that is also related to Cmr3/Csm4, contain two predicted RRM domains followed by a clearly identifiable G-rich loop (Additional File 1). The Csx10 subfamily can be unequivocally linked to the Cas5 group and specifically to Cmr3 and Csm4 families through HHpred searches (the best hit for a representative of the family Rcas_3289 from Roseiflexus castenholzii is the pfam09700 profile for Cmr3 with the probability of 99%.) The remaining Cas5 proteins including Cas5 proper, Csy2, Csc1 and Csf3 contain a single N-terminal RRM domain that terminates with the G-rich loop and is followed by a distinct C-terminal beta-meander domain. Thus, the large Cas5 group of RAMPs consists of two distinct subgroups one of which contains two RRM domains and the other one contains only one RRM domain (Figure 2). It remains uncertain as to which is the ancestral form, i.e. whether the ancestor of the Cas5 group already contained two RRM domains, and the C-terminal one was lost or severely deteriorated in one of the subgroups, or the ancestral form possessed a single RRM domain that was duplicated in one of the subgroups.
The Cas6 group includes Cas6 proteins proper (COG1853/COG5551) that have been experimentally characterized as the CRISPR transcript processing RNA endonucleases [13, 14, 26, 28] as well as highly diverged homologs from the I-E (Cas6e) and I-F (Cas6f) CRISPR-Cas subtypes. This grouping is supported by the available structures and is compatible with the reported functions for the representatives of each family. Most of the Cas6 proteins encompass two well-defined RRM domains which are connected by a "flange" in the extended conformation and have a glycine-rich loop upstream of the last strand of the second RRM fold domain. Thus, the ancestor of the Cas6 group can be confidently inferred to have possessed two RRM domains. However, the Cas6f proteins contain a typical N-terminal RRM domain and a distinct C-terminal domain that displays certain topological features reminiscent of the RRM fold (see Additional file 1 and 2) and contains a C-terminal G-rich loop but does not show significant sequence or structural similarity to any RRM domains (Figure 2). This domain could be either a grossly distorted RRM or a distinct beta-meander that convergently acquired the G-rich loop.
The Cas7 group includes Cas7 proper (COG1857) and a variety of RAMPs mostly associated with the Type III CRISPR-Cas systems. All of these proteins contain a single RRM domain with additional elaborations as demonstrated by the recently reported Cas7 structure (Figure 2), sequence comparison and secondary structure prediction. The Type III RAMP families (Csm3/COG1337 and Csm5/COG1332 in subtype III-A; Cmr6/COG1604, Cmr4/COG1336, Cmr1/COG1367 in subtype III-B; Csc2 in subtype I-D and Csf2 from the system of unknown subtype) are more similar to each other (Additional File 3) than to Cas7 but share with Cas7 a number of conserved sequence motifs (Figure 1 and Additional File 1), the overall sequence similarity identifiable by HHpred (Additional File 3) and the absence of the additional RRM domain after the G-rich loop (or its counterpart). The Csy3 subfamily is tentatively included in this group based on the secondary structure prediction (no extension after the G-rich loop compatible with another RRM domain). Some members of the Cas7 group, such as Cmr1, contain a second predicted RRM domain. Furthermore, the RAMPs of the Cas7 group show a tendency for gene duplication at least in Type III CRISPR-Cas systems.
The only RAMP family that could not be confidently assigned to any of the three groups is Csf3: despite some weak sequence similarity to both Cas6 and Cas5 in the G-rich loop region, these proteins contain fewer predicted beta-strands than Cas6 or Cas7 and no predicted RRM domain downstream of the G-rich loop; although the latter feature resembles the organization of Cas7, there is otherwise no similarity between these proteins.
The diversity and weak conservation of the sequences and structures of the RAMPs hamper the elucidation of the evolutionary relationships between the three major groups. Structural comparisons seem to suggest a specific affinity between the Cas6 and Cas7 groups [16]. From a different standpoint, the most parsimonious evolutionary scenario might involve an ancestral RAMP with a single enzymatically active RRM domain, resembling Cas7, and a single duplication in the putative common ancestor of the Cas5 and Cas6 groups, with subsequent deterioration or displacement of the C-terminal RRM domains in several Cas5 and Cas6 lineages (Figure 3). Under this scenario, the similarity between Cas7 and Cas6 would reflect ancestral structural features.
The characteristic arrangement of RAMPs in CRISPR-Cas operons
Mapping the new classification of RAMPs described in the preceding section onto the operons of the type I and type III CRISPR-Cas systems reveals a common pattern of organization. Most subtypes of the Type I CRISPR-Cas systems encode one RAMP of the Cas5, Cas6 and Cas7 groups each. Operons of type III CRISPR-Cas system are organized similarly except that they typically encode multiple Cas7 group RAMPs. Notably, cas5 and a cas7 usually form a pair of adjacent genes (Additional File 4). Remarkably, the Cas5 and Cas7 orthologs in two distinct CRSIPR-Cas systems belong to the stable core of the CASCADE complex both in E. coli (Type I-E) [7, 8] and in S. solfataricus (Type I-A) [16]. In the unclassified (U-type) CRISPR-Cas system, operons that contain no cas5, a cas7 (csf2) gene is located adjacent to the csf3 gene suggesting that Csf3 is a truncated derivative of Cas5 (Additional File 4). In the unclassified (Type U) CRISPR-Cas systems that contain no cas5, a cas7 (csf2) gene is located adjacent to the csf3 gene suggesting that in these systems Csf3 could play a role comparable to that of Cas5. (Additional File 4).
Enzymatic activities and catalytic sites of the RAMPs
Endoribonculease activity involved in CRISPR transcript processing has been demonstrated for four proteins of the Cas6 group, namely the E. coli CasE (Cse3), Cas6 from the archaea Pyrococcus furiosus and S. sulfataricus, and Cas6f (Csy4) from Pseudomonas aeruginosa. All these enzymatically active Cas6 proteins contain a conserved motif centered at the catalytic histidine, and a similar motif is found in many RAMP families of both Cas5 and Cas7 groups, especially from type III CRISPR-Cas systems (Figure 3 and see Additional File 1). In most cases, including Cmr4 (COG1336), Cmr6 (COG1604), Csm3 (COG1337), Csm5 (COG1332), Csm4 (COG1567), and MA1928-like families, this motif is located immediately after the predicted first beta-strand of the RRM domain, similarly to the catalytic histidine of Cas6. Despite the weak sequence similarity between the three groups of RAMPs, the presence of the conserved histidine in many members of each group and in nearly identical positions within the RRM domain suggests that this is an ancestral feature and accordingly the original RAMP most likely was an active endoribonculease.
In addition to the catalytic histidine, the enzymatically active Cas6 protein of P. furiosus contains a lysine and a tyrosine residues that are essential for the activity and are thought to comprise the catalytic triad of this enzyme together with the conserved histidine [14]. However, these amino acids are not conserved other than in close relatives of P. furiosus Cas6. Although several of the other RAMP families also possess conserved polar or aromatic residues that potentially could contribute to a catalytic triad similar to that of the Cas6 endonucleases (see Additional File 1), the exact architecture of the catalytic site of this RAMPs is currently difficult to predict.
Several RAMPs in each of the three major groups contain a motif with a conserved histidine in the C-terminal portion of the RRM domain. At this time, it remains unclear whether any of the RAMPs that lack the conserved histidine in the N-terminal part but contain other (not homologous to the known catalytic ones) conserved histidines closer to the C-terminus (Figure 3 and Additional File 1) are catalytically active.
Given that the Cas6 group RAMPs are dedicated nucleases for the processing of the CRISPR transcripts (pre-crRNA) that produce the crRNAs and appear to be sufficient for this function [13, 14, 28], most of the other RAMPs might be involved in non-enzymatic functions in the respective Cascade complexes. However, the possibility remains that some of these RAMPs are involved in crRNA-guided mRNA interference. Indeed, mRNA cleavage has been experimentally demonstrated in vitro for the Type III CRISPR-Cas system from Pyrococcus furiosus [15]. Furthermore, in some CRISPR-Cas systems, catalytically active RAMPs of the Cas5 or Cas7 groups might substitute for the Cas6 activity. For example, in the type I-C systems that lack cas6, the Cas5 family proteins contain a conserved histidine in the C-terminal region of the protein that jointly with other aromatic and charged residues that are conserved in subfamily of RAMPs might contribute to the catalytic site of these proteins (see Additional File 1).
Gene content similarity between Type I and Type III CRISPR-Cas systems
The new classification of RAMPs and the common arrangement of RAMP genes in the operons for type I and type III CRISPR-Cas systems emphasize the considerable conservation of organization of the genes encoding (potential) Cascade subunits. The overall organization of the cas operon is especially similar between the I-E and III-A subtypes (Figure 4A). Both systems have been experimentally characterized, I-E in Escherichia coli [13] and III-A in Staphylococcus epidermidis [12, 18], and shown to be fully functional. The Type I-E system consists of 9 components, and the Type III-A system includes 10 components (counting the HD superfamily nuclease domains fused to different genes separately). Six genes (domains) of the I-E system, namely cas1, cas2, cas3''(HD), cas7, cas5 and cas6e, are clearly homologous to cas1, cas2, cas3''(HD), csm3, csm4 and cas6 of III-A respectively (III-A contains and additional homolog of cas7, csm5). Although for the small alpha helical proteins Cse2 and Csm2, sequence similarity cannot be readily detected, they share several similar motifs [5] and might be homologous as well. There are two genes for which there seems to be no counterpart in the other system. One is the Cas3 helicase-nuclease which is unique for Type I systems, and the other is Csm6 which is loosely associated with the CRISPR-Cas systems. The Csm6 protein has been structurally characterized; it contains an HTH domain and probably is a regulatory protein, most likely not involved in the basic CRISPR-Cas mechanism [29].
The large protein Cse1 in the I-E system is a subunit of the Cascade complex [13] and so is the Cas10 protein (CRISPR Polymerase) in P. furiosus [15]. Furthermore, Cse1 has a similar size to Cas10 (without the HD domain; Figure 4A). Thus, it seems tempting to speculate that Cse1 might be a homolog of Cas10. As the Cse1 family proteins do not contain any motifs implicated in catalysis in the predicted Cas10 polymerase, Cse1 would be an inactivated enzyme should it be demonstrated that it is indeed a Cas10 homolog.
Putative homology between the large and small subunits of different type I and type III CRISPR-Cas systems
Among the large subunits of Type I CRISPR-Cas systems, sequence conservation has been demonstrated previously [5] for several subfamilies of the Cas8 family (Cas8a1/Csa6, a subfamily of subtype I-A; Cas8b/Csh1/Cst1, a subfamily of subtype I-B; and Cas8c/Csd1, a subfamily of subtype I-C). Here, using HHpred and PSI-BLAST, we linked several other subfamilies to the Cas8 family: Cmx1/Csx13/LA3191 associated with some diverged variants of I-C subtype and Cas8a2 (Csa4/Csx9 subfamily) associated with some I-A subtype systems. For example, HHpred identifies the profile for Cas8a2 (Csa4) with probability 0.42 using the FRAAL5579 sequence from Frankia alni (subfamily Cas8a1/Cst1) as the query and profile TIGR02556 for Cas8a (Cst1 subfamily) with probability 0.83 using the M23134_00692 sequence (Cmx1/Csx13/LA3191 subfamily) from Microscilla marina as the query. The large Cascade subunit of subtype I-D shows similarity to the Zn-finger regions of the Cas8b/Cst1of I-B system and additionally is fused to an HD domain analogously to the type III Cas10 proteins. The large subunits of the I-E (Cse1) and I-F (Csy1) subtypes do not show any sequence similarity to one another (despite the fact that these systems are related by the Cas1 phylogeny and the cas gene sets) or to any Cas8 family proteins.
Type III systems contain several subfamilies of Cas10 (Csm1, Cmr2 and Csx11 according to [19]) that have been denoted CRISPR polymerases because of their similarity to the Palm/Cyclase domain [2, 30, 31]. The CRISPR polymerase consists of several domains, namely, the HD domain (predicted nuclease), a distinct domain so far unique to this protein family, a Zn-finger domain, and a region containing the Palm domain, the signature domain of various polymerases and cyclases which adopts a distinct RRM fold [2]. The Palm domain of CRISPR polymerases is more similar to the Palm domain of cyclases than to those of 3'-5' DNA and RNA polymerases, and contains all typical secondary structure elements including four beta-strands of the core RRM fold [31]. Many structures of Palm domain-containing polymerases from all domains of life and numerous viruses have been solved and compared [32]. Most of these polymerases show a common arrangement of core domains and the same modes of nucleic acid binding; the polymerases additionally contain a variety of editing nuclease domains and regulatory domains. The core domains (usually arranged in the same order from the N-terminus to the C-terminus) are the following: the "Fingers" domain that binds a nucleotide, the catalytic "Palm" domain "Palm" that binds single-stranded nucleic acid, and the "Thumb" domain that binds double-stranded nucleic acid [32].
Despite this structural and mechanistic similarity, only the Palm domains of these numerous polymerase families are clearly homologous [32, 33]. The most conserved feature of the Palm domains is the beta-hairpin formed by strands 2 and 3 of the RRM fold [33, 34]. The thumb domain is usually enriched in alpha helices some of which interact directly with the DNA or RNA duplex [32].
To analyze and compare the sequence and structural features of the large subunits of type I and type III systems (Cas8 and Cas10 [predicted CRISPR polymerases], respectively), we constructed a multiple alignment of representative sequences and predicted the secondary structure using the JPRED program (Figure 4B) (Additional File 5). A detailed analysis of the C-terminal region of CRISPR polymerases (starting immediately after the Zn-binding treble clef domain) showed that a region consisting mostly of alpha-helices follows the fourth strand of the RRM fold of the Palm domain (Region 5 in Additional File 5). This arrangement is consistent with the general structure of Palm-domain polymerases described above and suggests that this region of the CRISPR polymerases could be equivalent to the Thumb domain of other polymerases. Furthermore, given that the core Palm domain is rather compact in these proteins, the region located after the HD nuclease domain and upstream of the Zn-binding domain (Region 2 in Additional File 5) might be an equivalent of the Fingers domain.
Most of the large subunits of different subtypes of Type I CRISPR-Cas systems contain a readily identifiable Zn-finger domain in the middle of the protein sequence [5]. If the large subunits are highly diverged, inactivated Palm-domain containing polymerases as proposed above, and the Zn-finger is equivalent to the treble-clef domain found in the CRISPR polymerase, one should expect that a domain containing several beta-strands compatible with the general structure of the Palm-domain followed by an alpha helical region would be located downstream of the Zn-finger. Indeed, in various subfamilies of Cas8, Cas10d, inactivated Cas10 (Csx11 subfamily) and Cse1, we observed the same structural pattern, namely, at least three predicted beta-strands that could belong to a RRM fold, including the core beta-hairpin, followed by an alpha-helical region (Regions 4 and 5 in Additional File 5). Because two other subfamilies (Csy1 and Cmx1) do not contain Zn-fingers, it is difficult to map the beginning of the putative Palm-domain within these sequences. However, we detected sequence similarity between Cmx1 and Cas8 (see above) and given that Cmx1 proteins possess an alpha-helical C-terminal domain (Regions 4 and 5, Additional File 5), it seems likely that Cmx1 is homologous to Cas8. The Csy1 protein might be homologous to Cse1 (the large subunit of the subtype I-F system) given the overall similarity in the operon organization between the I-E and I-F systems and the clustering of these systems in the Cas1 phylogeny [20]. Like Cse1, Csy1 also has an alpha-helical C-terminal domain and an N-terminal region with mixed alpha-helices and beta-strands (Additional File 5, Csy1 subfamily). Although the pattern of the predicted secondary structure elements of Csy1 cannot be confidently aligned with either Cse1 or Cas8, we cannot rule out the possibility that it contains a derived RRM-like fold. Most of the large subunits of type I CRISPR-Cas systems containing Zn-fingers also possess an N-terminal region with mixed beta-strands and alpha helices which is compatible with the general organization of the region following the HD domain and preceding the Zn-finger in Cas10 subfamilies (Region 2, Additional File 5). Taken together, analysis of the general secondary structure features, the presence of the Zn-finger domain in many large subunits, the similar operon organization and the experimentally demonstrated functional link to RAMPs and the Cascade complex [8, 13, 15] raise the possibility that all large subunits of CRISPR-Cas systems might be inactivated derivatives of the CRISPR polymerase (Figure 4B). However, there is currently not enough evidence to rule out non-homologous displacement of some of the large subunits or their individual domains.
Interestingly, the pattern of secondary structure elements in the putative Fingers domain in Cas10 and several large subunits, Csx11, Cas8a2/Csa4, Csc3 (Region 2, Additional File 5) resembles the structures of the RRM domain found in RAMPs. Like the RRM core domain, many of the Fingers-like domains contain four predicted beta-strands. Furthermore, the Fingers-like domains start with a beta strand-alpha helix element and ends by a helix-beta-strand element, which are the two most conserved structural patterns in RAMPs (see above and Additional File 1). Thus, it is possible that the Fingers domain of the large subunits adopts an RRM fold.
In several families of the large subunits (Cas8a1, Cas8b, Cas8c, Cmx1 and Cas10d) of the I-A, B, C and D system subtypes, the C-terminal region (predicted Thumb domain) is longer than that in Cas10 proteins (8 alpha helices versus 4 in Cas10;Region 5, Additional File 5). Interestingly, in these subtypes, the small Cascade subunit is missing in the CRISPR-Cas operons. Typically, the small subunit is an alpha-helical protein containing 6 alpha helices (structure is solved for cmr5: AF1862, 2OEB and TTHB164, 2ZOP). This is, in principle, compatible with the size of the extra alpha helical region at the C-termini of the aforementioned large subunits (Figure 4B). The Csy1 protein, the subtype I-F specific large subunit, contains eight predicted alpha helices at the C-terminus and four helices at the extreme N-terminus. Because none of the predicted RAMP proteins from this system contain extended alpha-helical regions compatible with the size of the small subunit, we speculate that a domain homologous (or at least structurally and functionally analogous) to the small subunit might be "hidden" in Csy1.
The demonstration that at least some of the large subunits of Type I CRISPR-Cas systems are homologous to the CRISPR polymerase suggests that all these large proteins function and interact with DNA or RNA in a mode analogous to that of other Palm domain polymerases Table 1. In particular, the Palm domain probably interacts with ssDNA whereas the analog of the Thumb interacts with dsDNA. Notably, evolutionarily conserved inactivated derivatives of Palm domain polymerases have been detected in Archaea and eukaryotes although their functions remain uncharacterized [35, 36]. The small subunits of CRISPR-Cas systems might be responsible for the recognition of the PAM motif that is required for the selection and incorporation of new spacers.
The conservation of the complete set of catalytic residues typical of Palm domain polymerases and cyclases implies that the Palm domain of Cas10 is enzymatically active but the nature of this activity remains unknown. There is no indication that a processive polymerase is involved at any stage of the CRISPR-Cas system functioning. The possibility remains that Cas10 is a nucleotidyltranferase or even a nucleotide cyclase, perhaps involved in crRNA modification. This is compatible with the activity of the tRNA(His) guanylyltransferase THG1 [37] which belongs to the same clade of Palm domain proteins with Cas10 and the GGDEF diguanylate cyclases [31] (see above). Another possibility is that Cas10 has a secondary role as a helicase in one or more stages of CRISPR/Cas functioning. A helicase activity dependent on the cleavage of the α-β bond in NTP during polymerization has been demonstrated for the bacteriophage T7 RNA polymerase [32, 38, 39], which is a derivative of the Palm domain DNA polymerases [33]. Remarkably, all Type I CRISPR-Cas systems in which the large subunits are inactivated Cas10 homologs also include the Cas3 helicase, and conversely, all Type III systems that contain Cas10 proteins predicted to be active lack Cas3 [20]. Thus, it is tempting to propose that Cas3 compensates for the loss of the original enzymatic function of Cas10 in Type I CRISPR-Cas system whereas the inactivated derivative of Cas10 performs an accessory structural role. It is of further note that some Type U CRISPR-Cas systems that contain degraded versions of Cas10 and lack Cas3 include a DinG-like helicase (see below), in further support of the possibility that a helicase activity required for the CRISPR-Cas function can be supplied by different, in some cases, unrelated proteins.
Type II CRISPR-Cas systems and homologs of Cas9
The signature protein of the type II CRISPR-Cas systems II, Cas9, does not show any detectable similarity to any proteins in Type I and Type III systems. It appears that Cas9 is sufficient both to generate crRNA and to cleave the target DNA [6, 9, 20]. The large Cas9 protein (~1000 amino acids) contains two predicted nuclease domains, namely, the N-terminal RuvC-like nuclease (RNAse H fold) and the HNH (McrA-like) nuclease domain that is located in the middle of the protein [5, 40].
To analyze the remaining portions of the Cas9 protein, we constructed a multiple alignment of the two distinct subfamilies of Cas9 (Csn1 and Csx12 subfamilies), predicted the secondary structure and performed PSI-BLAST and HHpred searches with different queries from these subfamilies. Both full-length proteins and fragments outside of previously identified domains were used for these searches (Additional File 6, N-terminal region, N1 and C-terminal region N2). We failed to detect any significant similarity for the region N1 which is located between the two nuclease domains (Additional File 6) and is ~400 aa in length. The predicted secondary structure in this region is mostly alpha-helical with several beta-strands in the middle. For the region N2 which is located downstream of the HNH domain (eg. NMCC_0397 from Clostridium cellulolyticum H10, 610 to 1021 aa. Additional File 6), HHpred identifies a weak similarity to the RuvC-like resolvase profile (cd00529; probability 0.22). Given that a region similar to RuvC has been previously detected at the N-terminus of Cas9 [5], we investigated the N2 region in greater detail. Comparative analysis of the conserved motifs and secondary structure of Holliday junction resolvases (HJRs) and endonucleases [40–42] and the regions of similarity with RuvC identified in Cas9 indicates that the N-terminal region contains three known motifs. Furthermore, the region immediately after the HNH-nuclease domain corresponds to the C-terminal region of HJR superfamily which contains two alpha helices (or one long helix) and a fourth motif with the signature HxxD (Figure 5, motifs 1-4 in Additional File 6). Thus, within the RuvC-like domain, Cas9 contains an almost 450 aa long insert which includes the HNH nuclease domain; nevertheless, the RuvC domain is most likely an active nuclease given the conservation of all four HJR motifs and the characteristic conserved secondary structure elements (Additional File 6). For the rest of the N2 region, we failed to detect sequence similarity to any proteins although secondary structure prediction for this region shows that it consists mostly of beta-strands with a few alpha helices, suggesting the presence of a compact globular domain (Additional File 6).
The exact roles of the two predicted nuclease domains of Cas9 remain unclear. However, the insertion of the HNH nuclease domain into the RNAse H fold domain suggests that their activities are closely coupled and that their active sites are proximally located. The HNH nuclease domain, which is common in restriction enzymes and possesses DNA-endonuclease activity [43, 44], might be responsible for the target cleavage. Conversely, the RuvC-like RNAseH fold domain might be involved in CRISPR transcript processing.
Several PSI-BLAST searches using various regions of Cas9 as queries detected similarity to a large family of prokaryotic proteins containing both RuvC-like and HNH-nuclease domain (for the details on the identification of these homologs see Additional File 6). This family could be divided into at least two subfamilies by domain architecture (Figure 5). Analysis of the genomic context of the genes encoding these Cas9 homologs did not reveal any stable associations, and there are no CRISPR repeats in the vicinity of any of these genes. Hence, the function of these proteins remains obscure. An intriguing possibility is that they might represent a novel system of RNA-guided DNA interference involved in antivirus defense that in some respects could be analogous to the prokaryotic Argonaute proteins [45]. Some of these proteins form large species-specific paralogous families (e. g. 49 genes in Ktedonobacter racemifer or 17 genes in Microcoleus chthonoplastes, see Additional File 6). These expansions of closely related paralogs in the same genome suggest that at least this subset of the family could represent novel mobile elements. The cas9 gene might have been co-opted by the CRISPR/Cas system from such mobile elements with the concomitant loss of typical CRISPR/Cas components, such as RAMPs and CRISPR polymerases resulting in the emergence of the distinctive Type II gene neighborhoods. The emergence of Cas9 involved two distinct insertions, namely a mostly alpha-helical insert near the middle of the protein sequence and a mostly beta-stranded region near the C-terminus (Figure 5). These large inserts did not show sequence similarity to any other proteins but, given the close functional similarity between Type II and Type I/III CRISPR-Cas, it cannot be ruled out that the inserts originate from CRISPR-Cas components.
Type U CRISPR-Cas systems
An unusual CRISPR-Cas system has been recently identified in several bacterial genomes, e.g., Acidithiobacillus ferrooxidans ATCC 23270 (operon AFE_1037-AFE_1040) (denoted type U as it did not contain signature genes of any of the three CRISPR-Cas types) [20]. This system is associated neither with the two ubiquitous core cas genes, cas1 or cas2, nor with any other signature genes of the three CRISPR-Cas types or the 10 subtypes. The A. ferrooxidans system consists of four genes denoted csf1, csf2, csf3 and csf4. The Csf2 protein is a Cas7 group RAMP closely related to the Csm3 subfamily. Csf3 is yet another diverged RAMP protein that might be functionally analogous to the Cas5 group (Figure 3). Csf1 is a Zn-finger containing protein. A PSI-BLAST search started with one of the Csf1 proteins (AFE_1038, Acidithiobacillus ferrooxidans) after first iteration identified a weak (not statistically significant) similarity with the Zn-finger sequence of Cas10 proteins of the Crm2 family, and its predicted secondary structure is comptabile with the treble clef fold. The secondary structure prediction for these proteins generally shows the same pattern as in the large Cascade subunits discussed above, namely several beta-strands (some of them forming a potential hairpin) and several alpha-helices at the C-terminus (Additional File 5). Taken together, these observations suggest the possibility that Csf1 could be a highly divergent, inactivated and N-terminally truncated Cas10-like polymerase derivative lacking the N-terminal Fingers domain. The fourth gene in this system, csf4, is usually located on the complementary DNA strand in the divergent orientation and encodes a DinG family helicase [46]. According to the CRISPRdb database [47], CRISPR arrays are present in the vicinity of the above four genes in all of the respective genomes but the architecture of these arrays is unique in each case. Thus, this system might function in conjunction with different CRISPR arrays and would not require a distinct repeat signature.
Homologs of Csf1, Csf2 and Csf3 were identified in several Actinobacteria in a somewhat different genomic context (eg. pREL1_0084-pREL1_0087 Rhodococcus erythropolis). There is no DinG-like helicase in the neighborhood. A gene encoding a small, largely alpha-helical protein with conserved positively charged and aromatic amino acids in several positions follows the csf1 gene. This arrangement resembles the large and small Cascade subunits of the I-E and III-A subtypes. All these loci are located on plasmids. There are no CRISPR repeats detected on these plasmids and, in many cases, in other partitions of the respective genomes either (see the CRISPRdb database [47]). Thus, this variant of the Type U CRISPR-Cas system might be a mobile Cascade-like module functioning in a completely different context, not associated with CRISPR repeats and other Cas proteins.
Unusual CRISPR-Cas system variants
A few CRISPR-Cas systems that could be readily classified into established subtypes or at least types based on signature genes contain, in addition, unusual protein families, domain fusions and/or operon rearrangements (Figure 6). For example, a distinct subtype I-C system variant has a number of specific features, in particular, fused cas1 and cas4 genes and two extremely divergent RAMPs (Figure 6A). One of the latter is a homolog of Cas7 group RAMPs (GSU0053), and the other one is an apparent fusion of Cas5 and Cas6 group RAMPS (GSU0054) (see Additional File 1). The ancestral version of this systems could be similar to that present in Methanosarcina barkeri, with a probable homolog of Cas8 (inferred Cas8 family protein with characteristic alpha-helical domain at C-terminus which could also include fusion to the small subunit). Several CRISPR-Cas systems (e.g. in Geobacter sulfurreducens) contain an apparent deteriorated version of the Cas8 protein (which is identified on the basis of presence of alpha-helical C-terminal domain and the location in the operon). In a few other genomes there are no traces of a Cas8-like subunit (e.g. in Bifidobacterium animalis). The additional gene in this operon (Csb3 family) resembles RAMPs of the Cas6 family by secondary structure prediction and several motifs (see additional file Additional file 1); however, this protein also contains a C-terminal extension resembling the alpha-helical region present in Cas8 family proteins. The variant of the subtype I-F system in Photobacterium profundum contains three genes that are clearly orthologous to Cas1, Cas2/Cas3 fusion and Cas6f of the I-F system, respectively; however, two additional genes in this system encode proteins (PBPRB1993 and PBPRB1992) that show no detectable sequence similarity to any known protein families (Figure 6B). By length and the position in the operon, these proteins resemble Csy2 and Csy3, respectively. The predicted secondary structures of these proteins are also compatible with the RAMP structure but not with that of the Cas8 family (no alpha-helical domain). Thus, these proteins might belong to the Cas5 and Cas7 groups, respectively. The cas8 (large subunit) gene is absent in this system, which seems active based on the presence of large array of CRISPR repeats in the genome.
Some variants of the subtype III-B system encompass the signature Csx10 family which belongs to the Cas5 group of RAMPs (Figure 6C). Another feature of this system is the presence of a protein of all1473 family, which does not show any similarity to known Cas protein families but the predicted secondary structure resembles that of the RAMPs. These systems also contain the ribosomal protein S1 domain (the OB fold [48] which forms two distinct fusions). In some systems (e.g. in Bacillus tusciae), several additional fusions occurred, mostly between adjacent genes in the operon (Figure 6C). The Cas10 homolog in the latter systems lost the HD domain but retained all catalytic residues of the Palm domain.
Comparative analysis of these unusual variants of CRISPR-Cas system architectures may shed additional light on CRISPR-Cas evolution as discussed in the next section.
An evolutionary scenario for the origin of CRISPR-Cas systems
Combined, the findings described here allow us to propose a simple scenario for the origin of the CRISPR-Cas system (Figure 7). The primary observations that contribute to this reconstruction of CRISPR-Cas evolution are:
-
i)
the demonstration that Cas7 proteins represent a distinct group of RAMPs
ii) classification of all RAMPs into three major groups, Cas5, Cas6 and Cas7
iii) the more tentative unification of Cas8 and Cas10 into the CRISPR polymerase family (large subunits of CRISPR-Cas systems)
vi) the tentative unification of small, Csm2-like subunits
Taking into account these newly discovered unifying connections between the Cas proteins, comparison of the gene composition and operon organization of the three major types and 12 subtypes of CRISPR-Cas systems allows us to reconstruct the ancestral forms with confidence.
The ancestral functional CRISPR-Cas system probably resembled Subtype III-A and consisted of six or seven genes, namely the two universal cas genes, cas1 and cas2 ("information processing" subsystem involved in the adaptation phase) along with four or five additional genes which comprised the "executive" subsystem (CASCADE complex) involved in crRNA processing and interference. The "executive" module included the large subunit (Cas10/Cas8, or the CRISPR polymerase), the small subunit (an alpha-helical protein or domain enriched in positively charged and aromatic amino acids) and two or three RAMPs (of the Cas5, Cas6 and Cas7 groups). Given that Cas5 and Cas6 are structurally similar and considering that Cas5 probably substitutes for Cas6 in subtype I-C, the ancestral system could have contained only one protein representing these two families. Most of the ancestral components are retained in many extant CRISPR-Cas subtypes, in particular, the Type III systems that show relatively little variation. In the most parsimonious scenario, relatively few evolutionary events are required to explain the emergence of Type I and Type III systems with their subtypes (Figure 7)
The key events that gave rise to Type I CRISPR-Cas systems events include the acquisition of the helicase Cas3 and the RecB family nuclease Cas4; inactivation of the Palm domain of Cas10 protein that yielded Cas8; and fission of HD domain and Cas10 followed by fusion of HD domain with the Cas3-like helicase. The preservation of 6-7 ancestral components in most of the Type I and Type III CRISPR-Cas systems suggests tight structural and functional links among these proteins. However, a degree of independence between the "informational" and "executive" modules has been reported previously [5, 19, 20]. In particular, Type III "executive" modules (type III Cascades) are often encoded separately (not in proximity to cas1 and cas2 genes) and often occur in a genome along with Type I and/or Type II systems. Furthermore, Cas1 sequences from Type III systems are not monophyletic in the phylogenetic tree [20], suggesting that Type III "executive" modules have combined with diverse "informational modules" on multiple occasions. This is a likely evolutionary scenario for Subtype I-D in which the Cascade complex (especially the Cas7 group RAMP Csc2) resembles the Type III counterpart rather than other Type I Cascades (See Additional file 1). Interestingly, HD domain in this subtype is associated with the large subunit (Cas10d) rather than with Cas3, again similarly to Type III rather than to other Type I systems. However, the HD domain of Subtype I-D systems does not show the circular permutation that is characteristic of the HD domain fused with Cas10 in Type III systems. Thus, in this case, the similarity of domain architectures seems to be convergent, i.e., the HD domain in Subtype I-D systems probably was translocated from cas3 to inactivated cas10 (or fused with the latter if the ancestral form was a stand-alone HD domain).
There are currently no archaeal or bacterial genomes that would possess the "information processing" module but not the "executive" module of the CRISPR-Cas system. Although involvement of Cas1 in various repair processes has been suggested by recent experiments [49], this tight linkage indicates that the primary function of Cas1-Cas2 depends on the Cascade complex (the "executive module"). In contrast, "Cascade only" systems (Type-U) that are not associated with CRISPR arrays have been identified, suggesting the intriguing possibility that some variants of Cascade might function as an independent defense system, without relying on Cas1, Cas2 and CRISPR arrays for the acquisition of spacers. Although the source of RNA guides for such a system is unclear, an interesting possibility is that this version of Cascade might recognize alien DNA molecules and process nascent alien mRNA to generate RNA guides; such mechanism obviously would be analogous to the siRNA branch of the eukaryotic RNA interference systems [50]. From the evolutionary perspective, such standalone Cascades could be one of the antecedents of CRISPR-Cas systems.
The ancestor of the CRISPR polymerase (Cas10) could have originated from an ancient Palm domain polymerase, such as reverse transcriptase. On the basis of a number of derived shared characters, the CRISPR polymerase has been classified as a member of a distinct group of Palm domain proteins that also includes Thg1-type 3'- 5' nucleic acid polymerases and adenylate and diguanylate cyclases [31]. The association with the HD domain probably goes deep into the evolutionary past given that HD family hydrolases are also commonly associated with the GGDEF family diguanylate cyclases [31, 51]. The ancestral function of the CRISPR polymerase that was probably associated with the HD hydrolase domain could potentially involve a distinct form of signal transduction, a role in repair and/or in antivirus defense. The latter possibility seems attractive given the tight association of this protein with the CRISPR-Cas systems.
Genomic islands, in which viral defense, mobile elements and stress response genes, such as toxin-antitoxin systems, are often present together, are likely to be "melting pots" for the emergence of new functional systems through recombination, duplication and lateral transfer [45, 52]. It appears likely that the CRISPR-Cas systems evolved in such genomic environments, in part by combination of distinct mobile elements. The origin of RAMPs remains an enigma: these highly diverged RRM-domain proteins possess shared derived characters that are strongly suggestive of their monophyly (such as the presence of a glycine-rich loop and a conserved histidine implicated in catalysis in numerous RAMPs) but do not show significant similarity to any other proteins. An intriguing possibility is that there is a direct evolutionary connection between the CRISPR polymerase and the RAMPs given that the cores of all these proteins consist of RRM domains. The first RAMP proteins could have emerged by duplication of an inactivated polymerase followed by rapid evolution that involved the emergence of the endoribonuclease catalytic center. The ancestral RAMP might have resembled Cas7 proteins that contain a single RRM domain with structural embellishments along with (in some of the Cas7 proteins) a Zn-finger domain, and so resemble polymerases in their domain architecture. Furthermore, several CRISPR-Cas systems apparently remain functional despite having a highly degraded form of the large subunit (type U system) or lacking the large subunit altogether in some variants of Subtype I-C and Subtype I-F (Figure 6B), suggesting that RAMPs could substitute for the function of large subunits. The Cas6 and Cas5 group RAMPs could have subsequently evolved from the Cas7-like RAMPs. This scenario seems plausible considering that RAMP duplications, including tandem duplications and fusions, are often present in CRISPR-Cas loci, especially among the Type III systems in which Cas7 group RAMPs are particularly prone to duplication. Interestingly, in both Type I Cascade complexes that have been characterized in detail, those from E. coli and S. solfataricus [8, 16] the Cas7 subunit is present in multiple copies. It seems plausible that in Type III Cascades, these homo-oligomers are replaced by hetero-oligomers made of paralogous Cas7 proteins. Furthermore, recent inactivation of the CRISPR polymerase (Cas10) was detected in some Type III systems such as MTH326-like (Figure 7). All these observations attest to the dynamic character of the evolution of CRISPR-Cas systems and might add to the plausibility of the route of evolution from the CRISPR polymerase to the RAMP-based Cascade complexes (Figure 7). However, this scenario remains speculative given the absence of specific similarity between the RAMPs and CRISPR polymerases, and recruitment of another RRM-domain protein as the ancestral RAMP gene cannot be ruled out.
The CRISPR polymerase and the entire ancestral, Subtype III-A-like CRISPR-Cas system most likely evolved in thermophilic Archaea. Indeed, this system and in particular the cas10 gene is present in a substantial majority of archaea and is confidently reconstructed as a gene present in Last Archaeal Common Ancestor (LACA) [53]. By contrast, Type III CRISPR-Cas systems are much less common in bacteria and often contain variants of Cas10 that are predicted to be inactivated [20]. Like most antiviral defense systems, CRISPR-Cas is prone to HGT and could have rapidly spread among bacteria. Notably, many thermophilic bacteria possess Type III systems, which might have started the dissemination of CRISPR-Cas among bacteria. The active Cas10 could be particularly beneficial in thermal environments, in agreement with the previous observations that identified Cas10 as a prominent genomic determinant of the thermophilic life style [2, 54].
The close association between Cas1 and Cas2 is more difficult to explain in terms of function or evolution. Given that Cas1 is a DNAse with a Holliday junction resolvase-like activity [21, 49], it is most likely to function as a recombinase and integrase at the spacer acquisition stage. These activities are typical of transposable elements, so the origin of Cas1 from this type of elements that are extremely common in prokaryotes appears likely. The endoribonuclease Cas2 might have evolved from another class of equally widespread mobile elements, namely toxin-antitoxin systems. Cas2 is yet another RRM-domain protein that is homologous to VapDHi, the toxin of the two-component toxin-antitoxin system vapDHi/VapX [55], as suggested previously [5] and supported by new HHPred searches which unequivocally retrieved Cas2 as the protein family most similar to VapDHi (for example, a HHpred search started with Psta_3906, VapDHi from Pirellula staleyi, detected Cas2, PF09827, as the best hit with the probability 98.9). It remains unclear whether Cas1 and Cas2 ever formed a distinct two gene unit or have independently joined the evolving CRISPR-Cas system.
Type II CRISPR-Cas systems are the only group for which the origin of Cascade complex components could not be confidently inferred. Nevertheless, experimental data suggests that it functions in general similarly to the Cascade complexes of Type I and Type III systems [9]. Of the three types of CRISPR-Cas systems, the Type II systems have undergone the most radical transformation compared to the inferred ancestral form during which the genes encoding the subunits of the ancestral Cascade complex as well as the large (polymerase) and small subunits appear to have been replaced by a single large, multidomain protein, Cas9 which contains two unrelated nuclease domains (Figure 5) and appears to be responsible for both the CRISPR transcript processing and interference.