A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action
© Makarova et al; licensee BioMed Central Ltd. 2006
Received: 08 February 2006
Accepted: 16 March 2006
Published: 16 March 2006
All archaeal and many bacterial genomes contain Clustered Regularly Interspaced Short Palindrome Repeats (CRISPR) and variable arrays of the CRISPR-associated (cas) genes that have been previously implicated in a novel form of DNA repair on the basis of comparative analysis of their protein product sequences. However, the proximity of CRISPR and cas genes strongly suggests that they have related functions which is hard to reconcile with the repair hypothesis.
The protein sequences of the numerous cas gene products were classified into ~25 distinct protein families; several new functional and structural predictions are described. Comparative-genomic analysis of CRISPR and cas genes leads to the hypothesis that the CRISPR-Cas system (CASS) is a mechanism of defense against invading phages and plasmids that functions analogously to the eukaryotic RNA interference (RNAi) systems. Specific functional analogies are drawn between several components of CASS and proteins involved in eukaryotic RNAi, including the double-stranded RNA-specific helicase-nuclease (dicer), the endonuclease cleaving target mRNAs (slicer), and the RNA-dependent RNA polymerase. However, none of the CASS components is orthologous to its apparent eukaryotic functional counterpart. It is proposed that unique inserts of CRISPR, some of which are homologous to fragments of bacteriophage and plasmid genes, function as prokaryotic siRNAs (psiRNA), by base-pairing with the target mRNAs and promoting their degradation or translation shutdown. Specific hypothetical schemes are developed for the functioning of the predicted prokaryotic siRNA system and for the formation of new CRISPR units with unique inserts encoding psiRNA conferring immunity to the respective newly encountered phages or plasmids. The unique inserts in CRISPR show virtually no similarity even between closely related bacterial strains which suggests their rapid turnover, on evolutionary scale. Corollaries of this finding are that, even among closely related prokaryotes, the most commonly encountered phages and plasmids are different and/or that the dominant phages and plasmids turn over rapidly.
We proposed previously that Cas proteins comprise a novel DNA repair system. The association of the cas genes with CRISPR and, especially, the presence, in CRISPR units, of unique inserts homologous to phage and plasmid genes make us abandon this hypothesis. It appears most likely that CASS is a prokaryotic system of defense against phages and plasmids that functions via the RNAi mechanism. The functioning of this system seems to involve integration of fragments of foreign genes into archaeal and bacterial chromosomes yielding heritable immunity to the respective agents. However, it appears that this inheritance is extremely unstable on the evolutionary scale such that the repertoires of unique psiRNAs are completely replaced even in closely related prokaryotes, presumably, in response to rapidly changing repertoires of dominant phages and plasmids.
This article was reviewed by: Eric Bapteste, Patrick Forterre, and Martijn Huynen.
Open peer review
Reviewed by Eric Bapteste, Patrick Forterre, and Martijn Huynen.
For the full reviews, please go to the Reviewers' comments section.
The discovery of the elaborate and versatile systems of RNA silencing in eukaryotes is one of the pivotal advances in biology of the last decade [1–6]. There are two major, distinct forms of regulatory small RNAs involved in eukaryotic gene silencing: small interfering (si) RNAs and micro (mi) RNAs. siRNAs are produced from double-stranded RNAs of viruses and transposable elements, which are processed by the dicer nuclease, one of the essential components of the RNA-Induced Silencing Complexes (RISCs) [7–9]. Dicer cleaves long dsRNA molecules into short, 21–22 nucleotide duplexes which are subsequently unwound by the RISC to yield mature siRNAs. The RISC-siRNA complex then binds to the target mRNA which is cleaved by the slicer nuclease, another crucial component of RISC, to release the RISC-siRNA which acts as a recyclable catalyst [9, 10]. In addition to silencing genes of exogenous agents, a distinct class of longer, 28 nt siRNAs, the so-called repeated-associated siRNAs (rasiRNAs), silence expression of chromosomal copies of transposons and transposon-like repeats [11–13].
Unlike the siRNAs, 21–25 nt-long miRNAs are encoded in eukaryotic genomes and are either perfectly (in plants) or imperfectly (in animals) complementary to sequences in the 3'-untranslated regions of specific endogenous mRNAs [6, 13]. Base-pairing of miRNAs with the target mRNAs, which is mediated by a distinct form of RISC, results either in RNA cleavage or in down-regulation of translation . Evidence is rapidly accumulating that numerous, probably, thousands of miRNAs in animals and plants are major players in development regulation and chromatin remodeling .
Prokaryotes have apparent functional counterparts to the miRNA system, i.e., regulation of bacterial gene expression by small antisense RNAs. The best characterized of these pathways employ the RNA-binding protein Hfq for small RNA presentation and RNAse E for target degradation [14–17]. Escherichia coli has ~60 microRNA genes, and comparable numbers of expressed, small antisense RNAs have been detected in the archaea Archaeoglobus fulgidus  and Sulfolobus solfataricus , suggesting an important role of this regulatory mechanism in prokaryotic physiology. In addition, small antisense RNAs have been shown to regulate plasmid replication and killing of plasmid-free bacterial cells by silencing specific plasmid genes [20–22]. In contrast, counterparts to the eukaryotic siRNA mechanism so far have not been described in prokaryotes. Here, we apply comparative genomics and in-depth computational analysis of protein and RNA sequences and structures to predict a distinct prokaryotic siRNA-like system and the associated enzymatic apparatus.
In a previous comparative-genomic study, which has been originally conceived as a test case for methods for conserved gene neighborhood analysis we have developed, we characterized an extensive set of genes that included several proteins related to DNA or RNA metabolism and was, mostly, specific to thermophiles . These genes comprise a complex array of overlapping neighborhoods that are partially conserved but highly diversified, in terms of both gene composition and gene order, and are represented in all archaeal and many bacterial genomes [23, 24]. At the time of its discovery, we hypothesized that these genes encoded an uncharacterized, versatile repair system, largely, associated with the thermophilic lifestyle .
Independently and almost simultaneously, Jansen and coworkers found  that at least several genes from this gene neighborhood were tightly associated with the so-called Clustered Regularly Interspaced Short Palindrome Repeats (CRISPR); the acronym cas (for CRISPR-associated) genes was thus coined. The CRISPR are a distinct class of repetitive elements that are present in numerous prokaryotic genomes. A CRISPR element consists of a direct repeat of ~28–40 base pairs (bp), with the copies separated by a unique sequence of ~25–40 bp. Typically, CRISPR form tandem arrays containing from 4 to >100 elements. Most of the genomes contain a single array of CRISPR in which the sequences of the repeats are (nearly) identical; some, however, possess multiple CRISPR cassettes that may have substantially different sequences [24, 25]. The repeats in CRISPR from different genomes show only limited similarity, but often retain distinct, conserved motifs shared even by distant species including archaea and bacteria [18, 26]. There seems to be a strict link between CRISPR and cas genes, suggestive of a (nearly) mutualistic relationship: the great majority of the genomes that contain CRISPR also have at least a minimal set of cas genes, and vice versa.
Recently, Mojica and coworkers reported that the unique inserts in some of the CRISPR are homologous to fragments of bacteriophage and plasmid genes . This led to the hypothesis that the CRISPR might have a function in the defense of prokaryotes against invading foreign replicons and that there could be functional analogies between this putative defense system and eukaryotic RNA interference. Similar findings have been independently reported by two other groups [28, 29].
The recent rapid growth of the number and diversity of sequenced prokaryotic genomes has led to a dramatic increase in the complexity of the identified cas gene arrays . Here we describe the results of an exhaustive sequence analysis of the Cas protein sequences which yielded a classification of these proteins, several new functional predictions, and a reconstruction of evolutionary relationship between these genes. We propose that the cas genes encode the protein machinery of a prokaryotic siRNA-like system that performs, primarily, but perhaps, not exclusively, defense functions and is generally similar, in some respects, to eukaryotic siRNA, and in other respects, to the vertebrate immune system. The predicted enzymatic machinery of this system seems to be functionally analogous, but not homologous, to the protein apparatus involved in the eukaryotic RNA-mediated gene silencing. Finally, we outline possible molecular mechanisms of the predicted prokaryotic siRNA system. The hypothesis on the involvement of Cas proteins in an RNAi-type mechanism supplants the previous proposal that these proteins might comprise a novel DNA repair system  which is hardly compatible with the tight association of these proteins with CRISPR and the existence of unique CRISPR inserts homologous to phage and plasmid sequences.
Results and discussion
Identification, classification and evolutionary analysis of cas genes
In the original study of the cas gene neighborhoods, which was performed with ~40 genomes, we identified ~20 protein families that were tightly or more loosely associated with the system we now call CASS. A recent update by Haft and coworkers with >200 genomes yielded a diverse "guild" of ~45 Cas protein families . Many of the Cas proteins show very low sequence conservation which makes identification of homologous relationship between them a non-trivial task. We employed an iterative approach to the exhaustive analysis of the Cas protein sequences. The protein sequences of each Cas family were compared to the protein sequences from all available prokaryotic genomes using PSI-BLAST [30, 31], the proteins encoded in the neighborhoods of all identified candidates were used as queries for further searches, and the process was iterated until convergence. The sequences of Cas proteins were, in addition, carefully compared to each other, in an attempt to identify possible traces of common origin of some of these genes that have so far eluded detection [for a complete list of Gene Identification (GI) numbers of the detected Cas proteins, see Additional file 1].
COG1518, a universal marker of CASS
In agreement with the previous observations , we found that Cas1 (COG1518 in the Clusters of Orthologous Group of proteins classification system ) is the best marker of the CRISPR-associated systems (CASS). This gene encodes a highly conserved protein and that is represented in all cas neighborhoods, with the single exception of Pyrococcus abyssii. A PSI-BLAST search for COG1518 members in the completely sequenced prokaryotic genomes revealed the presence of at least one representative of this COG in 77 of the 177 analyzed genomes.
Protein components of CASS
Putative novel nuclease/integrase; Mostly α-helical protein
COG1343 (cas2), COG3512, ygbF-like; MTH324-like; y1723_N-like;
Small protein related to VapD, fused to helicase (COG1203) in y1723-like proteins
DNA helicase; Most proteins have fusion to HD nuclease
COG1468 (cas4), COG4343
RecB-like nuclease; Contains three-cysteine C-terminal cluster
COG1688, COG1769, COG1583, COG1567, COG1336, COG1367, COG1604, COG1337, COG1332, COG5551, BH0337-like, MJ0978-like, YgcH-like, y1726-like, y1727-like
Belong to "RAMP" superfamily, possibly RNA-binding protein, structurally related to a duplicated ferredoxin fold (PDB: 1WJ9)
COG1857, COG3649, YgcJ-like, y1725-like
α/β protein; probable enzymatic activity, possibly, a nuclease
COG1203 (N-terminus), COG2254
All, mostly archaea and FIRM
Large Zn-finger-containing proteins, possibly, nucleases (nuclease activity has been reported for MTH1090 .
Bacteria, mostly PROTEO
Large Zn-finger containing proteins;
COG1353, MTH326-like, alr1562, slr7011
All, mostly Archaea
Putative novel polymerase; Multidomain protein with permuted HD nuclease domain, palm domain, polymerase-thumb-like domain and Zn-ribbon; MTH326-like has inactivated polymerase catalytic domain; alr1562 and slr7011 – predicted only on the basis of size, presence of HD domain, and location with RAMPs in one operon
Former COG2462; Fusion of COG1517-like domain to HTH-type transcriptional regulator; Possible regulator of the system expression in archaea
All, mostly Archaea
~150 aa protein; Has a few motifs similar to ygcK-like; mostly α-helical protein
Bacteria, mostly PROTEO
~180 aa protein; has a few motifs similar to COG1421; mostly α-helical protein
All, mostly Archaea
~110 aa; mostly α-helical protein
All, mostly Archaea
Some are fused to HTH domain (see COG1517/HTH), some proteins have the domain duplication; structure is available (1XMX); domain appears to have a Rossmann-like fold.
Bacteria, mostly PROTEO
Huge protein; contains McrA/HNH-nuclease related domain and RuvC-like nuclease domain
All, mostly Archaea
Specific for Pyrococcus and Thermococcus. The pair ST0031/SSO1401 and AF1873, most likely, belong to the same family because have similar length and located in the identical place in an operon but due to low conservation are not alignable
Former COG3574; ~150 aa protein.
~420 aa protein, no prediction
Bacteria, mostly PROTEO
~450 aa protein, no prediction
Bacteria, mostly FIRM
~220 aa protein, no prediction
Bacteria, mostly CHLOR
~130 aa protein, no prediction
~650 aa, no prediction
COG1343 and its relatives – distant homologs of VapD
COG1343 (cas2) is another gene that is common in CASS. Typically, this gene is located immediately downstream of the COG1518 gene (Fig. 2a). Exhaustive PSI-BLAST search starting from COG1343 proteins identified proteins of COG3512 as homologs of COG1343 such that these COGs could be unified in a single superfamily. The members of this superfamily are small (80–120 amino acids) proteins with distinct structural motifs, in particular, an N-terminal β-strand followed by a polar amino acid, most often, aspartate or asparagine (see Additional file 2). In CASS2, which is typified by the cas operon of E. coli, there is an uncharacterized gene coding for a small protein immediately downstream of the COG1518 gene. Analysis of the multiple alignment of the homologs of this protein revealed motifs highly similar to those in COG1343, suggesting that these proteins actually are diverged COG1343 homologs. Only CASS3 does not seem to contain a gene for a small protein that potentially could be a COG1343 homolog. However, we found that, in CASS3, the next gene after COG1518, which codes for the CASS helicase (cas3 or COG1203) is unusually long and contains a small domain preceding the HD-hydrolase N-terminal domain present in many COG1203 proteins. We analyzed this domain separately and found that its size and motifs were consistent with a homologous relationship with COG1343 (see Additional file 2). Thus, it appears that, either as a stand-alone protein or as part of a multidomain protein, the COG1343 domain is present in all CASS (except for a few highly degraded and, probably, non-functional ones) and, accordingly, could be essential for the CASS functions. Furthermore, searches started with the sequences of many proteins of COG3512 showed some sequence similarity to vapD (COG3309 family), a family of uncharacterized proteins that are functionally linked to the VapBC operon . The VapBC operon encodes a variant of the bacterial toxin-antitoxin (TA) module which includes an HTH-containing transcription regulator and a PIN-domain nuclease. The PIN domain has been shown to possess ribonuclease activity that in eukaryotes is involved in pre-rRNA processing, nonsense-mediated decay (NMD) of aberrant mRNAs, and RNAi [35–37]. It has been proposed that there is an evolutionary connection between eukaryotic NMD and bacterial TA systems, and that the functioning of the TA module might involve mRNA degradation as well .Together, these observations seem to establish links between CASS and the TA system and, through the latter, between CASS and eukaryotic NMD and RNAi, but do not directly shed light on the function of the COG1343 domain. However, the COG1343 proteins and VapD show some generic similarity to the PIN domain in terms of size and the signature motifs (COG1343 proteins contain several partially conserved aspartates; see Additional file 2) which makes it tempting to speculate that these proteins represent yet another family of nucleases.
COG1857 – another putative enzyme found in most CASS versions
The Repeat-Associated Mysterious Proteins (RAMPs)
The RAMPs are the most diverse class of CASS genes. In addition to the previously identified 5 distinct families of RAMPs, we detected several additional ones, namely, BH0337-like, y1726-like, YgcH-like, y1727-like, and MJ0978-like families, as well as numerous diverged members of the previously described families (Fig. 3A). Despite the dramatic sequence divergence, all these protein contain the RAMP signature, the G-rich loop at the C-terminus (Fig. 3A). One family of RAMPs, COG1853/COG5551, is often encoded outside the CASS operons or on the periphery of these. Moreover, analysis of the gene context of this gene in Aquifex aeolicus led us to the identification of yet another member of RAMP superfamily, COG1851. This protein family does not appear to be linked to CASS at all.
With the identification of these new families of RAMPs, it now becomes apparent that all CASS versions, with the apparent exception of the minimal CASS4, include at least one RAMP. The crystal structure of one of the RAMPs from the newly detected YgcH-like family has been solved as part of one of the structural-genomics projects (PDB: 1wj9). The structure of this protein from Thermus thermophilus reveals that the RAMP module is a duplication of a ferredoxin-like fold domain. Each domain has a two-layer α+β architecture and is composed of four β-strands and two α-helices topologically arranged as a repeat of two βαβ units (Fig 3B, A1-A86, β1 through β4; and A87-A211, β5 through β8 for the first and the second domains, respectively). The N-terminal ferredoxin-like domain contains two additional α-helices (α1' and α1", Fig 3A) inserted before and after the first α-helix. The C-terminal domain has two disordered regions and houses the conserved Gly-rich loop situated between the last α-helix and β-strand (Fig. 3A). Various structure similarity search programs detect ferredoxin fold proteins as the first hits to RAMP domains. In particular, for the N-terminal RAMP domain, DALI , first match is the anticodon-binding domain of Phe-tRNA synthetase (PDB Entry 1eiy, chain B) with Z-score 4.8 and RMSD (squared Root of Mean Square Deviations) 2.9Å over 67 aligned residues. The VAST program  finds ribosomal protein S6 (PDB Entry 1fjg chain F) as the top hit for the C-terminal RMAP domain with P-value 0.039, RMSD 2.6 Å over 64 residues.
Thus, at least six gene (super)families seem to comprise the stable core of the CASS: COG1518 (cas1), COG1343 (cas2), COG1203 – a helicase often fused to a HD-family hydrolase (cas3) plus free-standing versions of the HD-hydrolase (COG2254), COG1468 (cas4) – a RecB family nuclease usually containing a C-terminal Zn cluster, COG1857, and RAMPs. This exact set of genes is seen only in a few genomes; most versions of CASS have substantial variations around this core – loss of some core genes in the minimal versions and addition of other genes and whole gene cassettes in others (Fig. 2a,b).
The second major module of CASS – the pol-cassette
The most notable non-core CASS module which, in a sense, constitutes a second, even if non-ubiquitous, central part of CASS may be called the "pol-cassette" after the predicted palm-domain RNA or DNA polymerase of COG1353. The pol-cassette also includes several distinct RAMPs and a few uncharacterized genes. The pol-cassette is strictly linked only to CASS5 and CASS7 and also, in some instances, is found in CASS1,2,4,6 (Fig. 2c), although, in some genomes, the pol-cassette is not adjacent to the CASS-core gene array. The phylogenetic tree for the predicted polymerase (COG1353), which consisted of two major branches corresponding to two distinct operon organizations (Fig. 2c), showed essentially no topological congruence with the COG1518 tree (for those CASS that have both components; compare the two trees in Fig. 2a,c). Thus, it appears that the pol-cassette comprises a distinct evolutionary unit that is often transferred horizontally independently of the CASS-core. Notably, the pol-cassette is strongly, although not strictly, linked to thermophily – the great majority of the species containing this module, typically associated with CASS5 and CASS7, are thermophiles. Additionally, several species possess a third module containing a diverged form of COG1353 with an apparently intact HD hydrolase domain but an inactivated polymerase (PALM) domain (Fig 2c).
Ancillary CASS components
The functions of several other CASS gene families remain obscure. For instance, CASS1, 5, 7 contain genes upstream of COG1857 that encode large (500–600 amino acids), homologous proteins; the best conserved family in this set of proteins is represented by BH0338 and its orthologs. Some members of the BH0338 family contain a Zn-ribbon in the middle of the sequence but otherwise have no recognizable domains or motifs. Among several conserved motifs of these proteins, are two conserved aspartates and a distal conserved glycine, a combination that resembles the motifs seen in the PALM polymerase domain (not shown). Although we were unable to obtain additional evidence of the potential connection of this family with polymerases, it is tempting to speculate that these proteins might contain an extremely diverged version of the PALM domain. Similar, albeit less pronounced, motifs are detectable in the MTH1090 family proteins which are present in CASS5 and CASS7. This similarity and the fact that the respective genes occupy the same position in the corresponding operons suggest that the BH0338-family and the MTH1090-family proteins are highly diverged homologs. CASS2 also includes a large protein (YgcL-family), some of which contain Zn-clusters; however, the conserved motifs of these proteins and their position in the respective operons are different from those of the MTH1090 and BH0338 families. CASS4 contains another huge protein (COG3513, ~1150–1400 aa) with two recognizable domains, a McrA/HNH nuclease and a RuvC-like nuclease (RNAseH fold). These observations emphasize the striking diversity of still poorly characterized CASS components, particularly, the plethora of predicted nucleases of various classes and potential novel ones.
Hypothesis: CASS is a prokaryotic defense system that functions on the RNAi principle
Based on the properties of CRISPR and Cas proteins, we speculate that this system is a functional analog of the eukaryotic siRNA systems and propose possible mechanisms of the putative prokaryotic small RNA interference. The crucial observation reported independently by Mojica et al , Pourcel et al , and Bolotin et al  is that a certain fraction (~10% according to ) of the unique inserts in CRISPR units are homologous to fragments of viral (bacteriophage) or plasmid genomes. Only a miniscule fraction of the existing phage and plasmid sequences is currently available, whereas the total diversity of prokaryotic mobile elements is humungous and apparently exceeds the diversity of prokaryotes at least by an order of magnitude [41, 42]. Thus, it is not far fetched to propose that most, if not all, CRISPR inserts are derived from mobile replicons .
Should that be the case, it seems more or less obvious that CASS is a prokaryotic defense system against foreign replicons that functions on the antisense RNA principle. More specifically, it seems likely that the inserts are transcribed and silence the cognate phage or plasmid genes via the formation of a duplex between the prokaryotic small interfering (psiRNA) and the target mRNA followed by cleavage of the duplex or translation repression. Indeed, Mojica et al  mention the analogy between the CRISPR and eukaryotic RNA interference systems but propose no specific mechanisms for the action of the putative defense systems and, crucially, do not explore the connection between the putative psiRNA and the predicted activities of Cas proteins. Important supporting evidence has been independently obtained through the analysis of the small non-messenger RNA expression in the euryarchaeon Archaeoglobus fulgidus which showed that CRISPR are transcribed (from a leader-promoter sequence), apparently, in the form of a multiunit precursor that is subsequently cleaved into CRISPR monomers and oligomers ; very similar observations have been subsequently reported for the crenarchaeon Sulfolobus solfataricus . Furthermore, as noticed by Pourcel and coworkers , one of the unique CRISPR inserts in the MIGAS strain of the bacterium Streptococcus pyogenes is homologous to a prophage present in other strains of the same bacterium that, conversely, do not carry the CRISPR. This is compatible with the possibility that the insert makes the bacterium immune to the given phage.
Functional and structural parallels between CASS and eukaryotic RNAi machinery
Helicase/RNAseIII. Processing of long dsRNA into siRNA and pre-miRNA into miRNA, involves unwinding
Helicase (COG1203) + HD nuclease (COG2254) - fused or adjacent genes,
SFII helicase + HD nuclease
Ferredoxin-fold-PAZ-PIWI – endonuclease, target degradation
RecB-family nuclease (COG1468, 4343); COG1857 – a novel nuclease?
dsRNA-binding domain, interacts with Dicer
Ferredoxin-fold duplication. Size-specific psiRNA-binding, pre-psiRNA-binding, other RNA-binding functions?
Ferredoxin-fold duplication. Size-specific psiRNA-binding, pre-psiRNA-binding, other RNA-binding functions?
Tudor, SN – RNA-binding
Ferredoxin-fold duplication. Size-specific psiRNA-binding, pre-psiRNA-binding, other RNA-binding functions?
RGG – RNA-binding
RNA-dependent RNA polymerase
RdRp domain related to DdRp; 2nd-strand synthesis for siRNA production
Predicted RdRp/RT (COG1353)
Palm polymerase domain. 2nd strand synthesis for psiRNA production, reverse transcription for CRISPR formation
Figure 5b shows a version of this pathway that involves the activity of the CASS polymerase, by analogy with the eukaryotic RdRp, which participates in some RNAi pathways in most eukaryotes, but apparently has been lost in arthropods and chordates [45–47]. The initial steps in this scheme are the same as in the basic one (Fig. 5a) – transcription of the CRISPR, processing of the psiRNA precursor, and annealing of psiRNA to the target mRNA mediated by the pRISC – but, at the next step, psiRNA is postulated to serve as the primer for elongation by the CASS polymerase, yielding an extended double-stranded form of the target (Fig. 5b). This form would be cleaved by p-dicer analogously to the cleavage of viral and transposon dsRNAs by the eukaryotic dicers. The p-dicer might function as a complex with the respective RAMP to form a distinct version of pRISC. This could be the endpoint of the pathway, or else, the dsRNA degradation products could be utilized as new psiRNAs, resulting in amplification of the silencing effect (Fig. 5b). The CASS polymerase is most common in thermophiles, and it is tempting to speculate that the prevalence of this form of the psiRNA pathway has to do with the instability of the psiRNA-target duplex under the high ambient temperatures of these organisms.
The most complex and uncertain aspect of the putative prokaryotic RNAi system discussed here is the formation of new CRISPR units containing unique psiRNA genes specific for new targets encountered by the organism (Fig. 5c). The path to the creation of new psiRNAs would begin just like the response pathway, i.e., with transcription of the CRISPR locus and the first processing step yielding the 70–100 nt psiRNA precursors (compare Fig. 5c with Fig. 5a). At the next step, however, there must be a mechanism to replace the unique insert within the pre-psiRNA with a new fragment of foreign (e.g., phage) RNA. The nature of this mechanism remains unclear. In principle, two possibilities can be envisaged: i) reverse transcription with copy choice whereby a reverse transcriptase, most likely, the predicted CASS polymerase (COG1353) switches from using the pre-psiRNA as a template to using a phage mRNA, and then back, and ii) direct, non-homologous RNA recombination between a pre-psiRNA and a foreign mRNA, followed by reverse transcription of the resulting recombinant RNA (Fig. 5c). Both mechanisms are non-trivial in their molecular choreography and are unlikely to occur with high efficiency. Nevertheless, there are precedents for both in the molecular biology of retroviruses and other RNA viruses. In particular, reverse transcriptase switches templates in each cycle of retrovirus first-strand cDNA synthesis although, in this case, copy-choice is facilitated by the spatial juxtaposition of the two templates within the virus particle; a similar mechanism is responsible for recombination in retroviruses . In addition, and probably, more relevantly to the psiRNA case, reverse transcription with copy-choice is thought to be involved in the incorporation of copies of cellular genes, such as oncogenes, into retroviral genomes [49, 50]. The alternative, namely, direct recombination between RNA molecules might seem far fetched, but such a process has been demonstrated, by several groups independently, to occur in RNA viruses, apparently, via a protein-independent mechanism [51, 52]. During the formation of new psiRNA species, these low-frequency processes might be facilitated by the high abundance of the phage mRNAs involved. Indeed, it has been shown that the unique inserts in CRISPR most often correspond to fragments of essential, highly conserved phage genes that are typically expressed at a high level in infected bacteria. Once the dsDNA molecule consisting of a CRISPR unit with the new, unique insert is produced, by one of the mechanisms outlined here or, perhaps, via a different pathway, it must insert into the CRISPR array via homologous recombination (Fig. 5c). We suspect that this process is mediated by the COG1518 protein, the universal marker of CASS containing conserved motifs resembling those of different nucleases . It seems likely that this protein functions as the CRISPR integrase/recombinase, perhaps, in cooperation with the COG1343 protein, another universal component of CASS.
Additional lines of evidence relevant for the predicted RNAi function of CASS
Genes loosely associated with CASS
Genes loosely associated with CASS
Example of genes associated with CASS
Reverse transriptase (RT)
VVA1544, PG1982, alr1468
Fused to COG1518 in three occasions and a remnant of RT (Mbar_A1351 and MM3360) in M. barkeri and mazei genomes is located close to cas
alr1560, ST0017, Ava_4168
HTH domain, component of toxin-antitoxin system, probably targeting mRNA
Large family of proteins, predicted to be a phosphatase or a nuclease on the basis of sequence motifs which is shared by all three domain of life. In multidomain proteins in plants it is associated with C2H2 Zn-finger domain
An enzymatic domain, that is located in an operon with restriction-modification systems or in association with a diverged helicase
Homologs of phage anti-repressor Ant which is known to be inhibited by an antisense RNA
Homolog of the eukaryotic argonaute protein, that are key player in RNA guided posttranscriptional regulation by siRNA and miRNA
Probably has an RNAseH-like fold, often fused to CopG-family of transcriptional regulators; forms a conserved operon with COG1724/hicA, which has the dsRBD-like fold; possible novel toxin-antitoxin module targeting mRNA
RNA binding domain
Fused to COG1343 in Lactococcus bulgaricus and L.casei
Regulatory ATPase of AAA family fused to RecB-family nuclease; Predicted regulator of RNA metabolism
DNA-binding domain, belongs to the same fold as MazE, which involved in toxin-antitoxin system
Ribosomal protein S1-like RNA-binding domain, fused to RAMP domain
Cold shock protein-like RNA-binding domain, fused to RAMP domain
A connection between CRISPR and RAMPs
Rank correlations coefficients between CRISPR spacers and selected Cas proteins
Values selected for correlation
Rank Correlation coefficienta
Number of spacers vs S.D. of spacer lengths
Number of spacers vs number of COG1518 proteins
Number of spacers vs number of RAMPs
Number of spacers vs number of COG1517 proteins
Number of spacers vs total number of Cas proteins minus RAMPs
Number of spacers vs total number of Cas proteins minus RAMPs and COG1517 proteins
Number of RAMPs vs S.D. of spacer lengths
Putative psiRNAs: relationships with phage and plasmid genes and secondary structure
A selection of CRISPR inserts homologous to phage, plasmid and prokaryotic genes
Sulfolobus acidocaldarius DSM 639
Xanthomonas oryzae KACC10331
Anabaena variabilis ATCC 29413
Streptococcus thermophilus CNRZ1066
Streptococcus thermophilus LMG 18311
Streptococcus agalactiae 2603
Importantly, the CRISPR insert sequences from even closely related bacterial strains are unrelated to each other, with only a few exceptions among enterobacterial strains (data not shown). This suggests that the inserts are replaced rapidly on the evolutionary scale, perhaps, via the mechanism outlined in Fig. 5c. A broader implication is that the dominant phages and plasmids encountered by even closely related bacteria are different leading to the rapid generation of distinct repertoires of psiRNAs.
Obviously, the entire concept of the prokaryotic CASS defense system functioning on the RNAi principle currently remains a hypothesis. However, we believe that three lines of evidence make such a mechanism, in its broad outline, almost a logical inevitability: i) the indisputable origin of at least some of the unique CRISPR inserts from phage and plasmid genes, ii) the demonstration of transcription and processing of the CRISPR loci in A. fulgidus and S. solfataricus, and iii) the abundance of CASS components that are clearly implicated in nucleic acid degradation, processing, and possibly, recombination. The only substantial variation on the theme of RNAi could be an antisense mechanism acting on DNA. Such a mechanism is, obviously, much less common than post-transcriptional gene silencing by small RNAs, but is not without precedent. Indeed there are strong indications that elimination of intergenic DNA sequences in the macronucleus of the ciliate Tetrahymena occurs via a siRNA-mediated mechanism, with the participation of a specific dicer-slicer pair [54, 55]. In principle, this mechanism is compatible with the finding of both sense and antisense sequences among CRISPR inserts homologous to phage and plasmid genes (Table 5). Nevertheless, given the much wider prevalence of silencing pathways and, in particular, the demonstrated instances of antisense RNA regulation in bacteria, it seems most likely that CASS acts by silencing genes from invading replicons. This being said, it should be emphasized that the mechanisms depicted in Figure 5 are only rough outlines of some of the ways in which this system could function. There is no doubt that experimental studies will reveal mechanisms different from these, at least, in detail. However, regardless of the specific mechanisms and even whether the predicted psiRNA systems acts on RNA or on DNA, it appears certain that its main function is RNAi-mediated defense against alien replicons invading archaea and bacteria.
The predicted psiRNA system resembles the eukaryotic counterparts not only in its functional principle but also in the general characteristics of the implicated proteins. What is most striking is the comparable complexity, diversity, and plasticity of the protein machineries involved. Both systems consist of one or more helicases, a broad spectrum of nucleases, a specific polymerase, and a variety of RNA-binding proteins. Both in CASS and in RISCs, only two or three protein subunits appear to be truly indispensable; the rest come and go, resulting in a variety of RISCs with their distinct functions, many of them still poorly understood [7–9], and, presumably, in a comparable diversity of CASS. Remarkably, however, not a single protein belonging to the bona fide CASS has an ortholog in eukaryotes, involved in RNAi or otherwise. The single direct link could be the argonaut protein which is the central active moiety of eukaryotic RISCs (slicer) and might have some functional connection to CASS in archaea as tentatively suggested by the M. kandleri CASS operon structure; admittedly, however, the indications of potential involvement of argonaut in the CASS functioning are currently quite weak.
The eukaryotic RNAi systems come in two basic varieties: i) siRNAs that are produced from dsRNAs of viruses and transposons and protect the host from the respective agents via perfect base-pairing with the respective target mRNAs, and ii) miRNA that regulate translation of endogenous genes via either perfect (plants) or imperfect (animals) base-pairing . CASS appears to be the functional counterpart of the siRNA mechanism inasmuch as it seems to be involved in defense against infecting agents, the psiRNAs seem to be derived from the invading genome and are predicted to function via perfect base-pairing with the target. From a different viewpoint, however, this system is similar to miRNA in that the active small RNA moieties are encoded in the prokaryotic genomes rather than produced from the foreign dsRNA. The closest eukaryotic analogs of the CASS system might be the rasiRNAs that, like the putative psiRNAs, are encoded in the genome, are generated by processing of double-stranded molecules formed by symmetric transcripts of transposons (or repeats thought to be of transposon origin) and silence the latter, which contributes to heterochromatin formation [11–13, 56]. In contrast, the bacterial small antisense RNA regulatory pathways that employ Hfq and RNAse E that target regular, chromosomal bacterial genes, rather than those of infectious agents or transposons [14–16], seem to be the functional counterpart of eukaryotic miRNA systems. Thus, prokaryotes seem to have at least two distinct RNAi systems none of which is operated by homologs of eukaryotic RNAi protein machinery components. Furthermore, unlike the case of eukaryotes where the siRNA, rasiRNA, and miRNA systems are operated by substantially overlapping sets of proteins , the prokaryotic systems seem to be completely independent from one another.
Clearly, our interpretation of the probable functions of CASS is based, in large part, on the analogy with the eukaryotic siRNA system. There might be danger in heavily relying on analogy as a prediction method because, if the basic premise is false, the entire scheme will fall apart. However, in the case of CASS and eukaryotic siRNA, the analogy stems from two a priori independent lines of evidence, namely, the discovery of CRISPR inserts homologous to phage and plasmid genes and the functional similarity between Cas proteins and components of eukaryotic RNAi systems, e.g., dicer. Should there be no bona fide functional analogy between CASS and RNAi, there would be no basis for this congruence. By contrast, the homology of CRISPR inserts to phage and plasmid genes and, more generally, the association of cas genes with CRISPR have no explanation in the context of our previous hypothesis that CASS is a repair system  which forces us to abandon this interpretation of the CASS function.
All the analogies notwithstanding, the predicted psiRNA system shows at least one fundamental difference from the eukaryotic counterpart: the coding segments for the putative psiRNAs are derived from genes of invading agents and incorporated into the host genome to confer heritable immunity to the respective agent. As an acquired immunity mechanism, CASS resembles more the vertebrate immune system than the eukaryotic RNAi pathways but, again, with the crucial difference that the animal immunity is not inheritable. Furthermore, the wide spread of CASS that spans a great variety of prokaryotic lineages contrasts the narrow presence of classical immunity which appears to be a mechanism specific to jawed vertebrates; the recent discovery of a dramatically different immunity system in jawless vertebrates  emphasizes the status of the immune system as a lineage-specific evolutionary novelty. More generally, it appears that CASS is one of the most ancient if not, indeed, the primordial biological defense system that probably emerged at an early stage of prokaryotic evolution, considering that diverse viruses, in all likelihood, have accompanied cellular life from its very beginning. Given the ubiquity of CASS in archaea and its less prominent presence in bacteria, one scenario is that CASS emerged in an ancient ancestor of archaea and spread to bacteria horizontally.
Interestingly, as a mechanism of inheritance of acquired traits, CASS seems to come closest to a true Lamarckian mode of evolution among all known systems of heredity. Remarkably, however, this putative system of Lamarckian inheritance appears to be extremely volatile on the evolutionary scale as indicated by the lack of conservation of the psiRNA sequences even between closely related strains. A general implication of this aspect of CASS evolution is that the diversity of mobile replicons (phages, plasmids, transposons etc) in nature might be even more enormous than it is currently estimated [58, 59] such that even closely related bacteria occupying similar niches are predominantly invaded by different agents. Additionally or alternatively, it is conceivable that, in many niches, the dominant phages and plasmids rapidly turn over with time, making existing CRISPR cassettes obsolete as defense means and triggering their exchange.
Finally, a practical note. It seems that, once the psiRNA mechanism described here is investigated experimentally, it could be exploited to silence any gene in organisms that encode CASS. The simple design of such experimental gene silencing in prokaryotes will involve transfection with a plasmid containing the desired psiRNA inserted between CRISPR to facilitate homologous recombination.
Genome sequences, databases and sequence analysis
The complete bacterial and archaeal genome sequences were retrieved from National Center for Biotechnology Information (NCBI, NIH, Bethesda) FTP site. The non-redundant database of protein sequences at the National Center for Biotechnology Information (NIH, Bethesda) was iteratively searched using the PSI-BLAST program . The cut-off of E < 0.01 was normally employed for inclusion of sequences in the position-specific weight matrices. Each retrieved sequence was used as the query for additional searches until no new sequences could be detected. For detecting subtle sequence conservation, the PSI-BLAST search results were visually examined and sequences with greater E-values, but containing signature motifs of a given protein family were included into profiles on a case by case basis [30, 60]. Multiple alignments of protein sequences were constructed using the MUSCLE program  and corrected on the basis of PSI-BLAST results. Protein secondary structure was predicted using the JPRED program . Protein structure comparisons were performed using the DALI  and VAST  programs, and ribbon diagrams of protein structures were generated made using the program BOBSCRIPT .
Distance trees were constructed from multiple protein sequence alignments after excluding poorly aligned positions, by using the least-square method as implemented in the FITCH program of the PHYLIP package [65, 66]. Maximum likelihood trees were constructed using the ProtML program of the MOLPHY package, with the JTT-F model of amino acid substitutions, by optimizing the least-square trees with local rearrangements [67, 68].
Identification and analysis of CRISPR repeats
Search for repeats was performed as follows: first, all exactly matching 20-mer anchor substrings were identified in the nucleotide sequence of a bacterial or archaeal genome. Alignments around these anchors were expanded in both directions, to include all adjacent positions with information content above or equal to 1.5 bits . All identified high-similarity fragments were used as queries in nucleotide BLAST  search (word size 7; mismatch penalty -1; both gap opening and extension costs -1; E-value threshold 0.001) to detect more diverged versions of the repeats as well as the instances of the repeat in the opposite strand. To determine the repeat boundaries more precisely, sequences of all loci with short (20 nt) flanks added were collected and aligned using the MUSCLE program . Alignments were trimmed from the 5' and 3' termini up to columns with information content exceeding 1.3 bits. Repeat families that shared the spatial arrangement typical of CRISPR (median repeat length of 20–50 nt; median spacer length of 15–60 nt) were identified as CRISPR candidates and further examined for chromosomal proximity to cas genes. The custom PERL scripts used for this analysis are available upon request.
Sequences of eukaryotic miRNAs precursors, CRISPRs, and randomly shuffled CRISPR sequences were computationally folded, and the free energy of the most stable secondary structure was calculated using a a dynamic programming algorithm that employs nearest neighbor parameters to evaluate free energy . Energy minimization was performed by dynamic programming method that finds the secondary structures with the minimum free energy by summing up the contributions from stacking, loop length, and other structural features, using improved thermodynamic parameters .
Similarity of inter-CRISPR spacers to other sequences
Nucleotide sequences of inter-CRISPR spacers were used as a query in MEGABLAST  searches (word size 11; e-value threshold 0.01) against GenBank; hits to virus or plasmid sequences and to distantly related prokaryotes were counted separately for each source organism.
Reviewer's report 1
Eric Bapteste, Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada
The scientific quality of this paper and its methodology is certain. There has been a lot of good and interesting work done here. As indicated by its title, it thus provides multiple information and one hypothesis about RNA-interference in prokaryotes. Because of this broad scope, the manuscript is quite large. In fact, several of its parts could be read on their own, depending on the reader-specific interest, this is notably the case for the part dealing with the hypothesis of a RNAi prokaryotic immune system. I would thus suggest that a shorter version of the paper, centered around this very interesting hypothesis could be proposed online (with the first part turning into Supp. Mat.), because I feel that this part is going to receive more attention anyway, and it would be unfortunate if some readers did not consider this aspect because they are scared by the overall size of the paper. But this is simply a suggestion, and the authors are more than welcome to disregard my opinion.
Author response: While we fully understand the sentiment and agree that the hypothesis on prokaryotic RNAi is of greater general interest than the detailed presentation of the protein sequence-structure analysis, we strongly feel that the latter provides a badly needed foundation for the hypothesis as well as important information in its own right. Furthermore, we tend to believe that the general spirit of online publication is to present complete results of a study (of course, there are exceptions). The reader can easily navigate between sections, so the length of a paper does not represent a particularly severe problem. Furthermore, we made certain modifications to the protein analysis part in response to similar but more specific comments of Martijn Huynen, in particular, introduced additional subheadings which, hopefully, makes this part of the paper more reader-friendly.
The seductive hypothesis of a RNAi based immune system is presented as an analogy with the eukaryotic RNAi system. The use of analogy is potentially challenging: on the one hand, it allows a powerful and elegant presentation of many complex genomic results, but on the other hand, it is questionable, since the analogy may impose an a priori model to interpret biological features, and if this model is incorrect, if the analogy does not hold, there is a risk that the genomics data receive a fairly biased interpretation. In this respect, it would be interesting if the authors discuss whether an homologous immune system could have been possible in eukaryotes and in prokaryotes, and why it is not found. Indeed, such an homologous system would be a more natural reference to interpret the data than an analogy.
Author response: We understand the epistemological concerns regarding the role of analogy in this study. However, as indicated by the reviewer himself, the analogy is strong. Moreover, this analogy is manifest at two different levels: i) the presence of inserts homologous to phage and plasmid genes in CRISPR units and ii) presence of predicted activities compatible with a siRNA-like system among cas gene products, in particular, the dicer analog. Had the analogy been false, there would be no reason whatever for this congruence. We briefly comment to that effect in the revised manuscript. The idea of homologous immune systems in prokaryotes and eukaryotes seems a little far-fetched. Nevertheless, this comment prompted us to incorporate a brief comparison of the evolutionary histories of RNAi and classical immune systems.
This being said, I do not feel that the use of the analogy was a problem here, as it is convincingly presented and argued by the authors. We could eventually question more if all the so-called CASS genes are really involved in the prokaryotic immune systemand deserve their label: some may just be present in the genomic proximity of CRISP, yet having nothing to do with the RNA interference. They might just be mobile «travelling» genes. A further study of the genomic distribution of the homologs of the CASS genes in bacterial genomes may help to clarify which genes are strongly and exclusively CRISP related and which ones can be found also in alternative locations in different genomes. Then, perhaps the «striking diversity of still poorly characterized CASS components» described on page 16 would appear less striking if some CASS categories are simply not relevantly defined, and include unrelated proteins, since maybe, the use of the analogy would had led to too relaxed definitions of CASS. In another situation, by contrast, the use of the analogy could perhaps be too strict. On page 18, the authors wonder how to identify «the slicer counterpart (p-slicer)» in prokaryotes. They explain that this identification is «less straightforward because of the diversity of predicted nucleases within the CASS». But, after all, why should there be only one p-slicer, as in eukaryotes? It is possible that prokaryotes have multiple «p-slicers».
Author response: Yes, we agree, the possibility of multiple slicers exists, and we modified the text to acknowledge this. With regard to the rest of this comment, however, we feel that the current diversity of prokaryotic genomes is already sufficient to make conclusions on the strength of the association of individual genes with CASS, and the genes are classified here accordingly, as true CASS components and loosely associated "satellites". As far as the latter are concerned, inferences on involvement with CASS functions are made only for those genes whose activities appear clearly relevant, like the RT or Argonaut.
Finally, the strong suspicion about the analogous and intricate functions of COG1518 and COG1343 (cf. page 21) could be similarly toned down. Maybe these genes do play the essential analogous role of CRISPR integrase/recombinase consistently with the analogy, but maybe they fulfill several different tasks. Perhaps the authors would like to comment more on some of these minor points.
Author response: Actually, this prediction is not based on analogy with eukaryotic RNAi systems but rather on the mutualistic association of these genes to CRISPR and the features of the proteins themselves. We dropped the "strong" suspicion but, generally, we strongly believe that this is the best possible prediction. Reference to multitasking might not be particularly productive unless there are good ideas regarding what these multiple functions might be (this is different from the above possibility that there are multiple slicers which is, indeed, compatible with the data).
Also, to go back to the CASS gene evolution, the authors mention, page 8, « the extraordinary evolutionary mobility of CASS». It is unclear to me how this statement has been tested, and how the authors have established that CASS genes are more mobile than any average gene randomly picked in the same collection of prokaryotic genomes. For this reason, I am not sure if, as claimed by the authors, on page 14 «the pol-cassette comprises a distinct evolutionary unit that is often transferred horizontally independently of the CASS-core». Does the CASS-core really have an established vertical mode of inheritanceor, as the authors stated before, a «non-uniform» distribution (cf. page 8)? This might be more strongly argued.
Author response: Several distinct issues are addressed here. With regard to the 'extraordinary mobility' of CASS, this is demonstrated by the trees in Fig. 2(more trees have been published previously in our own 2002 paper and by Haft et al.), but even more convincingly, by the persistent pattern of presence-absence of CASS in closely related species and even strains of bacteria. We consolidated the argument such that this becomes clear the first time "extraordinary" mobility comes up. It is true that we did not compare the mobility of the CASS components with that of garden-variety prokaryotic genes in a rigorous, quantitative manner. While this is doable, in principle, all methods we are aware of are open to debate, and we feel that the exercise is beyond the scope of this paper. Given the above argument, we believe that, qualitatively, it is clear that CASS is unusual in this respect. With regard to the pol-casette, we believe that the discrepancy between the topologies of the two trees in Fig. 2is quite sufficient for the statement on independent HGT. As for the "vertical mode of inheritance" of CASS, there seems to be a semantic issue here. We do not really claim vertical inheritance for CASS but neither is such a pattern necessary to detect horizontal mobility. What is required is a predominant pattern of vertical inheritance among other genes that allows us to use a species tree to detect HGT. Of course, we realize that there are substantial arguments for abandoning "tree thinking"altogether but, on balance, we still believe that a species tree conceptualized as a central trend in the evolution of gene ensembles is, at least, a useful tool for analysis of genome evolution.
These few questions show a strength of the present work which interestingly opens perspectives and suggests that some additionnal analyses should now be conducted, because the topic deserves consideration. Maybe the authors would feel like addressing some of the points below in a revised version of the current paper, or in future analyses.
Author response: These are, indeed, very interesting questions, we appreciate them. Some are for future studies but we can provide certain answers now.
Further study could iclude the following:
- Do other genomic regions harboring concentrations of nucleasescomparable to the ones around CRISP exist elsewhere in the genomes?
Author response: Hardly. As indicated in the paper, the CRISPR neighborhood is the second most prominent (i.e., the one that ranks second in the number of genes) neighborhood in prokaryotic genomes after the ribosomal superoperons, so it is quite outstanding. However, there are other, considerably small constellations of nucleases, such as the classical recBCD operon encoding repair proteins, some restriction-modification systems, and, perhaps, others that are still poorly understood and deserve investigation.
- If yes, is there more than one prokaryotic immune system definable on this analogous ground? Notably, did bacteria without CRISPR evolve a totally different immune system?
Author response: There is no evidence of that. Furthermore, as repeatedly emphasized in this paper, CASS shows extreme evolutionary volatility, apparently being lost quite easily, in a short time, on evolutionary scale. It is hardly imaginable that these bacteria evolved a distinct immune system in the short time elapsed since the loss of CASS. Of course, purely hypothetically, one could perceive the possibility that another immune system is disseminated horizontally, like CASS, and prokaryotes having both, could differentially lose one of them. However, we are unaware of any support for such a scenario. Another prominent prokaryotic defense mechanism is restriction-modification; it would be interesting to examine the relationship between RM systems and CASS, that could be a subject for a future study.
How did the psiRNA pathway arise in thermophiles(cf. page 20)? Does it result from a transfer? Was it ancestral?
Author response: Very interesting, fundamental questions, indeed. In response, we expanded the discussion of these and other aspects of evolution of CASS. The specific preponderance of CASS in thermophiles, noticed already in the 2002 paper, when we thought that this was a thermophile-specific repair system, remains a mystery. Whatever the nature of this association, it seems likely that CASS is ancestral in thermophiles (at least in hyperthermophiles).
- Could we imagine that multiples promoters exist, both sense and antisense, which would activate the transcription of CRISPR, generating even more RNAi(cf.p 24)?
Author response: In principle, existence of multiple promoters cannot be ruled out. However, the leader sequence seems to be the only natural candidate for the promoter function. The rest of the CRISPR cassette is homogeneous (repetitive), so it is unclear where an alternative promoter would be located. Further, in the two archaeal systems that have been studied experimentally (Archaeoglobus and Sulfolobus) all transcription of CRISPR loci appear to be unidirectional.
- Finally, it might be challenging, though interesting to test in vitro on bacterial cultures if, as proposed by the authors, the presence of CRISP and CASS, has really an impact on the fitness of prokaryotes in presence of viruses.
Author response: We certainly hope that the computational analyses and predictions described in this paper stimulate a lot of experimentation aimed at elucidation of the biological functions of CASS and roles of its individual components.
We greatly appreciate these insightful and stimulating comments.
On page 7: «functionally analogous» is redundant.
Author response: We see the point but do not really agree. The word "functionally" seems to add clarity.
On page 8: the sentence «the distribution of COG1518 and, by implication, CASS among prokaryotic lineages...» is too «bold» for me: even if the conclusion is correct, I am not sure one can generalize as suggested here from the case of one protein only.
Author response: Indeed, we can. Rephrased to clarify and emphasize this.
On page 15: «several other CASS gene families remain mysterious» is a mysterious sentence. I am not sure what this does really mean.
Author response: That there is no clue as to the possible functions of these proteins; modified to clarify.
On page 21: I miss the idea of the sentence starting by «In addition, and probably, more relevantly etc.» to «retroviral genomes». Could you rephrase it to explicit it a little bit more?
Author response: Rephrased – hopefully, to clarify.
On page 23: what is the criterion retained for homology between the plasmid genes, fragments of phages and the CRISPR sequences?
Author response: The following quote from the Methods addresses this issue:
"Nucleotide sequences of inter-CRISPR spacers were used as a query in MEGABLASTsearches (word size 11; e-value threshold 0.01) against GenBank; hits to virus or plasmid sequences and to distantly related prokaryotes were counted separately for each source organism."
On page 48: To me, the multiple positive correlations evoke multiple causalities and the possibility of some hidden correlations. Would you say that all the relevant combinations have been considered here?
No, we won't claim that. More complex multiple regression analysis would be required to separate correlations that reflect true causality; for the purposes of this paper, we felt it was sufficient to note the strongest correlations.
Reviewer's report 2
Patrick Forterre, Biologie Moléculaire du Gène chez les Extrêmophiles (BMGE) Institut de Génétique et Microbiologie (IGM), Université Paris-Sud, Centre d'Orsay, 91405 Orsay Cedex, France, and Biologie Moléculaire du Gène chez les Extrêmophiles (BMGE), Département de Microbiologie Fondamentale et Médicale, Institut Pasteur, Paris, France
In this very important paper, Makarova and coworkers propose a detailed mechanism for a putative procaryotic antiviral immunity system mediated by CRISPS sequences and their associated Cas proteins (the CAS system, CASS sensu the authors). Their model is based on the hypothesis that these elements represent a prokaryotic-specific antiviral mechanism analogous to the eukaryotic RNAi system. In procaryotes (Bacteria and Archaea) there is no homologs of the proteins involved in the eucaryotic RNAi system. Untill recently, it was therefore widely believed that restriction-modification mechanisms were the only defense available to procaryotes to fight viral infections. However, it has been proposed last year by several groups that procaryotic CASS could also play a significant role in fighting viral aggression in archaea and bacteria (Mojica et al. 2005, Pourcel et al., 2005, Bolotin et al., 2005). CRISPR sequences, which are transcribed but non-coding, are formed by the tandem repetition of units containing both a conserved element (similar all along a given CRISPR) and a variable element, the spacer, different from one unit to the other. The spacer sequences have strikingly no homologous sequence in databases, except for viral or plasmid sequences. Both Mojica et al. (2005) and Bolotin et al., (2005) have suggested that transcription of CRISPR sequences produce anti-sense RNA that can inhibit transcription of incoming viral (plasmid) sequences and Mojica et al. (2005) mentioned the analogy of such system with eukaryotic RNAi. However, these authors did not elaborate on the specific mechanism involved and how the cas proteins could be involved in the processing of viral RNA.
In this work Makarova and co-workers have first performed an updated analysis of cas proteins using genomic context analysis and sensitive methods (iteration approaches) to detect low level of similarity and to classify cas proteins in families and superfamilies. They were able to identify several new putative cas proteins and to define 25 superfamilies of cas proteins and 7 different types of CASS organization (named CASS1 to 7). They have also analyzed all available CRISPR repeated sequences and their putative secondary structures. More importantly, they try to predict the biological function of the cas proteins and their mechanism of action in the framework of the RNAi hypothesis. Previously, it has been suggested that cas proteins were involved in the formation and spreading of the CRISPR. For instance, Bolotin et al. Have predicted that cas proteins are acting at the DNA level by promoting cleavage, recombination and ligation. Makarova and al are the first to suggest that several cas proteins should instead interact at the RNA level, by promoting RNA degradation and RNA-RNA hybridization. They specifically suggest the existence of procaryotic homologs of eucaryotic dicer (helicase-nuclease) and splicer (nuclease). They also propose that a previously suspected DNA polymerase could be an RNA dependent RNA polymerase used to stabilize RNA/RNA hybrid by extending iRNA hybridized to their viral mRNA target. They also suggest the involvement of a reverse transcriptase in the formation of the linker sequences from viral (plasmidic) mRNA. In my opinion, all these proposals are reasonnable and very convaincing. Another prediction is that RAMP proteins recognize linker sequences of different sizes. This is supported by a correlation between the number of linker sequences and the number of RAMPs encoding genes (Fig. 6). In that case, it's not clear to me why this could not be due to the binding of RAMPs to the repeated units, since these units exhibit conserved sequences and their number (identical to the number of linkers) should be also correlated with the number of RAMPs.
Author response: That RAMPs discriminate, one way or another, between CRISPR inserts, is strongly suggested by the extreme sequence divergence of RAMPs which is hardly compatible with recognition of identical repeats. To be explicit about it, we added a clarification at the end of this section.
The search for specific secondary structure associated to the repeated units did not give convincing results and suggest for me that the dyad symmetry observed in many repeat units could be due to the binding of proteins with repeated structure (possibly the duplicated ferredoxin-like fold present in RAMPs) and not the formation of secondary structures in the transcribed repeats.
Author response: It is hard to see how one excludes the other: it stands to reason that CRISPR do form distinct secondary structures which bind to symmetrical proteins.
The model proposed (including possible variation) thus implies many predictions that could be experimentally tested. Surprisingly, to my knowledge, only one cas protein has been studied at the bench up to now (ref 70 in the manuscript). This protein turns out to have DNAse activity in vitro, but I suspect that the authors have not tested a possible RNAse activity. This is surprising because the importance of these proteins was already highlighted in 2002 by two in silico papers that in one case suggested their participation to a "mysterious DNA repair system and in the other described their association with CRISPR sequences. The present paper, with much more specfic predictions, should hopefully strongly stimulate biochemists and molecular biologists to jump onto this really exciting story. As noticed by the authors in their conclusion, if their hypothesis turned out to be correct, this prokaryotic RNAi system could be exploited to silence any gene in organisms that encode CASS. Furthermore, the experimental study of this system should help us to get new critical insights on the dynamic relationships between viruses and archaeal/bacterial populations in nature.
Finally, I would like to know if the authors have some idea about the origin of this CAS system. Why is it present in all archaeal genomes sequenced so far? Is it possible that this system originated in Archaea and was later on introduced in bacteria by LGT?
Author response: Given the horizontal mobility of CASS, we can only speculate on the point of its origin. We expand such speculation in the revised conclusion including the possibility of archaeal origin.
– In some case, the authors should be more cautious in their statement. For instance, when they talk about the pol-cassette, it might led some reader to believe that the polymerase actvity of the COG1353 protein has been experimentally validated, which is not the case.
Author response: We added a few more "predicted". However, we did not want to abandon the term 'pol-cassette' as it is descriptive and succinct.
Reviewer's report 3
Martijn Huynen, Nijmegen Center for Molecular Life Sciences University Medical Center St. Radboud p/a Center for Molecular and Biomolecular Informatics, Nijmegen, Netherlands
This paper provides a highly interesting and well documented hypothesis about a cluster of genes that E. Koonin and co-workers have discovered some time ago. By combining biological knowledge with bioinformatics methods and creative thinking the authors propose that Archaea and to a bit lesser extent Bacteria posses an RNA-interference-based immune system involving CRISPR and cas genes, that is analogous the eukaryotic RNA interference systems. Although aspects of this hypothesis have been published before, specifically with respect to CRISPR, this paper is, as far as I can tell, the first that makes the analogy between the cas genes and the RNA interference system. The idea that prokaryotic genomes would internalize pieces of foreign DNA in order to be able to defend themselves against it, thus having an immune system with a memory, would be an interesting example of Lamarckian evolution.
I do have some questions and editorial comments that I think should be addressed.
1) Do the authors have any idea why this system has the phylogenetic distribution that it does, being present in such a small genome as the nanoarchaeon, but not in e.g. the majority of Firmicutes
Author response: No mechanistic idea, unfortunate as this might be. We added some additional discussion of the ultimate origin of CASS (see the response to Patrick Forterre).
2) concerning the feasibilty of the system proposed by the authors: Is there anything known about how many fiendly DNAs a prokaryote encounters in daily life, and how does that compare to the number of different elements in a CRISPR ?
Author response: Not enough for this particular comparison. However, it is well known that phages are extremely abundant, much more so than bacteria or archaea, and in the revised manuscript, we refer to this more specifically, with the corresponding references.
3) Regarding the Lamarckian scheme: That the unique element of the CRISPR correspond to highly conserved, essential elements of phage genes suggests that selection on genetic variation also plays role here. So the scheme would be partly Lamarckian.
Author response: Probably, so. The way we state it in the text "CASS seems to come closest to a true Lamarckian mode of evolution among all known systems of heredity" is compatible with this view.
4) I am not so convinced by the argument on page 3 that the results imply that even among closely related prokaryotes the most commonly encountered phages are different. First of all, it is more a corollary of the hypothesis, but second, it could also reflect the high turnover of phages over time, rather than niche.
Author response: This is a very good idea, we now mention this possibility both in the Abstract and in the Discussion.
5) I am puzzled on the involvement of more or less randomly selected pieces of DNA from foreign DNA/RNA in exactly the same location in the secondary structure of the psiRNA (the top of the hairpin). Does this pattern occur more often?
Author response: The situation when the insert forms a stem of varying stability with parts of the repeats is common but not universal. The positions of the inserts are not exactly the same although they are, indeed, very similar, and the stems in which the inserts are involved are imperfect. Of course, the exciting possibility exists that the CRISPR inserts are specifically selected for their ability to base-pair with the repeats, however, we do not have enough data to make that claim.
We thank M. Huynen and J. Van der Oost for helpful discussions, and all three reviewers for their insightful, detailed, and meticulous review that, as we believe, allowed to substantially improve the article. This work was supported in part by the Intramural Research Program of the National Library of Medicine at National Institutes of Health/DHHS.
- Fire A: RNA-triggered gene silencing. Trends Genet 1999,15(9):358-363. 10.1016/S0168-9525(99)01818-1PubMedView Article
- Hannon GJ: RNA interference. Nature 2002,418(6894):244-251. 10.1038/418244aPubMedView Article
- Cogoni C, Macino G: Post-transcriptional gene silencing across kingdoms. Curr Opin Genet Dev 2000,10(6):638-643. 10.1016/S0959-437X(00)00134-9PubMedView Article
- Bernstein E, Denli AM, Hannon GJ: The rest is silence. Rna 2001,7(11):1509-1521.PubMedPubMed Central
- Denli AM, Hannon GJ: RNAi: an ever-growing puzzle. Trends Biochem Sci 2003,28(4):196-201. 10.1016/S0968-0004(03)00058-6PubMedView Article
- Zamore PD, Haley B: Ribo-gnome: the big world of small RNAs. Science 2005,309(5740):1519-1524. 10.1126/science.1111444PubMedView Article
- Filipowicz W: RNAi: the nuts and bolts of the RISC machine. Cell 2005,122(1):17-20. 10.1016/j.cell.2005.06.023PubMedView Article
- Tang G: siRNA and miRNA: an insight into RISCs. Trends Biochem Sci 2005,30(2):106-114. 10.1016/j.tibs.2004.12.007PubMedView Article
- Sontheimer EJ: Assembly and function of RNA silencing complexes. Nat Rev Mol Cell Biol 2005,6(2):127-138. 10.1038/nrm1568PubMedView Article
- Miyoshi K, Tsukumo H, Nagami T, Siomi H, Siomi MC: Slicer function of Drosophila Argonautes and its involvement in RISC formation. Genes Dev 2005.
- Ambros V, Lee RC, Lavanway A, Williams PT, Jewell D: MicroRNAs and other tiny endogenous RNAs in C. elegans. Curr Biol 2003,13(10):807-818. 10.1016/S0960-9822(03)00287-2PubMedView Article
- Aravin AA, Lagos-Quintana M, Yalcin A, Zavolan M, Marks D, Snyder B, Gaasterland T, Meyer J, Tuschl T: The small RNA profile during Drosophila melanogaster development. Dev Cell 2003,5(2):337-350. 10.1016/S1534-5807(03)00228-4PubMedView Article
- Sontheimer EJ, Carthew RW: Silence from within: endogenous siRNAs and miRNAs. Cell 2005,122(1):9-12. 10.1016/j.cell.2005.06.030PubMedView Article
- Gottesman S: The small RNA regulators of Escherichia coli: roles and mechanisms*. Annu Rev Microbiol 2004, 58: 303-328. 10.1146/annurev.micro.58.030603.123841PubMedView Article
- Gottesman S: Micros for microbes: non-coding regulatory RNAs in bacteria. Trends Genet 2005,21(7):399-404. 10.1016/j.tig.2005.05.008PubMedView Article
- Majdalani N, Vanderpool CK, Gottesman S: Bacterial small RNA regulators. Crit Rev Biochem Mol Biol 2005,40(2):93-113. 10.1080/10409230590918702PubMedView Article
- Storz G, Opdyke JA, Zhang A: Controlling mRNA stability and translation with small, noncoding RNAs. Curr Opin Microbiol 2004,7(2):140-144. 10.1016/j.mib.2004.02.015PubMedView Article
- Tang TH, Bachellerie JP, Rozhdestvensky T, Bortolin ML, Huber H, Drungowski M, Elge T, Brosius J, Huttenhofer A: Identification of 86 candidates for small non-messenger RNAs from the archaeon Archaeoglobus fulgidus. Proc Natl Acad Sci U S A 2002,99(11):7536-7541. 10.1073/pnas.112047299PubMedPubMed CentralView Article
- Tang TH, Polacek N, Zywicki M, Huber H, Brugger K, Garrett R, Bachellerie JP, Huttenhofer A: Identification of novel non-coding RNAs as potential antisense regulators in the archaeon Sulfolobus solfataricus. Mol Microbiol 2005,55(2):469-481. 10.1111/j.1365-2958.2004.04428.xPubMedView Article
- Soderbom F, Wagner EG: Degradation pathway of CopA, the antisense RNA that controls replication of plasmid R1. Microbiology 1998, 144 ( Pt 7): 1907-1917.View Article
- Gerdes K, Gultyaev AP, Franch T, Pedersen K, Mikkelsen ND: Antisense RNA-regulated programmed cell death. Annu Rev Genet 1997, 31: 1-31. 10.1146/annurev.genet.31.1.1PubMedView Article
- Greenfield TJ, Franch T, Gerdes K, Weaver KE: Antisense RNA regulation of the par post-segregational killing system: structural analysis and mechanism of binding of the antisense RNA, RNAII and its target, RNAI. Mol Microbiol 2001,42(2):527-537. 10.1046/j.1365-2958.2001.02663.xPubMedView Article
- Makarova KS, Aravind L, Grishin NV, Rogozin IB, Koonin EV: A DNA repair system specific for thermophilic archaea and bacteria predicted by genomic context analysis. Nucleic Acids Res 2002, 30: 482-496. 10.1093/nar/30.2.482PubMedPubMed CentralView Article
- Haft DH, Selengut J, Mongodin EF, Nelson KE: A guild of forty-five CRISPR-associated (Cas) protein families and multiple CRISPR/Cas subtypes exist in prokaryotic genomes. PLoS Comput Biol 2005., in press:
- Jansen R, Embden JD, Gaastra W, Schouls LM: Identification of genes that are associated with DNA repeats in prokaryotes. Mol Microbiol 2002,43(6):1565-1575. 10.1046/j.1365-2958.2002.02839.xPubMedView Article
- Mojica FJ, Diez-Villasenor C, Soria E, Juez G: Biological significance of a family of regularly spaced repeats in the genomes of Archaea, Bacteria and mitochondria. Mol Microbiol 2000,36(1):244-246. 10.1046/j.1365-2958.2000.01838.xPubMedView Article
- Mojica FJ, Diez-Villasenor C, Garcia-Martinez J, Soria E: Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements. J Mol Evol 2005,60(2):174-182. 10.1007/s00239-004-0046-3PubMedView Article
- Bolotin A, Quinquis B, Sorokin A, Ehrlich SD: Clustered regularly interspaced short palindrome repeats (CRISPRs) have spacers of extrachromosomal origin. Microbiology 2005,151(Pt 8):2551-2561. 10.1099/mic.0.28048-0PubMedView Article
- Pourcel C, Salvignol G, Vergnaud G: CRISPR elements in Yersinia pestis acquire new repeats by preferential uptake of bacteriophage DNA, and provide additional tools for evolutionary studies. Microbiology 2005,151(Pt 3):653-663. 10.1099/mic.0.27437-0PubMedView Article
- Altschul SF, Koonin EV: PSI-BLAST - a tool for making discoveries in sequence databases. Trends Biochem Sci 1998, 23: 444-447. 10.1016/S0968-0004(98)01298-5PubMedView Article
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997,25(17):3389-3402. 10.1093/nar/25.17.3389PubMedPubMed CentralView Article
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003, 4: 41. 10.1186/1471-2105-4-41PubMedPubMed CentralView Article
- Omelchenko MV, Wolf YI, Gaidamakova EK, Matrosova VY, Vasilenko A, Zhai M, Daly MJ, Koonin EV, Makarova KS: Comparative genomics of Thermus thermophilus and Deinococcus radiodurans: divergent routes of adaptation to thermophily and radiation resistance. BMC Evol Biol 2005, 5: 57. 10.1186/1471-2148-5-57PubMedPubMed CentralView Article
- Katz ME, Wright CL, Gartside TS, Cheetham BF, Doidge CV, Moses EK, Rood JI: Genetic organization of the duplicated vap region of the Dichelobacter nodosus genome. J Bacteriol 1994,176(9):2663-2669.PubMedPubMed Central
- Clissold PM, Ponting CP: PIN domains in nonsense-mediated mRNA decay and RNAi. Curr Biol 2000,10(24):R888-90. 10.1016/S0960-9822(00)00858-7PubMedView Article
- Fatica A, Tollervey D, Dlakic M: PIN domain of Nob1p is required for D-site cleavage in 20S pre-rRNA. Rna 2004,10(11):1698-1701. 10.1261/rna.7123504PubMedPubMed CentralView Article
- Arcus VL, Rainey PB, Turner SJ: The PIN-domain toxin-antitoxin array in mycobacteria. Trends Microbiol 2005,13(8):360-365. 10.1016/j.tim.2005.06.008PubMedView Article
- Anantharaman V, Aravind L: New connections in the prokaryotic toxin-antitoxin network: relationship with the eukaryotic nonsense-mediated RNA decay system. Genome Biol 2003,4(12):R81. 10.1186/gb-2003-4-12-r81PubMedPubMed CentralView Article
- Dietmann S, Holm L: Identification of homology in protein structure classification. Nat Struct Biol 2001,8(11):953-957. 10.1038/nsb1101-953PubMedView Article
- Madej T, Gibrat JF, Bryant SH: Threading a database of protein cores. Proteins 1995,23(3):356-369. 10.1002/prot.340230309PubMedView Article
- Edwards RA, Rohwer F: Viral metagenomics. Nat Rev Microbiol 2005,3(6):504-510. 10.1038/nrmicro1163PubMedView Article
- Breitbart M, Rohwer F: Here a virus, there a virus, everywhere the same virus? Trends Microbiol 2005,13(6):278-284. 10.1016/j.tim.2005.04.003PubMedView Article
- Hammond SM: Dicing and slicing: the core machinery of the RNA interference pathway. FEBS Lett 2005,579(26):5822-5829. 10.1016/j.febslet.2005.08.079PubMedView Article
- Scadden AD: The RISC subunit Tudor-SN binds to hyper-edited double-stranded RNA and promotes its cleavage. Nat Struct Mol Biol 2005,12(6):489-496. 10.1038/nsmb936PubMedView Article
- Smardon A, Spoerke JM, Stacey SC, Klein ME, Mackin N, Maine EM: EGO-1 is related to RNA-directed RNA polymerase and functions in germ-line development and RNA interference in C. elegans. Curr Biol 2000,10(4):169-178. 10.1016/S0960-9822(00)00323-7PubMedView Article
- Lipardi C, Wei Q, Paterson BM: RNAi as random degradative PCR: siRNA primers convert mRNA into dsRNAs that are degraded to generate new siRNAs. Cell 2001,107(3):297-307. 10.1016/S0092-8674(01)00537-2PubMedView Article
- Sijen T, Fleenor J, Simmer F, Thijssen KL, Parrish S, Timmons L, Plasterk RH, Fire A: On the role of RNA amplification in dsRNA-triggered gene silencing. Cell 2001,107(4):465-476. 10.1016/S0092-8674(01)00576-1PubMedView Article
- Negroni M, Buc H: Retroviral recombination: what drives the switch? Nat Rev Mol Cell Biol 2001,2(2):151-155. 10.1038/35052098PubMedView Article
- Huang CC, Hay N, Bishop JM: The role of RNA molecules in transduction of the proto-oncogene c-fps. Cell 1986,44(6):935-940. 10.1016/0092-8674(86)90016-4PubMedView Article
- Swain A, Coffin JM: Mechanism of transduction by retroviruses. Science 1992,255(5046):841-845.PubMedView Article
- Raju R, Subramaniam SV, Hajjou M: Genesis of Sindbis virus by in vivo recombination of nonreplicative RNA precursors. J Virol 1995,69(12):7391-7401.PubMedPubMed Central
- Gmyl AP, Korshenko SA, Belousov EV, Khitrina EV, Agol VI: Nonreplicative homologous RNA recombination: promiscuous joining of RNA pieces? Rna 2003,9(10):1221-1231. 10.1261/rna.5111803PubMedPubMed CentralView Article
- Vazquez F, Vaucheret H, Rajagopalan R, Lepers C, Gasciolli V, Mallory AC, Hilbert JL, Bartel DP, Crete P: Endogenous trans-acting siRNAs regulate the accumulation of Arabidopsis mRNAs. Mol Cell 2004,16(1):69-79. 10.1016/j.molcel.2004.09.028PubMedView Article
- Mochizuki K, Gorovsky MA: Conjugation-specific small RNAs in Tetrahymena have predicted properties of scan (scn) RNAs involved in genome rearrangement. Genes Dev 2004,18(17):2068-2073. 10.1101/gad.1219904PubMedPubMed CentralView Article
- Mochizuki K, Gorovsky MA: A Dicer-like protein in Tetrahymena has distinct functions in genome rearrangement, chromosome segregation, and meiotic prophase. Genes Dev 2005,19(1):77-89. 10.1101/gad.1265105PubMedPubMed CentralView Article
- Xie Z, Johansen LK, Gustafson AM, Kasschau KD, Lellis AD, Zilberman D, Jacobsen SE, Carrington JC: Genetic and functional diversification of small RNA pathways in plants. PLoS Biol 2004,2(5):E104. 10.1371/journal.pbio.0020104PubMedPubMed CentralView Article
- Alder MN, Rogozin IB, Iyer LM, Glazko GV, Cooper MD, Pancer Z: Diversity and function of adaptive immune receptors in a jawless vertebrate. Science 2005,310(5756):1970-1973. 10.1126/science.1119420PubMedView Article
- Hendrix RW: Bacteriophage genomics. Curr Opin Microbiol 2003,6(5):506-511. 10.1016/j.mib.2003.09.004PubMedView Article
- Hendrix RW, Smith MC, Burns RN, Ford ME, Hatfull GF: Evolutionary relationships among diverse bacteriophages and prophages: all the world's a phage. Proc Natl Acad Sci U S A 1999,96(5):2192-2197. 10.1073/pnas.96.5.2192PubMedPubMed CentralView Article
- Aravind L, Koonin EV: Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches. J Mol Biol 1999,287(5):1023-1040. 10.1006/jmbi.1999.2653PubMedView Article
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004,32(5):1792-1797. 10.1093/nar/gkh340PubMedPubMed CentralView Article
- Cuff JA, Clamp ME, Siddiqui AS, Finlay M, Barton GJ: JPred: a consensus secondary structure prediction server. Bioinformatics 1998,14(10):892-893. 10.1093/bioinformatics/14.10.892PubMedView Article
- Holm L, Sander C: Dali/FSSP classification of three-dimensional protein folds. Nucleic Acids Res 1997,25(1):231-234. 10.1093/nar/25.1.231PubMedPubMed CentralView Article
- Esnouf RM: Further additions to MolScript version 1.4, including reading and contouring of electron-density maps. Acta Crystallogr D Biol Crystallogr 1999, 55 ( Pt 4): 938-940. 10.1107/S0907444998017363View Article
- Felsenstein J: Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. Methods Enzymol 1996, 266: 418-427.PubMedView Article
- Fitch WM, Margoliash E: Construction of phylogenetic trees. Science 1967,155(760):279-284.PubMedView Article
- Adachi J, Hasegawa M: MOLPHY: Programs for Molecular Phylogenetics. Tokyo , Institute of Statistical Mathematics; 1992.
- Hasegawa M, Kishino H, Saitou N: On the maximum likelihood method in molecular phylogenetics. J Mol Evol 1991,32(5):443-445. 10.1007/BF02101285PubMedView Article
- Schneider TD, Stormo GD, Gold L, Ehrenfeucht A: Information content of binding sites on nucleotide sequences. J Mol Biol 1986,188(3):415-431. 10.1016/0022-2836(86)90165-8PubMedView Article
- McGinnis S, Madden TL: BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res 2004,32(Web Server issue):W20-5.PubMedPubMed CentralView Article
- Zuker M: Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 2003,31(13):3406-3415. 10.1093/nar/gkg595PubMedPubMed CentralView Article
- Mathews DH, Sabina J, Zuker M, Turner DH: Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol 1999,288(5):911-940. 10.1006/jmbi.1999.2700PubMedView Article
- Bapteste E, Susko E, Leigh J, MacLeod D, Charlebois RL, Doolittle WF: Do orthologous gene phylogenies really support tree-thinking? BMC Evol Biol 2005,5(1):33. 10.1186/1471-2148-5-33PubMedPubMed CentralView Article
- Wolf YI, Rogozin IB, Grishin NV, Koonin EV: Genome trees and the tree of life. Trends Genet 2002,18(9):472-479. 10.1016/S0168-9525(02)02744-0PubMedView Article
- Guy CP, Majernik AI, Chong JP, Bolt EL: A novel nuclease-ATPase (Nar71) from archaea is part of a proposed thermophilic DNA repair system. Nucleic Acids Res 2004,32(21):6176-6186. 10.1093/nar/gkh960PubMedPubMed CentralView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.