Deep transcriptome-sequencing and proteome analysis of the hydrothermal vent annelid Alvinella pompejana identifies the CvP-bias as a robust measure of eukaryotic thermostability

Background Alvinella pompejana is an annelid worm that inhabits deep-sea hydrothermal vent sites in the Pacific Ocean. Living at a depth of approximately 2500 meters, these worms experience extreme environmental conditions, including high temperature and pressure as well as high levels of sulfide and heavy metals. A. pompejana is one of the most thermotolerant metazoans, making this animal a subject of great interest for studies of eukaryotic thermoadaptation. Results In order to complement existing EST resources we performed deep sequencing of the A. pompejana transcriptome. We identified several thousand novel protein-coding transcripts, nearly doubling the sequence data for this annelid. We then performed an extensive survey of previously established prokaryotic thermoadaptation measures to search for global signals of thermoadaptation in A. pompejana in comparison with mesophilic eukaryotes. In an orthologous set of 457 proteins, we found that the best indicator of thermoadaptation was the difference in frequency of charged versus polar residues (CvP-bias), which was highest in A. pompejana. CvP-bias robustly distinguished prokaryotic thermophiles from prokaryotic mesophiles, as well as the thermophilic fungus Chaetomium thermophilum from mesophilic eukaryotes. Experimental values for thermophilic proteins supported higher CvP-bias as a measure of thermal stability when compared to their mesophilic orthologs. Proteome-wide mean CvP-bias also correlated with the body temperatures of homeothermic birds and mammals. Conclusions Our work extends the transcriptome resources for A. pompejana and identifies the CvP-bias as a robust and widely applicable measure of eukaryotic thermoadaptation. Reviewer This article was reviewed by Sándor Pongor, L. Aravind and Anthony M. Poole.

Reviewer: This article was reviewed by Sándor Pongor, L. Aravind and Anthony M. Poole.

Background
Alvinella pompejana is one of the most heat tolerant of all animals known to date [1]. This annelid worm inhabits deep-sea hydrothermal vent chimney walls in self-made glycoprotein tubes [2], where it is exposed to extreme environmental conditions (high pressure, high temperature, low pH, anoxia, heavy metals). In situ measurements inside occupied tubes, near the animals' tails, revealed temperatures of approximately 68°C, compared to temperatures of approximately 20°C in the surrounding water [3]. Given the steep temperature gradient inside the tubes and the difficulty of carrying out such in situ measurements, the maximum body temperature A. pompejana can tolerate is unclear [1]. Direct temperature preference and tolerance experiments using a high-pressure aquarium with a thermal gradient have not yet been carried out on adult A. pompejana. However, such experiments have shown that a North-Pacific relative of A. pompejana, Paralvinella sulfincola, prefers temperatures between 40°C and 50°C and tolerates temperatures up to 55°C [4]. The habitat of A. pompejana is similar to that of P. sulfincola, and it is likely that adult A. pompejana have a similar thermal preference. Given its high temperature tolerance, there has been considerable interest in studying the mechanisms of thermoadaptation in A. pompejana, and in establishing sequence-resources for this organism [5].
A thermoadapted metazoan proteome could greatly benefit structural biology research. The advantage of using thermostable proteins for structural studies has been well documented. In general, proteins from thermophiles are more stable and less flexible than their mesophilic counterparts. Consequently, these proteins are more amenable for expression, purification and crystallization experiments, and have often been used to solve the structure of large macromolecular complexes. Well-known examples include the ribosome, whose complete atomic structure was first determined from the thermophilic eubacterium Thermus thermophilus [6], and the exosome, first purified and crystallized from the archaebacterium Sulfolobus solfataricus [7]. Among the eukaryotes, the fungus Chaetomium thermophilum has recently been shown to have a thermoadapted proteome, and this facilitated the structural study of nuclear pore components [8]. Among the metazoans, A. pompejana potentially represents a promising source of thermostable proteins and complexes.
Although A. pompejana was first described more than 30 years ago, biochemical data have only been published for a small subset of its proteins. Melting temperature (Tm) values, the most direct measure of thermal stability, have only been published for cuticular and interstitial collagen [9,10]. These proteins have melting temperatures of 45-46°C, 17 degrees higher than that of collagens from shallow seawater annelids [9,10]. The activity of A. pompejana and human recombinant DNA polymerase η (Pol η) following high temperature incubation has also been tested. A. pompejana Pol η maintained high activity following incubation at temperatures up to 49°C, whereas human Pol η maintained high activity only until 43°C. At 52°C the activity of Pol η from both species dropped below 20% [11]. A. pompejana superoxide dismutase (SOD) was also shown to have enhanced chemical stability relative to its human counterpart by guanidine denaturation, a measure thought to correlate with thermal stability. In this study the structure of A. pompejana SOD was also solved at high resolution and for the first time in complex with H 2 O 2 [12]. The A. pompejana splicing factor U2AF65 in complex with RNA has also been shown to have a slightly increased thermal stability (using RNAbinding as a readout) relative to the human protein (6°C higher) [13].
The aforementioned studies showed increased thermostability of some A. pompejana proteins. In contrast, other studies either did not reveal higher thermal stability for A. pompejana proteins [14], or measured parameters (e.g. optimal temperatures for enzyme activity) that only allowed an indirect assessment of thermal stability. For example, the extracellular giant hemoglobin of A. pompejana showed higher oxygen affinity than that of other annelids, and exhibited other functional properties related to the vent environment [15], but its macromolecular assembly was unstable at 50°C, and it was not more thermostable than earthworm hemoglobin [14].
Given the small number of biochemical studies and the uncertainties about the thermotolerance of the animals, the general degree of thermoadaptation of the A. pompejana proteome is still unclear. Sequence analysis of a large number of proteins could reveal general features of thermoadaptation. There have been many attempts to correlate protein thermal stability with sequence or structure derived features [16]. When comparing sequence composition of thermophiles and mesophiles, the most apparent difference is an enrichment of charged residues in combination with a decrease in the number of polar residues in the thermophilic proteins. Both the (E + K)/(Q + H) ratio and the CvP-bias (difference in the frequency between charged and polar residues) can discriminate hyperthermophiles from mesophiles [17] as well as barophiles [18]. Another study identified a universal set of residues, I,V,Y,W,R,E,L (IVYWREL measure), enriched in thermophiles [19]. More complex discrimination functions have also been proposed, such as the function employed by THERMORANK [20]. This method uses a linear combination of 10 sequence-based features to rank a set of input sequences by relative thermostability. Another method, the Tm-Index tool, uses dipeptide composition to predict a melting point index for a single protein sequence [21]. These measures have not yet been systematically tested on the Alvinella proteome.
Recently, a large A. pompejana cDNA resource, prepared from three different tissues and whole animals was published [5]. Analysis of this dataset in comparison with various metazoan homologs showed that A. pompejana protein sequences have the highest proportion of charged amino acids. This bias was interpreted as a sign of protein thermostability [5]. Another EST resource, generated by the Joint Genome Institute (JGI, http:// www.jgi.doe.gov/), is also publicly available. These A. pompejana EST resources could form the basis for reconstituting and determining structures of metazoan proteins and multiprotein complexes. However, in order for structural biologists to identify all components of large multiprotein complexes, where the lack of a single component will impede reconstitution, more extensive sequence coverage is essential.
To increase the available dataset, we first carried out deep sequencing of the A. pompejana transcriptome. Using this resource and the published sequences, we established a large orthologous dataset with a taxon sampling that included other annelids, as well as the thermostable fungus, Chaetomium thermophilum [8]. To investigate the extent of thermoadaptation of A. pompejana we then performed a systematic survey of the available sequence-based thermostability measures previously established for prokaryotes on its proteome. Testing the THERMORANK, IVYWREL, Tm-Index, (E + K)/(Q + H), and loop-length gave conflicting results. Our analyses identified the CvP-bias as the best measure to discriminate A. pompejana from mesophilic eukaryotes. The CvP-bias also discriminated the thermophilic fungus C. thermophilum from mesophilic eukaryotes. The correlation of the CvP-bias with thermostability was also supported by experimentally-determined thermostability data.

Deep sequencing of the A. pompejana transcriptome
To gain insights into the mechanisms of thermoadaptation of A. pompejana, we first performed deep sequencing of its transcriptome. We used a combination of the Sanger, Roche/454 and Illumina technologies. With the 454 technology we obtained 2,717,445 reads of an average length of 220 bp. Using Illumina paired-end sequencing (~300 bp fragments) we obtained 87 million reads of an average length of 76 bp. These resources, and an additional 10,063 novel Sanger ESTs we generated, were assembled into a reference transcriptome dataset. Large-scale EST sequencing projects have also been carried out by the Joint Genome Institute (JGI) and Genoscope [5], yielding a total of 218,458 publicly available ESTs (as of Feb 2011). We also assembled these sequences with our resource, creating a combined MPI + JGI + Genoscope dataset.
To estimate the number of novel sequences in our resource (MPI dataset, Additional file 1) we compared it with the already available ESTs (JGI + Genoscope dataset, Additional file 2), as well as the combined dataset (MPI + JGI + Genoscope dataset, Additional file 3). First, we compared the length distribution of contigs and singletons in the three assemblies ( Figure 1A). The final combined assembly nearly doubled the available transcriptome resources for A. pompejana, with 60,475 contigs longer than 500 bp, compared to 34,860 in the JGI + Genoscope dataset. Due to the small size of the 454 and Illumina reads, the MPI and the MPI + JGI + Genoscope datasets were dominated by short contigs ( Figure 1A). However, they also contained a larger number of long contigs (>1,500 bp) than the JGI + Genoscope dataset (7,269 in MPI and 9,639 in MPI + JGI + Genoscope compared to 3,634 in JGI + Genoscope).

Annotation of predicted A. pompejana proteins
Next we searched the nucleotide datasets for potential open reading frames and analyzed the predicted protein sequences (Additional file 4, 5 and 6). We identified many novel full-length and partial sequences ( Figure 1B and Table 1), and also extended the length of already known partial proteins. We then compared the A. pompejana predicted proteins in the MPI + JGI + Genoscope dataset to those already available in the JGI + Genoscope dataset. We found 13,301 sequences (>100 aa) in the MPI + JGI + Genoscope dataset with no identical BLAST hit (fraction identical <90%) to the JGI + Genoscope dataset. We then performed BLASTP searches with these sequences in the predicted proteome of the annelid Capitella teleta, the closest relatives of A. pompejana with complete genome information (phylogenetic tree in Additional file 7). We found that 3,897 C. teleta sequences had one or more significant hits (e-value 1e-5) among the novel A. pompejana proteins. These represent newly identified, conserved A. pompejana proteins.
In addition, our data also extended the length of many truncated protein sequences in the JGI + Genoscope dataset. In a BLASTP comparison of the predicted protein datasets, 2,776 query sequences from the MPI + JGI + Genoscope dataset were at least 40 amino acids longer than their corresponding fragments in the JGI + Genoscope dataset.
Next, we annotated the combined predicted A. pompejana proteome using BLASTP (Additional file 8). We defined the sequences based on their first hit in the SwissProt database, and further annotated them with the first hit in the C. teleta, human, Danio rerio and Drosophila melanogaster proteomes. We also identified 775 sequences with no BLASTP hit in 26 eukaryotic genomes (including 18 animals) but significant hits in prokaryotes (in the UniRef90 database). These were annotated as potential contaminants or genes that originated by recent lateral gene transfer. A future A. pompejana genome sequencing project could distinguish between these two possibilities.
In order to estimate the completion of the A. pompejana datasets, we performed extensive BLASTP searches in the proteome of 15 animal and 8 additional eukaryotic species. We used C. teleta protein sequences as query and performed BLASTP searches in the 23 eukaryotic datasets, including the A. pompejana datasets. We then counted the number of C. teleta proteins that had a significant hit (e-value 1e-5) in all animals, but not outside animals, or all eukaryotes. These two subsets of proteins (general animal and general eukaryotic) are also expected to be present in A. pompejana. We therefore counted how many of these two subsets had a BLAST hit in the A. pompejana datasets ( Table 2). We found that the combined A. pompejana resource is about 74-99% complete, depending on the subset of proteins examined. Animalspecific proteins were highly covered, between 74-94%, depending on whether we considered all sequences or only full-length sequences. Proteins present in all eukaryotes had even higher percent coverage (87-99%). This is probably due to the higher expression levels of genes with general eukaryotic cellular functions. When we considered only full-length A. pompejana proteins, the MPI + JGI + Genoscope dataset was 74% complete for animalspecific proteins, compared to the 55% completion of the JGI + Genoscope dataset. These searches show that the A. pompejana resource is about 74-99% complete at the level of eukaryotic paralogous families and protein domains. Overall, our sequencing efforts greatly extended the known A. pompejana transcriptome and proteome.

Generation of an orthologous set of protein sequences
As a prerequisite for a thorough assessment of the thermoadaptation of the A. pompejana proteome we generated a large set of orthologous protein sequences from A. pompejana and nine other eukaryotic species. Importantly, our orthologous set also included three other lophotrochozoan species, the annelids C. teleta and Helobdella robusta and the mollusk Lottia gigantea. To be able to identify general features of thermoadaptation in eukaryotes, we also included the thermotolerant fungus, Chaetomium thermophilum, and a mesophilic yeast, Saccharomyces cerevisiae in the orthologous set.
The orthologous set was generated using an all-againstall BLASTP search and the reciprocal best-hit approach, Number of full-length (with start and stop codon), partial (with stop codon), and total number of predicted protein sequences in the three datasets clustered at 100%, 98% and 90% identity. and contained 457 members (Additional file 9). We used this dataset to search for signals of thermoadaptation using a broad range of sequence-based methods that have been proposed in the thermostability literature.

Sequence based thermostability ranking
Protein sequence composition of thermophilic prokaryotes differs significantly from mesophilic protein sequences. The general trend of an increased number of charged residues and a decreased number of polar residues was reported several times [16,17,22]. It was previously reported that A. pompejana proteins were enriched in charged amino acids [5]. This was interpreted as a sign for enhanced thermostability. Our set of orthologous sequences only partly supports this observation. Compared to the protein sequences from Drosophila melanogaster, Danio rerio and Homo sapiens, A. pompejana is strongly enriched in Lys, and slightly enriched in Asp, but not in Glu or Arg. However, the enrichment in Lys and Asp is shared with H. robusta and L. gigantea, two other lophotrochozoan species that do not live at high temperatures. These two species also have fewer Ala than A. pompejana, a sequence feature that has previously been proposed to be associated with thermostability [20].
The fungus C. thermophilum is strongly enriched in Ala, Gly, Pro, and Arg. This reflects the high GC-content of this species (see below). However, the enrichment in Arg is compensated by a reduction in Lys content, resulting in no global enrichment in charged residues in C. thermophilum. We conclude that the number of charged residues alone does not distinguish A. pompejana and C. thermophilum from mesophilic species (Figure 2).
We next performed sequence-based thermostability ranking calculations using a variety of other computational methods. All measures were used to rank the whole orthologous set of 457 proteins. We found that THERMORANK [20], "Tm Predictor" [21], the IVY-WREL measure [19] and the (E + K)/(Q + H) ratio [23], all failed to rank A. pompejana or C. thermophilum as the most thermostable species ( Figure 3A-D). In the THERMORANK analysis, for example, A. pompejana ranks behind H. robusta and L. gigantea, also highlighting the importance of broad taxon-sampling when performing these comparisons. Additionally, the length of the protein sequences in the trimmed multiple alignments (an indication of the length of surface loops) [24] did not identify A. pompejana proteins as thermostable, and ranked the two fungi as the species with the most compact proteins  ( Figure 3E). The average hydrophobicity versus charged residues, a measure that showed A. pompejana as an outlier [5], clustered A. pompejana with mesophilic L. gigantea ( Figure 3F). A reduction in intrinsic protein disorder may also correlate with thermoadaptation. However, when we performed protein disorder predictions [25] on the orthologous set, A. pompejana ranked as average. C. thermophilum ranked highest ( Figure 3I), probably due to the high content of Pro, Ala, Gly and Arg, residues that strongly contribute to structural disorder [25].
We found only two thermostability measures that weakly discriminated A. pompejana and C. thermophilum from the other species. Low serine content [26] ranks A. pompejana and C. thermophilum first and second in cumulative rankings, although C. thermophilum receives very similar scores to human ( Figure 3G). The CvP-bias [17] was the only measure that ranked A. pompejana and C. thermophilum as the two most thermostable species ( Figure 3H).
We validated all the measures used on prokaryotic thermophile-mesophile test datasets. All measures robustly  [25]). The species are as in Figure 2. discriminated a self-compiled Thermus thermophilus (a thermophilic bacterium) vs. Deinococcus radiodurans (an extremophilic but not thermophilic bacterium) dataset.
Other published datasets were also tested, and most measures distinguished thermophiles from mesophiles [20,27,28] (Figure 4). We also calculated the GC-content for the nucleotide sequences corresponding to the protein sequences in our trimmed orthologous set (Table 3). We found that C. thermophilum had the highest GC-content, A. pompejana had a GC-content below the average of the dataset, and L. gigantea, H. robusta and S. cerevisiae had the lowest GCcontent values. We plotted the ratio of GC-rich codons (GARP residues) against the ratio of AT-rich codons (FYMINK residues; Figure 5A) [29] across the ortologous set and found a very strong correlation, indicating that GC-content strongly influences amino acid composition. We also tested how individual amino acids are influenced by GC-content and found that the frequency of GARP residues as well as of Trp, Ile and Lys correlates with GCcontent (Additional file 10). The thermostability measures we applied, however, do not correlate with GC-content ( Figure 5B-F), indicating that the observed ranking is not influenced by GC-content.

Close correlation of the CvP value with experimentally determined thermal stabilities
Out of all the sequence-based thermostability measures used, only the CvP-bias ranked both A. pompejana and C. thermophilum highest. To test how well this measure correlates with thermal stability, we calculated the CvPbias for proteins with experimentally determined protein stabilities (Table 4). Despite limited sample size, the CvP-bias values correlated well with experimentally determined stabilities for A. pompejana Pol η [11], collagen [9,10] and U2AF65 [13], and for C. thermophilum Nup170, Nup192 [8], and xylanase [30], when compared to mesophilic proteins.
To further test the reliability of the CvP-bias measure, we searched for A. pompejana proteins with the lowest CvP-bias ranking among metazoans. We identified 26 proteins where A. pompejana ranked lowest, including the exosome component Rrp4. We expressed and purified recombinant Rrp4 proteins from A. pompejana,      human, and yeast and performed thermal denaturation experiments. As predicted, we found that human Rrp4 was more thermoresistant than the A. pompejana protein ( Figure 6A and Table 4). Overall, the difference in the thermal stability values between the thermophilic and mesophilic orthologous pairs and the difference in the CvP-bias showed a consistent trend. The proteins that showed higher thermal stability than their ortholog consistently showed a higher CvP-bias ( Figure 6B and Table 4). Three other measures (Tm-Index, IVYWREL, and (E + K)/(Q + H)) did not show such a trend. These results suggest that the CvPbias, in comparison to mesophilic orthologs, is a good predictor of the thermal stability of eukaryotic proteins.

Close correlation of the CvP value with the body temperatures of homeothermic vertebrates
The thermal stability experiments performed to date indicate that A. pompejana proteins have approximately 4-8°C higher thermal denaturation values than their mesophilic counterparts. Nevertheless the CvP-bias was able to discriminate A. pompejana from mesophilic Validation of CvP-bias with experimentally determined thermal stabilities for orthologous protein pairs from refs. [8,30]. animals, indicating that it is a sensitive measure of thermoadaptation. Different homeothermic vertebrates can also show up to 5°C difference in body temperature, with dolphins and whales having a body temperature of 36°C and birds 41°C. To test whether the CvP-bias also correlated with body temperatures among the homeothermic vertebrates we analysed the mean CvP-bias of the proteomes of 10 mammalian and bird species. We found a significant correlation between body temperature of a species and the mean CvP-bias of its proteins (Pearson's r = 0.71, p = 0.02). In contrast, the Tm-Index, IVYWREL, and intrinsic disorder did not show significant correlation (Table 5).

Discussion
Taking advantage of an extended sequence dataset, we searched for global signals of thermoadaptation in the proteome of the hydrothermal vent annelid A. pompejana. We used a broad phylogenetic framework, comparing orthologous eukaryotic proteins across vast evolutionary distances. For a detailed understanding of thermoadaptation, one would ideally focus on several closely related thermophilic and mesophilic species. However, there are no similar large-scale sequence resources available from other alvinellids, and such analyses are also confounded by the uncertain evolutionary history of thermoadaptation within the group. A recent analysis compared A. pompejana to a closely related mesophilic species, Paralvinella grasslei [29]. This study revealed a higher proportion of Ala residues in A. pompejana than in P. grasslei. Given the possibility that P. grasslei may only recently have become a mesophile, evolving from a thermophilic common ancestor with A. pompejana, the direction of change and the role of Ala in thermoadaptation in alvinellids are unclear. Only an increased taxon sampling and further analyses within and outside the alvinellids will clarify this. In our orthologous set the proportion of Ala residues in A. pompejana is not higher than in most metazoans. The thermophilic C. thermophilum is highly enriched in Ala, but this is probably due to the high GCcontent of this species.
Our approach compared A. pompejana to a broad selection of taxa, including other annelids, as well as very distantly related eukaryotes (fungi). This global comparison indicated that the CvP-bias may be a robust measure of thermostability across eukaryotes. CvP-bias ranked both A. pompejana and the thermophilic fungus C. thermophilum with the highest score among 10 eukaryotic species. Importantly, these ranking results cannot be explained by differences in GC-content, given that GC-content and CvP-bias are not correlated.
We also tested the CvP-bias against the limited biochemical data available, and found a good correlation between thermal stability and CvP-bias, when comparing orthologous protein pairs. For a thorough validation or for the development of other, more sensitive measures, more biochemical and genomic data will be needed.
One surprising finding was that the CvP-bias could also predict whether an A. pompejana protein would be less thermostable than its human ortholog, as shown for Rrp4. This suggests that not all A. pompejana protein are thermoadapted. The thermoadaptation of a subset of proteins with certain functions (e.g. collagen, a major component of the cuticle), together with the up-regulation of heat-shock proteins [5], may be sufficient to enable A. pompejana to cope with higher temperatures.

Conclusions
Our deep-sequencing efforts greatly enhance the existing transcriptome data for A. pompejana, nearly doubling the number of full-length cDNAs. This extended resource will be valuable for further comparative genomic studies of metazoans and extremophiles. The correlation of the CvPbias with thermal stability may be used to identify the most thermoadapted proteins from A. pompejana and other thermophilic eukaryotes, potentially facilitating protein structure determination studies.

Alvinella pompejana samples
Samples were collected from the hydrothermal vent chimney sites Julie and Parigo at 13N°/EPR (East Pacific Rise) during the cruises HOT 1996 and PHARE 2002 in the Pacific Ocean using the telemanipulated arms of the submersible Nautile and the ROV Victor. The samples were brought back to the surface into an insulated basket and snap-frozen in liquid nitrogen and stored at −80°C until RNA extraction.

A. pompejana transcriptome sequencing, assembly and analysis
For the sequencing of the A. pompejana transcriptome we used a combination of techniques. We generated a custom, normalized, full-length cDNA library (with m 7 Gppp affinity purification to limit bacterial RNA contamination; Invitrogen), cloned into the pENTR222.1 vector from two adult worms. After plating, we sequenced 10,063 randomly picked clones using the Sanger technology (ABI 3730) and the M13-FP primer. The programs Phred and Cross-match were used for base calling and vector trimming. We also performed 454 sequencing (GS FLX, Roche/454) on the PCR-amplified cDNA library, following concatenation and fragmentation. After adaptor trimming (pDONR), quality (0.05), and length filtering (50 bp cutoff) with the software package CLC Genomics Workbench 4.5.1, we obtained 2,717,445 reads of an average length of 220 bp.
For Illumina sequencing we used total RNA isolated from a third animal, a frozen A. pompejana male using the RNaesy Kit (Qiagen), following the breaking up of the tissue (pieces of 20-30 mg). We performed pairedend sequencing following reverse transcription of total RNA using the Smart cDNA Construction Kit (Clontech), m 7 cap-primed second strand synthesis, co-ligation and nebulization of cDNA, gel fractionation (approximately 350 bp) and adapter ligation. Samples were run on a HiSeq 2000 sequencer obtaining 91,670,518 paired reads. After adaptor trimming (pDONR222, Illumina PCR Primer, Illumina Paired End PCR Primer 2.0), quality (0.05) and length (30 bp cutoff) filtering, using the CLC Genomics Workbench 4.5.1, we obtained 87,799,426 high-quality reads of an average length of 76 bp.
All quality trimmed and vector screened 454 and Illumina reads were assembled using Velvet version 1.1. and Oases [31]. The k-mer length used was 25. The resulting contigs and singletons were joined with all published A. pompejana ESTs from NCBI dbEST, and with our EST sequences, and passed to the CAP3 assembler with default parameters [32]. To assess the amount of newly obtained transcripts from our own sequencing effort, our sequencing data and the JGI + Genoscope ESTs (218,458) were also assembled separately with CAP3. Contigs and singletons were joined for each assembly to compose the final set of transcript sequences.
Transcript sequences were translated with ESTScan [33] and trimmed to the longest stop codon-free protein fragment. If a 5' stop-codon was present, the sequence was scanned to the next methionine and that was considered as the protein start. Fragments shorter than 60 amino acids were discarded. The sequence assemblies can be queried by BLAST at http://jekely-lab.tuebingen. mpg.de/blast/.
Amino acid composition of protein sequences and measures derived from amino acid composition were calculated with custom scripts in Python. The CvP bias was calculated as (C -P)/length × 100, where C and P represents the number of charged (EDKR) and polar residues (STNQ) respectively.

Orthologous sets
We downloaded the Swissprot and TrEMBL sequences for Homo sapiens, Drosophila melanogaster, Caenorhabditis elegans, Danio rerio, and Saccharomyces cerevisiae from UniprotKB. For Capitella teleta, Helobdella robusta and Lottia gigantea protein sequences were downloaded from JGI (http://www.jgi.doe.gov/, Filtered Models). The predicted Chaetomium thermophilum proteins were downloaded from http://ct.bork.embl.de. Each proteome was clustered by 98% sequence identity and arbitrary length difference, using CD-HIT [34]. A total of 45 pairwise BLASTP searches were performed, best hits that were consistent in all searches were considered as an orthologous set, yielding 457 sets. For each orthologous set, a multiple sequence alignment was created using MUSCLE [35]. Because many of the protein sequences were not full length, the alignments were trimmed from both ends to the first column without a gap symbol, so that each protein is represented with the same fragment. The sequences from the trimmed alignments were used for all further analyses (Additional file 9). The tree, based on the recently published phylogeny of annelids [36] and the consensus animal phylogeny [37], in Additional file 7 represents the relationships of the species in the orthologous set.

Thermorank analysis
For local usage, the THERMORANK tool was reimplemented according to ref. [20], using the Python programming language. Tripeptide residue accessible surface area values were used as described in ref. [38]. The 8 protein sequences in each orthologous set were ranked and the cumulative rank sums over the 457 sets calculated to assess the overall trend of ranking. Ranking calculations for all measures were performed using the R software environment (http://www.r-project.org/).

Cloning, expression and purification of A. pompejana proteins
A. pompejana Rrp4 was cloned and expressed as a Histagged fusion together with Rrp4 orthologues from Schisosaccharomyces pombe, Homo sapiens and Saccharomyces cerevisiae. DNA was transformed in BL21Gold pLysS (Stratagene), grown overnight at 18°C and induced with 0.5 mM IPTG. Cells were resuspended and lysed by sonication in a buffer containing Tris pH 7.5, 500 mM NaCl, 20 mM Imidazole, 5 mM beta-mercaptoethanol and 10% glycerol. All proteins were purified using affinity chromatography on Talon resin (Clontech) (elution with 250 mM imidazole), followed by size exclusion chromatography on a GF200 column.

Thermal shift assay
Solutions containing 5 μl of 2 mg/ml protein with 35x of Sypro Orange (Invitrogen) and 45 μl of buffer screen were added to the wells of a 96-well PCR plate (Eppendorf). The plate was sealed and heated in a real-time PCR system (Eppendorf) from 20°C to 80°C in increments of 0.2°C. Fluorescence changes were monitored simultaneously. The wavelengths for excitation and emission were 470 and 550 nm, respectively. To obtain the temperature midpoint for the protein unfolding transition (Tm), a Boltzmann model was used to fit the fluorescence data [39].

Reviewers' comments
Reviewer's reports Reviewer 1: Sándor Pongor, International Centre for Genetic Engineering and Biotechnology, Trieste, Italy The authors present transcriptome sequencing studies on the Alvinella pompejana worm that lives in high temperature environments. A. pompejana is an attractive eukaryotic model organism both for studying thermoadaptation and also because thermophylic proteins hold promise for structural studies. The paper presents a significant advance in the sequencing of this organism, a large number of potential ORFs were discovered which warrants publication in itself. The manuscript also presents a variety of data coming from and evaluated by different techniques. The authors also conducted biophysical tests that showed that not all A. pompejana proteins are thermotolerant, indicating that the thermotolerance of this species may not be as outstanding as previously thought.
I recommend the following points to the attention of the authors: At the first sight, the title and abstract does not make it clear whether or not the CvP index is defined here, or is it already known. I suggest to clarify this, for instance by adding a restrictive adjective, something like "…identifies the CvP-bias as a (reliable, robust) measure of eukaryotic thermostability". CvP-bias was described in several articles including PMID: 16494505. Since CvPbias is part of the main message, these papers could be cited in the introduction. We have now changed the title to "..identifies the CvPbias as a robust measure…". We also changed one sentence in the abstract to clearly state that we looked at already known measures "We then performed an extensive survey of previously established prokaryotic thermoadaptation measures" We also included a reference to PMID: 16494505 (ref. 18). Even though the technical parts are well separated from the main text of the paper, at times the manuscript is still difficult to read, simply because many techniques are used in the project. This could be helped by emphasizing the main messages clarifying the details. For instance, ROC curves presented in Figure 4 could be complemented by a tabular comparison of AUC values. The panels of Figure 4 are too tiny and crowded in my opinion. Also, more details in the Figure legends would improve readability. We have included the AUC values in Figure 4 and give more details in the figure legends.
Reviwer 2: L. Aravind, National Center for Biotechnology Information, National Library of Medicine, National Institutes 702 of Health, Bethesda, USA Since its discovery, A. pompejana has been of great interest in regard to the question of how a metazoan might tolerate environmental extremes such as those it faces. Holder et al. use deep transcriptome sequence combined with bioinformatics analysis to attempt to explain the unusual thermotolerance of Alvinella pompejana. A key point made by this study is that the earlier analysis of Alvinella sequences might not have necessarily identified the actual basis for thermostability of proteins in this organism. In particular, the reported features are shared with mesophilic lophotrochozoans, suggesting that they might not be genuine discriminants of thermophily. In this regard, the extended proteome generated from sequence data obtained by the authors is a useful resource. Further, the study shows that the metrics for thermophily that were found to be successful in discriminating prokaryotic thermophily cannot be uncritically applied to eukaryotes.
The authors might want to consider a few points: While they used length of the trimmed sequence in the alignment as a possible discriminant, it might be better to directly measure intrinsic disorder or sequence entropy and use them as potential discriminants. This is of interest because in general eukaryotes have much great amount of low complexity sequence in their proteins than prokaryotes (especially low complexity sequence enriched in charge or polar residues or both). While other factors affect the amount of low complexity in eukaryotes does thermophily have a negative effect on it. We have performed calculations of intrinsic protein disorder and included a new panel in Figure 3 showing the results. Using this measure (IUPred), Chaetomium ranks the highest. This is likely due to the high GC content of this species. Alvinella ranks average, indicating that intrinsic disorder is not a general discriminator of thermophilic and mesophilic eukaryotes. Most homeothermic metazoans tend to maintain higher body temperatures than the rest. In their analysis of CvP Homo, a homeothermic metazoan is ranked next after the thermophilic species. However, a species with a much lower preferred temperature is also close. Is there any significance to this? Would it be possible to use some bird species in the comparison as they have much higher body temperatures (Gallus/Zebra Finch~41 C). Would they show a higher rank in the CvP measure than other species in the current figure?
We thank the reviewer for this insightful comment. Indeed, among the homeothermic metazoans temperature differences can be quite large (36 to 41 C), and this could be reflected in their proteomes. Although a full analysis in homeothermic metazoans is beyond the scope of this paper, we determined the average CvP values of the proteomes of several homeothermic vertebrates and found a strong correlation between body temperature and the CvP values, as shown in Table 5. This question can be addressed in more detail in the future when more bird genomes and genomes from vertebrates with low body temperature (e.g. dolphins, whales) become available. While the data is currently very limited, are there any possible explanations for the CvP being more successful as a discriminant than other measures in eukaryotes? One possible explanation is that some of the other measures are over-fitted to a training dataset (e.g. the Tm-Index uses dipeptide composition). The CvP-bias is a simple measure with few parameters combining information from two classes of amino acids, the changes of which have previously been linked to thermoadaptation. CvP-bias also performs best on the prokaryotic datasets. A larger number of Alvinella and Chaetomium protein structures and their comparison to mesophilic ortologues will help to clarify the role of charged and polar residues in thermoadaptation.
the project and wrote the paper, FB designed and supervised the project and wrote the paper. All authors read and approved the final manuscript.