RNA-seq library construction
RNA-seq library construction methods vary among sequencing instruments. ABI SOLiD, Illumina HiSeq, and Roche 454 are three major RNA-seq sequencers. They have their own library preparation methods, and other companies also developed library construction methods for the sequencers. Furthermore, customized methods have been designed based on researchers’ interests, for example, RNA-editing or poly(A) tail studies [9, 10]. In previous studies, random primers or oligo(dT) primers were used for reverse transcription of whole transcriptome library preparation [5, 10, 11].
In this study, we compared different whole transcriptome libraries to each other as well as to a library specific to the human GABAB1 gene. There was substantial difference between the two library preparations. For the gene specific library, complementary DNA (cDNA) was fragmented after gene specific reverse transcription and amplification, and adaptors were ligated to the fragmented cDNA (Figure 1). For the gene specific reverse transcription and amplification, the gene specific primers were designed for the 3′ untranslated region (UTR) since most human GABAB1 transcripts share a common 3’ UTR [12]. To prepare the whole transcriptome library, RNA was fragmented first, and adaptors were ligated to the end of the fragmented RNA. The RNA was reverse transcribed based on the random sequence overhangs of the adaptors. Though this approach generated cDNAs based on random sequences, it can be considered targeted priming because the reverse transcription started from the end of the fragmented RNA. The ends of the cDNA always reflected the 5′ and 3′ end sequences of the fragmented RNA. For our study, the gene specific and whole transcriptome libraries were prepared for the SOLiD system and sequenced.
Sequence mapping and visualization
Raw sequence reads of the gene specific and whole transcriptome libraries were filtered by sequence quality values. Non-coding RNAs and adaptor sequences were also removed. The non-coding RNAs were primarily ribosomal RNAs (rRNAs) and transfer RNAs (tRNAs). After filtering, the sequence data of the gene specific and whole transcriptome libraries were mapped against human reference genome, hg18, and visualized at the GABAB1 gene (Figure 2).
For each step, the remaining read numbers were calculated (Additional file 1: Table S1). Overall mappability of these reads was 29.2%. Redundancy (redundant mapped reads / mapped reads) was also calculated for each library. The number of mapped reads and the redundancy have a strong positive correlation within the same libraries. Because more mapped reads can have more chances to be identical to other reads, the libraries that generated more mapped reads showed higher redundancies. Different library construction methods also resulted in different redundancies even though they had the same amplification cycles and generated similar mapped reads.
Figure 2a shows the mapped reads that resulted from the gene specific library for the GABAB1 gene. Figure 2b illustrates the mapped reads resulting from the whole transcriptome library for the same chromosome location. These reads were 50 mer single-end reads. We found the mapping patterns between these two libraries to be dramatically different. Even though the chromosome region was a known exon and should be covered with continuous reads, the mapped reads of the whole transcriptome library showed gaps and pile-ups. (We use the term pile-up to refer to a substantial number of reads having exactly the same sequence, as seen in Figure 2b). This uneven coverage of exons with mapped reads demonstrates that the whole transcriptome library has substantial bias. The pile-ups and gaps of the whole transcriptome library were ubiquitously found throughout the genome, including for endogenous housekeeping genes, such as ACTB, PPIA, GAPDH, and PGK1 (Additional file 2: Figure S1). Therefore, we concluded that the pile-up and gap pattern was not a mapping artifact of a specific gene.
Comparison of the two sequencing data sets
The pile-ups shown in Figure 2b might be the result of sequence bias at the 5' end of the read. Fragment end sequence bias can generate read duplication under SOLiD sequencing and cause the pile-ups. To examine this possibility, we collected all mapped reads and calculated sequence logos (using WebLogo) and entropy at each nucleotide position (Figure 3). The height of letters in the sequence logo is calculated based on entropy at that site. Thus, entropy profiles and sequence logo profiles should be similar, except that entropy profiles were calculated on all reads and sequence logos were calculated on a random subset (see Methods). We found that the 5′ ends of reads from the whole transcriptome library have a specific sequence pattern. The sequence logo results showed a predominant pattern of AA at the 5′ end of the whole transcriptome library reads (Figure 3a). In contrast, reads from the gene specific library did not display this sequence pattern. For controls of each library, we generated random exome sequences in silico based on mapped read locations. These controls showed almost no bias whatsoever (Figure 3b). The strong bias in the whole transcriptome library was reflected in reduced entropy at the 5′ end of reads (Figure 3c). In contrast, entropy for the gene specific library showed only a minor fluctuation without the 5′ end drop.
Comparison of the two library preparation methods
The gene specific library and the whole transcriptome library were constructed very differently (Figure 4). To identify the source of the bias, we carefully reviewed all steps at which library construction differed. Most importantly, the bias could have arisen from fragmentation or its following steps.
For the gene specific library, double stranded complementary DNA (dscDNA) fragmentation was performed with sonication, and adaptors were ligated to the fragmented dscDNA using DNA ligase (Figure 4a). Sonication and DNA ligase possibly introduced the extremely minor noise (Figure 3a and c). However, these two steps are not known to cause sequence bias, and reads in our experiment were not sufficiently biased to introduce read pile-ups and mapping gaps (Figure 2a).
By contrast, the whole transcriptome library construction method used RNaseIII fragmentation at the very beginning of the protocol (Figure 4b). Perhaps RNaseIII fragmentation causes a bias that has not been previously evaluated. However, the three steps following RNaseIII fragmentation (adaptor ligation, reverse transcription, and amplification) were already known to have sequence biases [4–6]. Therefore, we first inspected these three steps for potential problems.
We used RNA ligase for adaptor ligation. RNA ligase prefers to bind to certain phosphate donors, but preference patterns have been studied only for short oligomers [4]. In our case, phosphate donors were the fragmented single stranded RNAs (ssRNAs) of 200 base pairs (bp) in length. Figure 3a did not show the same sequence pattern as the short oligomer case [4]. Therefore, its sequence bias either appeared in a different way at the fragmented ssRNAs or was too small to be detected from whole transcriptome library data. We concluded that the sequence bias we saw was likely not caused by RNA ligase.
The bias for reverse transcription has been previously studied for the Illumina RNA-seq system [5]. The randomly primed reverse transcription of Illumina RNA-seq causes substantial sequence biases at the 5' and 3' ends. However, our reverse transcription method was different from the Illumina method, and their respective biases are likely different. While the random primers of Illumina RNA-seq can hybridize on any locations of mRNAs, the primers in our method are the overhangs of adaptors and can bind only to ends of fragmented ssRNAs. The overhangs are 4 bp random sequences, and they were used for reverse transcription. Because of this targeted priming, the sequence bias of the whole transcriptome library was entirely different from the Illumina random primer case (Figure 3a) [5]. Therefore, we concluded that it was unlikely that reverse transcription was a major source of bias in our library preparation.
Finally, amplification is known to have GC bias [6]. This bias could affect some transcript regions that might appear more or less highly expressed, but this could not generate 5' end sequence bias.
Therefore, we excluded the three steps adaptor ligation, reverse transcription, and amplification as major sources of 5′ end sequence bias of the whole transcriptome library. We next focused our efforts on RNaseIII. RNaseIII can digest ssRNAs of preribosomal RNA and bacteriophage T4 [13, 14]. However, we removed ribosomal RNAs and used only human RNA samples. Even though we used only about 200 bp fragmented RNAs, their ssRNA digestion product length is different. During eukaryotic double stranded RNA (dsRNA) metabolism RNaseIII digests RNAs into 200 bp, and it is generally considered to be a random cutter. It recognizes double stranded RNA structures and cleaves them [14]. The RNA cleavage usually produces both short and long fragments. In yeast, the short fragments are about 28–32 bp in length and have unique sequence (AGNN) in the middle as containing hairpin structures [15]. In our system, the long fragments were around 200 bp and used for RNA-seq. The short fragments were too short for RNA-seq.
Because RNaseIII specifically recognizes RNA secondary structure and leaves a unique sequence in the short fragments, it is not likely to be a perfectly random cutter. Therefore, the long fragments could have an RNaseIII specific sequence pattern. On the basis of this reason, we replaced RNaseIII fragmentation by an alternative method using heat.
Alternative library constructions
We fragmented RNA at 95°C when it was denatured. In denatured RNA the hydrogen bonds are broken and only covalent bonds remain. Denatured RNA has no secondary structure. Therefore, each base can be attacked equally by heat, and we do not expect that heat fragmentation can introduce 5′ end bias.
However, heat fragmented RNA could not be used directly for adaptor ligation. The 5' and 3' ends of the heat fragmented RNA needed to be modified. We used T4PNK for the modification (Figure 5b). Its kinase activity adds phosphate groups at 5′ fragment ends. Its phosphatase activity removes phosphates from at 3′ fragment ends and leaves hydroxyl groups.
T4PNK is known to have sequence bias in its kinase activity, according to the OptiKinase product information [16]. Therefore, we also tested an alternative preparation method where we applied OptiKinase before T4PNK treatment (Figure 5c). OptiKinase has been shown not to have the sequence bias of T4PNK, but it does not have phosphatase activity at the 3′ end. Here, we assumed that T4PNK has negligible bias in its phosphatase activity and used it following OptiKinase treatment for its phosphatase activity. (Our results confirmed that this assumption is valid, as shown below.) Note that for kinase activity, T4PNK treatment required the optimization for enzyme and substrate titration and treatment time [16]. Therefore, OptiKinase was clearly a better choice to avoid technical variation in RNA-seq studies.
In Figure 5, we compared the alternative whole transcriptome library construction methods with the standard whole transcriptome library construction. The standard method fragmented RNA using RNaseIII and ligated adaptors to the RNA (Figure 5a). For heat fragmentation method, we switched RNaseIII fragment to heat fragmentation and did adaptor ligation after T4PNK treatment (Figure 5b). Modified heat fragmentation method reduced known T4PNK bias of the heat fragmentation method as treating OptiKinase between fragmentation and T4PNK steps (Figure 5c).
To assess the sequence bias of these two newly proposed fragmentation methods, we prepared two new whole transcriptome libraries, together with a second iteration of the whole transcriptome library prepared according to the standard protocol based on RNaseIII (Figure 5a). To reduce batch-to-batch sequencing variation, we sequenced all three libraries simultaneously in the same SOLiD system. Their mapping and further data analyses were processed identically.
Mapping results of novel libraries
We processed paired-end reads instead of previous single-end reads and visualized both 50 mer 5′ and 35 mer 3′ read mapping results (Figure 6). In the following, “Ctl” refers to the library prepared with RNaseIII, “Heat” refers to the library generated with heat fragmentation and T4PNK, and “Heat + OptiK” refers to the library that was generated with an additional OptiKinase step right after the heat fragmentation.
As described before for Figure 2 and 3, fragment end sequence bias generated the pile-up mapping pattern. Figure 6 compares the mapping patterns of Ctl, Heat, and Heat + OptiK for an endogenous control, PGK1 gene (Figure 6a). Ctl showed clear pile-ups for 5' read mapping results as expected. By contrast, Heat had much fewer pile-ups, and Heat + OptiK had virtually no pile-ups. Also, Heat + OptiK had the most even and smoothly connected distribution of the mapped reads.
Figure 6b shows mapping results for 3' reads. Ctl showed pile-up patterns just like in the 5' case. The pile-up patterns were weaker using Heat and largely disappeared using Heat + OptiK.
To study genome-wide pile-up patterns using the Ctl, Heat, and Heat + OptiK libraries, we calculated pile-up-to-read ratios from the total mapped reads (Figure 7a). Genome-wide pile-up-to-read ratios for Heat are slightly smaller than for Ctl, but Heat + OptiK had a much lower ratio than the other libraries. The three libraries had similar total mapped reads numbers, about 5.3 million (Table S1). Therefore, the reason for the difference in genome-wide pile-up-to-read ratio is the library construction method.
For individual genes, the three libraries had different patterns of pile-up-to-read ratios (Figure 7b and c). For the GABAB1 gene, Heat and Heat + OptiK had much lower ratios than Ctl, and Heat + OptiK showed slightly smaller ratios than Heat (Figure 7b). For ACTB gene, the ratio of Heat was near the average of the Ctl and Heat + OptiK ratios (Figure 7c). Among these two genes, the ACTB gene is more highly expressed than the GABAB1 gene, and the pile-up-to-read ratios of the ATCB gene were higher than the ratios of the GABAB1 gene. Genes that have more mapped reads have more chances to have identical reads and higher pile-up-to-read ratios.
Identifying the detailed sequence bias patterns of the three libraries Ctl, Heat, and Heat + OptiK
Next, we carried out sequence logo and entropy analyses for the three libraries (Figure 8 and 9). Figure 8 represents results for 5' reads, and Figure 9 represents results for 3' reads. Figure 8b and 9b showed the sequence logo results for the computational control for Figure 8a and 9a. Computational controls were generated as for Figure 3, but for Figure 9 35 bp reads were prepared as controls for the 35 mer 3′ end reads.
As described before for Figure 3a, Ctl had a strong bias of AA at the 5′ end of reads and showed minor bias at the other positions (Figure 8a). Heat had less bias right at the 5′ end but had some bias throughout the entire 50 bp window. T4PNK is biased though its bias is somewhat less than the RNaseIII bias at the 5′ end. Finally, Heat + OptiK had almost no bias throughout the entire 50 bp window. Even though its negligible bias shares the same sequences with Heat, OptiKinase decreases the T4PNK sequence bias. For 5′ reads, Heat + OptiK was the least biased method overall.
For 3′ reads, biases of Ctl were generally more severe than for the 5′ reads (Figure 8 and 9), but the overall bias pattern followed that of the 5′ reads. Ctl showed the strongest bias at the beginning of reads with a preferred sequence of GG. Heat was biased towards sequences starting with A or T and had small bias ubiquitously throughout the sequence. The bias pattern was very different from the one found in Ctl. Thus, we believe that this bias was caused by the phosphatase activity of T4PNK. Most enzymes acting on nucleic acids are generally affected by nucleotide sequences near their reaction sites [4]. Surprisingly, Heat + OptiK showed a much smaller sequence bias for 3′ reads than Heat. Thus, it seems that pretreatment with OptiKinase weakens the bias of T4PNK phosphatase activity because the T4PNK mostly has phosphatase activity rather than kinase activity.
As expected, the entropy results generally mirrored the sequence logo results (Figure 8c and 9c). The entropy plots showed clearly that the entropy values for Heat + OptiK fell consistently below the entropy values for Ctl and Heat for both 5′ and 3′ end reads (Figure 8d and 9d). Therefore, Heat + OptiK is the least biased method for both 5′ and 3′ end reads.
As a result of our study, we have obtained the sequence biases of RNaseIII and T4PNK. T4PNK kinase activity has a minor bias at the 5′ end of fragmented ssRNA (Figure 8). T4PNK phosphatase activity is biased mostly towards sequences with an A or T nucleotide at the fragment end of 3′ reads (Figure 9). Its sequence biases make less severe pile-ups than the broad RNaseIII sequence biases (Figure 6b).
Even when the pile-up reads were filtered out from mapped reads of the initial whole transcriptome library and the Ctl library, the filtered reads still showed the same sequence bias patterns of RNaseIII with a smaller number of mapped reads (Additional file 3: Figure S2). Therefore, RNaseIII has a specific target sequence for its reaction.
RNaseIII has biases at both the 5′ and 3′ ends (summarized in Figure 10a). Its sequence pattern was mostly 5′ - AA∙∙∙ ∙∙∙CC - 3′ (We reverse complemented the 3′ sequence here). Based on the known RNaseIII reaction mechanism, the two 5′ and 3′ biased sequences were separated by the short fragmented RNAs (Figure 10b). Thus, we have shown that RNaseIII introduces sequence biases into the long fragments that it produces.