Reviewers report 1 - Frank Eisenhaber
"The authors describe a method to determine underrepresented pairwise combinations of hexanucleotides with spacers of differing length between them (with up to two mismatches for each motif half) and find that up to 3% of these pairs have drastic reductions of occurrences at specific spacer lengths. These motifs are called spanions. Spanions are reported to be clustered and to occur at genomic locations that are correlated with (i) some isochors and notably CpG islands, (ii) RNAPII binding sites, and (iii) locations of tiRNAs and several other small RNAs.
Whereas the work brings up important observations and successfully connects them with previous knowledge, the manuscript would benefit from considering the following issues:
1) Is there any motivation why the authors analyze motifs of the type hexanucleotide-spacer-hexanucleotide and restrict the mismatches to two on each side? Why not one mismatch or penta-/heptanucleotides, why is the spacer introduced?"
Authors' response: The presented statistical model is only the most potent one amongst the tested variants. A short paragraph in the 'Methods' section introduced in the revised version about the general considerations of the model construction and the experience with the less successful candidates. The use of the spacer is also described briefly (see Methods: 'The statistical model' and 'Selection of motifs').
"2) The work would greatly benefit from presenting detailed data for a single representative spanion and a single representative non-spanion so that the reader gets a feeling what kind of data does this analysis produce (to be inserted at page 3 bottom/page 4 top)."
Authors' response: The full list of Human and Mouse spanions are submitted as additional files 2 and 3. The Table S4 (additional file 1) provides an example for spanion motifs. These changes are also requested by the other two referees.
"3) The section "spanions and isochors" finally does not clarify what is the relationship between them. How many spanions are "near" CpG islands or isochors (what is the distance relationship)? Generally, the authors are scarce with exact numbers; instead, the trends in the data are described with words here and throughout the text. Maybe, it would be good to summarize in a table all absolute numbers of spanions, RNAP sites, tiRNAs, etc. and how many of those overlap sequentially."
Authors' response: The text of the section is altered on several places and answers the question in its present form. A new table have been added listing the correlation data of RNAPII/tiRNA segments and the spanion clusters. The data in the CpG islands vs. spanion clusters relation is available in the additional file 1 section. In our opinion it would be slightly confusing to present the correlation of spanion clusters with experimental results and a prediction method (CpG islands) in the same table.
"4) The author should provide a algorithmic definition of what is a spanion cluster with all parameters in the main text (top of page 5)."
Authors' response: The detailed description of the scoring algorithm and the filtering procedure is included in the relevant 'Methods' section and referenced at the main text. We introduced a Supplementary Table and a Supplementary Figure presenting the concept in a visual way. (These changes were also requested by Rotem Sorek.)
"5) The language of the MS would benefit from polishing. Some sentences are incomplete (e.g., 2nd sentence of last paragraph of page 7)."
Authors' response: The MS is modified on several points according to the comments of the referees (including that particular sentence) and hopefully the most confusing parts are corrected in the recent version.
"6) The abstract would benefit from including all conclusions and some of the most important numerical results in this MS; at present, it is too verbose without the interesting pieces of information provided in this work."
Authors' response: The 'Abstract' now includes the most important conclusions of the work in an explicit form.
Reviewers report 2 - Sandor Pongor
"The work of Cserzo and associates presents a statistical analysis of rare sequence motifs in the human genome. They find that the rare sequence motives, termed spanions cluster in the vicinity of RNA pol II and tiRNA binding sites. These interesting observations shed light to a relatively less known property of genomic sequences which certainly deserves systematic analysis. The central concept of this work is the role of rare motifs. In my opinion the statistical analysis of rare motifs is a particularly important topic. According to the working hypothesis of this work, rare motifs carry specific functions. The naive reader expects that rare motifs should coincide with promoters and other protein binding sites. In contrast, the present work shows that rare motifs cluster around various other binding sites, notably tiRNA and RNA pol II binding sites, which points to the currently perhaps underestimated - role of the latter in regulatory events.
1) The present analysis is based on direct enumeration of bipartie motives consisting of two hexanucleotides connected with a spacer of varying length. It is not clear to me whether the bipartite motives are chosen because of their similarity of transcription factor binding sites or because of their "enumerability". Namely, direct enumeration of DNA motives is a computationallly hard problem which can be apparently solved on this particular subset. While addressing this question in the manuscript the authors may also comment on the percentage of the motif space they analyze so that the reader can get an impression about the generality of the conclusions."
Authors' response: Indeed, our original intention was to identify cooperative transcription factor binding sites at fixed distance in the sequence. Our model could not find those but picked spanions instead. Frank Eisenhaber also queried the details of the model; please see our reply to his first point above.
"2) The authors may consider showing a few representative examples of the avoided motifs, motif clusters and give more detailes on how the motif clusters are defined. Also some of the conclusions (e.g.) could be supported by more statistical details given as supplementary information. For instance, a list of spanion clusters for the human and mouse genomes could be given as an appendix."
Authors' response: The spanion libraries as well as the spanion clusters in the transcript proximal region of Human and Mouse genomes are submitted as Supplementary files as requested. The details of the scoring procedure also illustrated with a new Supplementary Table and Supplementary Figure.
"3) Finally, some of the conclusions I found particularly interesting are not mentioned in the present version of the abstract. The abstract would definitely benefit from a thorough brush-up, according to the guidelines of the Journal."
Authors' response: The abstract is modified as requested.
Reviewers report 3 - Rotem Sorek
"In this paper Cserzo et al. devised an algorithm to scan the human and mouse genome for underrepresented sequences, which they called spanions. They further found that these spanions are overrepresented in proximity to TSS of genes. The authors infer their results as if spanions are functionally connected to tiRNAs and polII binding sites; however, this is far from convincing, as these features are co-localized to TSSs, where spanions also co-localize. As described below, this is a major flaw of this manuscript as currently written, and more analyses are needed to establish the spanion-tiRNA overlap theory."
Authors' response: It appears that the referee missed a few crucial points of the paper perhaps due to the insufficient depth of explanation. We improved the text in reflection to his comments in the hope he reconsiders his first judgment (see the responses below).
"Major issues:
Once the authors showed that spanions are enriched in TSS (and even more bothering: according to the Methods, the filters were set so that spanions
will be
enriched in TSS) it is an expected and a trivial result that spanions will be enriched within tiRNAs and polII binding sites as these two are enriched in TSSs. The authors themselves mention that "the overlap between RNAPII binding sites and tiRNAs is also reported (by Taft et al.)". The authors also stated that tiRNAs were mapped only to unique genomic sequences, which bias their appearances in spanions (those tiRNAs that are not located in unique genomic sequences were not reported by Taft et al.)."
Authors' response: We agree with the concerns of the referee regarding the filtering procedure. However, the filter settings were obtained on the basis of the statistical significance of the individual predictions. The procedure separates weak signals and strong ones detecting - but not generating - enrichment of spanion clusters as highly significant hits around the TSSs. Generally speaking, post processing the output of any statistical model - i.e. ranking the predictions according to the signal strength and setting minimal confidence level requirement - is a widely accepted practice. We modified the 'Scoring procedure' section in the 'Methods' to avoid the confusion and emphasize the importance of the filtering.
The referee is right about that the correlation of RNAPII binding sites, tiRNAs and spanion clusters is inevitable at a certain extent as all the three genomic features concentrated around the TSSs. The only question is whether the observed correlation exceeds that certain random extent or not. Therefore the correlations of the experiments and prediction were contrasted with the correlations of experiments vs. random reference. The experimental data sets on Fig 3 and 4 result visibly different distributions with the predicted spanion clusters relative to the corresponding reference set. This visual proof is expressed in numbers by the results of the chi square goodness-of-fit test (see 'Conclusions'). After all, the observed correlations of RNAPII sites and tiRNAs with the spanion clusters are well beyond the level one can expect by pure coincidence.
"Based on this, the title should be changed to exclude mentioning tiRNAs, as the authors have absolutely no evidence that these spanions are connected to the phenomenon of tiRNAs."
Authors' response: The evidence is presented on Figure 4 as the difference of the real set and reference set.
"I also don't understand the reference set selection. If a spanion cluster occurrence is a rare event in the genome, isn't it expectable not to see the same event re-occurring in a random gene and in a specific distance from polII binding site/tiRNA?"
Authors' response: The generation of the reference set is one of the crucial points of the paper. It is absolutely essential to understand this step for the correct interpretation of the results demonstrating the close link between the spanion clusters and the experimental observations. Accordingly, detailed explanation via an example is introduced in the 'The link between spanion clusters and RNA polymerase II binding Sites' section.
"The statistical model:
Most known motifs have a single, possibly degenerate, consecutive pattern (often represented by a position weight matrix) and the motifs you are searching have a unique pattern of fixed head-spacer-fixed tail pattern. Please explain the logic behind the model; why did you decide working with such a complex pattern? Why did you select to work with 6 consecutive bases of DNA in the head and tail? Why masking of 2 bases was determined? What is the logic behind the decision to work with fixed flanking sequences and a variable spacer size? Please denote if these preferences are based on any empirical computational results or a biological principal."
Authors' response: Please see our answer to Frank Eisenhaber's first question and Sandor Pongor's first remark.
"It is also important to note in the paper, and not only in the Sup. Information part, that the vast majority of spanions (~95%) had a spacer size = 0 (Sup. Table1-2), which means that they are actually a consecutive 12-bp degenerate motif (a 'normal' motif), and not a bipartite motif recognized by its distinct head and tail separated by unimportant sequence."
Authors' response: Done (see Methods: 'Selection of motifs', last paragraph).
"Results and discussion - the relation of spanions and isochors.
This part should be shortened and moved to the end of the discussion. It was hard to understand the connection of this part to the spanions phenomenon until reading, in the next page (page 5), that spanions are GC rich and enriched in TSS, similarly to CpG islands."
Authors' response: According to our experience the conceptual difference of CpG islands and spanion clusters is a major issue for the potential readers. We changed the order of these two paragraphs as suggested but we prefer the lengthy and detailed version for better understanding.
"Results and discussion - Scanning Human genome for spanion clusters
The concept of spanion clusters repeats throughout the paper and is not clear.
Please give an example for a spanion cluster and its creation from different spanions, preferably as a Sup. Figure."
Authors' response: The explanation is in the second paragraph of the new version of 'Scoring procedure'. Briefly, spanions are the motifs of the statistical model while spanion clusters are sequence fragments with high spanion content. It is also presented via an example (Tab. S4; additional file 1).
"Figure 1and 2: The impression one gets from these figures is that spanions are enriched in TSS. However, the authors are not showing whether the rarity of these sequences might contribute to the positional bias. Therefore, I suggest that the authors would add as another control to the analysis, data of non-spanions from your initial analysis, that is, sequences that are represented as expected or overrepresented in the genome. It would be more interesting if the enrichment in TSS is specific to spanions and not to non-spanions. Also, in figure 2, it would be useful to add the normalized frequency of randon intergenic regions that are far from genes."
Authors' response: Figure 1 shows the spanion cluster part of the transcript proximal region. As this subset is overrepresented around the TSSs its complementary set is necessarily underrepresented. We can not see the benefit presenting this trivial fact on the plot.
"Figure 4
Regarding figure 4: since spanions and tiRNAs are approxiamately in the same size (whereas spanions are much smaller the polII binding sites) it would be much comprehensible if the distance between these elements would be presented as the distance between the 5' edges of both element types. Actually, if tiRNAs are ~18 bp long, and the distance peak is around -20 bp between spanion start and tiRNA end, then both elements should start in the same position."
Authors' response: The 5' ends of the segments are indeed close to each other in number of cases. We made this explicit in the legend of the figure. However, we prefer to use the same metric as in case of the RNAPII were the spanion clusters tend to accumulate 80 bases upstream of the 3' end of the segments and the chosen metric suits for that. In our opinion the identical metric together with the notes in the legends makes the two figures more comparable.
"Relation of spanion clusters and tiRNAs
If tiRNAs are presumably functional RNAs and the authors want to show that spanions are related to functional RNAs, it would be convincing if the tiRNAs are enriched within conserved (in human-mouse) spanions."
Authors' response: The main aim of this step is to establish the relation between the statistical model and experimental evidences. The referee is right about that it would be a stronger argument if we could access and analyze tiRNA data for mouse. We are eager to do so as soon as the data will be available. Till that we have to rely on Human data.
"It would be also nice to see if tiRNAs are enriched within spanions of higher spike index."
Authors' response: Only the fragments which are concentrating high spike index value spanions can pass the filtering procedure. Please see our response concerning the filtering procedure above.
"At the moment I am not convinced that tiRNAs are overlapping with the very large list of transcript proximal spanions (> 200K according to Sup. Table 1) just because the two groups co-localize in the TSS of genes."
Authors' response: According to our results the observed co-localization of the two sets exceeds the level of the correlation what would be expected by pure chance. Please see our response concerning the generation of the reference set few comments above and the related improved version of the text.
"Page 8 - the paragraph starting with "For the fourth". The authors hypothesize that spanions are enriched in some other small RNA groups, in addition to tiRNAs. My first guess was that spanions would overlap with many miRNAs as these are short (~22 bp) similarly to spanion clusters, noncoding, and many of them are found in single copies in the genomes and transcribed mainly by polII (spanions are near polII binding sites). However, the authors states in the sup. information that according to their finidings "even if the detected spanion clusters are related to microRNAs this link is rather weak.". I believe this negative result should appear in the paper and not in the sup. information as many readers would think of it."
Authors' response: The moderate correlation of miRNA genes and spanion clusters is mentioned as suggested in the 'Relation of spanion clusters and tiRNAs' section.
"Minor issues:
Please supply the full list of spanions/spanion clusters, preferably with their spike index, as a supplementary material."
Authors' response: The Human and the Mouse spanion libraries for the scoring procedure are included as Supplementary Files.
Page 2, last line: change 'what' to 'that' (a typo).
Authors' response: We rephrased the sentence.
"Page 3, line 7 from the end of page: You are explaining the calculation of 2560 possibilities in the methods part but here this number is confusing since 4^6 = 4096. Please explain the calculation here (4^4*10 = 2560) or refer to Methods."
Authors' response: The reference to the Methods section included.
"Figure 2- the usage of two colors here is little confusing. If the different colors are used only to emphasize each separate region, then please mention it in the figure legend. It is also worth mentioning in the text that the lack of continuity in the frequency of spanions in the border between an intron and coding/noncoding exon and the fact that intron edges are very poor in spanions may be caused by the fact that intron edges are characterized by clear overrepresented splicing signals (5' and 3' splice sites, polypyrimidine tract) which cannot be spanions by definitions. This bias contradicts the example given by the in the fourth explanation for the large number of spanions (upper part of page 8)."
Authors' response: The legend to Figure 2 is modified accordingly to the suggestion. The referee is right about that exon/intron edges are poor of spanion clusters but they appear at the exonic side close to the junction. The text is corrected accordingly at the referred place.
"Page 4, paragraph starting with the second question: the context of the question itself is not clear (what is the complicated statistical model and what is the low resolution approach here?) and so does the example given as an answer. Maybe you should start the paragraph by defining that isochors determination is a low resolution approach is, as mentioned in the next paragraph, and you are suggesting here the 'spanion statistics' concept which is a more complicated model."
Authors' response: The paragraph modified accordingly to the request.
"Page 5, line 13 - "spanions are G/C rich" - it is worth mentioning that this is an expected result as the genome is GC poor (41% GC according to the Lander et al. Nature 2001 Initial sequencing and analysis of the human genome) so we expect underrepresented motifs to be GC rich."
Authors' response: The remark and the reference are included as suggested.
"Page 6, without the notion that RNAPII segments span several hundred bases (that appear afterwards) in comparison to the ~25 bp of a spanion, the following sentence is unclear: "The distribution shows strong relative positional preference of the two datasets with maximum at around -80, i.e. spanion clusters are typically mapped to the 3' ends of the RNAPII segments." - since a short RNAPII segment might be downstream to a spanion if their distance is negative."
Authors' response: The correction mentioning the size difference of the two sets is introduced in the legend to Figure 3.
"Page 8: "Numerous examples have been identified where a spanion cluster located in an alternative exon of a gene appears in the 5' UTR region of a different gene in reverse complement orientation." -please add at least one reference."
Authors' response: This is the preliminary results of our more complex analysis. We made it explicit at the referred place in the text.
"Page 10, line 6 for the end - change 'point' to 'points'."
Authors' response: Done.
"Sup. Figure 4- The authors suggest in the sup. information (page 3, line 7 from the bottom) that the spanions are underrepresented genome wide, but these otherwise rare fragments accumulate at the proximity of start sites of genes. It will be more convincing that the peak here is caused by spanions in TSS, if the global sampling set will be divided into those that are in TSS and those not in TSS."
Authors' response: Fig. S4 (additional file 1) presents the distribution of the spike indexes calculated from two human database sections and also suggests the preference of spanion motifs towards to the TSS segments in an indirect way. This, indeed, could be supported with the suggested division of the global sampling set. However, Fig. 1 and 2 answers this question in the most direct way presenting the specific genetic locations where the spanion motifs are concentrated. Therefore, in our opinion, the suggested change would mean very little improvement relative to the original version while would require repeated calculations on the two parts of the database.