Exploratory Period Estimation
The discrete Fourier transform (DFT) is defined as
where N is the length of the DNA sequence fragment, and k ∈ ℕ, is a frequency index corresponding to a period p = N/k. The DFT is thus only evaluated at selected integer periods, whose values depend on N, while typically integer periods are of most interest. For this reason, we also consider a second technique, the integer period discrete Fourier transform (IPDFT) [21], which is a variant of the DFT evaluated only at integer periods:
where p
max
is the longest period of interest. It should be noted that the Fourier components of the IPDFT are not all mutually orthogonal. For Fourier-based methods, explicit bounds on the period resolution can be determined. For example, in sequence fragment lengths of 150 bp (rounding up the length of nucleosome associated DNA), the period beyond which adjacent integer-spaced periods cannot be resolved using Fourier-based methods (with a rectangular window) is approximately
. Thus period-10 behaviour can be adequately estimated by these methods, to the resolution of the nearest integer period. The autocorrelation is widely used in period estimation for sequence data, and is defined as:
Finally, we also use the recently developed Hybrid autocorrelation-IPDFT technique, which has shown promise for overcoming some of the limitations of the DFT and autocorrelation for period estimation in sequence data [11].
In each case, the period estimate
, where S[p] ∈ {X[p], X
IP
[p], r
xx
[p], H
x
[p]}, is taken to be the dominant periodicity of the sequence, and which we refer to as the dominant period throughout this work.
Significance Measures for Confirmatory Period Estimation
Fourier-Based Period Estimation Variance Bounds
Fourier methods are often used for maximum-likelihood estimation of a dominant frequency (period) in a numerical signal, both within [18, 19] and outside [38] the genomics literature. For this estimate, it is possible to derive a Cramér-Rao bound (CRB) on the variance of this frequency estimate [39]. Following a similar procedure, we derive a bound for the (sinusoidal) period estimate variance herein as:
where
is the 'signal-to-noise ratio', which can be interpreted as the ratio of the energy of the perfectly periodic component of x[n] to that of the remaining component of x[n]. It will be noted that this bound is a very strong function of the period p, reflecting the period resolution effect discussed in the foregoing sub-section. This can be observed to contrast with Tretter's [39] CRB for sinusoid frequency, which is a similar function of inverted SNR but is not a function of frequency, and hence not of period. Clearly calculation of the CRB requires an estimate
of the signal-to-noise ratio, since the true 'signal' and 'noise' components are unknown. Here we use a definition based on the discrete Fourier transforms of the actual and perfectly periodic sequences (see Additional file 4) which (by Parseval's theorem) approximately preserves the conventional interpretation of the signal-to-noise ratio. The CRB is applicable to numerical signals whose periodicity is sinusoidal, or can be represented using sinusoids, and would not typically be used as a statistical test. Here, we use the CRB as a measure of the significance of a candidate period, and compare it qualitatively with the other significance measures.
g-Statistic
The g-statistic, originally proposed for the periodogram by Fisher [40], and used recently by Wichert et al. [41], Ahdesmaki et al. [31] and in a comparison by Ptitsyn et al. [32] for detecting periodicity in microarray time series data, can be generalised to the period estimation methods considered herein, in a modified form as
under the assumption that the 'noise' (i.e. all signal components other than the dominant periodicity) is Gaussian. Clearly this will be a weak assumption if multiple periods are present in the signal, however in general this is not known a priori. Using the p-value based on the exact distribution of g [41], with notation adjusted for use herein,
where g
obs
is the observed value of g and
, it is possible to test whether a sequence is purely random or whether it has periodic behaviour. The g-statistic is applicable to numerical signals.
Blockwise Bootstrap
In sequence analysis, both of the permutation approaches discussed in the background section have the limitation that they disrupt the non-random background distribution of polynucleotides. In essence, neighbouring nucleotides cannot be considered to be independently distributed [34, 35].
Hence, we adopt a method we refer to as the blockwise bootstrap (BWB). As in Ying et al. [33], here we resample sequences of interest by building a resampled sequence s
p
[n] from p-length fragments of s[n] selected at random, i.e.
(s
p
[lp], s
p
[lp + 1],..., s
p
[(l + 1)p - 1]) = (s[n], s[n + 1],..., s[n + p - 1]), l = 0, 1,..., ⌊N/p⌋, where n ∈ {0, 1,..., N - p} is selected at random for each l. An appealing feature of this approach is that it preserves the base rate of occurrence of nucleotides (and polynucleotides up to length p) during the test. R synthetic sequences are produced by randomly resampling, with replacement, from the original sequence fragment, and the number of times N
G
a peak greater than or equal to
is measured at period
is recorded. Finally the p-value P
BWB
of the test sequence is determined as N
G
/R. A low p-value, for example less than 0.01, corresponds to fewer than 1 in 100 resampled sequences exhibiting a peak greater than or equal to
at period
. Of course this test can be applied to other periods than
, to ascertain the significance of a secondary peak, for example. This type of test has been applied in a very wide range of applications, e.g. [42].
Although the BWB is a fundamental test for p-periodicity, it has the shortcoming of being very computationally expensive to apply. Like the g-statistic, the significance level of the BWB is dependent on the period estimation technique, rather than merely the data and the period of interest, in contrast to the other two measures considered.
Pearson's Chi-Squared Test
Treating perfect periodicity as a model, whose fit to (or deviation from) the sequence data of interest is to be measured, a chi-squared test can be developed. In this case the deviation corresponds to a period estimation significance measure, while the test itself is a threshold for the measure.
To calculate the deviation, it is necessary to first define the 'model'. For a sequence fragment that is perfectly p-periodic with respect to symbol s
m
, m ∈ {1, 2,..., M }, the probability mass function (pmf) is
. Note that this pmf says nothing about other symbols; they could be randomly occurring or also perfectly periodic, but most often we are interested in the periodicity of a particular symbol or symbols.
Having determined the periodicity 'model', a count of the observed instances of periodicity is required. For each position in a sequence fragment s[n], note the value s
m
of the current symbol, look ahead by p, record the presence or absence of each symbol of interest, then aggregate these across all instances of s
m
and divide by the total number of occurrences of s
m
, to produce an empirical p-spaced pmf for each symbol.
That is, for a sequence fragment s comprised of symbols s1, s2,..., s
M
, form the set C
m
= {s[n] | s[n] = s
m
, n = 0, 1,..., N - 1} and the M sets
, one per symbol s
m
. The empirical pmf for each symbol,
, is then
where |C| denotes the number of elements in C.
The deviation measure can thus be constructed as
Note that the deviation measure can be quite flexibly constructed, in the sense that 'symbols' can be replaced by sets of symbols of interest, for example rather than treating AA, AT, TA separately, periodicity in any of {AA, AT, TA} can be treated as a single 'symbol'. Just as all symbols that are of interest can each be lumped together in the pmf, all symbols that are not of interest can also be lumped together, rather than listing a probability for every possible symbol (e.g.
if s[n] = s
m
and
if s[n] ≠ s
m
), which is convenient when dealing with periodic long polynucleotides. This approach has some similarities with the mutual information method for revealing latent sequence periodicity proposed by Chaley et al. [30].
Finally, from the deviation statistic a p-value P
CS
may be obtained by comparing its value with a chi-square distribution with one degree of freedom.
Simulation of eroded and approximately periodic data
In order to compare the period estimation performance of the methods discussed above, an experiment was conducted in which 100 synthetic period-10 sequences of length N = 150 were generated and then eroded in increments of 1% to a maximum of 50% erosion. The erosion comprised 'flipping' the binary indicator sequence representation of equation (1) of a percentage of the N dinucleotides, as adopted by Arora et al. (2008) for nucleotides. This corresponds to the substitution of alternative dinucleotides for AA, TT, TA, and vice versa. The percentage of instances for which each period estimation method correctly predicted the period-10 behaviour was then recorded. This experiment was then repeated for approximate periodicity, for which we defined Gaussian distributed periods with expected value of 10 and standard deviation σ ∈ [0.1, 3]. This corresponds to the insertion and/or deletion of non-{AA, TT, TA} dinucleotides. To better understand their properties for known periodic sequence data, all significance measures considered herein were compared on the same synthetic period-10 sequences eroded and modified to produce approximate periodicity to assess the degradation of the period estimates for non-ideally periodic sequences. In all cases the results shown are averages across 100 random degradations of a pure period-10 sequence, and the IPDFT was used for period estimation throughout, to allow comparison with the CRB (which requires a Fourier-based method). In these experiments, the BWB used R = 1000 random sequences.
To assess the significance measures in terms of their accuracy as confirmatory methods, a period detection simulation was constructed. For 1000 instances of a perfectly periodic synthetic sequence of length N = 150 with 10 bp periodicity degraded between 1% and 50% (20 instances per 1% increment), i.e. the true positives, and 1000 randomly permuted instances of the same sequence, i.e. the true negatives, the sensitivity and specificity and hence receiver operator characteristic (ROC) curves were calculated. In this experiment it is possible that a small number of highly eroded periodic sequences may strictly be true negatives and a small number of random sequences may strictly be true positives, however this is common across all methods compared. This procedure was repeated for 1000 instances of the same sequence but with the period varied randomly according to a Gaussian distribution with expected value of 10 and standard deviation σ ∈ [0.2, 4]. (50 instances per 0.2 increment in σ).
Software implementation
All methods were implemented originally in MATLAB Version 2008a. The methods for period estimation and significance testing were then independently implemented in Python and Pyrex, a language that generates C-code which is compiled into dynamically loaded Python extensions. The compiled versions substantially improve compute performance. The source code has been contributed to the open sourced genome biology toolkit PyCogent [43] and is available from the subversion repository. All genomic analyses were conducted using the Python implementation. All scripts used are available on request from the authors.
Biological data
Yeast genome sequence coordinates for nucleosome associated DNA were obtained from Lee et al [37], whose procedure we briefly summarise. This data set was generated by analysis of a micrococcal nuclease (MNase) digestion of whole yeast genomic chromatin that had been subjected to cross-linking of histones to DNA. The resulting purified DNA fragments were then hybridized to an Affymetrix probe array with a 4 bp resolution. A Hidden Markov Model was used for detecting regions corresponding to 'well-positioned', defined as spanning 31-38 probes, or 'fuzzy', defined as spanning 39 probes, nucleosomes. Linker regions were defined as those spanning between identified nucleosome positions. Coordinates for the well-positioned, fuzzy and linker regions were downloaded from http://chemogenomics.stanford.edu/supplements/03nuc/datasets.html (dataset S5). Since these regions differed in length, and statistical power of the period estimation methods are sensitive to length, we modified these sequence coordinates such that sequence fragments from each class were all exactly 150 bp long. Specifically, the sequence coordinates from Lee et al were adjusted by equivalent symmetric expansion (in the 5' and 3' directions) until the coordinates were exactly 150 bp long. Only sequence coordinates that were independent (did not overlap with any other coordinates) were used. The genomic sequences corresponding to these coordinates were downloaded from http://www.ebi.ac.uk/~huber/yetia/yetiadata/SGD-0508/ The total number of sequences in each class were: 31557 well-positioned nucleosomes; 41770 fuzzy nucleosomes; and 10465 linker regions. Mouse genomic sequences were obtained from Ensembl release 58.