 Research
 Open Access
 Published:
Finitesize effects in transcript sequencing count distribution: its powerlaw correction necessarily precedes downstream normalization and comparative analysis
Biology Direct volume 13, Article number: 2 (2018)
Abstract
Background
Though earlier works on modelling transcript abundance from vertebrates to lower eukaroytes have specifically singled out the Zip’s law, the observed distributions often deviate from a single powerlaw slope. In hindsight, while powerlaws of critical phenomena are derived asymptotically under the conditions of infinite observations, real world observations are finite where the finitesize effects will set in to force a powerlaw distribution into an exponential decay and consequently, manifests as a curvature (i.e., varying exponent values) in a loglog plot. If transcript abundance is truly powerlaw distributed, the varying exponent signifies changing mathematical moments (e.g., mean, variance) and creates heteroskedasticity which compromises statistical rigor in analysis. The impact of this deviation from the asymptotic powerlaw on sequencing count data has never truly been examined and quantified.
Results
The anecdotal description of transcript abundance being almost Zipf’s lawlike distributed can be conceptualized as the imperfect mathematical rendition of the Pareto powerlaw distribution when subjected to the finitesize effects in the real world; This is regardless of the advancement in sequencing technology since sampling is finite in practice. Our conceptualization agrees well with our empirical analysis of two modern day NGS (Nextgeneration sequencing) datasets: an inhouse generated dilution miRNA study of two gastric cancer cell lines (NUGC3 and AGS) and a publicly available spikein miRNA data; Firstly, the finitesize effects causes the deviations of sequencing count data from Zipf’s law and issues of reproducibility in sequencing experiments. Secondly, it manifests as heteroskedasticity among experimental replicates to bring about statistical woes. Surprisingly, a straightforward powerlaw correction that restores the distribution distortion to a single exponent value can dramatically reduce data heteroskedasticity to invoke an instant increase in signaltonoise ratio by 50% and the statistical/detection sensitivity by as high as 30% regardless of the downstream mapping and normalization methods. Most importantly, the powerlaw correction improves concordance in significant calls among different normalization methods of a data series averagely by 22%. When presented with a higher sequence depth (4 times difference), the improvement in concordance is asymmetrical (32% for the higher sequencing depth instance versus 13% for the lower instance) and demonstrates that the simple powerlaw correction can increase significant detection with higher sequencing depths. Finally, the correction dramatically enhances the statistical conclusions and eludes the metastasis potential of the NUGC3 cell line against AGS of our dilution analysis.
Conclusions
The finitesize effects due to undersampling generally plagues transcript count data with reproducibility issues but can be minimized through a simple powerlaw correction of the count distribution. This distribution correction has direct implication on the biological interpretation of the study and the rigor of the scientific findings.
Reviewers
This article was reviewed by Oliviero Carugo, Thomas Dandekar and Sandor Pongor.
Author summary
In the grand scheme of things, the fundamental issue of reproducibility has a longterm implication on scientific rigor in this fastpaced OMICSfrenzy era. Since technology is not always WYSIWYG (What you see is what you get), it is important to validate our observations against some theoretical basis. For transcriptomic analysis, the lack of reproducibility is often hinted by the high discordance among normalization methods in a typical comparative analysis workflow given the same data set. Since important conclusions are often made based on these NGSderived exploratory results, improving the reproducibility of the sequencing outputs becomes instrumental and ever more so since most bioinformatics analysis seldom bridge the gap between the exploratory finds and the true molecular actuators via the formal arguments of underlying molecular mechanisms. The latter is especially critical for clinical diagnostics applications.
Background
Despite some cautionary notes on the generalization of powerlaw on natural phenomena [1], cell transcript abundance has often been theorized as originating from the family of powerlaw distributions [2]. Typically visualized in terms of histogram or rankfrequency plot, transcript abundance distribution seems to follow the extreme value theory where only a couple of genes are highlyexpressed while the rest are relatively lowlyexpressed. Earlier works on modelling SAGEderived (serial analysis of gene expression) transcript abundance from vertebrates to lower eukaroytes have specifically singled out the powerlaw distribution, namely Zip’s law [3,4,5,6,7] where the slope of a powerlaw equation is about − 1 on a loglog scale. Originating from information theory, this slope describes the ideal compromise between the sender and receiver as the “Principle of Least Effort”; steep line represents a smaller and repetitive vocabulary while a shallower slope represents a larger and more diverse vocabulary. As such, Zipf statistic evaluates the balance between redundancy and diversity. Remarkably, Zipf’s law seemingly holds for most normal tissues of homogenous cell type and also approximately for the heterogenous cell type (i.e., the slope tends to be slightly lower than 1.0) [4]. However, there exists a caveat to the powerlaw association: the observed powerlaw distribution of transcript abundance is usually imperfect in that it deviates from a single parameterized powerlaw slope.
By far, it has been unclear if this deviation is either reflective of the underlying true distribution or indicative of some inherent biases in terms of library size/sequencing depth [8], transcript lengths [9] and GC contents [10] in the physical or technological process that generates the observations. In our best understanding, the implications of the powerlaw deviation in transcript abundance has never been truly examined in current literature. Presumably, most researchers deem this deviation to have minimal effects on the downstream preprocessing steps like read mapping, normalization and statistical analysis. However, it is clear that there is no general consensus on the preprocessing of RNAbased sequencing data but rather best practices [11], with the normalization step contributing to the largest variation in the workflow performance [12,13,14].
In retrospect, all powerlaws of critical phenomena are derived asymptotically under the conditions of infinite observations. In the real world, observations are finite and, therefore, the deviations from asymptotic powerlaw is to be expected in finite systems. The latter, which is known as finitesize effects, will force an observed powerlaw distribution into an exponential decay and presents itself as a curvature in the loglog plot [15]. Pertaining to the nature system that governs the cell transcript abundance, the critical question is to clarify if the observed powerlaw deviation is truly the result of the finitesize effects and not because the underlying distribution cannot be simply described by powerlaw [16, 17].
The implication here is that if transcript abundance is truly powerlaw distributed, its deviation or curvature on the loglog plot translates to varying exponent values which, in turn, signifies the changing mathematical moments (i.e., mean, variance, skewness, kurtosis) of the distribution. Overall, this will manifest as heteroskedasticity (i.e., unequal variance within the data) among the experimental replicates. Heteroskedasticity brings about two issues: Firstly, it introduces both bias and unequal variance to the data and poses additional difficulty to normalization methods where a good method aims to reduce variance without increasing bias [18]. Secondly, heteroskedasticity will bias test statistics since Type I and Type II error increases with underestimated and overestimated standard errors respectively as a consequence of unequal variance [19, 20].
In this work, we derived a generalized concept whereby the anecdotal description that transcript abundance sequencing count data is almost Zipf’s lawlike distributed can now be objectively quantified by the Pareto powerlaw distribution via its mathematical moments and how the distribution can be rendered mathematically imperfect when subjected to the finitesize effects in the real world; a manifestation of the aliasing noise when undersampling occurs. Our formalism concurs well with our empirical analysis of two modern day NGS (Nextgeneration sequencing) datasets: an inhouse generated dilution miRNA study of two gastric cancer cell lines (NUGC3 and AGS) and a publicly available spikein miRNA data; Firstly, the finitesize effects causes deviations of sequencing count data from Zipf’s law and the issues of reproducibility issues in sequencing experiments that seems inescapable despite the advancement in sequencing technology since sampling is finite in the real world. Secondly, finitesize effects manifests as heteroskedasticity among experimental replicates to create statistical woes.
Collectively, this justifies for a simple restoration of the sequencing count data towards a powerlaw distribution with a single exponent value, herein as powerlaw correction, to reduce sample variance of lower transcript counts towards homoskedasticity for improved statistical outcomes. When this method was evaluated in a typical NGS comparative analysis workflow that entails (i) read mapping/count quantification (ii) prefiltering of the zero counts across conditions (iii) normalization and (iv) the statistical testing, the signaltonoise ratio (SNR) of the workflow improved by 50% after powerlaw correction. In turn, this higher SNR translates to an increase in statistical and detection sensitivity by approximately 30% in the dilution analysis regardless of the mapping and normalization methods used in the evaluation. Most importantly, the powerlaw correction addresses a longstanding issue of discordance in the comparative analysis workflow, particularly attributed to the variations among different normalization methods [12,13,14]. Using the dilution study, the increase in concordance rate was averagely 22% from the baseline rate of (48.24 ± 7.07)% to (70.32 ± 6.72)% upon powerlaw correction. When a higher sequencing depth is presented, powerlaw correction can extract the additional information content to increase significant detection. Specifically, in the dilution analysis, the higher sequencing depth instance (by four times higher) has an increase concordance rate of 32% (44.6% ± 4.91% versus 76.25% ± 1.78%) while it was 13% (51.88% ± 7.26% versus 64.39% ± 3.65%) for the lower depth instance. Finally, powerlaw correction statistically enhances the biological context of the experiment and elucidates the multiple metastatic signatures of the NUGC3 cell line in the dilution study of two gastric cell lines.
Results and discussion
Finitesize effects introduces curvature in sequencing count data distributions, impacts the reproducibility of the experiment and brings about heteroskedasticity among experimental replicates
Two miRNA sequencing datasets composed of technical replicates were being examined; The choice of miRNA is deliberate to avoid both transcript length bias [9] and abundance quantification [21] as confounding factors. The first miRNA set is the background count data of a spikein experiment from a published study (GEO dataset: GSE67074) that contains 12 replicates [11]; The original authors’ BWAmapped counts were used. The second set is an inhouse generated dilution series of two gastric cancer cell lines  AGS and NUGC3 (See methods for details: The dilution dataset [22]). In this section, only the Bowtie1mapped NUGC3 set of 8 technical replicates that spans across 4 concentration points of 12pM, 6pM, 3pM and 1.5pM was used. The varying concentration design aims to simulate the different sequencing depth (i.e., the total mapped reads) that mimics a system of various sizes to study its finitesize effects (See Additional file 1: Figure S1).
Given that these datasets are made up of replicates, a simple intrasample scaling where the counts of each replicate is divided by the maximum count of the same transcript within the replicate, will suffice. Furthermore, instead of visualizing Zipf’s law distribution with rankfrequency graphs, the Pareto distribution plots were used (See methods for details: Transformation between rankfrequency and Pareto distribution). This has the added advantage of characterizing the sequencing count data with the mathematical moments (i.e., mean, standard deviations) of the Pareto’s density function that is lacking in a typical Zip’s law plot.
Figure 1a and b depict the cumulative histograms, specifically the Pareto distribution plots of the scaled counts from the spikein background and NUGC3 dilution dataset (See methods for details: Property of Type I Pareto distribution). The plots are segmented into its appropriate highestcount to lowestcount linear ranges based on an order of magnitude per segment (see vertical dotted lines across horizontal axis). In both cases, the highestcount segments approach the Zipf’s law (see dashed black line) which has a characteristic slope of − 1. Beyond that, the slope values generally decreased and finished with an inflection for the lowestcount segments. While there is a general convergence of slope values from the highestcount to the midcount segments, a specific divergence for the low and lowestcount segments can be readily seen. In the case of the dilution set, its divergence is more exaggerated (as marked by the splitend pattern) as a consequence of a more deliberate sequencing depth differences among the replicates. The latter marks the effects of the finitesize effects which plays a major role in the reproducibility of the observed distributions.
Meanwhile, the trend towards changing slopes along the count segments indicates a general deviation from a single powerlaw exponent. Based on the mathematical moments of the Pareto distribution (Eqs. 3 and 4), exponent values of below “1” indicates asymptotically infinite moments. The consequence of these infinite moments is that their empirical estimates can converge very slowly due to the frequent occurrences of extreme values [23]. When coupled with the changing exponents along the count segments, heteroskedasticity (i.e., unequal error variance) among the replicates can be expected from the imperfect powerlaw distributions.
To further emphasize, the scatterplots of the scaled counts for the 11 replicates of the spikein background set against the replicate with the highest total reads were examined in Fig. 1c. Concurrently, Fig. 1d depicts the scaled count of the 7 NUGC3 replicates of the dilution set against the NUGC3 12pM sample with the highest total reads. Similar segmented ranges are also superimposed on these figures. Complementing Fig. 1c and d, the regression slope of the powerlaw fit, the total number of points, the observed and expected standard deviation of each segmented range were computed and complied in Table 1. Of particular importance is the expected standard deviation which projects the expected heteroskedasticity of the replicate noise across the count segments. It is extrapolated from the observed standard deviation of a reference count segment after accounting for the slope differences between the reference segment and the other segments (See Table 1 legend for further explanation).
Essentially, the observed heteroskedasticity seen in the Fig. 1c and d exhibits the hallmark of the Pareto’s mathematical moments where a change in variance is perpetuated by a change in the powerlaw exponent. Furthermore, the observed heteroskedasticity can be divided into variances of reproducible (i.e., the degree of agreement between experimental results conducted by different individuals/locations/instruments) and irreproducible origin. Specifically, when heteroskedasticity is about equal between the observed (i.e., general spread of the datapoints) and the expected (i.e., margins marked by the dotted lines at 99% confidence interval) standard deviations, it is simply reflective of the reproducible replicate noise as for the cases of the highest to midcount segments. However, when heteroskedasticity spreads beyond the expected margins, it indicates additional irreproducible noise as for the cases of the diverged low and lowestcount segments. In the worst cases, the observed standard deviation exceeds that of the expected by about 2 times for the spikein background set and 3 times for the NUGC3 dilution set (See Table 1: values in red).
The irreproducible noise that plagued the diverging low and lowestcount segments, serves to highlight the instability of the replicated count values when the corresponding powerlaw mathematical moments stem not only from low exponent values but of noncomparable magnitude as well. The latter basically demonstrates the impact of the finitesize effects on the same physical system when sampled at different rates. Since irreproducibility can occur even for a set of replicates that has similar sequencing depths like the case of the spikein set, it is expected to be worse for any datasets that have more diverse depths as attested by the dilution set.
Unfortunately, none of the commonly used normalization methods namely DESeq [24, 25], Relative Log Expression (RLE) [24, 26], Trimmed Mean of Mvalues (TMM) [26, 27], UpperQuartile (UQ) [12, 26], Count Per Million (CPM) [26] and Quantile [18, 28]) can correct for the powerlaw deviations in both datasets; Both powerlaw deviation and heteroskedasticity remain (See Additional files 2: Figure S2 and Additional files 3: Figure S3).
Aliasing noise explains the finitesize effects that distorts the theoretical powerlaw distribution of sequencing count data
In fact, the sequencing procedure can be recast into a sampling problem: The total transcript population in a cell can be viewed as a library of unique transcript species with different frequency of occurrences. Simply put, this library can be thought as the composites of a continuous analogue signal. And when this analogue signal is subjected to sequencing, it undergoes a sampling procedure where the abundance of the individual transcript species in terms of its counts, is being quantified. Collectively, the digitized counts becomes the sampled signal of the original analogue signal.
Mathematically, a powerlaw type sampled signal Y(f) with an amplitude of S_{ o } and an exponent of α, can be described as the sum of its original signal S_{ o }f^{−α} and its alias term S_{ o }(f_{ s } − f)^{−α} given any frequency f (see Eq. 13) and can be visualized as a frequencydomain plot. With any sampling procedure, undersampling will occur when the Nyquist sampling criterion of f_{max} < 2f_{ s } is not satisfied where f_{max} is the largest frequency component of the original signal and f_{ s } is the sampling frequency. As a consequence, this will result in a nonzero alias term S_{ o }(f_{ s } − f)^{−α}. More specifically, the condition of aliasing where a distortion of the sampled signal Y(f) from its original signal will occur [29] (See methods for details: Derivation of the alias term in the powerlaw 1/f^{α}equation;
Eqs. 513).
In relation to the sampled signal Y(f), the rank variable y and maximum count value C_{1} of the rankfrequency equation (see Eq. 14) are analogous to the frequency f and the amplitude S_{ o } of Y(f) respectively. In turn, the rankfrequency and Pareto’s tail distribution are inversely related to each other (See methods for details: Transformation between rankfrequency and Pareto distribution; Eqs. 1417). Essentially, the Pareto plots can be straightforwardly transformed into a frequencydomain plot.
To determine if undersampling has occurred, the sampling frequency f_{ s } needs to be first determined between the sampled signal and its original signal to check if the Nyquist sampling criterion is fulfilled. The best estimate or surrogate of the original signal S_{ o }f^{−α} can be estimated from the replicate with the largest total reads within the data series. For the dilution set, this was one of the 12p NUGC3 sample which consists of a total of 632 unique count values. In the case of the spikein background set, the replicate with the largest total reads has 863 unique count values. Corresponding to their rankfrequency (frequencyamplitude) plots, this translate to a maximum rank (frequency) of 632 and 863 accordingly.
Using the respective surrogates as baseline, the observed alias noise between a sampled signal and its original signal can be then determined by taking their logarithmic differences as described by the mathematical expression logΔY(f) = log[S_{ o }f^{−α} + S_{ o }(f_{ s } − f)^{−α}] − log[S_{ o }f^{−α}] (see Eq. 19). Since Zip’s law (see eq. 14 where b = 1) holds for the high and highestcount segments of both datasets, the exponent term is implicitly set to α = 1. Alias noise ΔY(f) reaches its maximum when f = f_{max} such that ΔY(f) = ΔY(f_{max}), for which the sampling frequency f_{ s } can be solved by evaluating logΔY(f_{max}) (See methods for details: Solving for sampling frequency f_{ s }to determine undersampling; Eqs. 18–21).
Furthering the analysis of the scaled datasets in Fig. 1, Fig. 2 shows the rankfrequency plots for the NUGC3 dilution and the spikein replicates (marked in red). In particular, Fig. 2a–e show the plots for the 1.5p pair, 3p pair, 6p pair, single 12p replicate and the 11 UHR replicates against the best estimate of the original signals (marked in black). In addition, the observed alias noise (marked in blue), together with the corresponding theoretical alias noise S_{ o }(f_{ s } − f)^{−α} (marked in magenta), are shown in the subfigures. For each case, the sampling frequency f_{ s } and the mean square error (MSE is defined as the residual error between the observed and theoretical alias noise) are given as well. The overall low MSE values of between 5.67e4 to 3.58e3 indicates a good fit between the theoretical alias noise model and the observed alias datapoints.
Within the NUGC3 dilution set, the 1.5pM, 3pM, 6pM replicates have failed to satisfy the Nyquist sampling criterion of f_{max} < 2f_{ s } at sampling frequencies of 589, 592 and 1045 (See Fig. 2a–c) respectively. Since the minimum sampling frequency needed by the NUGC3 dilution set is 1264 (2 × 632), undersampling has occurred for these cases. Undersampling can also be concluded for the spikein background dataset at a sampling frequency of 1464 (See Fig. 2e) where the required minimum sampling frequency is 1726 (2 × 863). In contrast, only the single 12pM case had satisfied the Nyquist criterion at f_{max} < 3.4f_{ s } (See Fig. 2d). Theoretically, the sampling frequency for a zero alias noise tends to infinity (solve eq. 17 for ΔY(f_{max}) = 1 at f = f_{max}).
In hindsight, the finitesize effects has always plagued sequencingbased studies since the early days [7] where the alias noise manifests as the misfitted tail in Zipf’s law distributions. The magnitude of the finitesize effects is dependent on the severity of undersampling and it can now be quantified formally through a simple recasting of the Pareto plot to the frequencydomain plot.
The necessity of powerlaw correction on sequencing count data to restore distribution distortion
The restoration of the powerlaw plots towards a common powerlaw slope were applied to the NUGC3 dilution and spikein background data series. (See methods for details: Computation procedures for powerlaw correction of a count data set). Akin to Figs. 1 and 3 shows the Pareto plots and scatterplots of both the powercorrected spikein background and the NUGC3 dilution datasets with the same intrasample scaling applied. Table 2 complements Fig. 3 with the details on the regression slope of the powerlaw fit, the total number of points, the observed and expected standard deviation of each segmented range.
Generally speaking, the Pareto plots in both Fig. 3a and b show a powerlaw distribution with a more uniform slope throughout all count segments, which averages to about − 0.94 (see Table 2 column 3) for the spikein background data set and − 0.97 (see Table 2 column 7) for the NUGC3 dilution data set. The restoration to a single exponent of the Pareto plot through the powerlaw correction gives us an estimate of how the true underlying distribution (see dashed line that depicts the Zipf’s law distribution) would have looked if there had been no aliasing issues.
With larger slope values than before, it implies that the standard deviation for all count segments, should theoretically converge towards a smaller value. Indeed, Fig. 3c and d of the respective data sets show that the corrected count values exhibit less heteroskedasticity across all count segments and variation among the replicates. This reduced heteroskedasticity is to be expected if transcript abundance is powerlaw distributed and adheres to its mathematical moments (see Eqs. 3 and 4); In hindsight, it does indeed. Furthermore, based on Table 2 (markings in red), the difference between the observed and expected standard deviation is merely 1.1 times for the spikein background dataset and 1.6 times for the NUGC3 dilution dataset in the worst case. The stark improvement from before the powerlaw correction (i.e., worst case of 2 times and 3 times respectively) signifies that the irreproducible noise in the data series has been dramatically reduced in the form of alias noise. Overall, it translates to a smaller dynamic range for the corrected values where the uncorrected count values from the low and lowestcount segment have now been shifted to the midcount segment.
When the corrected spikein background and NUGC3 dilution data sets were subjected to a reanalysis of aliasing, the corrected datasets shows a general absence of undersampling. The rankfrequency plots for the corrected dilution replicates are depicted by Fig. 4a for the 1.5p pair, Fig. 4b for the 3p pair, Fig. 4c for the 6p pair and Fig. 4d for the single 12p, while Fig. 4e shows the corrected spikein background replicates for the set of 12 UHR replicates (marked in red). The best estimate of the original signal is marked by black in each figure. The corresponding observed alias noise (marked in blue), as well as the theoretical alias noise S_{ o }(f_{ s } − f)^{−α} (marked in magenta), shows very slight aliasing in all cases given their new sampling frequencies of 1720, 1311, 1783, 3315 and 1920 respectively. The overall low MSE values of between 6.00e4 to 1.87e3 indicates a good fit between the theoretical model and the observed alias.
Powerlaw correction should precede normalization; it increases signaltonoise ratio and sensitivity of statistical testing/detection in comparative analysis
To rigorously evaluate the impact on powerlaw correction in a typical NGS comparative analysis workflow, Fig. 5 shows the evaluation setup that permutes across 4 mapping algorithms (Bowtie1, Bowtie2(global) [30], Novoalign ( www.novocraft.com ) and BWA [31, 32]) and 6 normalization methods (DESeq [24, 25], Relative Log Expression (RLE) [24, 26], Trimmed Mean of Mvalues (TMM) [26, 27], UpperQuartile (UQ) [12, 26], Count Per Million (CPM) [26] and Quantile normalization [18, 28]). Furthermore, the comparisons were split into the positive (signal between NUGC3 and AGS samples) and the negative (noise within the NUGC3 replicates) tests. For the statistical analysis, the generalized linear model [33] from the EdgeR package [26] was used for the multiple contrasts where each comparison produced a set of foldchange values, average counts (in terms of countspermillion) and pvalues (See methods for details: Generalized NGS comparative analysis).
Figure 6a shows the MAplots (i,e., average count versus foldchange) of the Bowtie1mapped dilution dataset before (leftcolumn) and after (rightcolumn) the powerlaw correction for the 6 normalization algorithms (arranged in rowwise). This Bowtie1mapped set comprises of 637 paired AGSNUGC3 pairedtranscripts. Likewise, Fig. 6b–d depict the MAplots of the Bowtie2 (global), Novoalign and BWAmapped dilution analysis where the total amount of mapped transcripts are 657, 673 and 670 respectively. Their respective PPS settings was referenced from the Bowtie1mapped set’s optimum setting to standardize the parameter settings of the powerlaw correction step across the mapping algorithms (See methods for details: Computation procedures for powerlaw correction of a count data set).
For each MAplot, the positive signal is depicted in red while the noise is shown in blue. The noise model, as a simple linear regression of y = mx, attempts is depicted dotted line. For both signal and noise datapoints, their corresponding residual with respect to the fitted noise model gives the foldchange variation along the average count axis (or xaxis) and can be recapitulated into a summary statistics. Essentially, the summary statistics gives the amount of bias (the mean) and variance (the standard deviation) of the normalization method where an effective one should reduce variance without increasing bias [18]. Furthermore, signaltonoise ratio (SNR) of each mapping/normalization pair, defined as \( E\left({x}_{signal}^2\right)/{\sigma}_{noise}^2 \) where \( E\left({x}_{signal}^2\right) \) is the expectation of the second moment of the signal and \( {\sigma}_{noise}^2 \) is the variance of the noise, was also computed. For each mapping algorithm, the median measures of the signal residual, noise residual and SNR across all normalization methods are also taken and summarized in Table 3 (see Additional file 4: Table S1 for full details).
Throughout all the MAplots, heteroskedasticity in the noise comparisons (depicted in blue) can be readily seen without the powerlaw correction. Heteroskedasticity brings about two issues: Firstly, it introduces both bias and large variance to the comparisons as attested by the mean and standard deviation ranges of − 0.192 to − 0.153 and 2.189 to 2.229 for the positive comparisons (or signal) (Table 3 column 3). In contrast, this was between 0.001 to 0.006 and between 1.017 to 1.022 for powerlaw corrected analysis (Table 3 column 6). Overall, the correction improved the SNR by about 50% (i.e., 17–11/11) given the SNR of the corrected and uncorrected analysis at about 17 times and 11 times respectively (Table 3 columns 4 and 7).
Secondly, heteroskedasticity, which manifests as unequal variance, can bias the teststatistics where Type I and Type II error will increase with underestimated and overestimated standard errors respectively [19]. To further emphasize, Fig. 7a–d show the same Bowtie1, Bowtie2(global), Novoalign and BWAmapped dilution analysis in terms of their volcano plots (i.e., log foldchanges versus pvalues). Likewise, the left and right columns show the before and after powerlaw correction for the 6 normalization algorithms (arranged in rowwise).
In each volcano plot, the noise comparisons can essentially be treated as the null hypothesis. As such, the log foldchange and pvalue cutoffs (marked by double horizontal dotted lines and single vertical dotted line) for the purpose of deriving the significant number of transcripts in the positive comparisons, were determined from the largest absolute foldchange value and smallest pvalue of these 6 noise comparisons (in blue). The latter aims to exclude any falsepositives. Furthermore, the rate of change in pvalue against foldchange can also be derived from the two cutoff values and is indicated in each volcano plot. Finally, for each of the 4 positive comparisons, the exact breakdown of the number of significant transcripts for all combinations of mapping and normalization methods before and after powerlaw correction were computed (see Additional file 5: Table S2 for full details).
Based on the volcano plots, the slower rate of change in pvalues of the uncorrected cases when compared to the powerlaw corrected cases, implies that a higher foldchange threshold is required to achieve a comparable pvalue (or Type I error rate) during statistical testing. Consequently, the higher foldchange threshold also implies a larger type II error (i.e., failing to detect an effect that is present) for the uncorrected cases and hence, a compromised sensitivity on the statistical testing. Indeed, based on Table 4, the general number of significant transcripts are higher for the powerlaw corrected analysis than the uncorrected ones. The trend is consistent regardless of the mapping algorithms used when averaged over the 6 normalization methods for each positive comparison. Meanwhile, it should also be noted that the variation contributed by different normalization algorithms is larger than that of different mapping methods. Overall, the average increase in sensitivity (in terms of percentage) across the 4 comparisons after powerlaw correction, is between 26% to 28% (36~ 42 transcripts versus 50~ 57 transcripts) for the Bowtie1mapped analysis, between 27% to 30% (41~ 44 transcripts versus 58~ 61 transcripts) for the Bowtie2(global)mapped analysis, between 26% to 34% (36~ 43 transcripts versus 54~ 58 transcripts) for the Novoalignmapped analysis and between 26% to 32% (36~ 41 transcripts versus 53~ 58 transcripts) for the BWAmapped analysis.
Independent validation of powerlaw application on the full spikein data series
As an independent validation, the full spikein dataset which includes the 12 nonhuman spikein transcripts was also analyzed. Given 12 samples in total without technical replicates across conditions, the total number of possible pairwise comparisons is 66 cases (\( {C}_2^{12} \)) where the positive set is made up of the 12 spikein transcripts (or signal) while the negative set (or noise) is composed of 460 UHR transcripts after filtering for nonzero count values among the conditions. In addition, given that the original authors’ BWAmapped counts were used, the permutation step across the 4 mapping algorithms was excluded. Also, due to the cyclic latinsquare design of the spikein transcripts across the 12 samples, the uniqueness of each sample meant that there are no replicates and hence, statistical evaluation is not possible. Instead, the cutoff criteria for significant call is simply based on the foldchange. As an additional note, the optimum PPS setting for the powerlaw corrected data was evaluated to be 10 according to the optimization plot (See Additional file 6: Figure S5B). Note that due to the lack of replicates for the spikein transcripts, only the background set was used for the parameter estimation.
Figure 8 shows the receivers operator characteristics (ROC) curves for the 6 normalization methods: DESeq, Relative Log Expression (RLE), Trimmed Mean of Mvalues (TMM), UpperQuartile (UQ), Count Per Million (CPM) and Quantile normalization. For each ROC plot, the sensitivity and specificity values were derived through the permutation of the log foldchange range of the noise comparisons. The plot without correction is shown in red while the powerlaw corrected one is depicted in blue. From the ROC plots, there is an obvious improvement in the performance across all tested normalization methods after the powerlaw correction. Among the methods, the performance is almost comparable to one another with the exception of the quantile normalization method. Furthermore, to compare against the BWA performance of the dilution analysis, the sensitivity of the spikein analysis for each normalization method was evaluated at the falsepositive rate of 0 (See the sensitivity values before and after powerlaw correction in the ROC plots). As compared to the improvement in statistical sensitivity of 26% to 32% in the dilution analysis, the improvement in detection sensitivity for the spikein analysis is lower (i.e., between 15% to 17%) across all the methods since its undersampling condition was less severe than that of the dilution data set.
Powerlaw correction improves the concordance in significant transcript call among normalization algorithms, especially with increased sequencing depth
Another important implication of the powerlaw correction is that the improved concordance in significant transcript call among the different normalization methods [12,13,14] will decrease the workflow’s dependency on the variations in specific algorithms. Returning to the dilution data set analysis, Table 5 gives the average concordance in significant calls by various mapping/normalization methods (see Additional file 5: Table S2 for the detail breakdown). It summarizes the level of agreement between the 6 normalization algorithms per mapping method for the positive comparisons in NGS workflow as shown in Fig. 5. Briefly, the “intersect” row gives the total number of common significant transcripts with the same foldchange directionality among the 6 algorithms, the “union” row gives the total number of significant transcripts reported by any of the 6 algorithms while the concordance ratio (in %) is taken between the “intersect” total and the “union” total. The concordance ratio serves as an unbiased measure given its doubleedged sword nature; While an increase in significant call by all algorithms is necessary to increase the “intersect” count, it also increases the likelihood that only some of the algorithms are making the call, thus lowering the concordance ratio.
With the powerlaw correction, the increase in the “intersect” total has almost doubled for all mapping/normalization combinations across all comparisons (see “intersect” rows). Meanwhile, the corresponding increase in the “union” total is less than onequarter at its worst (see “union” rows). This gives an increase of about 22% in concordance rate after the powerlaw correction i.e., (70.32 ± 6.72)% versus (48.24 ± 7.07)% (See “summary statistics” first row in Table 5). When the comparisons are further stratified by their sequencing depths (i.e., AGS12p and AGS3p comparisons), an increase in sequencing depth does not necessarily improve the concordance rates. In fact, the higher sequencing depth AGS12p instance has a lower concordance rate of (44.6 ± 4.91)% than that of the lower sequencing instance at (51.88 ± 7.26)% (See “summary statistics” second row in Table 5). In retrospect, although the number of significant transcript calls or the “intersect” total has generally increased with a higher sequencing depth, the inconsistency in significant transcript calls among the various normalization methods (i.e., the “union” total) has increased at a faster rate which resulted in a lower concordance rate despite the higher sequencing depth.
With the powerlaw correction, a higher sequencing depth correctly returns a higher concordance rate. Between the uncorrected and powerlaw corrected analysis, the improvement is somewhat asymmetrical where it was about 32% (44.6% ± 4.91% versus 76.25% ± 1.78%) for the higher sequencing depth AGS12p instance while this was about 13% (51.88% ± 7.26% versus 64.39% ± 3.65%) for the lower depth AGS3p instance. It remains that sufficient sequencing depth is necessary to generate enough information but when the condition is met, powerlaw correction will be able to extract any additional information content to increase significant detection.
Enhanced statistical conclusions elucidates the metastatic potential of the NUGC3 gastric cancer cell line
While both AGS and NUGC3 cell lines were commonly described as gastric adenocarcinoma according to the Cellosaurus database (version 22; http://web.expasy.org/cellosaurus/ ), NUGC3 was derived from a distal metastasis site  the Brachialis muscle of a male patient and AGS is presumably taken from the primary site of a female patient. Therefore, their comparison should elude the metastasis potential of the NUGC3 cell line beyond the common gastric adenocarcinoma. According to current literature, the common metastasis site of stomach cancer (in ascending order) is the liver, peritoneum, lung and bone [34, 35] while it is considerably rare to spread to the pancreas and skeletal muscle [36, 37]. When compared to generic adenocarcinoma which often spreads to the liver and lung [38], signetring adenocarcinoma frequently metastasizes within the peritoneum, bone, ovaries and sometimes to the breast [34, 39].
In our comparative study of the two gastric cell lines, the Bowtie1mapped concordance transcripts from Table 5 before and after powerlaw correction were independently subjected to geneset enrichment analysis (GSEA) via the MiEAA webserver to identify plausible disease groups from the collection of Human microRNA and Disease Database (HMDD). Briefly, using the Bowtie1mapped results from Table 5, the concordance transcripts across the 4 comparisons before powerlaw correction (see “intersect row”; columns 3–6) were compiled into a union set of concordance transcripts. The same was done for the powerlaw corrected comparisons (see “intersect row”; columns 7–10). Altogether, the uncorrected and powerlaw corrected union sets consist of 30 and 52 concordance precursor miRNA transcripts respectively (see Additional file 7: Table S3 columns 1 and 2). The uncorrected list exceeded the maximum intersect value of 28 (AGS12p versus NUGC3–12p) due to some slight variations among the 4 comparisons. Between the two concordance sets, the uncorrected set is almost a complete subset of the corrected set; one transcript is unique to the uncorrected set while this was 23 for the corrected set (See Additional file 7: Table S3 columns 3 and 4).
Thereafter, both lists were independently subjected to geneset enrichment analysis (GSEA) via the MiEAA webserver to identify plausible disease groups from the collection of Human microRNA and Disease Database (HMDD). For the powerlaw corrected list, the specific parameters are as follows: count≥10 and FDRadjusted p ≤ 0.05; This gives a maximum expected value of 0.5 for falsepositives (FP). To match the FP count of 0.5, the necessary parameters for the uncorrected list are: count≥5 and FDRadjusted p ≤ 0.1 (See Table 6 legend for detailed explanation).
Table 6 consolidates the identified HMDD categories of both analysis sorted by observed count, then by FDRadjusted pvalue. The expected baseline category  “adenocarcinoma” was used as the cutoff point for significance and hence, any categories beyond it were considered as insignificant hits (marked in black). Within the significant categories, there are two likely falsepositive hits (marked in blue). They are the “Leukemia, Myeloid, Acute” hit that should be grouped with the nonsignificant “Leukemia, Lymphocytic, Chronic” and the “Carcinoma, Squamous Cell” hit that should group with the nonsignificant “Esophageal Neoplasms” hit to explain esophageal cancer.
Between the uncorrected and powerlaw corrected result sets, the latter presents the stronger evidence of expected gastric adenocarcinoma through its more significant pvalues for both “stomach neoplasms” and “adenocarcinoma”. Likewise, the remaining significant hits suggest several neoplasms and carcinoma (“lung neoplasms”, “pancreatic neoplasm”, “ovarian neoplasm” “carcinoma, nonsmallcell lung”) as possible metastasis sites for NUGC3 with stronger statistical conclusions being drawn from powerlaw corrected analysis. In addition, powerlaw analysis discovers two more metastasis categories  “carcinoma, hepatocellular” and “breast neoplasms” with significant pvalues 0.015 and 0.023 respectively. Overall, the powerlaw corrected analysis concurs significantly better with the clinical evidence.
Conclusion
Specifically, our work has identified and mathematically quantified an important technical limitation of the sequencing technology for transcriptomics applications where finitesize effects due to undersampling [15, 29] can have profound effects on the reproducibility and statistical qualities of underlying transcript abundance distribution for its subsequent interpretation; This is independent of the advancement in sequencing technology since sampling is finite in the real world. With a simple distribution correction, the signaltonoise ratio and sensitivity of statistical detection in a typical comparative analysis can experience an instant and dramatic improvement that greatly impacts the reliability of the final biological interpretation of the study.
Methods
Property of type I Pareto distribution
When transcript abundance is being visualized in a rankfrequency plot, the Zip’s law [3,4,5,6,7] is specifically being singled out. Meanwhile, there exists a close relationship between the family of Pareto distributions (Type I, II, II and IV) to the Zip’s law; Type II to IV Pareto distributions varied from Type I mainly from the addition of a location and shape parameter that are irrelevant to the modelling of transcript abundance. Among the Pareto family, the Type I Pareto distribution remains the most mathematically compatible to the rankfrequency plot where their two axis can be shown to be interchangeable (See methods for details: Transformation between rankfrequency and Pareto distribution).
Mathematically, the probability (PDF) and cumulative (CDF) density function of the Type I Pareto distribution are defined as:
for the interval x ≥ x_{min} and x_{min} is the minimum value of the distribution and is necessarily positive (i.e. x_{min} > 0). In addition, the Pareto’s tail distribution (complementary CDF) is simply defined as P(X > x). Correspondingly, the mean and variance of the Pareto distribution are given as:
Therefore, for large values of the exponent term s, the corresponding mean μ and variance term σ^{2} converges towards smaller values for a fixed x_{min}.
Derivation of the alias term in the powerlaw 1/f ^{α} equation
Aliasing refers to a distortion or an artifact when a reconstructed signal differs from its original continuous signal. In this section, the alias term for the powerlaw equation 1/f^{α} is derived. Note that the main derivation originates from Kirchner [29] and this section provides only a concise adaptation.
Given a time series x(t), its Fourier transform of its discrete sampled time series y(t) is given as:
Furthermore, given that the sampling function III(t) is a periodic function at a sampling interval of Δt = 1/f_{ s }, it can be defined as:
where \( {c}_k=\frac{1}{\Delta t}\underset{\Delta t/2}{\overset{\Delta t/2}{\int }}\partial \left({f}_st\right){e}^{i2\Pi {kf}_st} dt=\frac{1}{\Delta t}\frac{1}{f_s}=1 \) for all k.
Combining Eqs. (5) and (6), one can reexpress the Fourier transform of y(t) into:
Also, given that the summation is taken over all k, the term −kf_{ s } can replace by kf_{ s }. Together with interchanging the summation and integration sign, one yields the following:
In addition, the sampled function Y(f) can be decomposed into its original signal X(f) and its alias components as follows:
Since x(t) is a real function, its Fourier transform X(f) is Hermitian. Therefore, X(−f) = X(f) and Eq. (9) can be written for positive frequencies only as follows:
Substituting the powerlaw equation X(f) = S_{ o }f^{−α} into (10) yields:
For Eq. (11) to converge mathematically, (i) the high frequency component (kf_{ s } + f) cannot be extended infinitely; In realworld, high frequency components fall off faster than 1/f^{α} way above the sampling frequency) and (ii) the condition where α > 1 needs to be satisfied. Hence, the Fourier transform of x(t) can be simplified to the following form:
Furthermore, for a bandlimited signal of 0 ≤ f ≤ f_{max}, the only relevant alias term is (f_{ s } − f_{max}) where k = 1, since (kf_{ s } − f_{max}) > 0 will satisfy the Nyquist sampling criterion of f_{max} < kf_{ s } for which k ≥ 2. In other words, aliasing will not occur for k ≥ 2. Finally, the powerlaw Fourier series of x(t) with the relevant alias term when undersampling occurs, is given as:
where Y(f) is the sampled function, S_{ o }f^{−α} is the original signal and S_{ o }(f_{ s } − f)^{−α} is the alias component.
Transformation between rankfrequency and Pareto (type I) distribution
The Pareto (Type I to IV) distribution belongs to the large family of powerlaw distributions; the subsequent derivation refers specifically to the Type I Pareto distribution. Given an observation, the Pareto’s tail distribution (complementary CDF) describes how many cases are seen greater than the observation in terms of cumulative density function (CDF). Meanwhile, the rankfrequency distribution is an inverse CDF (quantile function) seen in a reverse order with respect to the Pareto distribution, where it depicts the occurrence of the observation at a given rank.
First, let the rankfrequency equation be defined as:
where y is a y^{th} ranked value and x is the number of observed occurrences at y. One can further implies that there exists y number of values for which their corresponding x values are greater than C_{1}y^{−b}. As such, one can write a cumulative density function for random variable X for the number of observations larger than C_{1}y^{−b} in the form:
where C_{2} is a normalization constant such that P(X ≥ C_{1}y^{−b}) ≤ 1 must be satisfied. Then, rearranging Eq. (14) into \( y={\left[\frac{x}{C_1}\right]}^{\frac{1}{b}} \) and substituting it into Eq. (15) yields the Pareto’s tail distribution or complementary CDF:
For completeness sake, one can replace \( {x}_{\mathrm{min}}={C}_1{C}_2^b \) to obtain the usual Pareto’s tail distribution form of \( P\left(X>x\right)={\left[\frac{x}{x_{\mathrm{min}}}\right]}^{\frac{1}{b}} \) for x ≥ x_{min}. Meanwhile, to convert from the complementary CDF to the complementary cumulative total function (CTF), the expression can simply be rearranged as follows:
Hence, comparing terms in Eqs. (14) and (17), it can be seen that the Pareto’s tail distribution (in terms of complementary CTF) and rankfrequency distribution are inversely related.
Solving for sampling frequency f _{ s } to determine undersampling
Taking logarithm on both sides of Eq. (13), the sampled function Y(f) can be rewritten in logarithmic form as:
The second term on the right handside gives a distortion ratio between an aliased signal S_{ o }f^{−α} + S_{ o }(f_{ s } − f)^{−α} and original signal S_{ o }f^{−α}. As such, let the distortion ratio ΔY(f) be defined as:
Further simplification yields:
And solving for the sampling frequency f_{ s } gives:
For a rankfrequency plot where Zipf’s law holds (i.e., α = 1), f_{ s } can directly be evaluated when f = f_{max}, ΔY(f) = ΔY(f_{max}).
Derivation of the powerlaw correction factor
In an earlier section, the rankfrequency distribution and Pareto’s tail distribution has been proven to be inversely related to each other. For the purpose of estimating the exponent term in the rankfrequency plot, a better approach is to use Pareto’s tail distribution. This is because the largeranked tail of rankfrequency distribution tend to be clustered with small values of the same rank. As a result, this give a horizontal tail. In contrast, the same segment is always monotonicallyincreasing in Pareto. As such, let the count and rank of the i^{th} transcript be x and y respectively. Then the rankfrequency equation in its Pareto’s tail distribution form or complementary CTF can be written as.
where y = C_{2} ⋅ P(X ≥ x), \( k={C}_1^{s} \) and \( s=\frac{1}{b} \) from Eq. (17).
Taking logarithm on both sides, the expression is rewritten as:
where the slope and intercept are represented by m = − s and log_{ b }k respectively. Then, to convert the original slope and intercept (m, log_{ b }k) to a reference set of parameters (m_{ ref }, log_{ b }k_{ ref }), we let:
In the original scale, the rankfrequency equation can be reexpressed as:
Finally, the corrected count x^{'} is given as:
The powerlaw correction is implemented in PERL language and can be downloaded from the supplementary website [22].
Computation procedures for powerlaw correction of a count data set
The restoration of an observed distribution towards an uniform powerlaw entails that the slopes of all count segments to be the same. The reference powerlaw slope is taken from the highestcount segment since this segment is sampled from the higher abundance transcripts and should have the best mathematical convergence towards its real value. And with the correction towards a common slope, it is expected that all count segments will have similar variation among the replicates and that the overall heteroskedasticity should be dramatically reduced. Without the loss of generality, the proposed sequencing count correction will be, herein, named as the powerlaw correction.
In the actual implementation of the powerlaw correction procedure, there are two important computational aspects to note. Firstly, for the purpose of estimating the exponent term in a rankfrequency plot, the Pareto equation (see Eq. 21) is used rather than Zipf’s (see Eq. 14) because the largeranked tail of Zipf’s law tends to be clustered with small values of the same rank. As a result, this gives a horizontal tail which is suboptimal for slope estimation. In contrast, the same segment is always monotonically increasing in Pareto.
Secondly, the powerlaw correction is performed at a persample level. The total number of count segments in a Pareto plot is dependent on a fixed number of points per segment, herein, as pointspersegment (PPS). The partitioning of points will start from the highest count value. For each partitioned count segment, a set of slope and intercept (m, log_{ b }k) values will be solved using linear regression (see Eq. 22). The firstfitted count segment of the replicate which mimics the highestcount segment, will be used as the reference set of slope and intercept (m_{ ref }, log_{ b }k_{ ref }) values for the subsequent powerlaw correction via Eq. 26.
To find the optimum PPS setting that will yield the best overall fit between any replicate to a reference replicate in a Nsample dataset, the PPS parameter first needs to be permuted across a range of between 5 to 100 at an interval of 5. At a given PPS setting, two measures can be derived. First, the median of the N firstfitted count segment slopes of the data series can be taken. Secondly, a total of (N1) R^{2} (i.e., coefficient of determination) values can be derived from the linear regression results between the N1 replicates against the reference replicate. Consequently, a median R^{2} can also be taken.
The preceding computational procedures were then applied to the original BWAmapped spikein background and Bowtie1mapped NUGC3 dilution data. Additional file 8: Figure S4A and S4B show the median slope of the firstfitted segments versus the median R^{2} value of the spikein background set and the NUGC3 dilution set respectively. The PPS values are indicated besides the data points in the plots. Like before, the reference replicate was taken as the replicate with the largest total reads within the data series for the necessary R^{2} computations. For both Figures, the refined solution space of the optimum PPS is indicated by the error margins defined by the slope of the first highestcount segments from Table 1. Within this margin, the optimum PPS value is determined by the largest median R^{2} value. As such, the optimum PPS settings for the spikein background set and the NUGC3 dilution set are 20 and 45 respectively. The subsequent analysis is then based on the powerlaw corrected data sets using these PPS settings and their associated median slopes as the reference slope values for the respective data series. Similarly, the procedures were also applied to the BWAmapped spikein and Bowtie1mapped full dilution data sets to obtain the optimum parameters (see Additional file 6: Figure S5A and S5B). The parameter sets were subsequently used on the Bowtie2(global)mapped, Novoalignmapped and BWAmapped full dilution data sets to generate the results in Table 3.
The dilution dataset
Overview of design: The dilution series was created for two gastric cancer cell lines  AGS and NUGC3. The NUGC3 set consists of 8 replicates and spans across 4 concentration points of 12p, 6p, 3p and 1.5p so that each concentration contains exactly two technical replicates. Meanwhile, the AGS set is similarly designed except that it consists of 4 replicates across 2 concentrations of 12p and 3p. The varying concentration design aims to simulate the different sequencing depth (i.e., the total mapped reads) that mimics a system of various sizes to study its finitesize effects. The original sequencing files (in FASTQ format) of this dilution dataset can be downloaded from the supplementary website [22].
Sample preparation (Total RNA extraction): Isolation of total RNA from AGS and NUGC3 was performed using a Qiagen miRNeasy mini kit (Qiagen). Briefly, 5× volume of QIAzol lysis reagent was added to 1 million cells, incubated at room temperature for 5 min to disrupt and homogenize the cells. 1 volume of chloroform is then added to the tube, shaking vigorously for 15 s and incubates at room temperature for 2–3 min. Mixture is then transferred to a 2 ml Qiagen MaXtract high density tube and centrifuged for 15 min at 12,000 g for phase separation. Upper aqueous phase is carefully transferred to a new collection tube and 1.5 volume of 100% ethanol is added to aqueous phase for precipitation of total RNA in aqueous phase. The mixture is then pass into the RNeasy mini elute spin column (700ul each time) placed in a 2 ml collection tube. The column is spin at ≥8000 g for 15 s at room temperature and flow through is discarded. Process is repeated until all mixture has pass through column. Column is washed with 700ul of Buffer RWT and centrifuged at ≥8000 g for 15 s at room temperature Column is further washed with 500ul of Buffer RPE, spin at ≥8000 g for 15 s at room temperature. Lastly, column is washed with 500ul of 100% ethanol, centrifuge for 2 min at ≥8000 g. Column is transferred to a new collection tube and spin at ≥8000 g for 5 min at room temperature to remove residual ethanol and total RNA elute in 10ul of RNasefree water.
TruSeq small RNA library construction and sequencing: 6 (4 for NUGC3 and 2 for AGS) small RNA libraries were prepared in parallel for both NUGC3 and AGS cell lines using the Illumina TruSeq small RNA sample preparation kit according to manufacturer’s instruction. The 6 samples were uniquely indexed to enable sequencing of all 6 libraries in one MiSeq flow cell. Briefly, 1μg of total RNA was ligated with 5′ and 3′ adapter, cDNA was converted with SuperScript II Reverse Transcriptase and RT Primer. The cDNA was PCR amplified for 12 cycles with RNA PCR Primer and unique PCR Primer Index provided; It is important to note that indexing during PCR amplification minimizes the issue of barcoding bias [40] which masks significant expression differences between miRNA libraries. Amplified cDNA construct were first purified using QIagen MinElute PCR Purification kit and the construct were then size selected for fragments ranging between 145 bp to 150 bp using 10% TBE PAGE Gel. The indexed libraries were quantified individually by qPCR using KAPA SYBR FAST qPCR Kit (Kapa Biosciences, inc). To stimulate differences in sequencing depth in a multiplex sequencing experiment, the small RNA libraries for the NUGC3 cell line were pooled such that there was a 1, 2, 4 and 8× difference in concentration between the four unique libraries (12pM, 6pM, 3pM, 1.5pM). Small RNA libraries for AGS was pooled such that there is a 4× difference in concentration between the two unique libraries (12pM and 3pM). The libraries from both cell lines were pooled to yield a single pooled library and sequenced twice on the MiSeq instrument using MiSeq Reagent v2 for 1 × 40 + 6 (index) sequencing cycle (Illumina Inc., CA, USA).
Generalized NGS comparative workflow
Read mapping:
Raw data in FASTQ format was preprocessed using Trimmomatic [41] version 0.33 by trimming adapter sequences, removing trailing or leading low quality bases (base quality below 3). Subsequently, scan the reads with a 4base wide sliding window and trim when the average base quality drops below 15. Specifically, the command for Trimmomatic is:
The preprocessed reads were then aligned to miRBase v21 primary sequences using three different aligners, i.e. Bowtie (version 1.1.1 and 2.3.0) [30], Novoalign ( www.novocraft.com ; version V3.04.06) and BWA (version 0.7.12r1039) [31, 32] with the specific parameters as shown below:
Aligned reads in BAM format is then quantified using BEDtools [42] by counting how many reads map to each of the miRNA transcript. The respective mapped count files can be downloaded from the supplementary website [22].
For normalization, the EdgeR, DESeq and preprocessCore R packages were used in this work. Prior to normalization, the data is first organized into its specific cell lines (NUGC3, AGS) and concentration (12pM, 6pM, 3pM, 1.5pM) groups of 2 technical replicates via the following command:
Next, the data is read from an input file to perform the specific normalization. At the same time, an EdgeR DGElist object and the associated normalization factors for the proper scaling of the raw library sizes will also be created.
For DESeq normalization, the combined commands are as follows:
For Quantile normalization, the combined commands are as follows:
For CPM normalization, the combined commands are as follows:
For TMM, RLE, upperquartile normalization where m takes one of the following values “TMM”,“RLE”,“upperquartile”, the commands are as follows:
For performing statistical analysis, the generalized linear model (GLM) [33] from the EdgeR package was used. First, the count data is first fitted to the negative binomial model in the EdgeR package [26] for the purpose of estimating the common and tag dispersion. This is achieved through the CoxReid profileadjusted likelihood methods via the following commands:
Next, to allow for multiple contrasts in the comparison of AGS cell line against NUGC3 cell line, the GLM design matrix is to be set up. Specific to the dilution data set, the 4 treatment contrasts are AGS12p versus NUGC12p, AGS12p versus NUGC3p, AGS3p versus NUGC12p and AGS3p versus NUGC3p while the 6 control contrasts are NUGC36p versus NUGC3–12p, NUGC3–3 versus NUGC3–12p, NUGC3–1.5p versus NUGC3–12p, NUGC3–3p versus NUGC36p, NUGC3–1.5p versus NUGC36p and NUGC3–1.5p versus NUGC3–3p. This translates to the following commands:
To perform the GLM likelihood test for the 4 treatments and 6 controls, the following commands were issued:
Reviewers’ comments
Reviewer’s report 1: Oliviero Carugo, University of Vienna, Austria
The manuscript submitted by Wong and coworkers describes a computational technique that minimizes finitesize effects in NGS datasets and robustly improves the reproducibility of the results. It is an interesting example of how statistical tools may distort reality (see for example an article on Nature today: https://www.nature.com/articles/d4158601707522z) and should be used with extreme caution. It is also a nice example of how statistics begins when science ends. The methodology is described with high accuracy as well as the tests performed with both inhouse and publicly available NGS data. Although very long and perhaps prolix and although the math level is probably inaccessible to most of the Biology Direct readers, I think that this manuscript deserves publication because it might inspire further research in this field.
Authors’ response: We thank the reviewer for his positive comments. The concept behind the observed powerlaw distortion required a rigorous treatment as it has never been addressed in current literature and therefore, the length of the article. At the same time, we agree that the mathematics seems complex yet it was necessary for a complete treatment of the topic. Interestingly, even specialized bioinformatics journals shy away from our findings due to its lack of perceived appeal to readers attributed by the heavy mathematical contents; Regrettably, the mathematics cannot be further simplified. Taken together, we deeply appreciate the reviewer for his support of this manuscript.
Reviewer’s report 2: Thomas Dandekar, Department of Bioinformatics, University of Wuerzburg, Germany
I have the following comments: At present I would think the normal reader (non mathematician) realizes: "yes, this could be an important correction, but I am not sure.".
1) So I think everything which makes the article easier to understand and more accessible would be nice. First of all, explain Zipf’s law. It is a power law probability distribution. Thus the frequency of any word is inversely proportional to its rank in the frequency table (at least like this the linguist Zipf stumbled upon it). Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rankfrequency distribution is an inverse relation.
Authors’ response: We have expanded the Zipf’s law explanation in the first paragraph of the “Background” section to give the readers a better understanding of the origin and characteristics of Zipf’s law.
2) I recommend I would start the article results section with a figure explaining and showing the assumed Zipf distribution regarding the sequence count data and then illustrate in the same figure how now the corrected distribution looks like (the property of type I Pareto distribution). Furthermore, it is critical to show now how the observed distribution of tag counts for the sequencing data set looks like. Ideally for the reader then it should be readily to grasp that the new function really fits better the observed data and this message should be transported by the introductory figure of the results.
Authors’ response: Although less intuitive than the reviewer’s suggestion, we have added the Zipf’s distribution to show how the original observed distribution deviates from Zipf’s law (see dashed lines in Fig. 1a and b) and how the corrected observed distribution now coincides with the Zipf’s law (see dashed lines in Fig. 3a and b). The necessary text has also been added to the associated section where the figures were being discussed.
Mainly, what we wanted to achieve in the introductory message of the results section is to (i) show the observed distribution which suffers from a curvature will fit an undersampled powerlaw equation Y(f) = S_{ o }f^{−α} + S_{ o }(f_{ s } − f)^{−α}(eq. 13) and that (ii) correcting the alias noise reverts the distribution to the form Y(f) = S_{ o }f^{−α}(analogous to x = C_{1}y^{−b}(eq. 14) of the rankfrequency plot). As a consequence, the corrected observed data now fits better with Zipf’s law (i.e., x = C_{1}y^{−b}where b ≈ 1) as shown in Fig. 3a and b.
3) Another point is whether that correction is the best possible correction: could it for instance not be possible to find the best distribution by some datadriven modelling? 3b) Or are there some analytical results available why for instance a type II Pareto distribution would perform less?
Authors’ response: The reviewer brought up an interesting issue on the interplay between the datadriven approach and modeldriven (i.e., analytical forms) approach. On one hand, current sequencingbased transcriptome data suffers from inherent undersampling issue which has a direct impact on the distributional shape and hence, a purely datadriven approach is not optimal. Meanwhile, a purely modelbased approach to force all segments in a distribution towards a strict Zipf’s law without a good justification can be overbearing and might lead to overfitting. In our work, we balance between both datadriven and modeldriven approaches by correcting the middle and tail segment of the distribution (i.e., modeldriven) towards the exponent value of the fitted (i.e, datadriven) highabundance segment of the distribution which incidentally and approximately obeys the Zipf’s law.
As a side note, Pareto Type I has a direct 1:1 relationship to the Zipf’s law and has a support from x ∈ [x_{min}, ∞). For modelling transcript count which necessarily starts from at least one (i.e., x_{min} ≥ 1), Pareto Type I (or Zipf’s law) seems to be the most apt distribution within the Pareto family. Meanwhile, Pareto Type II (or Lomax distribution) is simply a shifted Type I such that its support starts from 0. Mathematically, it is as follows:
For modelling transcript count, the extra range of 0 to 1 has no relevance.
4) The confidence of the reader would increase if you can claim that you present the current dataset but you have the correction on e.g. ten other, unrelated data sets and each time the type I Pareto distribution was the best. 4b) Even better would be to rationalize the assumed correction by the typical distribution of sequences. p.10 does something in this direction, but what I was thinking of is more a physical explanation and best taking into account specifics of the used NGS technique, for instance may be with pacific biosciences sequencing the correction should be completely different, right?
Authors’ response: To recapitulate, SAGEbased messenger RNA data fits Type I Pareto distribution, particularly the Zip’s law relatively well [3,4,5,6,7] other than the low abundance tail segments. Independently of previous findings, we also found that NGSbased microRNA data follows the same trend in this work. When we investigated the NGSbased messenger RNA (GSE47774) of the Universal Human Reference (UHR), we found that Zipf’s law holds approximately for both the middle segments of the observed distributions (see Additional file 9: Figure S6) despite the differences in count quantification approach between HTSeq [43] and RSEM [44] (i.e., conservative versus greedy mapping approach). Expectedly, the low abundance segments exhibit curvatures albeit different in their slope trends.
Of particular interest is that the highest and high segments in NGSbased messenger RNA data tends to exhibit steeper slopes than the Zipf’s law which characterizes the SAGEbased messenger RNA data. Preliminary conclusions suggests that this is attributed to transcriptlength bias in NGSbased sequencing that is absent in SAGEbased sequencing for the messenger RNA species [9]. In other words, these high and highest NGSbased segments suffer from overestimated counts that arise from abundant transcripts with multiple pairend reads due to longer transcript lengths. As a side note, the differences in the slope trends for the low, high and highest segments between the HTSeq and RSEM quantified distributions implies that quantification algorithms generally do introduce bias in the count estimates and impacts on distributional shapes.
Nevertheless, regardless of the differences in technology (SAGE versus NGS), RNA species (microRNA versus messenger RNA) and count quantification algorithms (HTSeq versus RSEM), there exists common segments in the distribution that seems to follow the Zip’s law (i.e., a specific instance of Type I Pareto distribution where its exponent term equals to 1) in our preliminary investigations. However, a generalization of Zipf’s law on transcript distribution over all types of conditions will require a separate and more thorough investigation that is beyond the scope of this manuscript.
5) Apart from the questions I raise here I personally am convinced that such a correction is important and basically does the right thing. So another good point to spread the word would be to make some material (just the script used, page 24–26) available for download together with a tutorial, best of course integrated into R or some other gene expression analysis standard.
Authors’ response: The code is currently available at the supporting data website at http://mendel.bii.astar.edu.sg/SEQUENCES/PLSDBC/, but it is likely that we will rewrite the code in R language and to provide tutorial for future releases.
6) General final comment: the better understandable the language, the easier and intuitive clear figures, the more the people will understand your nice findings and actually APPLY them (which currently does not happen so often and hence leads then to wrong conclusions).
Authors’ response: We thank the reviewer for his positive comments of our work and his constructive suggestions to improve this manuscript.
Reviewer’s report 3: Sandor Pongor, International Centre for Genetic Engineering and Biotechnology (ICGEB), Italy
To the discretion of the authors: The authors may want to show results on more datasets or just preliminarily indicate how the findings generalize to other datasets. Also, instructions for practical use, availability of codes would be useful provided the authors do not plan to publish these data elsewhere.
Authors’ response: We thank the reviewer for his positive comments. In fact, the concept has been generalized beyond microRNA to messenger RNA sequencing data (see Additional file 9: Fig. S6) where we found the general trend of Zipf’s law in transcript abundance. In an ongoing work, we were able to show an increase in sensitivity of a miRNAmRNA analysis that leads to enhanced biological conclusion when the finitesize effect or powerlaw correction is applied; This is in a current working manuscript.
Also, although the code is already currently available at the supporting data website i.e. http://mendel.bii.astar.edu.sg/SEQUENCES/PLSDBC/, it is likely that we will rewrite the code in R language for future releases.
Abbreviations
 CDF:

Cumulative density function
 CPM:

Count Per Million
 CTF:

Cumulative total function
 NGS:

Nextgeneration sequencing
 PDF:

Probability density function
 RLE:

Relative Log Expression
 SNR:

Signaltonoise ratio
 TMM:

Trimmed Mean of Mvalues
 UQ:

UpperQuartile
References
Stumpf MP, Porter MA: Mathematics. Critical truths about power laws. Science 2012, 335:665–666.
Newman MEJ. Power laws, Pareto distributions and Zipf's law. Contemp Phys. 2005;46:323–51.
Luscombe NM, Qian J, Zhang Z, Johnson T, Gerstein M: The dominance of the population by a selected few: powerlaw behaviour applies to a wide variety of genomic properties. Genome Biol 2002, 3:RESEARCH0040.
Ogasawara O, Kawamoto S, Okubo K. Zipf’s law and human transcriptomes: an explanation with an evolutionary model. C R Biol. 2003;326:1097–101.
Konishi T. Threeparameter lognormal distribution ubiquitously found in cDNA microarray data and its application to parametric data treatment. BMC Bioinformatics. 2004;5:5.
Ueda HR, Hayashi S, Matsuyama S, Yomo T, Hashimoto S, Kay SA, Hogenesch JB, Iino M. Universality and flexibility in gene expression from bacteria to human. Proc Natl Acad Sci U S A. 2004;101:3765–9.
Furusawa C, Kaneko K. Zipf’s law in gene expression. Phys Rev Lett. 2003;90:088102.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Methods. 2008;5:621–8.
Oshlack A, Wakefield MJ. Transcript length bias in RNAseq data confounds systems biology. Biol Direct. 2009;4:14.
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010;464:768–72.
Tam S, Tsao MS, McPherson JD. Optimization of miRNAseq data preprocessing. Brief Bioinform. 2015;16:950–63.
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC Bioinformatics. 2010;11:94.
Garmire LX, Subramaniam S. Evaluation of normalization methods in mammalian microRNASeq data. RNA. 2012;18:1279–88.
Dillies MA, Rau A, Aubert J, HennequetAntier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, et al. A comprehensive evaluation of normalization methods for Illumina highthroughput RNA sequencing data analysis. Brief Bioinform. 2013;14:671–83.
Laherrere J, Sornette D. Stretched exponential distributions in nature and economy: ‘Fat tails’ with characteristic scales. The European Physical Journal B. 1998:525–39.
Fontanelli O, Miramontes P, Yang Y, Cocho G, Li W. Beyond Zipf’s law: the Lavalette rank function and its properties. PLoS One. 2016;11:e0163241.
Lu L, Zhang ZK, Zhou T. Zipf’s law leads to Heaps' law: analyzing their relation in finitesize systems. PLoS One. 2010;5:e14139.
Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–93.
Zar JH. Twosample hypotheses. In: Biostatistical Analysis. 4th edition. Prentice hall; 1998. p. 122–60.
Wong WC, Loh M, Eisenhaber F. On the necessity of different statistical treatment for Illumina BeadChip and Affymetrix GeneChip data and its significance for biological interpretation. Biol Direct. 2008;3:23.
Kanitz A, Gypas F, Gruber AJ, Gruber AR, Martin G, Zavolan M. Comparative assessment of methods for the computational inference of transcript isoform abundance from RNAseq data. Genome Biol. 2015;16:150.
Wong WC, Ng HK, Tantoso E, Soong R, Eisenhaber F. Finitesize effects in miRNA transcript sequencing count distribution website. http://mendel.bii.astar.edu.sg/SEQUENCES/PLSDBC/.
Clauset A: Inference, models and simulation for complex systems. 2011.
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106.
Anders S, Huber W: Differential expression of RNASeq data at the gene level  the DESeq package (version 1.24.0). 2016.
Chen Y, McCarthy D, Ritchie M, Robinson M, Smyth GK: edgeR: differential expression analysis of digital gene expression data. 2016.
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 2010;11:R25.
Bolstad BM. preprocessCore: A collection of preprocessing functions. R package version 1.36.0. 2016.
Kirchner JW. Aliasing in 1/f(alpha) noise spectra: origins, consequences, and remedies. Phys Rev E Stat Nonlinear Soft Matter Phys. 2005;71:066110.
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25.
Li H, Durbin R. Fast and accurate short read alignment with burrowswheeler transform. Bioinformatics. 2009;25:1754–60.
Li H, Durbin R. Fast and accurate longread alignment with burrowswheeler transform. Bioinformatics. 2010;26:589–95.
Nelder JA, Wedderburn RWM. Generalized linear models. Journal of the Royal Statistical Society Series A (General). 1972;135:370–84.
Riihimaki M, Hemminki A, Sundquist K, Sundquist J, Hemminki K. Metastatic spread in patients with gastric cancer. Oncotarget. 2016;7:52307–16.
Ushijima T, Sasako M. Focus on gastric cancer. Cancer Cell. 2004;5:121–5.
Wente MN, Bergmann F, Frohlich BE, Schirmacher P, Buchler MW, Friess H. Pancreatic metastasis from gastric carcinoma: a case report. World J Surg Oncol. 2004;2:43.
Jin SS, Jeong HS, Noh HJ, Choi WH, Choi SH, Won KY, Kim DP, Park JC, Joung MK, Kim JG, et al. Gastrointestinal stromal tumor solitary distant recurrence in the left brachialis muscle. World J Gastroenterol. 2015;21:6404–8.
Cichowitz A, Thomson BN, Choong PF. GIST metastasis to adductor longus muscle. ANZ J Surg. 2011;81:490–1.
lesato A, Oba T, Ono M, Hanamura T, Watanabe T, Ito T, Kanai T, Maeno K, Ishizaka K, Kitabatake H, et al. breast metastases of gastric signetring cell carcinoma: a report of two cases and review of the literature. Onco Targets Ther. 2015;8:91–7.
Alon S, Vigneault F, Eminaga S, Christodoulou DC, Seidman JG, Church GM, Eisenberg E. Barcoding bias in highthroughput multiplex sequencing of miRNA. Genome Res. 2011;21:1506–11.
Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–20.
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2.
Anders S, Pyl PT, Huber W. HTSeqa python framework to work with highthroughput sequencing data. Bioinformatics. 2015;31:166–9.
Li B, Dewey CN. RSEM: accurate transcript quantification from RNASeq data with or without a reference genome. BMC Bioinformatics. 2011;12:323.
Acknowledgements
Not applicable
Funding
The authors acknowledge support of this research by A*STAR Singapore and the grant “IAF CAT3 Integrated Genomics Platform”.
Availability of data and materials
All supporting data is made available at http://mendel.bii.astar.edu.sg/SEQUENCES/PLSDBC/
Author information
Authors and Affiliations
Contributions
WCW came up with the concept and designed the study. WCW performed the computations. WCW and FE analyzed the results. ET performed the NGS mapping. HKN and RS designed and performed the dilution experiment. All wrote the manuscript. All authors read and approved the final manuscript
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file1:
Figure S1. Concentration versus total mapped reads of the dilution data set. Figures S1A to D shows the concentrations of the AGS cell line (12pM, 3pM) and NUGC3 cell line (12pM, 6pM, 3pM, 1.5pM) versus the respective total mapped reads by the 4 mapping methods: Bowtie1, Bowtie2(global), Novoalign and BWA. Regardless of the mapping methods, the sequencing depth (i.e., the total mapped reads) is shown to be linearly proportional to the system size (in terms of transcript concentration) in the logarithmic scale. Overall, the dilution data set attempts to mimic a system of various sizes of finitesize effects. (PNG 371 kb)
Additional file 2:
Figure S2. Pareto distributions and scatterplots of spikein background data set. Figure S2A to F show the Pareto plots (left column) and supplementary Figure S2G to L show the scatterplots (right column) of the spikein background set where each applied normalization methods (i.e., DESeq, RLE, TMM, UQ, CPM and Quantile) are arranged rowwise. Generally speaking, the characteristics of these Pareto plots of the normalized spikein background set are very comparable to that of Fig. 1A and C, where only a simple intrasample scaling has been applied. Despite the application of normalization, two characteristics remain unchanged. Firstly, the nonuniform slope values and its decreasing trend from the highest to lowestcount segment indicate that heteroskedasticity among the replicates will remain. Secondly, for those count segment with slope values far from “1”, their mathematical moments are infinite and hence, large variation among the replicates will be expected for these segments. (PNG 1662 kb)
Additional file 3:
Figure S3. Pareto distributions and scatterplots of NUGC3 dilution data set. Figures S3A to F show the Pareto plots (left column) and Figure S3G to L show the scatterplots (right column) of the spikein background set where each applied normalization methods (i.e., DESeq, RLE, TMM, UQ, CPM and Quantile) are arranged rowwise. Likewise, the same conclusion can be made of the Pareto and scatterplots of the NUGC3 dilution set (Fig. 3AF) versus Fig. 1B and D where both the exaggerated spitend among the Pareto plots and the extreme heteroskedasticity of the scatter plots in the NUGC3 dilution set remain. (PNG 1415 kb)
Additional file 4:
Table S1. Signaltonoise characteristics of the comparative dilution analysis (AGS versus NUGC3) before and after powerlaw correction. (DOCX 21 kb)
Additional file 5:
Table S2. Significant transcripts calls of comparative dilution analysis (AGS versus NUGC3) before and after powerlaw correction. (DOCX 16 kb)
Additional file 6:
Figure S5. Medians of Regressed slopes of firstfitted segment versus R^{2} fit for the full dilution and spikein datasets. Figure S5A and 5B show the median slope of the firstfitted segments versus the median R^{2} value of the dilution and the spikein data set respectively. In both plots, the refined solution space of the optimum pointspersegment (PPS; as indicated besides the data points) is indicated by the error margins defined by the slope of the first highestcount segments from Table 1 like before. Consequently, the optimum PPS value is determined by the largest average R^{2} value where it is 55 for the dilution set and 10 for the spikein set. Note that due to the lack of replicates for the spikein transcripts, only the background of the spikein set was used for the parameter estimation. (PNG 315 kb)
Additional file 7:
Table S3. Concordance list of miRNA transcripts before and after powerlaw correction. (DOCX 12 kb)
Additional file 8:
Figure S4. Medians of Regressed slopes of firstfitted segment versus R^{2} fit for NUGC3 dilution and spikein background datasets. Figure S4A and 4B show the median slope of the firstfitted segments versus the median R^{2} value of the spikein background set and the NUGC3 dilution set respectively. For the necessary R^{2} computations, the reference replicate was taken as the replicate with the largest total reads within the data series. In both plots, the refined solution space of the optimum pointspersegment (PPS; as indicated besides the data points) is indicated by the error margins defined by the slope of the first highestcount segments from Table 1. Within this margin, the optimum PPS value is determined by the largest average R^{2} value where it is 20 for the spikein background set and 45 for the NUGC3 dilution data sets. (PNG 316 kb)
Additional file 9:
Figure S6. Pareto distributions of Universal Human Reference (UHR) mRNA HTSeqmapped and RSEMmapped sequencing count data. Figures S6A and B show the Pareto distributions of the Universal Human Reference (UHR) mRNA data set from the publicly available source  GSE47774 that has been quantified by HTSeq and RSEM respectively. Generally, Zipf’s law holds approximately for the middle segments of the observed distributions despite the differences in abundance quantification approach between HTSeq [43] and RSEM [44]; HTSeq tends to be more conservative than RSEM by limiting quantification to uniquely mapped reads. Meanwhile, the low abundance segments exhibit different trends. Of particular interest is that the highest and high segment in NGSbased mRNA data seems to exhibit a higher slope than the Zipf’s law that characterized SAGEbased mRNA data. Preliminary findings suggests that this might be attributed to transcriptlength bias in NGSbasedsequencing that is absent in SAGEbased sequencing [9]. Nevertheless, Type I Pareto distribution (or approximately Zip’s law) seemingly holds true for transcript abundance distributions despite the differences in technology (SAGE versus NGS) and RNA species (miRNA and mRNA). (PNG 313 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Wong, WC., Ng, Hk., Tantoso, E. et al. Finitesize effects in transcript sequencing count distribution: its powerlaw correction necessarily precedes downstream normalization and comparative analysis. Biol Direct 13, 2 (2018). https://doi.org/10.1186/s130620180204y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s130620180204y
Keywords
 Finitesize effects
 Nyquist sampling criterion
 Aliasing noise
 Pareto distribution
 Zip’s law
 Transcript abundance
 Sequencing
 Normalization
 Heteroskedasticity