Generalization of DNA microarray dispersion properties: microarray equivalent of t-distribution.

BACKGROUND
DNA microarrays are a powerful technology that can provide a wealth of gene expression data for disease studies, drug development, and a wide scope of other investigations. Because of the large volume and inherent variability of DNA microarray data, many new statistical methods have been developed for evaluating the significance of the observed differences in gene expression. However, until now little attention has been given to the characterization of dispersion of DNA microarray data.


RESULTS
Here we examine the expression data obtained from 682 Affymetrix GeneChips with 22 different types and we demonstrate that the Gaussian (normal) frequency distribution is characteristic for the variability of gene expression values. However, typically 5 to 15% of the samples deviate from normality. Furthermore, it is shown that the frequency distributions of the difference of expression in subsets of ordered, consecutive pairs of genes (consecutive samples) in pair-wise comparisons of replicate experiments are also normal. We describe a consecutive sampling method, which is employed to calculate the characteristic function approximating standard deviation and show that the standard deviation derived from the consecutive samples is equivalent to the standard deviation obtained from individual genes. Finally, we determine the boundaries of probability intervals and demonstrate that the coefficients defining the intervals are independent of sample characteristics, variability of data, laboratory conditions and type of chips. These coefficients are very closely correlated with Student's t-distribution.


CONCLUSION
In this study we ascertained that the non-systematic variations possess Gaussian distribution, determined the probability intervals and demonstrated that the K(alpha) coefficients defining these intervals are invariant; these coefficients offer a convenient universal measure of dispersion of data. The fact that the K(alpha) distributions are so close to t-distribution and independent of conditions and type of arrays suggests that the quantitative data provided by Affymetrix technology give "true" representation of physical processes, involved in measurement of RNA abundance.


REVIEWERS
This article was reviewed by Yoav Gilad (nominated by Doron Lancet), Sach Mukherjee (nominated by Sandrine Dudoit) and Amir Niknejad and Shmuel Friedland (nominated by Neil Smalheiser).


Open peer review
Reviewed by Yoav Gilad (nominated by Doron Lancet), Sach Mukherjee (nominated by Sandrine Dudoit) and Amir Niknejad and Shmuel Friedland (nominated by Neil Smalheiser). For the full reviews, please go to the Reviewers' comments section.

Background
DNA microarrays provide large quantities of data for the study of diseases and biological processes in various organisms. However, microarray studies are subject to potential variations including biological and technical variability. Usually, the existence of a large dispersion makes it very difficult to draw any meaningful conclusions from the differences between the experimental and control groups [1,2]. Alison et al. [1] give the most recent general evaluation of the approaches and methods, summarizing the items where consensus has been established as well as outstanding questions; they underline the need for replicates and the usefulness of drawing information from neighboring genes ("shrinkage"), which is discussed at length here, provide the overview of clustering methods, etc. Many methods have been developed to deal with the problem of separation of systematic and random or pseudorandom components of the signal. For example, in the case of arrays using multi-probe sets, such as Affymetrix GeneChips ® , we first have to derive a representative value of gene expression from the signals of individual probes ("low-level" analysis). The Affymetrix MAS 5 and GCOS use Tukey's biweight algorithm and yield an absolute expression value for each probe set (Affymetrix, 2005, GeneChip Expression Analysis Algorithm Tutorial, Part Number 700285, Rev. 1). The method of low-level analysis, developed by Li and Wong (dChip; [3,4]) is designed to assess the observed differences in expressions of genes on the arrays under comparison. It is based on fitting data to a simplified model, assuming that the noise variable is independent of the signal. A different model, called Robust Multiarray Analysis (RMA), was proposed by Speed, Bolstad, Irizarry and co-workers [5][6][7] (see also Bolstad, B.M., 2004, PhD Thesis, University of California, Berkeley). It uses a log-transform of the data implicitly assuming that the error is proportional to the signal intensity. In reality, the error variable has both, constant and proportional components. Once the representative value of the gene expression is known, standard statistical methods of comparison can be used for "high level" analysis of the observed differences. Nonparametric methods, such as the Mann-Whitney U-test (Wilcoxon test) or analysis of variance on ranks, are generally preferable, although the parametric t-test and ANOVA are also frequently used. It should be pointed out that the statistical methods can only separate the systematic variations from the random or pseudo-random component. Random errors are recognizable because they conform to some known frequency distribution, usually Gaussian distribution. However, occasionally, one or several samples exhibit spurious differences from the rest of the data, due to changes in the biological state of the examined cells, quality of RNA etc. Such undesirable effects are often significant and can be detected only by detailed comparisons of the individual replicate samples.
So far, very little attention has been given to the general properties of the dispersion of gene expression levels.
With respect to applicability of various statistical methods it is useful to know how the standard deviation behaves across the expression range and whether this behavior is consistent from one assay to another and among the different types of arrays. Verification of normality of the frequency distribution of random fluctuations is particularly relevant. All parametric methods are based on concordance of the observed frequency distribution with the normal (Gaussian) distribution. Most physical and chemical systems, where random variations result mainly from collective interactions of large ensembles of particles, exhibit frequency distributions close to the Gaussian. The underlying mechanisms of microarray data variability are certainly of the same nature as the collective phenomena in physical systems but the ensemble of the processes involved is so complex that one would expect some compound distribution, far from the simple form expressed by the Gaussian prototype.
The object of the present study is to examine the frequency distributions, general properties of the standard deviations and coefficients of the probability intervals. It was found that the general characteristics of dispersion are useful for quality control, reduction of a system dimension and other purposes. Firstly an overview of the frequency distributions is given for both replicate arrays (five or more replicates) and consecutive sampling of the expression difference in the ordered pairs of genes in twoarray comparisons. Subsequently, we describe the consecutive sampling analysis and evaluation of the linear characteristic function, approximating the standard deviation of the data variability across the arrays. The standard deviation function is then employed to define the probability intervals encompassing specific percentages of the observed values. The boundaries of these intervals are defined by probability coefficients K α . It was found that the values of K α coefficients obtained using various arrays are, at least in the first approximation, invariant. Finally, we compare the probability of coefficients K α with the corresponding values of inverse t-distribution.

Results
In the present investigation we analyzed 682 Affymetrix microarrays of 22 different types. Our main objective was to study the microarray data derived from particular biological investigations, generated in many different micro-array core laboratories, rather than the sets of arrays produced in the context of technology development or testing methods of analysis. Only a few "testing" sets were included. We evaluated the CEL files using MAS 4 (Affymetrix, 2002, Statistical Algorithm Description Document. Part Number 701137, Rev. 3.) and employed the "Average Difference" as expression signal value. Because MAS 5 and GCOS distort the frequency distributions in the near-zero region by ignoring the negative values, MAS 5 and GCOS outputs are not suitable. Prior to the analysis, we verified the linearity and quality of the data, in particular, the absence of clusters with significantly different expressions. All data on each array were normalized to 100% of the array mean; all Affymetrix control genes were excluded.

Frequency distributions
In the case of experiments with five or more replicates, we tested the distributions of the expressions of individual genes. In addition, in all pair-wise comparisons we performed the Kolmogorov-Smirnov normality test on consecutive samples (Table 1). Based on our several thousands of tests, it was found that the Gaussian distribution was characteristic of the expression data obtained using the Affymetrix GeneChips ® . Typically, for goodquality data, between 85 and 95 percent of samples passed the test. Moreover, a limited number of tests using the data obtained from fiberoptic bead-based oligonucleotide microarrays by Illumina led to the same conclusion [8].
For illustration, Table 2 shows the results of the Kolmogorov-Smirnov test for six studies using Affymetrix GeneChips ® with five to 11 replicates and two studies using Illumina arrays with four replicates each. The mean percentage of probe sets across the arrays failing the Kolmogorov-Smirnov test was 6.9 using the algorithm of Sokal and Rohlf [9] (intrinsic hypothesis, P = 0.05). Usually, but not always, it was found that larger percentages of failures occur in the near-zero region. We did not examine systematically reasons for the failures, but it was often noted that there were outliers and, occasionally, a change of the slope or a discontinuity, noticeable in the quantilequantile plots. Generally, we performed our analysis in the positive range of values above a small, arbitrary threshold. However, in the several tests, the percentage of failures above and below the threshold was practically identical. Figure 1 illustrates the similarity of the normal distribution and distribution of the expressions measured by typical probe sets in the human cell line IMR90 (11 replicates) in the high range (from 1000 to maximum of 6681, panel A), near-zero range (from -0.4 to 0.4, panel B) and negative range (from a minimum of -923 to -20, panel C; data Ref. [10]). A "typical" probe set is defined as a probe set with the Kolmogorov-Smirnov distance D at or ------------6.9 6.9 6.9 Percentage of samples failing the Kolmogorov-Smirnov normality test at the level P = 0.05. All arrays are normalized to 100% of the mean value. The columns %failure (above) and %failure (below) give percentage of failures above and below the specified threshold.
[f] lymphoblast cell line GM10469 [8].  U01691_s_at  576  630  54  603  265  X17620_at  605  594  -11  600  266  U10323_at  562  617  56  590  267  AJ001421_at  413  766  354  589  268  X62654_rna1_at  576  602  26  589  269  D64142_at  666  510  -156  588  270  D21063_at  562  613  51  588  271  X16560_at  588  580  -8  584  272  D26600_at  580  586  6  583  273  M19267_s_at  599  566  -33  583  274 J02621_s_at Furthermore, we observed that the probe sets with the mean expressions within a "reasonably small" range had, on average, a similar variance. Figure 2A shows pooled data of the 62 probe sets in the expression range from -0.1 to 0.1 (cell line IMR90, 11 replicates, Ref. [10]) in Q-Q plot in comparison to the inverse normal cumulative distribution with good agreement except for about six outliers. The picture changes when we scan probe sets with a wide range of mean expressions. Figure 2B shows the Q-Q plot of 185 probes sets in the range of means from 500 to 1000; the lower part of the graph deviates substantially from the straight line. When we plotted the relative expression (i.e. expressions of the individual probe sets divided by the mean of 11 arrays; Figure 2C), we got all the points, except for about ten outliers, back on the 45°l ine. This implies that the standard deviation is linearly proportional to the mean expression level.
Based on the evidence of Figure 2, we hypothesize that approximately the same standard deviation can be obtained by scanning the data vertically, i.e. looking at expressions of the neighboring probe sets, or horizontally, i.e. looking at the series of arrays for each probe set. In other words, the probability that we will observe a difference d between the measurements M 1 and M 2 of the probe set Pr 1 on the arrays A 1 and A 2 is, at least in the first approximation, about the same as the probability that we will observe such difference between the measurement M 3 of the probe set Pr 1 on the array A 1 and the measurement M 4 of the probe set Pr 2 on the array A 2 , provided that the mean expression of both populations is the same. It further follows that an estimate of mean standard deviation of a group of genes with approximately same mean expression can be obtained from comparison of two arrays. We need to rank the probe sets according to the mean expression and evaluate the standard deviation from the differences in gene expressions in samples of k consecutive genes; the range of the means within a sample must be small. Furthermore, in this arrangement we can also obtain the standard deviation by using the ranked probe sets of each individual array (Ref. [10], Supplementary Material). Note that the standard deviation derived from the difference converges to √2σ, where σ is a standard deviation of a given population. Figure 3 shows a comparison of the frequency distribution of the difference in expression of two consecutive samples with the corresponding inverse normal cumulative distribution (cell line IMR90).

Consecutive sampling analysis
Assume, as a working hypothesis, that we can estimate the standard deviation of the gene expression variability of series replicate arrays from two-array comparisons. Since the evidence derived from the frequency distribution suggests that the standard deviation is linearly proportional to the expression level (at least in the first approximation), we assume that a representative estimate of the standard deviation can be obtained in the form of a linear function of the mean expression. A similar model was proposed on a basis of theoretical considerations by Rocke and coworkers [11][12][13][14]. The consecutive sampling program (see Methods) takes k pairs of expression values Y 1i and Y 2i ranked according to the mean (Y 1i , Y 2i ) and calculates the standard deviation from the difference Y 2i -Y 1i , where the subscripts 1 and 2 denote the array number and i signifies the probe set rank; typically we set k = 12, 25 or 50, depending on the size of the array. The standard deviation function is then determined by fitting the logarithmically transformed values to the logarithm of the linear function of the mean expression (see the Methods section). For illustration, Figure 4A shows the dispersion plot and boundaries of the 0.8 and 0.95 probability intervals for the murine array MG U74Av2 (lung tissue, AKR mice; Table 4), whereas Figure 4B shows standard deviations of the consecutive samples consisting of 12 ordered pairs of probe sets and the regression curve, representing the standard deviation function.
To verify our working hypothesis stated above, we also evaluated the regression function using the standard deviations calculated from dispersion of expressions recorded by replicates of the individual probe sets. Table 3 shows the comparison for five assays with the number of replicates ranging from four to 11. The values of the coefficients a 1 and a 2 ranged from 2.1 to 6.0 and from 0.076 to 0.161, respectively. Since the standard deviation calculated from the difference is √2 times larger than the standard deviation of a given population, we compared the values obtained from the individual genes to the results of the consecutive sampling divided by √2. The average difference of the coefficient a 1 for the Affymetrix arrays was 5.4% and for the Illumina arrays 7.4%, whereas the differences for the coefficient a 2 were 7.5% and 11.2%, respectively. The total average difference for a 1 and a 2 was 6.2% and 9.0%, respectively. We observed that in all cases except one (Focus Arrays) the values obtained from the consecutive sampling were above the results obtained from individual genes. This is to be expected, because the expressions in the consecutive samples belong to populations with different, albeit very similar, means. Since the standard deviation increases with increasing average, the differences among the means introduce an additional variability. Figure 5 shows an example of the standard deviation derived from 9 replicates of the Focus array. The points represent the standard deviations of the expressions of individual probe sets and the solid line represents the standard deviation function derived from the consecutive sampling.

Probability intervals and correlation of the K α coefficients with t-distribution
Once we evaluate the standard deviation function, we can determine the limits of the probability intervals, i.e. the boundaries corresponding to a distance from the 45° axis of symmetry equal to a constant number of standard deviations. Equations defining these limits are given in the Methods section (Eqs. (2) and (3)). The coefficient K α is equivalent to the standardized or "standard" deviate of the normal distribution, representing the distance from the mean, expressed in standard deviations. In case of the z-distribution or t-distribution the standard deviates corresponding to specific probability intervals can be derived from the cumulative distribution function. Since the theoretical distribution function corresponding to the probability intervals of the microarray dispersion is unknown, we determined the coefficients K α empirically. First we calculated the standard deviation function and then used Eqs. (2) and (3) to define the limits of the standard deviate intervals ("probability intervals"; see Figure 4A, note that the boundary lines appear in the log-log plot as curves). To determine the K α coefficients corresponding to specific probabilities we counted the points lying outside a given interval. For example, if the number of points in a given expression range examined was, say, 10000, we determined the K α value corresponding to the interval 0.995 by finding the interval containing 9950 points (99.5%), leaving the 50 points outside. More precisely, the K α is calculated as the average of the values corresponding to the integers above and below the number equal to the given fraction.
The K α coefficients are standardized with respect to the mean and standard deviation of given populations. As such, they are a universal measure of the probability of occurrence, function only of the shape of the distribution function. Considering the complexity of the processes involved in microarray experiments, we did not expect that the coefficient would be constant even for just a variety of RNA samples of a given type of array. Nonetheless, examination of 42 microarray studies with two to 11 replicates comprising 682 arrays and 22 Affymetrix array types revealed that values of the K α coefficients were very close for all tested comparisons (note that multiple chip arrays are counted as multiple types). The coefficients were invariant for a wide range of dispersions, invariant with respect to different laboratory conditions, different tissues and different species and across all the types of arrays we tested. Table 4 shows a summary of the average values of K α coefficients for 900 pair-wise comparisons. The average coefficient a 1 varied from 1.8 to 54.4 and coefficient a 2 from 0.08 to 0.69, with total coefficients of variation 1.05 and 0.54, respectively. In spite of such a wide range, the differences in the coefficient K α were small: the coefficient of variation ranged from the minimum 0.031 at the probability p = 0.9 to the maximum 0.101 at p = 0.995.
We examined the relationship between the coefficients K α and the inverse cumulative t-distribution. We found a very close linear correlation between the K α values and the tdistribution values corresponding to the degree of freedom df = 6. The adjusted R 2 coefficient was 0.99993, with the intercept of 0.039 and the coefficient of proportionality of 0.855. Figure 6 shows the graph of the K α values plotted against the t-distribution parameters in the range of probability intervals from 0.5 to 0.995; the solid line represents the regression line for df = 6 and the bars indicate the standard deviation. We also compared directly the K α intervals and t-distribution. Figure 7 shows the probability values corresponding to the K α coefficients and t-distribution probability, represented by the solid curve. In the direct comparison we obtain better agreement for df = 12 than for df = 6.
A further examination of the results shown in Table 4 seemed to indicate that the older GeneChips ® had a some- Columns pair-wise a1 and pair-wise a2 are the coefficients of the standard deviation characteristic function derived from the consecutive sampling. Columns individual genes a1 and a2 show the values derived from the individual probe sets and difference is the difference in % between the two methods.
what broader distribution. For example, the mean K α at 0.995 for the array HuGene FL was 4.11, while these values for the later versions HG-U95A and HG-U133A were 3.48 and 3.56, respectively. To assess the correlation between the developing technology and shape of the K α distribution, we need a quantitative parameter, reflecting the technological advancement. One possibility is the feature size and number of probe pairs per set, which have been systematically decreasing with time. Table 5 shows the overview of the selected K α values correlated with the technical factor TF, defined as the sum of the feature size and number of the probe pairs per probe set. In Figure 8 we present the K α values at 0.95 and 0.995, plotted against TF. The regression line showed a slight decreasing tendency of the K α values at 0.995 with the decreasing TF, but the graph was not very convincing; the adjusted R 2 was only 0.31. No trend was discernible at the probability of 0.95.
Comparison of the observed frequency distribution to the inverse normal cumulative distribution We found the probability intervals useful for estimating the significance of the observed differences, in particular in assays with small numbers of replicates (four or less). The K α coefficients representing the number of standard deviates that separates the measured values from the reference mean values provide an objective measure of dissimilarity between the populations under consideration. For the single normal population the interpretation is straightforward. However, in case of the microarray data we deal with the multitude of populations and the theoretical K α function is unknown; our correlation results though indicate that a universal function, encompassing all GeneChip ® types, exists. We could use the K α values obtained from correlations instead of the theoretical values; however, the experience has shown that the results are not reliable. First, considering the large number of values on the arrays even small differences in the K α function translate into substantial differences in number of candidates. Second, quite frequently the unplanned differences between the samples cause deviations from the expected behavior and render comparison with the general function unsuitable. Therefore, in practice, we use the K α coefficients only for ranking.
To determine the best candidate genes differentially expressed, we search for the genes with the largest K α in all or most of the comparisons. We named this method "consecutive sampling and coincidence test." Briefly, we calculate the K α coefficients in all possible N pair-wise comparisons and select the probe sets with expressions beyond a given probability interval in at least M comparisons; the upper limit of probability of observing f false positives can be calculated theoretically, assuming random selection. Detailed discussion is beyond the scope of this study (a particular example of application to the analysis of five-replicate assay of murine lung tissue can be found in Ref. [15]). The main advantages of this approach are that: 1) it is a nonparametric method; 2) applicable to assays with small number of replicates (as small as two); 3) it examines all pair-wise comparisons and makes easy to identify and automatically flag problematic arrays; 4) the probability of false positives can be easily calculated from the binomial distributions or estimated by straightforward simulations [8]. Here, as a brief illustration of the consistency of this approach, Table 6a shows the analysis of five replicates of murine GeneChips MG-U74Av2, labeled as mg1 to mg5 (data Ref. [15]     No. of probe sets is approximate number of the probe sets on array, Cons. samp. size is number of the probe pairs in a consecutive sample, No. of arrays is a number of arrays tested, No. of comp. is the number of pair-wise comparisons among the replicates, coefficients a 1 and a 2 are the coefficients of standard deviation function and Kalpha is the coefficient determining probability interval; "h." stands for "human," "m." for "murine.   [50]. x -E. Nambara and K. Nakabayashi, ATH1, Arabidopsis thaliana (L.) Heynh of ecotype Columbia [51]. y -E. Nambara and K. Tatematsu, ATH1, Arabidopsis thaliana (L.) Heynh of ecotype Columbia [52]. subsequently counted the common genes found in any two particular subsets. The mean values of all possible comparisons are shown in the fourth row of Table 6a. The values shown in the last row represent the ratio of the mean number of common genes relative to the mean number of the genes that passed the test for each subset (third row) in percent. In the case of the t-test, the average for the over-and under-expressed genes was 23 and 29 percent, respectively. By comparison, the coincidence test for the over-expressed genes yielded 75% and RMA 81%; in the case of the coincidence and RMA, the mean numbers of under-expressed genes were below ten and the comparisons were considered unreliable (data not shown). In only this example we used MAS 5 generated values. Table 6b shows the results of similar tests carried out using the Illumina fiberoptic bead-based oligonucleotide arrays. In this case the average percentages of agreement for the coincidence tests were 89.1, to compare to 48.2%, obtained for the t-test. A more detailed comparison under slightly different assumptions, which includes also the CyberT and Tusher's method, can be found in Ref. [8].

Discussion
In our practice we adopted the approach of Affymetrix, which estimates the background from 2% of the probes with the lowest signals, uses the MM probes for the estimate of the non-specific component and yields an estimate of an "absolute" value of the RNA abundance. We adhered to the Affymetrix philosophy in spite of popularity of the global fitting methods, such as dChip [3,4] and RMA [6,7], because it provide us with a representative expression values independently for each array, enables us to assess consistency of the observed values and detect irregularities and outliers. This is an important advantage, considering how frequently we detect "atypical" arrays among replicates. Furthermore, consistency checks have  shown similar rates of coincidence for both RMA and coincidence testing ( showed the best performance. The authors also noted that the coefficient of variation in replicate experiments in the case of MAS 5 increases with a decreasing mean signal, but remains approximately constant for PM only of dChip and RMA. Invariance of the coefficient of variation raises a certain concern: percentage of contribution of the nonspecific signal increases with the decreasing concentration and one would expect that at low concentrations it would be harder to separate it from the specific component. Choe et al. [19] compared various combinations of the six steps in the differential expression analysis: background subtraction, probe-level normalization, PM adjustment (correction for the non-specific signal), expression summary (derivation of the representative gene expression from the multiple probe signals), probe set-level normalization and statistical evaluation. This was a particularly interesting comparative study, since their experimental design was much closer to real conditions than spiked sets of arrays used in other publications. The authors report that the combination of the MAS5 for background correction and PM adjustment, median Polish method or, marginally inferior, MAS5 for expression summary, loess for normalization and CyberT for statistical evaluation [20] yielded the best results. They also emphasized that, under their particular conditions, MM signals provided the best estimate of the non-specific component. Furthermore, they concluded that in the statistical evaluation it is important to account for variation of the standard deviation with the mean expression (see also [21]). They adopted the CyberT model proposed by Baldi

and Lang
Comparison of the K α distribution and inverse t-distribution Figure 7 Comparison of the K α distribution and inverse t-distribution. K α values correspond to probabilities from 0.5 to 0.995. The degree of freedom for the inverse t-distribution (solid lines) is 6 and 12.  [20], which uses consecutive samples to estimate the expression-dependent component of the standard deviation, similarly to our approach.
In the present analysis of the frequency distributions and properties of the K α we used the MAS 4 software, instead of MAS 5 or GCOS. The reason is that these more recent versions distort the frequency distribution and standard deviation function in the near-zero region. In the case of the Affymetrix arrays the estimate of additive signal, caused by nonspecific binding and other spurious phenomena, is based on the mismatch signal. The estimate of the "true" gene expression is then derived from the difference between perfect match (PM) and mismatch (MM). However, in such system the variability of this difference is a "true" measure of the absolute gene expression variability. Negative difference does not mean that the gene expression is negative, but simply that the MM signal is larger than PM. It is perfectly logical that in absence of a given RNA the MM signal would exceed PM in about 50% of cases. The frequency distribution of the PM -MM difference in the absence of a specific RNA is the best measure of the constant component of spurious signal, added to the "true signal" value. Such estimate cannot be derived from MAS 5 or GCOS data. Replacing the negative values resulting from the signals actually measured by the PM and MM probes by arbitrary numbers introduces inconsistency in the method of evaluation and leads to decrease of the standard deviation with decreasing signal level in near-zero region (unpublished observation). In the low expression region it also leads to a substantial increase in number of probe sets that deviate from the normal distribution [22]. Nevertheless, at the expression levels above about 50 (normalized to 100% of the mean) our observations and conclusions hold even for the data analyzed with MAS 5 or GCOS. Some methods of analysis, such as RMA and one variant of the dChip, avoid the negative values without introducing inconsistency in evaluation by using the PM values only.
Average K α coefficients at the intervals 0.95 and 0.995   In the preceding section, we demonstrated that the frequency distribution of the random and pseudo-random fluctuations of microarray data is predominantly normal. The normal frequency distribution is a useful property, allowing straightforward identification of outliers, a convenient quality check and simple characterization of the observed data. Normality of the error term is an important assumption of various global models used for the analysis of measured probe signals, such as dChip [3,4], RMA [5][6][7] and other approaches [23][24][25]. Among these only Pavelka et al. [25] demonstrated that the assumption is justified. Normality is also a necessary condition for application of the parametric methods. Here we observed that on average over 5% of samples deviate from the normal distribution (using the test threshold of 0.05). It is agreed that the t-test and ANOVA are rather robust with respect to normality (e.g. SigmaStat software [SPSS inc.] uses for ANOVA the threshold of 0.01), nonetheless the noted deviations call for caution when using parametric methods, in particular considering that every analysis involves multiple testing. Our conclusion differs from that of Gilles and Kipling [22], who studied normality of Gene-Chip data using a set of 59 Affymetrix HG-U95A microarrays with human pancreatic cRNA. The authors concluded that "...data provide strong support for the application of parametric tests to GeneChip data sets without the need for data transformation." However, Shapiro-Wilks test, applied to the MAS 4 evaluated data, detected 28% of probe sets deviating from normality at the level P < 0.05. The authors argued that the Shapiro-Wilks test is, perhaps, too sensitive, since the Q-Q plots of the observed and nor-mal values show high correlation. In our opinion, correlation is not a reliable measure of normality. The correlation coefficient can be high in spite of a small number of outlying points that might sufficiently affect variance to lead to false positive conclusions. Gilles and Kipling also observed an excessively high percentage of deviations from normality at low expression levels in data evaluated using MAS 5 and deduced that the most likely reason is MAS 5 treatment of negative values.
The probability of any value in normally distributed populations can be expressed as a number of standard deviates. For example, expressing the difference between the mean of a given population and a particular measurement in standard deviates enables us to compare this difference to the standardized z-distribution and determine, among other things, the cumulative probability of occurrence. For example, the standard deviate of 3.09 corresponds to the cumulative probability of measurements in the tails of the distribution function P = 0.001, a conventional threshold for identifying outliers in small-size samples. In the case of microarrays, we do not have single standard deviate values but standard deviate functions, defined by the K α coefficients. Nonetheless, the same reasoning applies. The necessary and sufficient condition for "standardization" of microarray dispersion is that the K α coefficients must be invariant. Under such condition differences expressed in K α variable are universal, independent of the particular properties of RNA samples, type of array, etc. This is of a practical significance for comparative studies, such as studies comparing results obtained  [15]). The data were subject to one-tail t-test at the level 0.01, coincidence test and RMA. The coincidence and RMA tests were not carried out for the cases below the interval, since the numbers of occurrences were too small. The means of positive cases in five four-sample tests are given. The means of genes common to any two trials are shown. Ratio of the means is given in percent. b) The t-test and coincidence test, Illumina (four samples; data Ref. [8]). The second and third column list the number of genes identified by the coincidence method for the interval 0.9 and 0.8, respectively. The last column shows the numbers of genes that satisfied the t-test. The first and second rows of data give the mean number of genes that passed three-sample sets and the mean of the genes passing concurrently in two particular tests, respectively.
Analysis of significance in assays with less than five replicates always represents a problem. Parametric methods are not reliable in the case of small samples and the nonparametric Mann-Whitney test and ANOVA on ranks provide a very crude estimate for three or four samples and are not very reliable either. Before asserting the invariance of K α values, we used probability intervals in pair-wise case-control comparisons and selected as candidate genes the genes that fell outside a given interval in predetermined number of comparisons [8,15]. We refer to this method as the "consecutive sampling and coincidence test." A more appropriate approach would be to estimate K α coefficients representing the random variability from replicate arrays and apply the coincidence test to the initial sets of genes lying outside the intervals defined by these values.
Besides the significance estimates, we found that the probability intervals determined by K α coefficients are very useful for filtering out the random probe sets prior to the clustering analysis, in particular when hierarchical clustering or principal component analysis is employed. Another straightforward application is to select the relevant set of genes for pathway analysis. Finally, disproportionate K α coefficients indicate a problematic pair of arrays, usually with nonlinear behavior or large clusters of outlying genes.

Conclusion
We provide evidence that the majority of microarray samples, typically between 85 and 95 percent, conform to a Gaussian distribution. Monitoring excessive number of consecutive samples that fail the Kolmogorov-Smirnov normality test is a useful method of quality control in automated analysis of gene expressions.
We used the consecutive sampling method to determine K α coefficients defining the probability intervals in pairwise comparisons. Subsequently, we demonstrated that these coefficients are, in the first approximation, independent of the nature of sample, the laboratory conditions and the type of array. The K α coefficients within the range of probabilities from 0.5 to 0.995 correlate very well with t-distribution. Filtering out the genes with expressions within the probability intervals defined by K α coefficients can significantly enhance the performance for clustering methods, especially for hierarchical clustering and principal component analysis. Finally, selecting the genes that fall outside a specific probability interval in a specific number of pair-wise comparisons provides a convenient, nonparametric method for estimating the signif-icance of observed differences, advantageous, in particular, in case of assays with a small number of replicates.
Our main objective in studying the invariant properties of K α distribution was to examine the arrays from many different experiments in different laboratories, rather than replicate assays, to verify technology or method of analysis. The fact that even under such diversity of data the K α distribution is so stable and so close to t-distributions implies that the Affymetrix technology provides "true" representation of quantitative phenomena, involved in measurement of the abundance of RNA in studied media. However, improving the precision and devising the most effective methods of evaluation still remain a challenge for future development.

Consecutive sampling program
The first version of the consecutive sampling method was published by Novak et al. [10]. Briefly, the program ranks the probe sets of two arrays under comparison (say array A 1 and A 2 ) according to the mean expression and defines the samples of k consecutive pairs of values ("consecutive samples"; typically k = 12, 25 or 50, depending on the size of the array). Then it calculates the standard deviation of samples from the difference of expressions and fits logarithm of the linear function SD = a 1 + a 2 Y mean (1) to the logarithm of calculated values; here Y mean is the sample mean and a 1 and a 2 are the intercept and coefficient proportionality, respectively. The logarithmic transform prior to the regression is used solely to equalize the residua. Without the transform the high-expression residua greatly outweigh the low-end values and lead to an inaccurate approximation in the near-zero range. After the fitting the standard deviation function is inverse-transformed back to the original scale. Since the range of mean values within the samples must be small, we exclude an adequate number of the probe sets below the maximum expression, where the density of the probe sets per unit of expression is low (see the identity test below). This is necessary to avoid inaccurate values caused by large differences of the mean values within the samples. Once the standard deviation function is determined, it is assumed it can be extrapolated to the maximum expression. Table 1 illustrates the procedure. The column "Rank" shows the rank from the highest mean expression, columns "Sample" give the expression values of the arrays A 1 and A 2 , "Y 2 -Y 1 " is the expressions difference, "(Y2+Y1)/2" is the expression mean, "Sample Mean" is the mean expression of the sample, "SD(Y2-Y1)" is the standard deviation obtained from the difference of expressions and "SD(Y1)+SD(Y2)" is the sum of the standard deviations calculated from the values Y1 and Y2, respectively. The first 250 probe sets were excluded from the regression procedure to ensure that the variation of the mean expression of two arrays in a given consecutive sample is small.
When the standard deviation function is determined, the program calculates the boundaries of chosen probability intervals as functions of the mean expression. The upper and lower limits in the dispersion plot Y 2 versus Y 1 are defined as and , where K α is a constant corresponding to the probability interval α (see Additional file 1).
Three reliability checks were incorporated into the consecutive sampling program. First is the identity test, which verifies the equality where SD(Y diff ) and SD(Y i ) are the standard deviations calculated from the expression difference and from the expression values of the individual (first or second) arrays, respectively [10]. It provides a good verification of variability of the mean values within samples; we usually require the mean discrepancy of the ten consecutive samples below 1%. The authors present an empirical study of the distributional characteristics of data from Affymetrix gene expression microarrays. One of the questions posed at the outset concerns the relationship between the mean and variance of microarray data ("it is useful to know how the standard deviation behaves across the expression range...", [Background, §2]) and is subsequently answered in the following way: "...the standard deviation is linearly proportional to the mean expression level" [Results, Frequency distributions, §3]. However, this latter finding seems widely recognized already, and has been discussed in some detail in the literature (e.g. Rocke  Author Response: It appears that our procedure was not clearly described and we revised the text accordingly. We actually use the logarithmic transform only in the regression procedure to balance the residuals, i.e. to prevent the residuals of the high-expression genes to outweigh the low-expression range. Thus instead of regressing SD(Y mean ) → a 1 + a 2 Y mean we fit log(SD(Y mean )) → log(a 1 + a 2 Y mean ) Consequently, the determined characteristic function represents the standard deviation of the original (normalized) data and not log-transformed data. This is the only occasion when we use the log-transform, in all other procedures we employ non-transformed normalized data.
This reviewer found the approach taken to "consecutive sampling" in studying the mean-variance relationship in paired arrays somewhat ad hoc. For example, the 250 probe sets having highest expression level are excluded from the analysis. What effect does this exclusion have on the analysis? Is it appropriate to leave out data (arguably some of the most interesting data) from an empirical study of this kind? This issue is not really discussed. The authors also state that "we usually require the mean discrepancy of the ten consecutive samples below 1%" [Methods, §1]. Does this mean the data are ignored if the discrepancy is higher than this threshold? The presentation of mathematical details is not always very clear in the paper. Equations (2) and (3) would benefit from either a derivation or a reference. Equally, some of the phrasing is somewhat difficult to interpret, e.g. "...we can calculate an estimator of the standard deviation of gene expressions variability of a population of replicate arrays from two-array comparisons" [Results, Consecutive sampling analysis, §1 (the text before revision)].
Author Response: Derivation of equations (2) and (3) is described in Additional file 1. The sentence in question was reformulated.
Finally, this reviewer found the introduction of a new methodology for finding differentially expressed genes [Results, Probability intervals and correlation of the K α coefficients with t-distribution, §5 onwards] puzzling inasmuch as it did not relate to, or strengthen, any of the main arguments of the paper. The case presented was also far from convincing: given that there are so many existing methods for detecting differential expression, it is surely reasonable to expect any new method to be accompanied by strong empirical evidence and/or theoretical arguments in its favor.
Author Response: Actually, the application of the probability intervals to the differential expression analysis had not been included in the earlier versions of the manuscript. However, during the internal reviews we frequently encountered a question "how can the dispersion analysis and probability intervals help biologist to analyze data and to detect significant differences in gene expression" (see also comment of the third referee). To answer this question we included a brief description of the consecutive sampling and coincidence analysis, which we use as a standard procedure, usually in combination with the RMA and/or other approaches. To provide better description we revised the text and included an additional reference.
In conclusion, the basic idea behind the paper, of characterizing microarray data distributions using a large set of real-life experimental data, is a very good one, but the paper is not well tied to the literature and suffers at times from a somewhat ad hoc approach. The paper addresses issues related to analysing DNA Microarrays data focusing on differences of gene expression. The paper is an extension of previous paper of J.P. Novak (reference# 8) by employing various parametric and nonparametric statistics tools and extensive use of statistical packages for very large data sets. The premise of the paper is that the standard deviation of samples of difference of gene expression in DNA microarrays is a linear function of their mean. The paper is a very good work in the area of quality control of Data in DNA Microarrays and certainly a contribution to the field. There are several points that the authors should clarify: