A nitty-gritty aspect of correlation and network inference from gene expression data

Background All currently available methods of network/association inference from microarray gene expression measurements implicitly assume that such measurements represent the actual expression levels of different genes within each cell included in the biological sample under study. Contrary to this common belief, modern microarray technology produces signals aggregated over a random number of individual cells, a "nitty-gritty" aspect of such arrays, thereby causing a random effect that distorts the correlation structure of intra-cellular gene expression levels. Results This paper provides a theoretical consideration of the random effect of signal aggregation and its implications for correlation analysis and network inference. An attempt is made to quantitatively assess the magnitude of this effect from real data. Some preliminary ideas are offered to mitigate the consequences of random signal aggregation in the analysis of gene expression data. Conclusion Resulting from the summation of expression intensities over a random number of individual cells, the observed signals may not adequately reflect the true dependence structure of intra-cellular gene expression levels needed as a source of information for network reconstruction. Whether the reported effect is extrime or not, the important point, is to reconize and incorporate such signal source for proper inference. The usefulness of inference on genetic regulatory structures from microarray data depends critically on the ability of investigators to overcome this obstacle in a scientifically sound way. Reviewers This article was reviewed by Byung Soo KIM, Jeanne Kowalski and Geoff McLachlan


Introduction
Inferring gene regulatory networks from microarray data has become a popular activity in recent years, resulting in an ever increasing volume of publications. There are many pitfalls in network analysis that remain either unnoticed or scantily understood. A critical discussion of such pitfalls is long overdue. In the present paper, we discuss one feature of microarray data the investigators need to be aware of when embarking on a study of putative associations between elements of networks and pathways. We believe that the present discussion pinpoints the crux of the difficulty in correlation analysis of microarray data and network inference based on correlation measures. The same caveat is of even greater concern in reference to more sophisticated methodologies that are designed to extract more information from the joint distributions of expression signals, Bayesian network inference being a relevant example. In a paper published in 2003, Chu et al.
[1] pointed out the important fact that the measurements of mRNA abundance produced by microarray technology represent aggregated expression signals and, as such, may not adequately reflect the molecular events occurring within individual cells. To illustrate this conjecture, the authors proceeded from the observation that each gene expression measurement produced by a microarray is of the sum of the expression levels over many cells.

Aggregated expression intensities
Let ν be the number of cells contributing to the observed expression signal U (see Remark 1 below) and denote by X i the expression level of a given gene the expression level of a given gene in the ith cell. The notation Y i is used for the second gene in a given pair of genes. A simplistic model of the observed expression signals in this pair is given by where X i and Y i are two sequences of independent and identically distributed (i.i.d.) random variables (r.v.s), while X i and Y i in each pair (X i , Y i ) may be dependent with joint distribution function F(x, y). Limiting themselves to the case where ν is non-random, Chu et al. [1] showed that, except for some very special and biologically irrelevant cases, the Markov factorization admitted by the expression levels within individual cells does not survive the summation (aggregation) in formula (1), thereby stymieing any network inference based on the joint distribution. The importance of this observation cannot be emphasized enough. However, as apparent from the relevant literature, it went entirely unnoticed.
In their concluding remarks, Chu et al. [1] note that the mean vector and covariance matrix remain "invariant under aggregation up to a simple linear transformation". The same is obviously true for the correlation matrix. They saw some hope in that fact as reflected in the following quote from their paper: "Thus, while waiting for the technologies capable of measuring efficiently the expression levels in single cells, in experimental studies, we can still make valid -although probably more limited -inferences about the regulatory networks based only on the first two moments of the joint distribution and the independence relations." Unfortunately, this hope is deflated when considering the case of random ν. Indeed, let each X i have the same distribution as X, while each Y i is distributed as Y. Then the following formula holds for the correlation coefficient ρ(U, Remark 1. If the hybridization reaction reaches equilibrium, an assumption widely adopted in the physical chemistry of microarrays [2], the random variable (r.v.) ν can be interpreted as the total number, N, of cells from which the total RNA is extracted. In the practical use of microarray technology, however, the reaction is typically stopped before equilibrium has been reached. In the latter case, the r.v. ν represents the number of cells that collectively yield the ultimate number of bound target-probe duplexes. Therefore, the random parameter ν is unobservable and should be thought of as a virtual number of cells associated with each batch of target RNA produced by them. This notion provides a constructive way of bridging the processes of gene expression at the genomic and tissue levels, which is the main thrust of our discussion. The conventional protocol of a microarray experiment implies that it is the total amount of RNA that is controlled (kept constant) across the arrays (subjects) rather than the number of cells ending up on each array. Therefore, the random fluctuations of ν cannot be controlled directly.
Even if a tight control of N could be provided in experiments, it is unclear whether this would have had a diminishing effect on the variance of ν.
An upper bound for the deviation between ρ(U, V) and ρ(X, Y) is given by This result follows from formula (3) and the following chain of inequalities: , Recall that the equality ρ(U, V) = ρ(X, Y) holds when τ = 0. Considering R = ρ(U, V) as a function of τ, one can verify that R(τ) either increases monotonically or attains a minimum before starting to increase with increasing τ. In both cases, R → 1 when τ → ∞. The function R(τ) is smooth at τ = 0, but its initial slope may be quite high as our sample computations show. An additional quantitative insight into the potential impact of this unobservable variation on the correlation structure of microarray data is possible as described in Section 4.

An alternative representation of ρ(X, Y) and its implications
Recalling the model given by (1), we give a formula that allows us to better understand the principal difficulty brought about by the random nature of the parameter ν.
In the [Additional file 1] we find the correlation between the unobservable r.v.s U and V: This formula implies that estimating the correlation between the unobservable variables X and Y in each gene pair amounts to estimating the correlation between their averages over a random number of cells, thereby showing the earlier-mentioned nonidentifiability aspect of the problem in terms of the basic random variables. Note that the model given by (1) can be represented as where the correlation between and is the same as that between X and Y, albeit the distributions of the corresponding vectors can be arbitrarily dissimilar. The above representation shows that the r.v. ν can be interpreted as a multiplicative random noise as long as the main focus is on pairwise correlations. However, this interpreation is to no avail. The noise ν and the signals and are inherently dependent under this model. Therefore, the popular model of independent random effect is unlikely to serve a good approximation to the aggregated signals. In Section 6, we will invoke formula (5) in our discussion of the utility of the Law of Large Numbers within the framework of model (1). Formula (5) also illustrates one restrictive assumption behind the model that may have gone unnoticed in its construction. Specifically, the assumption that (X i , Y i ) are i.i.d. random vectors implies exchangeability of these vectors across cells and subjects so that the joint distribution of (X, Y) exhaustively describes both types of variability in formula (1). Put another way, the baseline joint distribution of expression levels of all genes introduced at the cellular level is implicitly compounded with respect to a latent random parameter describing the inter-subject variability. In this case, the correlation between expression signals within each cell appears to be the same as the correlation between their random averages (as formula (5) shows), both correlations being computed across subjects. If one wants to separate the two types of biological variability in a mechanistic model, e.g., by incorporating a random effect into the expression signals associated with single cells and thus making them dependent within each subject, the resultant formulas will become quite cumbersome and contain additional unobservable parameters.

Assessing the effect of signal aggregation
While our discussion at the end of the previous section suggests that model (1) is quite simplistic, we presently have no better vehicle to assess the potential deviation of the correlation between X and Y from that between U and V. To gain an idea of how strong the effect of the parameter ν variability can be, let us first compute the coefficient R = ρ(U, V) for some parameter values, assuming that gene expressions within single cells are stochastically  independent (ρ(X, Y) = 0). By way of example, suppose σ ν /μ ν = 0.23 and μ ν = 2 × 10 5 cells. From formula (3), we obtain R = 0.999942 for a = 1, b = 2 and R = 0.999952 for a = 1, b = 5. When setting ρ(X, Y) = 0.5 or ρ(X, Y) = 0.9, the values of R change only in the fifth digit. The same magnitude of R still stands for ρ(X, Y) = 0 and even when ρ(X, Y) = -0.9. Notwithstanding arbitrariness of the chosen parameters, this indicates an extremely serious problem arising in studies of dependence structures in general and regulatory networks in particular.
Do our calculations imply that the true correlations between gene expressions are absent or weak? The answer is definitely "No" for the following three reasons. First, the assumption of gene independence is biologically implausible and in conflict with a large body of independent experimental evidence, including the known effects of noncoding RNAs and involvement of genes in biochemical pathways. Second, the situation observed in real data is not as severe as in our sample computations: positive correlations tend to be lower and even a small proportion of negative correlations has been documented. It would appear reasonable that many strong negative correlations are hidden in the overwhelmingly positive correlation structure of microarray data. Third, the unobservable parameters chosen in our computations may be very far from reality. Therefore, we have to base our assessment on real gene expression data rather than imaginary parameters of the model. One possible approach to real data analysis is presented below.

Remark 2.
It should be noted that negative correlations are typically much more prevalent in normalized versus not normalized data. This does not mean, however, that the commonly used normalization procedures can restore the true correlations. A profound effect of such procedures on the correlation structure of microarray data is well-documented [3,4]. This effect is hardly beneficial as normalization procedures distort the aggregated signals in an unpredictable way [5] and interfere in the true correlation structure [3]. There are also other theoretical reasons for the fact that data normalization does not provide a satisfactory solution to the problem; these reasons will be discussed at length in another paper.

From formula (2) it follows that
where z ν = σ ν /μ ν . As a function of z, the coefficient ρ(X, Y) either decreases monotonically or attains a maximum at the point where Therefore, the effect of signal aggregation is not unidirectional -the correlation coefficient ρ(X, Y) may be smaller or higher than the observed coefficient ρ(U, V). Formula (6) can be represented in a more concise form All the parameters entering formulas (6) or (8) can be estimated from microarray data except for z ν , which is unobservable. However, there are natural mathematical constraints that must be imposed on z ν . First of all, we have to require that z ν <ξ u for any gene, i.e., where , j = 1,..., m, is the variation coefficient for the jth gene and m is the total number of genes. However, condition (9) does not ensure that |ρ(X, Y)| ≤ 1. To meet the second condition, we derive from (6) the following requirement: for all pairs of genes simultaneously.
The above conditions allow us to deduce a realistic range of possible values of the unobservable variation coefficient z from a specific set of microarray data. If ρ(X, Y) appears to be a monotonically decreasing function of z ν , which property can be verified with real data, then we can use formula (6) to estimate its maximal deviation from ρ(U, V) by evaluating ρ(X, Y) at the right extreme of z ν yielded by conditions (9) and (10). In this case, we obtain a reasonably realistic upper estimate of the actual effect of signal aggregation in accordance with model (1). If ρ(X, Y) passes through a maximum as a function of z ν , this estimate will become conservative to shifts towards lower values of the true correlation coefficients. Such estimates need to be produced for all gene pairs, of course. More accurate estimates of the effect in both directions (up and down) can be obtained by evaluating the behavior of ρ(X, Y) over the whole range of admissible values of z ν in each gene pair, but this approach is computationally extremely expensive and requires parallel computations.
The mean and minimal (across genes) variation coefficients of gene expression were estimated from the following five sets of microarray data: BCC: Breast cancer cells cultured in vitro (represented solely by "vehicle" control samples that were treated with the medium used to solubilize the inhibitor) with HG U133A Affymetrix Chip used to produce microarray measurements [6]; TELL and HYPERDIP: two types of childhood leukemia, U95A Affymetrix Chip [7]; PCTUM: prostate cancer, U95Av2 Affymetrix Chip [8]; PCNORM: normal prostate tissue obtained from prostate cancer patients, U95Av2 Affymetrix Chip [8].
The results are shown in Table 1. These estimates are consistent with the earlier reported observation that the variation coefficients of gene expression are virtually constant across genes [9]. Using the above-described approach, we analyzed all gene pairs in the HYPERDYP data set reporting expression levels of m = 7084 genes for n = 88 patients with a specific type of childhood leukemia. In this case,  (10). Therefore, we used the latter value as the conservative estimate of when computing the correla- (6). Testing for monotonicity was performed by partitioning the admissible range of (given by condition (10)) into four intervals and using formula (6) to compute the corresponding increments of ρ(X, Y) for each interval. If at least one increment happened to be positive in a given pair, this event was recorded as a "monotonicity violation". There were less than 0.2% of all gene pairs that could be suspected for such violations in the HYPERDYP data. While this frequency of monotonicity violation may be reckoned as quite small, it should be kept in mind that possible shifts in ρ(X, Y) towards values higher than the observed ρ(U, V) were entirely ignored in this analysis.
Let us now evaluate the numerical results of this study. for assessing the deviation Δ ρ may still be overly simplistic discussed in the previous section. Much more research needs to be done, both theoretically and experimentally, to shed more light on this methodological difficulty.

Signal aggregation and technical noise
Our estimates in Table 1 and those resulted from condition (10) give only a rough idea of the magnitude of σ ν /μ ν and making them more accurate is highly desirable. We discuss one possibility to attain these ends in the present section. Consider an experimental design that supposedly eliminates the biological variation, thereby yielding the information on measurement errors only. Suppose that a sample of n arrays is available that consists solely of technical replicates representing gene expression measurements taken from one and the same subject. Proceeding from the traditional multiplicative noise model, where m is the total number of genes (probe sets), is the observed random signal, and j is an independent random technical (both gene-and array-specific) noise, one would model this situation as where C j are nonrandom constants. If the expression levels are log-transformed, we have Therefore, so that, relying on model (12), one can measure the variance, Var(log j ), of the log-transformed technical noise directly from technical replicates. In particular, one can estimate the variance of the mean error across all probesets, i.e., We resorted to the above line of reasoning in [10] when reanalyzing the Microarray Quality Control Study (MAQC) [11]. For this data set, the estimated is equal to 0.09. Since the overwhelming majority of genes have typically much larger (> 0.3) standard deviations of their log-expression signals in biological replicates (different subjects), this level of technical noise can be deemed negligibly small. This estimate also leads us to conclude that the true correlation between the unobservable signals log X j is really strong. Indeed, the contribution of to the variance of log-expressions observed in biological data is much larger than the contribution of estimated independently from the MAQC data, while a strong correlation between true biological signals (i.e., their values in the absence of measurement errors) is the only explanation for such a discrepancy when the number m of genes is very large. This also explains why the Law of Large Numbers (LLN) is not met in microarray data when applied to log-expression levels across genes [12,13].
The situation is no longer the same when we proceed from model (1) in an effort to measure the technical noise stemming from the random nature of the parameter ν. For any gene j, formula (1) gives and it is the parameter ν that plays the role of the technical noise here. It is clear from (13) that the biological variability cannot be entirely eliminated from gene expression signals even when they are produced by purely technical replicates. Designed to assess the technical variability, the experiment described above may only reduce the variance of the r.v.s X ij by eliminating the inter-subject variability, but there will always be some residual biological variability associated with different cells, i.e. "cell to cell" variability. Under such experimental conditions, we have where are i.i.d. r.v.s representing the expression levels of the jth gene in different cells obtained from the same subject and their common (conditional) variance is expected to be smaller than that of X 1j in (13). Formula (14) also suggests that the MAQC data are far from ideal for the purposes of noise assessment because the technical replicates in this study were produced from a mix of many dissimilar tissue sources, this heterogeneity of samples may inflate the variance of while it should be kept as low as possible.
To remove the scaling factor C j from the model (12), when deriving the variance of its noise component, we logtransformed the observed expression signals . This trick does not work for model (14) and this significantly complicates the noise assessment. More complications arise when extending the model represented by formula (14) to include an additive term that describes sources of variation other than ν. Under the extended model, the original expression level of the jth gene in technical replicates is given by ,..., , 1 X ij X ij X j In the presence of the noise component attributable to ν, the error term η does not not need to be array-specific as it essentially reflects the equipment-related optical noise.
Since the overall variance of is expected to be much lower than that of U j [10], one can use technical replicates to make the range of admissible values of z ν (see Section 4) much narrower, thereby providing more accurate estimates of ρ(X, Y) and Δ ρ in accordance with formulas (8) and (11), respectively. This idea of combining information from biological and technical replicates deserves a careful consideration and even a generous investment in specially-designed experiments because it offers an improved experimental protocol that may make the usual correlation analysis, as well as the network inference based on correlation measures, more meaningful. If the idea works in general, the new protocol will require producing a separate set of technical replicates in each biological experiment in order to estimate the range of z ν and then using this estimate to reconstruct ρ(X, Y) in each gene pair. This suggestion is based on a plausible assumption that the variation coefficient z ν is the same for the biological and technical replicates produced by a given biological experiment. To make the estimate of the range of z ν as accurate as possible, it is imperative that the technical replicates be produced from a homogeneous biological material derived either from one and the same subject or at least from the same type of a tissue (initially collected from several subjects) that is used to produce the corresponding biological replicates. While more laborious and expensive, the experiments thus designed may provide a practically workable solution to the problem discussed in the present paper. Some additional thoughts of this kind are offered in Section 7.

The law of large numbers and random summation
The following claims seem to be natural in the context of the model given by formulas (1): 1. The observed expression signal U is a result of summation of the inter-cellular signals X i over a random number of cells ν, thereby defining the basic model structure represented by formulas (1). In what follows, we examine some indirect corroborative evidence for the above claims.
Suppose for a moment that the number of summands ν = k is nonrandom. Then the distribution of the corresponding sum in (1) is M-divisible, i.e., it can be represented as the convolution of M distribution functions. In this particular case, the fouth central moment μ 4 (U) satisfies the inequality [14]: For infinitely divisible distributions, the condition (15) assumes the form Under mild conditions, these inequalities hold in the case of random ν as well [14]. If the inequality (15) is met in real biological data, this fact will lend additional support to the presence of signal summation in microarray technology. When testing the corresponding inequalities for empirical counterparts of the moments μ 4 (U) and σ u in (15) and (16), we observed the event of their violation to be of relatively rare occurrence. For example, the inequality (16) was violated for 18.6% of the 7084 genes in the HYPERDIP data. As expected, this proportion was lower for any finite M in (15). Although there is no objective criterion for declaring this frequency consistent with the property of infinite divisibility, we deem it quite low in view of the fact that μ 4 (U) and σ u in (16) were replaced with their sample counterparts. To corroborate our perception, we generated 7000 independent samples of size n = 88 from a log-normal distribution with parameters (log U) = 0.7 and Var(log U) = 0.09. The experiment was repeated 1000 times. The mean proportion of "inconsistent" cases was equal to 23.3%, suggesting that the random chance of the event under observation may be high even when the underlying distribution is known to be infinitely divisible.
Yet another underpinning for the presence of signal summation is provided by considering the accompanying distributions of random sums. In the classical summation E scheme, the notion of accompanying infinitely divisible distributions was introduced by Gnedenko [15]. This idea was later extended to the random summation by Klebanov and Rachev [16]. Consider the random sum where {ν k , k ∈ Θ}, Θ ⊂ (1, ∞) is a family of positive integer-valued r.v.s independent of X i , i ≥ 1. The r.v. The r.v. ν k is assumed to have finite expectation equaling k for all k.
It is known that the random sum U k can be approximated by its accompanying ν-infinitely divisible random variable S k under the condition of non-negativity of X i only.
Note that the definitions of ν-infinitely divisible and accompanying ν-infinitely divisible random variables can be found, for exmple, in [17]. In this case, it can be shown [17] that the Laplace transform of S k converges to the Laplace transform of U k in the uniform metric as k → ∞.
Since the number of cells ν is expected to be large, it is tempting to apply the Law of Large Numbers (LLN) to the normalized random sum where k is positive integer, and make some predictions based on its behavior as ν k → ∞ (k → ∞) in probability. As before, we will assume that the sequence of nonnegative integer-valued r.v.s ν k is independent of X i , i ≥ 1 and ν k → ∞ (in probability) as k → ∞. The continuous r.v.s X i are i.i.d. and positive. If μ x is finite, it is known [18] that Z k → μ x as ν k → ∞, with both limit relations holding in probability as k → ∞. This is the LLN for random sums.
There is no way of ascertaining whether the LLN is met in real microarray data because the r.v. ν is unobservable.
However, we intend to use this powerful tool to predict certain properties of expression signals and then verify them with real data. In doing so, we rely on the following simple result. The fact that the multivariate limit distribution of Z k is a degenerate one is consistent with the asymptotic behavior of Cov(U/ν, V/ν) (and consequently Var(U/ν), Var(V/ν)) considered in Section 3. Indeed, we have It is easy to show that → 0 as k → ∞. Therefore, the covariance in (20) tends to zero when ν k is large in probability. The same is true for the variances of U k /ν k and V k /ν k of course. While this behavior of the two central moments is consistent with the convergence established in Assertion 1, the correlation coefficient ρ is not welldefined for degenerate random vectors. At the same time, the fact that all components of Z k are asymptotically independent is not in conflict with formula (5) because the latter is valid for any value of ν. Nor does it come into conflict with the observation that the intra-cellular gene expression levels are strongly correlated. When deriving formula (5), we divide Cov(U k /ν k , V k /ν k ) by the product of the corresponding standard deviations of U k /ν k and V k /ν k , which is why the proportionality coefficient cancels out and the uncertainty manifesting itself in the limit distribution becomes resolved.
The results given above imply that, while the r.v.s U/ν and V/ν are asymptotically independent, the correlation between the components of each pair (X i , Y i ) may be arbitrarily strong even when ν takes on large values with high probability. Such "paradoxical" situtions are not uncommon in the theory of probability. It should be emphasized that the above assertion is valid for the random sum of ν i.i.d. random summands normalized by the same random variable ν, and not for other possible ways of normalization. For limit theorems related to the random sums normalized by sequences of nonrandom numbers we refer the reader to [19,20].
Now we are in a position to make and test the following two predictions: Prediction 1. The ratios of the observed expression levels U j and U r , j ≠ r, where j, r = 1,..., m and m is the total number of genes, tend to have small variances. The covariance between different ratios U j /U r is expected to be small as well.
Indeed, proceeding from the LLN, we expect the asymptotic relation to hold true as ν → ∞ in probability. This suggests that every ratio U j /U r is virtually constant (across arrays). The above-proven assertion also suggests that every two ratios of the form:U j /U r and U l /U q (with different indices) have small covariances.
To verify Prediction 1, we formed all pairs from 1000 randomly selected genes. The mean (over the gene pairs) standard deviation of the ratios U j /U r (j ≠ r) in the HYPER-DIP data was equal to 0.102, which value is very small compared to the corresponding mean of the estimated expectations (U j /U r ), the latter value being equal to 1.044. The histogram of the standard deviations in Figure  1 illustrates this point further. Shown in Figure 2 is the histogram of the estimated covariances between U j /U r and U l /U q in all quadruples formed from 100 randomly selected genes in the HYPERDIP data set. It is clear that they tend to be small as well. This observation explains the most salient properties of the so-called δ -sequence [12], as well as a remarkable success of significance testing for differential expression of genes when the relevant methods are applied to the elements of this sequence rather than to the original expression levels [12,13].

Prediction 2.
The average (expectation) of the ratio U j /U r is approximately equal to the ratio of the averages of U j and U r , j ≠ r.
Invoking the LLN, we can assert that Proceeding from the representation Histogram of the differences between (U j /U r ) and (U j )/(U r ) estimated by replacing the expected values with the corresponding sample means Figure 3 Histogram of the differences between (U j /U r ) and (U j )/ (U r ) estimated by replacing the expected values with the corresponding sample means. The HYPERDIP data set.  Histogram of covariances between the ratios of gene expres-sions in all quadruples from a subset of 100 genes Figure 2 Histogram of covariances between the ratios of gene expressions in all quadruples from a subset of 100 genes. The HYPERDIP data set.
we can claim that whenever ν is large with high probability.
Replacing the expected values with their sample counterparts, we computed the absolute difference between the left and right hand sides of the equality (22) for all gene pairs formed from 1000 randomly selected genes in the HYPERDIP data set. The resultant histogram (Figure 3) clearly indicates that such differences are very small with the mean (across the gene pairs) being equal to 0.006.
Hence, both predictions appear to be consistent with real data. Similar analyses of the other data sets referred to in Section 3 have confirmed this conjecture.

Discussion and concluding remarks
Methods of network reconstruction from designed gene perturbation experiments are beyond the scope of this paper. The fact that the latter strategy can be limited to mean expression levels makes it fundamentally different from the inference based on genome-wide expression measurements. Some limitations of the network inference from gene perturbation experiments have been discussed by other authors (see, e.g., [1]). The multiple testing aspect of the problem will not be touched upon either despite its direct bearing on this type of data analysis.
The world of stochastic phenomena is complex and uncanny. Intuition is not the best guide in that world. Some stochastic effects may seem to defy common sense but nevertheless they may be very real from the theoretical and practical perspectives.
In the present paper, we describe and explore to the best of our ability the impact of random signal aggregation on the correlation structure of microarray gene expression data. While our analysis of real data suggests that this impact may be deemed reasonably moderate in some situations, the main concern still remains because the estimates employed are not sufficiently stable and the underlying model may still need further refinements. A similar concern arises in regard to other standard measures of dependence such as the mutual information. The latter characteristic is applied extensively to the same data structure for the purposes of relevance network inference [21], thereby calling for a similar investigation of its properties.
We overlooked the phenomenon of signal aggregation when discussing the correlation structure of microarray data in our earlier publications [12,9]. The results of [12] connected to the use if δ-sequences find now theoretical basis (see Prediction 2). The influence of signal aggregation on Type-A dependency remains not completely clear.
In [9], we tried to make the case that the observed strong and long-ranged correlation between gene expression levels are of a biological nature rather than a technical flaw of the microarray technology. Our belief was based on the premise that the effects of the technical noise [10] and multiple targeting [9] on the correlation structure of microarray data appeared to be negligible. There is no reason to revise this premise. However, the effect of random summation of expression signals reported in the present paper is a drastically different story. While technical in nature, this effect represents a serious obstacle standing in the way of correlation analysis and network inference. At the same time, the estimates reported in the present paper still indicate the presence of strong correlations between the expression signals produced by different genes at the level of individual cells.
There are statistical questions, other than the estimation of correlation coefficients, that may be relatively insusceptible to the effect of signal aggregation. For example, we hypothesize that it may still be sensible to compare correlation vectors associated with each gene in two different phenotypes in order to extract more information on pathogenesis of some diseases or responses to drug therapies. However, this conjecture invites a special investigation. It is clear that the crux of the difficulty has to do with a natural desire to make inferences about "microscopic" processes of transcription within individual cells from "macroscopic" observations yielded by gene expression measurements. From this perspective, the mixing effect caused by signal summation should be considered as confounding [22] and, as such, is undesirable. Needless to say, one can employ the correlation coefficients between observed expression signals as more global characteristics of the cell system under study rather than associations between gene activities within each cell. Such characteristics still represent a source of useful information. From this viewpoint, the results of correlation analysis of gene expression data can be interpreted in terms of aggregated (over the cells) genes, an obvious departure from the interpretation that has been in wide use among molecular biologists and bioinformaticians. Since tissue-specific mechanisms regulating cell functions are not well-understood, it is premature to judge whether or not this cautious interpretation is of biological interest.
The most critical question still remains: How can the true correlation be extracted from observed expression levels despite the masking effect of signal aggregation? At present, we have no satisfactory answer to this question. However, some practical expedients mitigating the adverse consequences of signal aggregation can be envi- sioned. As discussed in Section 5, one approach is to combine the information provided by technical and biological replicates using the mathematical treatment of the problem presented in this paper. Yet another possibility is to modify the experimental protocol so that the total DNA rather than the total RNA be kept constant across the arrays (see Remark 1). The rationale for this suggestion is that the correlation between the parameter ν and the total number of cells (gauged by the DNA content) in a given biological sample may well be stronger than that between ν and the total RNA. The main problem with hybridiza- It is easy to see see, that the inequality implies because μ ν and ξ 2 (ν) are the same for all genes. Therefore, the coefficient ρ(U, V) ξ u ξ v defined by (23) preserves inequalities between the corresponding coefficients for (X, Y). In this connection, it is important to recall that the variation coefficient of observed expression levels is almost constant across genes, a fact mentioned in Section 4. Under such conditions, the inequalities (24) and (25) imply the same inequalities for the corresponding correlation coefficients. This suggests that the ranking of gene pairs by the correlation coefficient may still be possible and such inference can probably be improved by stratifying the population of genes by the value of the variation coefficient. Whether this observation is of real utility in studying relationships between genes within the network paradigm has yet to be explored.
The future of the whole research area dealing with regulatory networks hinges on our ability to surmount the obstacle described in the present paper either by means of mathematics (including the recourse to parametric methods) or through radical technological improvements.

Reviewer 1 (B.S. Kim)
The authors (K & Y, hereafter) revisited an old issue of statistics, i.e, the problem of aggregation, in a new biotechnology area with a more complicated mode. It is an old issue in statistics that the correlation at the aggregated level may be quite different from the correlation at the individual level. This phenomenon is often referred to as the Simpson's paradox (Simpson, 1951), or the ecological fallacy (Robinson, 1950). Yule and Kendall (1950) also dealt with this issue in Chapter 13. The primary difference of K&Y's approach is that the number of components in the aggregation is regarded as random, because the number of cells in a tissue, a target material on the microarray slide, is not controlled to be fixed under the current technology and hence subject to the random fluctuations.
As K&Y indicate, making inference on the genetic regulatory network (GRN) depends heavily on the finding the true correlation on the individual cell level, not on the aggregated level.
This paper deals with one of the basic and fundamental issues in statistics and biology.  Table I next to "Average ξ n " column, say.

Minor Points
6. p. 11 lines 8-9 from the bottom. The measurement error is another source of errors in the microarray experiment in addition to technical and biological variations (Churchill, 2002). It is better to distinguish the measurement error from the technical noise here and throughout the manuscript.
opposed to what was done and why. One suggestion would be to include the beginning part of section 2 as part of the introduction, explaining the observation noted by Chu et al., as section 2 is a bit lengthy in its current form.
3) Aggregated Expression Intensities. In remark 2, the discussion about heterogeneous experiments and PCC, thought important, in its current form, appears a bit disjoint from the rest of the text and perhaps could be either greatly shortened or removed.

4)
Equations. The number of models/formula within the manuscript severely detracts at time from the equally important biological content. One suggestion may be to devise an appendix for some formula and calculations presented. Of note, I did not check the formula and assumed that they were correct.
Minor Comments 1) p.2. section 2, line 3:may consider removing the word 'adequately' since it assumes that technology in its current form does reflect intra-cellular signal but is deficient.
2) Table 1. it may be useful to include the number of genes examined in each dataset to obtain estimates.
The authors' responses are provided in the text.

Reviewer 3 (G. McLachlan)
This paper provides a theoretical account of signal aggregation on the correlation between the measured expression levels between pairs of genes.
The approach and the results derived are quite novel and I recommend its publication in the Journal.
Authors: Thank you so much!