Harnessing the complexity of gene expression data from cancer: from single gene to structural pathway methods

High-dimensional gene expression data provide a rich source of information because they capture the expression level of genes in dynamic states that reflect the biological functioning of a cell. For this reason, such data are suitable to reveal systems related properties inside a cell, e.g., in order to elucidate molecular mechanisms of complex diseases like breast or prostate cancer. However, this is not only strongly dependent on the sample size and the correlation structure of a data set, but also on the statistical hypotheses tested. Many different approaches have been developed over the years to analyze gene expression data to (I) identify changes in single genes, (II) identify changes in gene sets or pathways, and (III) identify changes in the correlation structure in pathways. In this paper, we review statistical methods for all three types of approaches, including subtypes, in the context of cancer data and provide links to software implementations and tools and address also the general problem of multiple hypotheses testing. Further, we provide recommendations for the selection of such analysis methods. Reviewers This article was reviewed by Arcady Mushegian, Byung-Soo Kim and Joel Bader.


Background
The early driving forces in biology were reductionist approaches. In general, a reductionist approach tries to break-down a complex system into its parts list and explains its properties as the sum of its individual components. Hence, the individual constituents of a system inform its higher level functions [1][2][3][4]. However, the 'one gene, one protein, one function' working hypothesis [5] is not sufficient in order to explain the many emergent properties such as the phenotypic variability of organisms or the heterogeneity of cancer [6]. For this reason, nowadays, it is generally acknowledged that for achieving a functional understanding of biological systems, the genes in a cell need to be studied as a functioning collective [2,3,7]. In such a system, the collective functioning of groups of genes results in, for instance, signaling pathways *Correspondence: v@bio-complexity.com Computational Biology and Machine Learning Laboratory, Queen's University Belfast, Belfast, UK or protein complexes that regulate cell differentiation, transcription regulation or growth.
A systems integration at the cellular level has the potential to answer many, until now, unsolved questions about biological systems and their collective functioning, regulatory programs for growth, development, phenotypic variability and the causality of many complex diseases [8][9][10]. Due to the enormous complexity of a cellular system, where many processes and interactions at different levels inside a cell work in harmony to assure the vital functioning of a cell, we need to understand key properties of biological systems like its robustness or modularity [2,8] in order to enhance our understanding of complex diseases. These complex interactions occurring within a cell can be described by networks [11][12][13], including gene regulatory networks [14,15], protein-protein interaction (PPI) networks [16,17], metabolic networks [18] and transcription regulatory networks [19,20]. The networks are organized at different cellular levels and enable the functionality of the cell. The question now arising is http://www.biology-direct.com/content/7/1/44 how can the complexity inside a cell be understood, and analyzed?
The development of information processing technologies in the post genomic era enabled the generation of huge amounts of data. In this review, we focus on gene expression data from microarray platforms and summarize three major types of analysis strategies: (I) Identification of changes in single genes, (II) identification of changes in gene sets or pathways, and (III) identification of changes in the correlation structure within pathways. We discuss these methods in the context of cancer data sets to emphasize their biological meaning, implications and expressiveness.

Large-scale gene expression data
In the next section, we briefly review high-throughput technologies that enable the generation of large-scale gene expression data [21][22][23].

Gene expression data from microarray
A microarray experiment measures genome-wide gene expression levels of mRNA in a cell or a tissue sample under a particular condition. A microarray chip quantifies the hybridization of fluorecsent labeled target nucleotide sequences to defined complementary probe sequences that are spotted on a glass or silicon slide. For different microarray platforms the spotted probes are synthetic oligonucleotides ranging from 25 to 80 nucleotides or long cDNA transcripts. Different microarray platforms were designed for a single-channel or a multi-channel experimental setting. For single-channel arrays each condition sample is hybridized separately on individual arrays using a single dye. For multi-channel arrays multiple conditions are hybridized together on individual arrays using multiple dyes. For example Affymetrix is a single-channel platform, where multiple oligonucleotide probes (probeset) of 25 bases are used to measure the concentration of a mRNA transcript. The target mRNAs of expressed genes are extracted from a treatment or a control sample, reverse transcribed to cDNAs, labeled with a fluorescent dye and then hybridized to a microarray. An image of the microarray captures laser induced emitted fluoresent intensities of the probes at each spot. The intensities give a proportional measure of the corresponding mRNA concentration for each gene that was defined on the microarray.
RNA-seq has a wide variety of applications such as the measurement of gene expression levels from transcribed mRNA sequences [25]. In the first step of the procedure RNA is extracted from a given condition sample, fragmented, reverse transcribed to cDNA that is ligated to adapters. In the second step a library of reads is generated from the ligated fragments that are sequenced. In the third step the reads are mapped to known exon sequences of genes. The expression level of a gene is measured from the normalized number of mapped sequences that mapped to the known set of exon sequences of a gene. The RNA-seq transcriptome sequencing approach overcomes several limitations of microarrays for measuring gene expression. For example, RNA-seq measures large ranges of expression levels from very low to highly expressed genes and is able to consider unknown transcribed sequences. Since the novelty of the methodology, gold standard procedures for the management and processing of the data are currently being established.

Gene expression data and cancer
Cancer is a multifactorial disease, i.e., the detection of one mutation in one gene cannot explain the phenotypic plurality of carcino-and pathogenesis by a one-to-one relationship between genotype and phenotype. Instead, cancer can be induced by a multitude of genetic and environmental factors and the accumulation of such events. The intervening of such complex factors makes in general the characterization of complex diseases difficult. For this reason it is astonishing that the seminal work by Weinberg et al. [6,26] presented a relative simple, systematic functional framework for cancer and the role different biological key processes are playing. In this paper, the so called hallmarks of cancer have been defined. According to [6], the hallmarks of cancer (see Figure 1) are: • self sufficiency in growth signals • insensitivity to anti-growth signals • evading apoptosis • limitless replicative potential • sustained angiogenesis • tissue invasion and metastasis Later, this list has been extended by adding two further hallmarks [26] • deregulating cellular energetics • avoiding immune destruction and also two enabling characteristics • genome instability • mutation and tumor-promoting inflammation http://www.biology-direct.com/content/7/1/44

Avoiding immune destruction
Limitless replicative potential enabling characteristics extended hallmarks Figure 1 The hallmarks of cancer and enabling characteristics [6,26].
It has been recognized that these hallmarks are gradually acquired by different types of cancers, potentially, in a variable order. This variability in the acquiring of these disease-bearing processes is one of the indicators of the complexity of cancer.
The biological processes in a cell are controlled and regulated by signaling pathways that are activated by internal and external signaling receptors and factors. The signaling pathways governing growth and cell proliferation are likely dysregulated in their functioning in cancer. For example, they become insensitive to anti-growth signals, or they are dys-regulated in growth signaling pathways by gaining autonomy in their growth. It is assumed that interaction changes at various levels (genetic, mRNA or protein) lead to the unlimited growth of cells instead of the up-regulation or down-regulation of a single gene. Further, sometimes, even a moderate change in the expression of a group of genes can lead to a significant change in the biological function of an organism [27].
Currently, the underlying processes that contribute to cancer are being intensively investigated. However, so far, the molecular causes that initiate and maintain cancer are not well understood. For this reason, the understanding of gene expression profiles, which provide signatures of all the active genes and their interconnections in a cell, contain valuable information about the functioning of key pathways, as expressed by the hallmarks of cancer and, hence, enable a practical investigation of functional mechanisms thereof [28][29][30][31][32][33][34][35][36]. Despite the different focus of many studies of different cancer types, common themes in the form of 'key pathways' can be found throughout. For instance, the NF-κB pathway involved in the cellular responses to external stimuli like cytokines or free radicals, and immune response to infection [29,[37][38][39]; the MAPK signaling pathway responsible for regulating growth factor signaling including the RAF, MEK, and MAPK cascade [34,39,40]; the p53 signaling pathways involved in DNA damage control, apoptosis and inhibition of angiogenesis [37,41,42]; or the Wnt signaling pathway involved in cell differentiation, and cell polarity [31,34,36,38].

Formulating biological hypotheses
A main goal of high-throughput gene expression analysis is to identify differentially expressed genes or gene sets between two or more conditions to enable a functional interpretation of the underlying condition-specific mechanisms. The biological processes at the gene level are complex in nature as they dynamically interact with each other. A single gene can participate in different biological processes and regulate different genes at different time points. The identification of key genes or pathways is a difficult task, because their interactions are unknown. We only observe the phenotypic outcome of test conditions http://www.biology-direct.com/content/7/1/44 and the corresponding gene expression patterns measured from a tissue or cell culture. Univariate and multivariate statistical methods can be applied in order to understand such differences from a statistical perspective. The first type of approach that has been used to identify changes in the gene expression is a differential gene expression analysis. This approach is commonly used to compare different conditions of microarray samples to identify differences between them. As a result, a single gene analysis approach gives a list of genes that show a statistically significant difference between two conditions. For cancer, such genes may correspond to oncogenes or tumor suppressor genes.
If we consider the underlying network where different biological functions are being described by groups of interacting genes, a single gene analysis does not resolve the biological functions that are affected primarily in disease conditions and are causal factors of the disease. In order to get a systematic understanding of the disease or phenotypes we have to first understand what biological functions contribute to these changes, and perform a comparison between conditions using groups of genes defined by biological pathways. This approach leads to comparing gene expression data at the pathway level where sets of genes are tested for differential expression.
Another interesting property that can be extracted from gene expression data is the correlation structure of gene expression profiles between all genes. This correlation structure shows associations between genes which directly or indirectly interact with each other [8,[43][44][45]. Comparative analyses of gene pathways that consider the correlation structure of expression data can provide a suitable test for the hypothesis of changes in the underlying network.
In summary a gene expression data set can be used to (I) identify differentially altered single genes, (II) identify differentially expressed gene sets or pathways, and (III) identify differentially correlated pathways. In the following sections, we review statistical methods that have been introduced to study the three problems (I-III) above. In Figure 2 we give a graphical overview of the such methods.
Before we procede, we would like to point out that all of these methods test statistical hypotheses [46]. That implies that in order to understand a particular method biologically, i.e., one is capable of providing a biological interpretation, one needs to understand the underlying  The data are used to test differential expression and co-expression between two conditions for single gene and gene set based approaches. http://www.biology-direct.com/content/7/1/44 null hypothesis. In our opinion, it is helpful to approximately categorize all statistical hypotheses into three categories with respect to their biological interpretability, whereas each category represents a different degree of difficulty to find a biological interpretation for a hypothesis.
In the following, we provide a brief discussion of these three categories because it enables a better, potentially, more plausible understanding of the methods presented in the next sections.
In category one belong all hypotheses for which it is relatively easy to find a meaningful biological interpretation. An example from this category are tests that compare mean values (μ), e.g., to identify the differential expression of genes (section 'Differential expression of a gene'). That means these tests use the mean as a test statistic. Due to the fact that the underlying (probability) distribution of the genes represents, biologically, the activity of the gene expressions, the interpretation of a null hypothesis is directly derived thereof. For this reason the biological interpretation of the rejection of the null hypothesis given in Eqn. 2, is intuitively clear and appealing, because it implies a change in the (mean) expression of genes which may indicate a change in a biological function because the number of available proteins may be altered.
In category two fall tests for which there are several alternative biological interpretations. This makes the interpretation of such tests ambivalent from a biological perspective. As example for such a test, we consider the detection of the differential variance of a gene (section 'Differential variance of a gene'). Despite the fact that the underlying probability distribution of the expression of genes has a clear biological interpretation, the biological interpretation for the rejection of the null hypothesis in Eqn. 4 is not unique. For instance, a gene could have a different variance in two conditions because, e.g., in condition one it is periodically expressed, whereas in condition two it is constantly expressed on an intermediate level. The former condition may be related to the cell cycle or the circadian rhythm, or periodically triggered by an external signaling factor that is released by the administration of a medication that is regularly taken. A second equally plausible interpretation could be that in one condition the cell utilizes parallel pathways to transfer a signal whereas in the other condition only one signaling chain is used. The reason for the utilization of parallel pathways could be triggered by stress factors, e.g., in the presence of an infection, so that the cell is 'running' full power in order to execute all necessary programs that have been initiated by the presence of the intruder.
Lastly, for tests in category three it is very difficult to find sufficiently precise biological interpretations because, statistically, these methods test 'complex' expressions. An example for a test from this category is the N-statistic (section 'N-statistic'). The null hypothesis is based on the comparison of two distributions rather than scalar test statistics. In order to clarify the crucial difference between the comparison of two distributions and scalar test statistics we note that, theoretically, every probability distribution can be written as a series expansion in its moments [47]. This means a test for a distribution, compares implicitly the moments of this distribution. Here the (k-th) moment is defined as the expectation value of a random variable (to power k), i.e., An example for a moment is the mean (which is the first moment), other examples of entities that can be expressed as a function of moments are the variance and the kurtosis. This means whenever the null hypothesis in Eqn. 39 is rejected it could be because of a difference in any moment of which there are, theoretically, infinite many. Put differently, this kind of unspecificity makes this test very powerful in the sense that it may detect any possible difference two distributions can exhibit. On the other hand, if the null hypothesis is rejected it is very difficult to identify a precise reason for its rejection. For instance, this could be related to a difference in the mean, variance, kurtosis or any higher moment or function thereof. These combinatorial factors do usually not allow to find a concise biological interpretation. Nevertheless, such a test can be of valuable use, e.g., for diagnostic purposes.

Single-gene analysis
Single-gene based methods can be subdivided into three major classes. A) Methods for detecting differential gene expression, B) methods for detecting differential correlation, and C) methods for detecting a differential variance.

Differential expression of a gene
The analysis of differential gene expression is based on the mean expression change of individual genes. Suppose for a gene g i in a microarray data set the mean expression value for two conditions are μ 1 and μ 2 respectively. Then the null hypothesis for the differential expression of the gene g i is defined as A gene is called differentially expressed when H 0 is rejected. Figure 3A shows an example where the samples for two conditions are drawn from two normal distributions with different mean values, i.e., N(μ 1 = 0, σ 1 = 1) and N(μ 2 = 1, σ 2 = 1).
The first published studies for gene expression analysis selected differentially expressed genes based on a foldchange criteria between a treatment and control condition [48]. For example, an early application of this measure was http://www.biology-direct.com/content/7/1/ Examples for different biological hypothesis that can be formulated based on different properties of gene expression data (green -control group (condition 1), red -treatment group (condition 2)). E: corresponds to the covariance matrix, corresponding to a constant but different matrix for the two conditions. http://www.biology-direct.com/content/7/1/44 used to compare normal colon epithelium and primary colon cancers [49]. Since then, many statistical approaches have been developed to provide more robust measures. Among the most popular methods are, e.g., SAM [50,51], limma [52], and the empirical Bayes approach from Efron et al. [53].

Differential variance of a gene
The analysis of differential variability (DV) aims to detect a change in the variance of the gene expression values [54]. Suppose in a microarray expression data set, the mean expression value of a gene i is μ c and its variance σ 2 c , for condition c = {1, 2}. Then, the null and alternative hypothesis tested are: (5) A gene is called differentially variable when the null hypothesis H 0 is rejected. The DV analysis in [54] tests H 0 by using a F-test. In Figure 3B we show an example for a gene with a constant mean, but a changed variance in the two conditions. The samples for the two conditions are drawn from a standard normal distribution with the same mean but different variances for the conditions, i.e., N(μ 1 = 0, σ 1 = 1), and N(μ 2 = 0, σ 2 = 2).

Differential correlation of a gene
The analysis of differential correlation aims to detect changes in the dependency structure of a single gene [55]. Suppose r i = r i1 , ..., r ip denotes a p − 1 dimensional correlation vector, whereas each component corresponds to the correlation between gene i and one of the other p − 1 genes in a data set. Then for r i one obtains distribution functions, denoted by F A r i and F B r i , for condition A and B and the following hypotheses:

Gene-pair analysis
The functional activities of genes, as measured by gene expression values, reflect the interplay of the genes and their products in the underlying gene network. The objective of a gene-pair analysis is to identify either differential co-correlated or differential co-expressed pairs of genes, instead of individual genes. The reason for looking for pairs of genes is that the concerted changes in genes is due to their common membership in biological pathways.
The principle idea to detect correlation changes in genepairs is visualized in Figure 3D. The data are sampled from a multivariate normal distribution with a constant mean vector for both conditions, μ 1 = μ 2 = (0, 1), but a different correlation of ρ 1 = 0.8 and ρ 1 = −0.2. The point is despite no difference in the mean expression of the gene-pair, there is a difference in their correlation.
In [56] a method (CorScor) has been proposed to identify such gene-pairs. In Figure 4 we show three cases of the joint distribution of expression values of two genes, for two conditions. In this Figure we are showing simulated data for three possible changes in the co-expression of a pair of genes in two conditions. The samples in Figure 4A and B are drawn from a multivariate normal distribution with μ 1 = {5, 5} and μ 2 = {5, 7}. For Figure A the correlation between gene-pairs is ρ 1 = ρ 2 = 0.9 and for Figure  B it is ρ 1 = ρ 2 = −0.9. For Figure 4C the samples are generated from a multivariate normal distribution with μ 1 = {5, 5} and μ 2 = {5, 5} and the average correlation between gene-pairs is ρ 1 = 0.9 and ρ 2 = −0.3.
In the first two cases ( Figure 4A and B) the correlation of the gene-pairs show a condition specific shift, in [56] denoted as a gap and substitution. In the third case ( Figure 4C), the gene-pairs show a reversed correlation between the two conditions, denoted as on/off case. To identify gene-pairs in these two types of conditions, two scoring functions have been suggested in [56] given by: Here each of the three correlation coefficients are estimated for a gene-pair between gene i and j, i.e., ρ A = ρ A (i, j) etc. The value of ρ corresponds to the global correlation coefficient of the gene-pair over the two conditions (A, B) and ρ A , ρ B are the correlation coefficients of the gene-pair for condition A and condition B. In Eqn. 8, α is a tuning parameter that governs the balance between separation and parallel alignment. In [56] it was argued to use a value of α = 1.5. The null and alternative hypotheses tested are: In [57] the 'expected conditional F-statistic' (ECFstatistic) has been introduced to measure the differential co-expression of gene pairs (X, Y ). The method is based on a modified F-statistic, where the variance and the mean parameter of the statistic are estimated from a mixture of two normal distributions.
The R package R/EBcoexpress provides an empirical Bayesian implementation to identify the differential coexpression of gene pairs [58].
Another method called liquid association (LA) has been proposed in [59] to identify co-expressed gene pairs. In contrast to pairwise correlation measures, the LA method considers the presence of a mediator gene, Z, for observing a co-expression between two genes at a given cellular http://www.biology-direct.com/content/7/1/44 state. Let X, Y and Z be gene-expression profiles. We say that X and Y form a liquid association pair (LAP), if the cellular state of X and Y is correlated with Z. The LA score of X and Y with respect to Z is estimated from rank transformed expression profiles of X, Y , Z given by Here m corresponds to the number of samples. The LA method uses a permutation test for the identification of significant LA gene pair values. Due to the high computational burden of the method that would require N 3 (N is the number of genes) evaluations of Eqn 11 plus additional permutations of the data, which is even for only N = 10 3 genes intractable because it requires already more than 10 9 evaluations. For this reason the method is only used to (A) find the gene Z for a given pair of genes or (B) find the LAP, X and Y , for a given gene Z.

Gene set and pathway analysis methods
Generally, a pathway is a group of interacting genes (a gene set) that deploy a cellular function. In a biological system the biological processes are coordinated functions of sets of genes which make the organism work. Some general pathways are, e.g., metabolic pathways, signaling pathways or regulation pathways that represent minimal functioning units of a cellular system. The consideration of pathways or gene sets for a comparative gene expression analysis is an important step toward the exploration of relevant functional mechanisms of a cell.
So far, many multivariate and univariate tests have been proposed for a gene set analysis, see Figure 2. Finding differentially expressed pathways, instead of individual genes, is not straight forward from a statistical and biological perspective and there are several hurdles to this approach. The first is presented by the data themselves, because the number of variables is usually (much) larger than the number of samples, i.e., n << p, that leads to many estimation problems. The second hurdle is our incomplete information about the constitution of biological pathways and the potentially high overlap of genes between different pathways. For example, databases like GO [60] provide valuable information about genes for a large variety of different organisms. However, http://www.biology-direct.com/content/7/1/44 this information is not static but continuously expanding leaving us at the moment with a snap-shot of knowledge. This makes it difficult to find precise definitions for particular pathways of interest. The third problem comes from the underlying gene network structure that describes the true interactions between genes in a pathway. Here, the problem is that as a result of such interactions among genes it is usually not appropriate to assume their independence, as frequently done for statistical ease.
A motivating example for the general idea underlying gene set methods is shown in Figure 3C. For condition 1 (green), the samples are drawn from a multivariate normal distribution with μ 1 = {2, 2.2, 2.4, 2.6, 5, 8} and for condition 2 (red) μ 2 = {2, 7, 7.5, 2.6, 5, 8}. The covariance matrix, 1 = 2 is for both conditions the same. In this Figure, only 2 of the 6 gene are differentially expressed. This reflects biological situations because, usually, only of fraction of the genes belonging to a pathway is found to be differentially expressed. However, due to the fact that gene set methods are based on the expression of a set of genes such methods borrow strength from the combined analysis of the genes.

Null hypothesis for gene set analysis
Gene set analysis methods can be broadly divided into two major categories, depending on what null hypothesis is tested. The first type of methods are called competitive methods, and the second type self-contained methods [66]. Briefly, self-contained tests use only the data from a target gene set under investigation, whereas competitive tests use, in addition, also data outside the target gene set (background data). In the following we describe popular competitive and self-contained pathway methods.

Competitive gene set and pathway methods
In Table 1 we show an overview of gene set and pathway methods, described in the following.

GSEA
The gene set enrichment analysis method (GSEA) [27,68] is one of the most widely used competitive test based method. The test uses a Kolmogorov-Smirnov test statistic to identify differential expressed gene sets. The gene set and background data set are defined in the following. Let W be the target gene set to be tested and W c its complement in a way that the union of both sets defines all genes, i.e., V = W ∪ W c , in the data set. The hypotheses tested by GSEA with respect to an enrichment score (ES) are: Briefly, GSEA consists of the following steps, applied to each pathway: (1) Estimation of gene-level test statistics.
(2) Rank ordering of the test statistics.
(3) Calculation of an enrichment score (ES) for a pathway based on the gene-level test statistics. (4) Permutation of the gene-labels to estimate the significance of the enrichment score for the pathway.

GSEArot
GSEArot (gene set enrichment analysis rotation) [70] is very similar to GSEA, but uses a different approach to randomize data in order to assess the significance of a target pathway. More specifically, a data matrix X is randomized by, first, rotating X around a random angle δ, resulting in a matrix X(δ). Second, from the matrix X(δ), the randomization matrix is obtained by a QR decomposition [76]. In [70] it is argued that this procedure has an advantage for small sample sizes, when only very few permutations are achievable from sample-label permutations. The null hypothesis tested by GSEArot is the same as for GSEA.

Random set
The random set method introduced in [73] is a parametric test that is a generalization of Fisher's exact test in the sense that enrichment scores of gene sets are compared with randomly formed sets. The enrichment scores are based on single gene-level test statistics reflecting their differential expression.
1. Estimate the enrichment score of a target gene set W , Here s i are gene-level scores, e.g., t-scores, and m = |W | is the number of genes in the target pathway. 2. Estimate the enrichment score and its variance of the background gene set with p = |W ∪ W c |. 3. Estimate the standardized enrichment score The Z score follows a standard normal distribution under the null hypothesis H 0 given by: It is notable that Z can be calculated without a numerical randomization of the data. Further, the background data consist of all genes V , including the ones in the target pathway W . In [77] this method has been applied to head and neck and cervical cancer for human papillomavirusespositive and -negative samples.

GAGE
GAGE [71] (generally applicable gene set enrichment) is also a parametric test that, similarly to GSEA, compares the expression in a target gene set with that of the background. But instead of using a Kolmogorov-Smirnov like test [78] it employes a two-sample t-test. The principle steps of the method are as follows: 1. Estimate the mean fold change f and its standard deviation σ f for the m genes in the target pathway W . 2. Estimate the mean fold change f and its standard deviation σ f for all p genes in the background gene set V = W ∪ W c . 3. Estimate the t-score: degrees of freedom.
Also GAGE employs all genes V in the background gene set, including the ones in the target set W . The underlying assumption of GAGE is that the (mean) fold changes of genes are independent and identically distributed. The null hypothesis tested by GAGE is:

GSA
Another method is GSA (gene set analysis) [69]. The method, first, calculates z-scores, z i , with i ∈ {1, . . . , m}, for all m genes in a given target pathway W . Then each z-score is transformed into two scores, assessing the sign of z i . This results in two sets of non-zero scores S + = {s + (z 1 ), . . . , s + (z m )} and S − = {s − (z 1 ), . . . , s − (z m )} from which their mean value is calculated, Finally, the maxmean test statistic is defined by s mm = max{s + ,s − }, giving the test statistic for the target pathway.
The null distribution is assessed by a restandardization, combining a sample-and gene-label permutation.

Self-contained gene set and pathway methods
In Table 2 we show an overview of self-contained gene set and pathway methods that are describe in the following in more detail.

Sum of t-square
The sum of t-square test is an univariate test that is based on t-scores, {t i }, individually obtained for each of the m genes in a given set [95], see also [79]. That means that each t-score assesses the difference of the mean expression between the two conditions, Non-linear tests for identifying differentially expressed genes or genetic networks [87] Pathway-express [90] Signaling Pathway Impact Analysis [91] SPIA (Bioconductor) SEPEA [92] PARADIGM [93] Gene set analysis exploiting the topology of a pathway [94] IPS (available upon request) with s i the pooled standard deviation. The test statistic for a pathway is based on the individual t-scores given by Because for each gene μ i = 0 should hold if a gene is not differentially expressed, the null and alternative hypothesis can be formulated as: The significance of TS is assessed from sample-label permuted data.

SAM-GS
The method SAM-GS (Significance Analysis of Microarray for gene sets) [81] uses the test statistics, with d k =x 1k −x 2k s k +s 0 . Herex 1k andx 2k are the sample means of the control and treatment condition of gene k, s k corresponds to its pooled standard deviation and s 0 is a constant for a sensitivity adjustment. The null and alternative hypothesis can be formulated as: Statistical significance of SAM-GS is again assessed from sample-label permuted data.

Hotelling's T 2
The Hotelling T 2 test is a self-contained test that is a multivariate generalization of the univariate t-test. Its null and alternative hypothesis can be formulated as: Suppose we have two groups with n C samples from the control group and n T samples for the treatment group, each consisting of m genes. Let the expression level of the i th sample of the control group and treatment group be given by respectively. The pooled covariance matrix S is then defined by where S C and S T are the covariance matrices for the control and treatment group. Hotelling s T 2 is defined as The inverse of the covariance matrix is estimated via the shrinkage estimator [96][97][98][99]. The statistical significance of the test statistic T 2 is estimated from sample-label permuted data.

N-statistic
The N-statistic is a non-parametric test that is used to test the equality of two distributions. Suppose the expression level of the i th sample of the control group, n C , and the treatment group, n T , is given by . . n C } correspond to the control data-set and i ∈ {1 . . . n T } to the treatment data-set. The null and alternative hypothesis tested by the N-statistic can be formulated as: whereas F C (x) and F T (x) are two multivariate distribution functions from the control and the treatment condition. The N-statistic itself is defined as follows: Here , is the Euclidean Kernel serving as distance function between the expression values in the two conditions.

Linear model-based pathway methods
There are also several approaches that utilize either a linear or a generalized linear modeling framework for a gene set analysis. Examples for such methods are Global test [82], Extension of GSEA [80] or GlobalAncova [83].

Topological pathway methods based on existing network information
Some recent univariate methods, for instance, Pathwayexpress [90], SPIA [91] or SEPEA [92], use instead of correlation measures to estimate interactions among genes, predefined topological information as provided, e.g., by the KEGG database [100]. These methods assign each gene in a pathway a score that is based on the position of a gene in the given network structure and, finally, aggregate these individual gene scores to obtain a score for the pathway itself. Yet another approach is provided by PARADIGM [93]. This method uses a factor graph model combining gene copy number variation data with gene expression data for the identification of differentially expressed pathways.

Iterative Proportional Scaling: IPS
IPS (Iterative Proportional Scaling) [94,101] is another method that utilizes the topology of pathways of a given gene set by testing the hypotheses: (43) In this method, the covariance matrices, c 1 , c 2 , are estimated from the data, for both conditions, using the Iterative Proportional Scaling (IPS) algorithm. The inverse of the estimated covariance matrices are positive definite (concentration) matrices for which it is assumed that the non-zero elements in −1 c 1 and −1 c 2 are identical; this is the meaning of −1 c 1 , −1 c 2 ∈ S + (G) where S + indicates the class of all symmetric positive definite matrices with non-zeros elements given by the binary matrix G. This means that the concentration matrices, −1 c 1 and −1 c 2 , have identical zero element, but are allowed to have different non-zero entries. In other words, it is assumed that the underlying topology of a pathway is the same for condition c 1 and c 2 , given by G, whereas G ij = 0 corresponds to an 'absent' interactions among the genes i and j. Since the structure of G is not estimated from the data, it is necessary to obtain it from an independent source, e.g., from the KEGG database or Reactome [100,102]. In [94] it is shown that a log likelihood (log( )) ratio test can be used to test for the equality of the concentration matrices for the two conditions and that asymptotically the log likelihood ratio follows a Chi-square distribution with r + m degrees of freedom, i.e., log( ) ∼ χ 2 r+m , whereas m is the number of genes in the pathway and r is the number of non-vanishing edges in G corresponding to the fixed interaction structure of the pathway.
The IPS method has been used in [94] to study acute lymphocytic leukemia with and without BCR/ABL gene rearrangement. As a result, the JUN oncogene with RAS/MAPK/JNK followed by NFAT and NFKB seem to be crucial in distinguishing BCR/ABL positive and negative patients.

Differential correlation/interaction methods
In the previous sections, we discussed different gene set and pathway-based methods for the identification of differentially expressed pathways. These methods focused either only on the expression of genes, or considered an underlying interaction topology among the genes as taken from an independent source. However, even when these http://www.biology-direct.com/content/7/1/44 methods considered an underlying network structure, this structure was assumed to be the same for the 'treatment' and 'control' group.
In contrast, in this section we discuss methods that estimate the correlation/interaction structure of the genes within pathways, for each experimental condition. The underlying rationale for these approaches is to assume that the expression profiles of genes are dependent on each other [103,104] as the genes in a pathway interact, either directly or indirectly [105]. This assumption results from the observation that genes with similar functions or cellular localization are often co-expressed and cluster together. The methods discussed in this section bear a similarity to the statistical methods for the estimation of differential correlated gene-pairs (see section 'Gene-pair analysis'). However, the extension of such gene-pair measures to the pathway level allows the identification of pathways that show, e.g., a condition specific correlation change.
In Figure 3E we show a simulated example scenario for condition specific correlation changes of the expression profiles for a gene set. In Figure 3E the correlation between all gene-pairs of a gene set is aggregated by a summary statistic. In this example, the mean values between the genes is of a comparable order, whereas the correlation of the gene set in the treatment condition is reduced.
A variety of different pathway methods have been developed that integrate the estimated gene correlations or co-expression structures with gene expression data. A summary of different methods that are used for the identification of differential correlation/interaction changes in pathways is shown in Table 3. In the following these methods are described in more detail.

Graph Edit Distance: GED
Among the first approaches that estimate the interaction structure for a pathway analysis to identify differentially correlated pathways (DCP) is a method introduced in [106]. This method uses the graph edit distance (GED) score as a test statistic.
More precisely, for a given pathway containing m genes an association graph, G, also called pseudo-pathway, is inferred for each condition. That means the resulting network comprises the m genes of this pathway only. The inference method estimates correlation and partial correlation coefficients and tests their statistical significance. That means, if either the correlation or partial correlation coefficient for two genes i and j in this pathway vanishes, then the resulting network will not have an interaction between gene i and j, i.e., E ij = 0, otherwise there is an interaction, E ij = 1. Here E corresponds to the adjacency matrix of the network. Suppose G c 1 and G c 2 are two networks that have been inferred from gene expression data for condition c 1 and c 2 . Further, assume that M 1 , M 2 , . . . , M n are all possible transformations that map G c 1 into G c 2 , i.e, M i (G c 1 ) = G c 2 . Then the optimal cost of the optimal transformation, M , is given by This value is used to define a dissimilarity measure d GED (G c 1 , G c 2 ) = c(M ) between the two networks G c 1 and G c 2 , called the graph edit distance (GED) score [112]. For arbitrary networks, G c 1 and G c 2 , the estimation of d GED (G A , G B ) is numerically challenging. However, for our specific problem it can be efficiently calculated based on the adjacency matrices, E c 1 and E c 2 , of the two networks, In [106] the GED score has been used as a test-statistic for the formulation of the hypotheses: In order to assess statistical significance, sample label permutations are performed to obtain the null distribution.
Extensions of this method can be found in [113] where mutual information values have been used to capture nonlinear relations among gene expression values. Further, in [114] a methods based on a relevance value (RV) has been defined for integrating different types of genomics data sets which has also a resemblance to the GED.

Gene set co-expression analysis: GSCA
A method that is based on (zero-order) correlation coefficients is gene set co-expression analysis (GSCA) [107]. This method uses as test statistic the dispersion index, which is defined as follows: Here ρ c k , with c ∈ {c 1 , c 2 }, is the k-th correlation coefficient for a gene pair, which can be formed among the total number, P = m 2 , of such pairs for a pathway consisting of m genes. The null and alternative hypotheses tested are: From the definition of the dispersion index follows that also this method aims at detecting at differential correlation among pathways, despite its name emphasizing co-expression. Interestingly, the dispersion index corresponds to the GED score if its components in Eqn. 47 are re-labeled and one defines the components of the adjacency matrices E c 1 , E c 2 as the correlation coefficients rather than the outcome of the hypotheses tests [106].
A visualization of the underlying idea of GSCA is shown in Figure 3E. The gene expression values are sampled from multivariate normal distribution N(μ 1 , 1 ) and N(μ 2 , 2 ) with μ 1 = μ 2 , and the average covariance between gene-pairs is 1 = 0.8 and 2 = −0.2. Despite the fact that there is neither a difference in the individual expression of genes nor the the expression of a set of genes, condition 1 and 2 can be distinguished by using a measure based on a correlation change.

Partial least squares based scores: PLS
A statistical framework based on a partial least squares score is proposed in [115]. Similar to the above methods, two matrices for the two conditions are inferred. These matrices can be seen as weighted networks, whereas an edge weight corresponds to the strength of the association between two genes. In this paper, three different types of tests are introduced that allow (A) testing for changes in the module structure of the two networks, (B) testing for changed in the connectivity of a particular gene set, and (C) testing for changes in the connectivity of a particular gene.

Differentially co-expressed gene sets: dCoxS
In [108] the differentially co-expressed gene sets (dCoxS) algorithm is proposed. This is an entropy-based method that uses the interaction score (IS) to measure the difference between two pathways. The IS is estimated by the correlation coefficient between the entropies of the two pathways. The entropies themselves are estimated by using the Rényi relative entropy, which is defined by: Here α is a parameter andf h (S i ) andf h (S j ) are expression densities of the samples i and j, estimated by using a multiplicative kernel for the density estimation. Further, p and q are the probability density functions of P and Q. From this, the IS is estimated by: In this equation, RE P 1 and RE P 2 are entropy matrices of two gene sets, P 1 and P 2 , for condition c i , whereas each component of the entropy matrices is proportional . That means, strictly IS = IS(P 1 (c i ), P 2 (c i )).
Application of Fisher s Z-transformation to IS results in a z-score, z IS(P 1 (c i ), P 2 (c i )) , for condition c i . Combination of both z-scores for condition c 1 and c 2 leads to, Here n 1 and n 2 correspond to the number of samples for condition c 1 and c 2 . The interpretation of the null hypothesis tested can be stated as: In [108] dCoxS has been applied to gene expression data from lung cancer. Their analysis identified the Thrombin signaling and protease-activated receptors pathway, which is known to be involved in the angiogenesis of lung cancer, as the most frequently changed pathway. Another interesting result found is that all significant pathway pairs had a lower interaction score in lung cancer than in the normal control group. This might indicate that the variability in http://www.biology-direct.com/content/7/1/44 form of exploited parallel pathways is in cancer lower than in normal cells.

Gene regulatory networks
Finally, we would like to mention that also gene regulatory network inference methods have also been used in this context. More precisely, several attempts have been made to identify disease networks [110,111] that corresponds to particular pathways. For instance, in [110] the C3NET inference method [116,117] has been used to infer pathway specific networks for prostate cancer. A structural comparison between the pathway-specific networks, similar to [106] which is based on testing the hypothesis in Eqn. 45, allowed to identify growth and cell cycle related pathways.
On a side note, we would like to add that Gaussian graphical models (GGM), also known as Markov random fields [118][119][120], are also frequently used to infer gene regulatory networks. This model assumes that all variables follow a multivariate normal distribution with a specific structure of the inverse of the covariance matrix, = −1 , whereas is called the precision or concentration matrix. Network inference methods based on GGM make use of the relation, connecting the partial correlation coefficient of full-order (LHS) with the elements of , ω ij ∈ . The partial correlation is of full-order (with respect to the number of genes) because V \{ij} is the set of all genes excluding i and j, i.e., the largest possible set of genes not considering i and j. Several methodological improvements have been suggested to infer gene regulatory networks based on GGM [121][122][123]. These methods differ in the way the inverse of the covariance matrix, −1 , is estimated and in the statistical tests employed to assess significance. The reason for these technical variants comes from a variety of problems. For instance, if the number of samples is smaller than the number of genes, which is typically the case for a microarray data set, the sample covariance matrix is not positive definite and, hence, not invertible. This means that Eqn. 55 cannot be exploited. In order to overcome such practical estimation problems, recently, several extensions based on the LASSO (least absolute shrinkage and selection operator) have been suggested [124][125][126][127][128].

Importance of multiple hypotheses testing and sample size: An example for differentially expressed genes
Typical microarray experiments measure the concentration of thousands of mRNAs simultaneously. For this reason, usually, one does not just test one statistical hypotheses but dozens, hundreds or even thousands. This makes it mandatory to control the overall error rate for all the tests, because the probability to make at least one error, Pr(V ≥ 1|α, t) = (1 − (1 − α) t ), for a test with a false positive rate of α, increases rapidly with the number of tests t, as can be seen in Figure 5. Here, V corresponds to the number of false positives. From this one can see that even for a moderate number of tested hypotheses, e.g., 300, this probability is already almost 100%. Hence, each of the three principle types of hypotheses tests discussed in the previous sections are severely effected by this problem.
Classically, a Bonferroni correction is used controlling the Family wise Error Rate (FWER) [129,130], Unfortunately, this method is often too stringent, which may give no significant results at all. For this reason, alternative error measures and control procedures have been introduced. A recent, very popular measure is the false discovery rate (FDR) [131], controlled by a procedure introduced by Benjamini & Hochberg (BH) [131]. Subsequently, various related error measures have been proposed like pFDR [132], local FDR [133,134] and a variety of other control procedures [129,135]. Also extensions have been suggested [136] that  ≥ 1|α = 0.01, t) on the number t of tested hypotheses. http://www.biology-direct.com/content/7/1/44 allow the control of an error measure in cases where the underlying tests are not independent from each other. This is particularly important for microarray data that contain a none neglectable correlation structure among the genes.
In order to demonstrate the importance and the influence of different error measures and control procedures, we provide a numerical example, shown in Figure 6. Here, 'BH' means the Benjamini & Hochberg procedure to control the FDR [131], 'pFDR' indicates the positive false positive rate controlled with a method introduced in [132], 'Bonferroni' corresponds to the FWER controlled with a Bonferroni correction and 'localFDR' corresponds to the local FDR [134,137]. The local FDR is defined as, localFDR = Prob(null |test statistic).
that means the local FDR is the probability that the null model is true conditioned on the observed test statistic. The data we used for this analysis correspond to simulated gene expression data sampled from a normal distribution  Figure 6 Power and false discovery rate in dependence on the sample size and the correlation in the data. We compare the methods: pFDR (red) [138], BH (blue) [131], Bonferroni (green) and localFDR (purple) [134,137]. http://www.biology-direct.com/content/7/1/44 with different mean values for the two conditions. More precisely, we simulate 2000 genes of which 400 are differentially expressed (true positives). Further, we study three different (constant) correlation structures with ρ = {0.0, 0.2, 0.5}. The results shown in Figure 6 are for each sample size averaged over 50 independent runs.
As one can see, 'BH' and 'pFDR' give more significant results and, hence, have a higher power than the Bonferroni correction and the local FDR when there is either no or only a moderate correlation among the genes. However, it is important to note that the utility of these methods depends on the characteristics of the data. For example, if the average correlation in the data is ρ = 0.0, then 'pFDR' tends to perform best (see Figure 6A). However, when the average correlation in the data increases (ρ = 0.5) then the 'localFDR' [134,137] becomes preferable. We want to note, for a sample size of 5, the power of the methods is usually very low because only a couple of genes test significant. In addition, a large fraction of these can be false positives. This seems to be especially for the local FDR method a problem.

Recommendations
In general, there is a trade-off between a high power of a statistical method on one side, which requires a large number of samples, and low experimental costs on the other. For the identification of differentially expressed genes the results in Figure 6 provide some guidelines. Even for the most favorable condition (for ρ = 0.0) a study will usually be underpowered for ≤ 20 samples, however, on the other hand, even for 10 samples the Type I error will be well-controlled.
For gene set and pathway-based methods such recommendation are more delicate. In [105] two self-contained (sum of t-square and Hotelling's T 2 [84,95]) and one competitive test (GSEA [27]) have been analyzed. As a results, it is suggested not to apply a method unconditionally to all pathways in a given data set, but to filter them in order to eliminate conditions for which a method is more likely to cause problems. This can be seen as a reflection of the heterogeneity of cancer, as discussed above in the section 'Gene expression data and cancer' .
In [105] it has been suggested to filter pathways according to the following criteria: Hotelling's T 2 should only be applied to pathways with less than 35 genes and a sample size larger than 30. The sum of t-square test should only be used for pathways with DC > 10% (DC is detection call; the percentage of differentially expressed genes in a pathway) and a sample size of 25 or larger. GSEA should only be used for pathways with DC > 10% and a sample size larger than 25. That means for the sum of t-square test and GSEA, at least 10% of the genes in a pathway should be differentially expressed for the method to work. However, this is not independent of the correlation structure of the data. In general, in the presence of high correlations a larger number of differentially expressed genes is beneficial for these methods.
It is important to emphasize that these sample sizes are different to the minimal sample sizes necessary in order to avoid in addition that a study is underpowered. For the minimal sample sizes [105] predict a sample size of 59 for Hotelling's T 2 and 57 for the sum of t-square test and 83 for GSEA. Further, in [95] it was found that using the N-statistic with 40 samples (or more) leads to a good control of the Type I error and a satisfactory power for a variety of differing conditions, including different correlations of the data and DC values in the pathways. Further studies reviewing related methods can found in [61,62,65,139,140].
We would like to emphasize that the above recommendations are data dependent. That means it is not possible to judge solely based on the number of samples which method to use. Instead, one needs to estimate characteristics from a particular data set in order to select an appropriate test. This implies, e.g., to estimate the correlation structure and the detection call. In the context of cancer there is an additional problem that needs to be considered. It is known that a tumor is a heterogeneous collection of cells rather than a homogeneous one [26,141]. This translates into the heterogeneity of gene expression data [142] making it even more dangerous to provide general recommendations without considering a particular data set.
On a general note, we would like to highlight that whenever a given data set allows to (I) identify changes in single genes, (II) identify changes in gene sets or pathways, and (III) identify changes in the correlation structure in pathways, then methods from each of the three categories (I-III) should be applied and there is no need to focus on just one of these. The reason for this is that despite the fact that gene set or pathway methods have more explanatory power than methods to identify changes in single genes [64,95] it does not mean that there are no conditions for which single gene methods reveal interesting biological information that may not be obtained by the other types of methods. For instance, the differential expression of a single gene based on changes in the mean (rather than the variance) may be an indicator for the presence of a single signaling chain rather than of many parallel pathways. Hence, this could provide information about the presence of a Mendelian trait or a complex trait that contains a strong monogenetic component. It appears that for such conditions single gene methods have an advantage over gene set or pathway methods, although, the latter methods may be adaptable to such question as well. However, this may require additional effort. In summary, we recommend to use all different approaches (I-III) side-by side, whenever this is permitted by the data, to interrogate the http://www.biology-direct.com/content/7/1/44 data in the broadest way, because this translates into a diverse set of different biological questions.
Our recommendation complements a common line of thought asking for the combination of different types of data. Although it is certainly true that combining different types of high-throughput data, e.g., from DNA microarray and ChIP-chip experiments, is in general more informative, it is also more time and cost intensive to generate such data combinations. For this reason, frequently, only gene expression data are available. Hence, our review provides a survey of method to get the most out of expression data sets.
Finally, we would like to emphasize that all methods require an appropriate filtering and normalization of the data in order to obtain robust and statistically sensible results [143,144].

Conclusions and discussion
In the post genomic era, biology transitioned from a 'gene-centric' to a systems-focused field. This change is also reflected in the transition from methods to identify 'differentially expressed single genes' to approaches for finding 'differentially changed pathways' . Such a transition is natural, because a systems view is required to understand the complex biological functions inside a cell that are responsible for the observable phenotypic outcomes [9,11,145].
As recent findings in cancer research demonstrate, cancer is a heterogeneous disorder, even within a particular cancer type. For example, breast cancer is currently subcategorized into four major tumor subtypes [146]: basal-like, HER2-enriched, luminal (which can be further reduced) and normal-like tumors. Considering the fact that these results have been achieved by using high-throughput data one can expect further refinements when data from different hight-throughput technologies become available and being combined with each other. For this reason, it appears sensible to assume such a heterogeneity not only on the global, phenotype level, but also within the cells, on the pathway-level. This implies that a pathway-based filtering, as suggested in [105], is necessary to apply a method only selectively, and not unconditional, to cancer pathways.
Regarding potential future directions, we expect to see an increase in methods that target changes in the correlation structure in pathways for three reasons. First, genes and their products do interact with each other. This implies that there exists a correlation structure among these entities that represents, potentially, useful biological information that may be missed by co-expression based methods [106]. Second, the costs to generate highthroughput data are declining, which makes it easier for the experimenter to generate a sufficiently large number of samples that enables such an analysis. This is an important point, since the required sample sizes for a pathway analysis is considerably larger than for single gene analyses. Third, biologically, the hallmarks of cancer point to a few pathways as pivotal elements in the molecular elucidation of carcinogenesis, e.g., Wnt/Notch signaling, Hedgehog signaling or DNA damage control [147][148][149]. Hence, semantically, pathway studies enable the systematic connection of oncogenes, tumor-suppressor genes and stability genes [150] to provide fundamental insights into causal mechanisms underlying cancer. Unfortunately, the temporary literature especially of methodological papers discuss their results rarely in the framework of the hallmark pathways. For this reason, we suggest that future studies aim for a conceptual discussion of their results within this enlightening framework. Not because it provides the final answers to understand cancer [151], but due to the fact that it enables a systematic approach to the emperor of all maladies [152].

Reviewers' comments
First of all, we would like to thank all referees for their fruitful suggestions and comments. In the following, we kept our answers to the raised issues short but included our responses in the main text.

Referee 1: Dr. Arcady Mushegian
The manuscript by Emmert-Streib and colleagues is a review of statistical methods for analysis of gene expression data, but it is also much more than that. It is relatively rare for the statisticians to review all classes of such methods and to give an eminently logical classification not only of the techniques on which the methods are based, but also of the kinds of questions that are asked when applying these methods. This, certainly, is a strength of the work and the reason why it should appeal to the biologists that would like to have a deeper insight into which methods are appropriate to which task at hand. I have, however, several comments that rank somewhere between suggestions and concerns. Most importantly, the authors propose to distinguish three groups of methods: those that identify changes in single genes, those that identify changes in gene sets or pathways, and those that identify changes in the correlation structure in pathways. (By the way, in the Abstract and elsewhere, the description of the groups is almost the same as above, but "changes" are substituted by "differential changes"is it not a tautology, in particular when there are only two samples?). Then, in discussing the first two classes of methods, the authors almost in every case give a clear formulation of the question that is being asked of the data, in the form of the statistical hypothesis about the data that is being tested. This is an excellent way of explaining things. Unfortunately, it is not consistently applied: http://www.biology-direct.com/content/7/1/44 even among these classes of methods, the hypotheses are not mentioned, and then, upon discussing the differentialcorrelation methods (pp. [15][16][17][18], the hypotheses are not explicated at all, except for the IPS method. I think this need to be changed, and the null hypotheses need to be stated for all methods for which this is possible; and if the framework is such that no explicit null hypothesis exist, this needs to be discussed, and the applicable intuitive formulation be given.

Reply:
We appreciate this suggestion and added to all methods the definition of their null hypothesis. In addition, we extended the discussion in section 'Formulating biological hypotheses' explaining why it can be difficult to find a biological interpretation for a null hypothesis and we offer some explanation for this.

Question:
My other concern is abut Figures 3 and 4. The authors never state what the data points there represent. They must be expression values for two genes, but how are these data collected -are they technical replicates? biological replicates? some kind of ordered series? unordered series such as for example different drug treatments? Does it matter what of the above they are?

Reply:
We added an explanation of the data, which are simulated data to visualize the principle idea underlying some methods, to the corresponding methods.

Question:
The third shortcoming of the paper is that there is a significant disconnect between well-covered methodology and the stated goal of discussing the application to cancer biology. In fact, the short discussion about cancer hallmarks is an excellent introduction that points out the way in which analysis of gene expression can lead to the understanding of changes in expression of particular ("hallmark") pathways. This theme, however, is not followed through. Though occasionally we read that such and such method was applied to analysis of a particular type of cancer, there is never any discussion of what was found in gene expression data that allowed an insight into cancer biology. What happens to the hallmark pathways at the level of gene expression programs? Which methods have been used to support (or maybe question?) which aspects of the hallmark hypothesis? Which pathways were predicted or shown to be differentially regulated at the transcription or mRNA concentration level?

Reply:
We agree with the reviewer that 'Which methods have been used to support (or maybe question?) which aspects of the hallmark hypothesis? ' is an important questions. Unfortunately, the methodology oriented literature does rarely touch this topic in a clear manner. That means in order to extend the paper in this direction we could not survey these issues but would need to establish such results. Instead, the concern in our paper is to propagate such an approach in the context of the presented methods. A discussion has been added to 'Discussion and conclusions' .

Question:
Finally, there is the question of, if you will, general biology of transcriptional response. It stands to reason, and indeed has been occasionally shown, that in order for a pathway to be regulated, it may not be necessary to regulate all its components at the same (in this case, mRNA concentration) level. One may find that the gene product amounts are regulated at different levels, or maybe even only one or a few, e.g., rate-limiting, components are regulated at all. This would argue that single genebased methods may in these cases provide a better clue to the process than pathway-based or gene set enrichmentbased methods. It would be interesting to know whether this has been observed in the cancer datasets. A related question is about the rules of thumb in pathway analysis: for example, if a typical pathway (network module?) has a size of N genes, what is the number of genes in this pathway m < N that would still register as an enrichment in some of the tests that the authors discuss?

Reply:
This is an important point. We included a discussion of this to section 'Recommendations' . We added also a discussion of the danger of general suggestions and motivate this by known characteristics of gene expression data from cancer. The problem is twofold. First, each method has its own characteristics under what conditions it works best. Second, data sets from cancer are very heterogeneous so that two data sets containing about the same number of samples can exhibit a very different correlation structure and expression patterns. This holds potentially also for different grades of one cancer type.
Regarding the first question, it appears to us that this is related to the presence or absence of parallel pathways conveying a molecular signal. If for example no parallel pathways exist the detection of differentially expressed genes can provide a robust way to detect functional changes. On the other hand, if there are many alternatives this may not be the case and gene set methods appear to be better suited for such a situation. In general, this kind of cross-method comparisons are not well studied and we are not aware that this has been systematically addressed for cancer or other data sets. One reason for this is that until recently, most data sets http://www.biology-direct.com/content/7/1/44 contained less than 20 samples per condition, which usually does not permit a robust analysis of gene set or pathway methods and once larger data sets became available the detection of differentially expressed genes was neglected, potentially, due to the erroneous assumption that differentially expressed gene set methods include the former tests.
In order to emphasize that it is desirable to apply methods from all three different levels simultaneously ((I) identify changes in single genes, (II) identify changes in gene sets or pathways, and (III) identify changes in the correlation structure in pathways) whenever a given data sets allows this, rather than to focus on just one of these levels, we added a discussion to section 'Recommendations' .
Thank you for your suggestions and comments.

Referee 2: Dr. Byung-Soo Kim
General comments This is a well organized review of recent statistical methods of analyzing microarray experiment data sets, particularly on cancers, from single gene analysis to identifying differential changes in pathway, and finally to comparing a given pathway under two different conditions. However, I would like to indicate following four points for the possible improvement. (1) Gaussian graphical model: From the methodological point of view, it is desired to include the sparse Gaussian graphical model (GGM) approach for estimating the gene network under the multivariate normal assumption from a microarray data set. For the recent development of GGM approach one can include glasso (Friedman, Hastie and Tibshirani, 2008; Witten, Friedman and Simon, 2011) [125,127], SCAD penalty of Fan, Feng and Wu (2009) [124], adaptive lasso of Zou (2006) [128] and Kiiveri (2012) [126], among others. (2) Effect of inter-gene correlations on the single gene analysis. A series of Efron's recent work (Efron 2007a(Efron , 2007b [134,137] discussed in detail on how inter-gene correlations could affect the detection of differentially expressed (DE) genes in a single gene analysis? By including Efron's recent work and his R package "locfdr" authors can show how FDR can be used in the real data analysis in their Section on "Importance of multiple hypotheses testing and sample size: An example for differentially expressed genes". (3) Some of the reviews are misleading. These are the few examples. (i) The sentence, at the middle of page 12, "However, in order to use a two-sample t-test with equal size of the two samples it is assumed that the mean fold change f ' and its standard deviation σ f would be the same for a randomly selected background set consisting of only m genes, see Eqn. 10". Actually ( [99], Luo et al., 2009) assumes the i.i.d of the fold change of genes to make Eqn 10 have a t distribution. Here the key assumption was the independence, which was missing in the aforementioned sentence.
(ii) p. 14. Eqn 16 (3) what Renyi entropy was when the underlying random variables were continuous. (iv) I would suggest authors to allocate more space on the work of ( [90], Massa et al, 2010) which was methodologically sound and deserve more coverage than just the IPS algorithm. (4) Inconsistency of notations. In page 11 authors defined p and m to be sizes of the background genes and a target gene set, respectively. However, in line 2 of page 15 "p genes" (which should have been m genes according to page 11 definition) was incorrectly labeled. This inconsistency was repeated in N-statistic section of p.15, and also in Eqn 16 in p. 14 and Eqn 22 of p. 17. The "p-dimensional... " should be "m-dimensional... " at the bottom two lines of p. 14.

Minor Comments
1. p.2 "Gene expression data from next generation sequencing (RNA-seq)". This is an important issue.
There is no direct relevance, however, with statistical methods reviewed in this paper. 2. p.4. For detecting differential correlation and differential variance, it would be better to explain why these approaches were taken. For example, in ( [54], Ho et al., 2008) it was clearly indicated that changes in expression variability were associated with changes in coexpression pattern, which implied that DV was a signal rather than a noise. 3. Legend of Figure 2. "The data is..." should be "The data are... ". 4. p.7. There is no reference of Figures A, B in the main text. Also indicate in the legend of Figure 3 what is in Figure 3E. 5. p.8. In the legend of Figure 4 what the symbols in the outer-panel represent? What do the lines represent? It is better to use different notation (A, B) to avoid confusion in the main text of the second paragraph of p. 9. 6. p.9 What is "alpha" in Equation (4) 9. p. 14. Eq. 16 Under the null, it seems that t,2 should approach 1/ √ (p) rather than 0. 10. Eq. 16 How is the significance of SAM-GS calculated? 11. p. 18 Eq. 24 and text after, use log in math mode rather than log.

Reply:
All comments have been addressed and we revised the main text correspondingly. Thank you for your suggestions and comments.