 Review
 Open Access
 Published:
Harnessing the complexity of gene expression data from cancer: from single gene to structural pathway methods
Biology Directvolume 7, Article number: 44 (2012)
Abstract
Highdimensional gene expression data provide a rich source of information because they capture the expression level of genes in dynamic states that reflect the biological functioning of a cell. For this reason, such data are suitable to reveal systems related properties inside a cell, e.g., in order to elucidate molecular mechanisms of complex diseases like breast or prostate cancer. However, this is not only strongly dependent on the sample size and the correlation structure of a data set, but also on the statistical hypotheses tested. Many different approaches have been developed over the years to analyze gene expression data to (I) identify changes in single genes, (II) identify changes in gene sets or pathways, and (III) identify changes in the correlation structure in pathways. In this paper, we review statistical methods for all three types of approaches, including subtypes, in the context of cancer data and provide links to software implementations and tools and address also the general problem of multiple hypotheses testing. Further, we provide recommendations for the selection of such analysis methods.
Reviewers
This article was reviewed by Arcady Mushegian, ByungSoo Kim and Joel Bader.
Review
Background
The early driving forces in biology were reductionist approaches. In general, a reductionist approach tries to breakdown a complex system into its parts list and explains its properties as the sum of its individual components. Hence, the individual constituents of a system inform its higher level functions [1–4]. However, the ‘one gene, one protein, one function’ working hypothesis [5] is not sufficient in order to explain the many emergent properties such as the phenotypic variability of organisms or the heterogeneity of cancer [6]. For this reason, nowadays, it is generally acknowledged that for achieving a functional understanding of biological systems, the genes in a cell need to be studied as a functioning collective [2, 3, 7]. In such a system, the collective functioning of groups of genes results in, for instance, signaling pathways or protein complexes that regulate cell differentiation, transcription regulation or growth.
A systems integration at the cellular level has the potential to answer many, until now, unsolved questions about biological systems and their collective functioning, regulatory programs for growth, development, phenotypic variability and the causality of many complex diseases [8–10]. Due to the enormous complexity of a cellular system, where many processes and interactions at different levels inside a cell work in harmony to assure the vital functioning of a cell, we need to understand key properties of biological systems like its robustness or modularity [2, 8] in order to enhance our understanding of complex diseases. These complex interactions occurring within a cell can be described by networks [11–13], including gene regulatory networks [14, 15], proteinprotein interaction (PPI) networks [16, 17], metabolic networks [18] and transcription regulatory networks [19, 20]. The networks are organized at different cellular levels and enable the functionality of the cell. The question now arising is how can the complexity inside a cell be understood, and analyzed?
The development of information processing technologies in the post genomic era enabled the generation of huge amounts of data. In this review, we focus on gene expression data from microarray platforms and summarize three major types of analysis strategies: (I) Identification of changes in single genes, (II) identification of changes in gene sets or pathways, and (III) identification of changes in the correlation structure within pathways. We discuss these methods in the context of cancer data sets to emphasize their biological meaning, implications and expressiveness.
Largescale gene expression data
In the next section, we briefly review highthroughput technologies that enable the generation of largescale gene expression data [21–23].
Gene expression data from microarray
A microarray experiment measures genomewide gene expression levels of mRNA in a cell or a tissue sample under a particular condition. A microarray chip quantifies the hybridization of fluorecsent labeled target nucleotide sequences to defined complementary probe sequences that are spotted on a glass or silicon slide. For different microarray platforms the spotted probes are synthetic oligonucleotides ranging from 25 to 80 nucleotides or long cDNA transcripts. Different microarray platforms were designed for a singlechannel or a multichannel experimental setting. For singlechannel arrays each condition sample is hybridized separately on individual arrays using a single dye. For multichannel arrays multiple conditions are hybridized together on individual arrays using multiple dyes. For example Affymetrix is a singlechannel platform, where multiple oligonucleotide probes (probeset) of 25 bases are used to measure the concentration of a mRNA transcript. The target mRNAs of expressed genes are extracted from a treatment or a control sample, reverse transcribed to cDNAs, labeled with a fluorescent dye and then hybridized to a microarray. An image of the microarray captures laser induced emitted fluoresent intensities of the probes at each spot. The intensities give a proportional measure of the corresponding mRNA concentration for each gene that was defined on the microarray.
Gene expression data from next generation sequencing (RNAseq)
The transcriptome of a cell comprises mRNA, tRNA, rRNA, and short regulatory RNAs. RNAseq is a transcriptome sequencing approach that uses deep sequencing techniques such as 454 (Roche), genome analyzer (Illumina solexa), SOLiD (support oligonucleotide ligation detection), Polonator G.007, HeliScope (Helicos BioSciences) and SMRT (single molecule real time sequencing) [24].
RNAseq has a wide variety of applications such as the measurement of gene expression levels from transcribed mRNA sequences [25]. In the first step of the procedure RNA is extracted from a given condition sample, fragmented, reverse transcribed to cDNA that is ligated to adapters. In the second step a library of reads is generated from the ligated fragments that are sequenced. In the third step the reads are mapped to known exon sequences of genes. The expression level of a gene is measured from the normalized number of mapped sequences that mapped to the known set of exon sequences of a gene. The RNAseq transcriptome sequencing approach overcomes several limitations of microarrays for measuring gene expression. For example, RNAseq measures large ranges of expression levels from very low to highly expressed genes and is able to consider unknown transcribed sequences. Since the novelty of the methodology, gold standard procedures for the management and processing of the data are currently being established.
Gene expression data and cancer
Cancer is a multifactorial disease, i.e., the detection of one mutation in one gene cannot explain the phenotypic plurality of carcino and pathogenesis by a onetoone relationship between genotype and phenotype. Instead, cancer can be induced by a multitude of genetic and environmental factors and the accumulation of such events. The intervening of such complex factors makes in general the characterization of complex diseases difficult. For this reason it is astonishing that the seminal work by Weinberg et al. [6, 26] presented a relative simple, systematic functional framework for cancer and the role different biological key processes are playing. In this paper, the so called hallmarks of cancer have been defined. According to [6], the hallmarks of cancer (see Figure 1) are:

self sufficiency in growth signals

insensitivity to antigrowth signals

evading apoptosis

limitless replicative potential

sustained angiogenesis

tissue invasion and metastasis
Later, this list has been extended by adding two further hallmarks [26]

deregulating cellular energetics

avoiding immune destruction
and also two enabling characteristics

genome instability

mutation and tumorpromoting inflammation
It has been recognized that these hallmarks are gradually acquired by different types of cancers, potentially, in a variable order. This variability in the acquiring of these diseasebearing processes is one of the indicators of the complexity of cancer.
The biological processes in a cell are controlled and regulated by signaling pathways that are activated by internal and external signaling receptors and factors. The signaling pathways governing growth and cell proliferation are likely dysregulated in their functioning in cancer. For example, they become insensitive to antigrowth signals, or they are dysregulated in growth signaling pathways by gaining autonomy in their growth. It is assumed that interaction changes at various levels (genetic, mRNA or protein) lead to the unlimited growth of cells instead of the upregulation or downregulation of a single gene. Further, sometimes, even a moderate change in the expression of a group of genes can lead to a significant change in the biological function of an organism [27].
Currently, the underlying processes that contribute to cancer are being intensively investigated. However, so far, the molecular causes that initiate and maintain cancer are not well understood. For this reason, the understanding of gene expression profiles, which provide signatures of all the active genes and their interconnections in a cell, contain valuable information about the functioning of key pathways, as expressed by the hallmarks of cancer and, hence, enable a practical investigation of functional mechanisms thereof [28–36]. Despite the different focus of many studies of different cancer types, common themes in the form of ‘key pathways’ can be found throughout. For instance, the NFκB pathway involved in the cellular responses to external stimuli like cytokines or free radicals, and immune response to infection [29, 37–39]; the MAPK signaling pathway responsible for regulating growth factor signaling including the RAF, MEK, and MAPK cascade [34, 39, 40]; the p53 signaling pathways involved in DNA damage control, apoptosis and inhibition of angiogenesis [37, 41, 42]; or the Wnt signaling pathway involved in cell differentiation, and cell polarity [31, 34, 36, 38].
Formulating biological hypotheses
A main goal of highthroughput gene expression analysis is to identify differentially expressed genes or gene sets between two or more conditions to enable a functional interpretation of the underlying conditionspecific mechanisms. The biological processes at the gene level are complex in nature as they dynamically interact with each other. A single gene can participate in different biological processes and regulate different genes at different time points. The identification of key genes or pathways is a difficult task, because their interactions are unknown. We only observe the phenotypic outcome of test conditions and the corresponding gene expression patterns measured from a tissue or cell culture. Univariate and multivariate statistical methods can be applied in order to understand such differences from a statistical perspective. The first type of approach that has been used to identify changes in the gene expression is a differential gene expression analysis. This approach is commonly used to compare different conditions of microarray samples to identify differences between them. As a result, a single gene analysis approach gives a list of genes that show a statistically significant difference between two conditions. For cancer, such genes may correspond to oncogenes or tumor suppressor genes.
If we consider the underlying network where different biological functions are being described by groups of interacting genes, a single gene analysis does not resolve the biological functions that are affected primarily in disease conditions and are causal factors of the disease. In order to get a systematic understanding of the disease or phenotypes we have to first understand what biological functions contribute to these changes, and perform a comparison between conditions using groups of genes defined by biological pathways. This approach leads to comparing gene expression data at the pathway level where sets of genes are tested for differential expression.
Another interesting property that can be extracted from gene expression data is the correlation structure of gene expression profiles between all genes. This correlation structure shows associations between genes which directly or indirectly interact with each other [8, 43–45]. Comparative analyses of gene pathways that consider the correlation structure of expression data can provide a suitable test for the hypothesis of changes in the underlying network.
In summary a gene expression data set can be used to (I) identify differentially altered single genes, (II) identify differentially expressed gene sets or pathways, and (III) identify differentially correlated pathways. In the following sections, we review statistical methods that have been introduced to study the three problems (IIII) above. In Figure 2 we give a graphical overview of the such methods.
Before we procede, we would like to point out that all of these methods test statistical hypotheses [46]. That implies that in order to understand a particular method biologically, i.e., one is capable of providing a biological interpretation, one needs to understand the underlying null hypothesis. In our opinion, it is helpful to approximately categorize all statistical hypotheses into three categories with respect to their biological interpretability, whereas each category represents a different degree of difficulty to find a biological interpretation for a hypothesis. In the following, we provide a brief discussion of these three categories because it enables a better, potentially, more plausible understanding of the methods presented in the next sections.
In category one belong all hypotheses for which it is relatively easy to find a meaningful biological interpretation. An example from this category are tests that compare mean values (μ), e.g., to identify the differential expression of genes (section ‘Differential expression of a gene’). That means these tests use the mean as a test statistic. Due to the fact that the underlying (probability) distribution of the genes represents, biologically, the activity of the gene expressions, the interpretation of a null hypothesis is directly derived thereof. For this reason the biological interpretation of the rejection of the null hypothesis given in Eqn. 2, is intuitively clear and appealing, because it implies a change in the (mean) expression of genes which may indicate a change in a biological function because the number of available proteins may be altered.
In category two fall tests for which there are several alternative biological interpretations. This makes the interpretation of such tests ambivalent from a biological perspective. As example for such a test, we consider the detection of the differential variance of a gene (section ‘Differential variance of a gene’). Despite the fact that the underlying probability distribution of the expression of genes has a clear biological interpretation, the biological interpretation for the rejection of the null hypothesis in Eqn. 4 is not unique. For instance, a gene could have a different variance in two conditions because, e.g., in condition one it is periodically expressed, whereas in condition two it is constantly expressed on an intermediate level. The former condition may be related to the cell cycle or the circadian rhythm, or periodically triggered by an external signaling factor that is released by the administration of a medication that is regularly taken. A second equally plausible interpretation could be that in one condition the cell utilizes parallel pathways to transfer a signal whereas in the other condition only one signaling chain is used. The reason for the utilization of parallel pathways could be triggered by stress factors, e.g., in the presence of an infection, so that the cell is ‘running’ full power in order to execute all necessary programs that have been initiated by the presence of the intruder.
Lastly, for tests in category three it is very difficult to find sufficiently precise biological interpretations because, statistically, these methods test ‘complex’ expressions. An example for a test from this category is the Nstatistic (section ‘Nstatistic’). The null hypothesis is based on the comparison of two distributions rather than scalar test statistics. In order to clarify the crucial difference between the comparison of two distributions and scalar test statistics we note that, theoretically, every probability distribution can be written as a series expansion in its moments [47]. This means a test for a distribution, compares implicitly the moments of this distribution. Here the (kth) moment is defined as the expectation value of a random variable (to power k), i.e.,
An example for a moment is the mean (which is the first moment), other examples of entities that can be expressed as a function of moments are the variance and the kurtosis. This means whenever the null hypothesis in Eqn. 39 is rejected it could be because of a difference in any moment of which there are, theoretically, infinite many. Put differently, this kind of unspecificity makes this test very powerful in the sense that it may detect any possible difference two distributions can exhibit. On the other hand, if the null hypothesis is rejected it is very difficult to identify a precise reason for its rejection. For instance, this could be related to a difference in the mean, variance, kurtosis or any higher moment or function thereof. These combinatorial factors do usually not allow to find a concise biological interpretation. Nevertheless, such a test can be of valuable use, e.g., for diagnostic purposes.
Singlegene analysis
Singlegene based methods can be subdivided into three major classes. A) Methods for detecting differential gene expression, B) methods for detecting differential correlation, and C) methods for detecting a differential variance.
Differential expression of a gene
The analysis of differential gene expression is based on the mean expression change of individual genes. Suppose for a gene g_{ i } in a microarray data set the mean expression value for two conditions are μ_{1} and μ_{2} respectively. Then the null hypothesis for the differential expression of the gene g_{ i } is defined as
A gene is called differentially expressed when H_{0} is rejected. 3 Figure 3A shows an example where the samples for two conditions are drawn from two normal distributions with different mean values, i.e., N(μ_{1} = 0,σ_{1} = 1) and N(μ_{2} = 1,σ_{2} = 1).
The first published studies for gene expression analysis selected differentially expressed genes based on a foldchange criteria between a treatment and control condition [48]. For example, an early application of this measure was used to compare normal colon epithelium and primary colon cancers [49]. Since then, many statistical approaches have been developed to provide more robust measures. Among the most popular methods are, e.g., SAM [50, 51], limma [52], and the empirical Bayes approach from Efron et al. [53].
Differential variance of a gene
The analysis of differential variability (DV) aims to detect a change in the variance of the gene expression values [54]. Suppose in a microarray expression data set, the mean expression value of a gene i is μ_{ c } and its variance $\left(\right)close="">{\sigma}_{c}^{2}$, for condition c = {1,2}. Then, the null and alternative hypothesis tested are:
A gene is called differentially variable when the null hypothesis H_{0} is rejected. The DV analysis in [54] tests H_{0} by using a Ftest. In Figure 3B we show an example for a gene with a constant mean, but a changed variance in the two conditions. The samples for the two conditions are drawn from a standard normal distribution with the same mean but different variances for the conditions, i.e., N(μ_{1} = 0,σ_{1} = 1), and N(μ_{2} = 0,σ_{2} = 2).
Differential correlation of a gene
The analysis of differential correlation aims to detect changes in the dependency structure of a single gene [55]. Suppose r_{ i } = r_{i1},…,r_{ ip } denotes a p − 1 dimensional correlation vector, whereas each component corresponds to the correlation between gene i and one of the other p − 1 genes in a data set. Then for r_{ i } one obtains distribution functions, denoted by $\left(\right)close="">{F}_{{r}_{i}}^{A}$ and $\left(\right)close="">{F}_{{r}_{i}}^{B}$, for condition A and B and the following hypotheses:
A gene i is called differentially correlated when H_{0} is rejected.
Genepair analysis
The functional activities of genes, as measured by gene expression values, reflect the interplay of the genes and their products in the underlying gene network. The objective of a genepair analysis is to identify either differential cocorrelated or differential coexpressed pairs of genes, instead of individual genes. The reason for looking for pairs of genes is that the concerted changes in genes is due to their common membership in biological pathways.
The principle idea to detect correlation changes in genepairs is visualized in Figure 3D. The data are sampled from a multivariate normal distribution with a constant mean vector for both conditions, μ_{ 1 } = μ_{ 2 } = (0,1), but a different correlation of ρ_{1} = 0.8 and ρ_{1} = −0.2. The point is despite no difference in the mean expression of the genepair, there is a difference in their correlation.
In [56] a method (CorScor) has been proposed to identify such genepairs. In Figure 4 we show three cases of the joint distribution of expression values of two genes, for two conditions. In this Figure we are showing simulated data for three possible changes in the coexpression of a pair of genes in two conditions. The samples in Figure 4A and B are drawn from a multivariate normal distribution with μ_{1} = {5,5} and μ_{2} = {5,7}. For Figure A the correlation between genepairs is ρ_{1} = ρ_{2} = 0.9 and for Figure B it is ρ_{1} = ρ_{2} = −0.9. For Figure 4C the samples are generated from a multivariate normal distribution with μ_{1} = {5,5} and μ_{2} = {5,5} and the average correlation between genepairs is ρ_{1} = 0.9 and ρ_{2} = −0.3.
In the first two cases (Figure 4A and B) the correlation of the genepairs show a condition specific shift, in [56] denoted as a gap and substitution. In the third case (Figure 4C), the genepairs show a reversed correlation between the two conditions, denoted as on/off case. To identify genepairs in these two types of conditions, two scoring functions have been suggested in [56] given by:
Here each of the three correlation coefficients are estimated for a genepair between gene i and j, i.e., ρ_{ A } = ρ_{ A }(i j) etc. The value of ρ corresponds to the global correlation coefficient of the genepair over the two conditions (A B) and ρ_{ A }, ρ_{ B } are the correlation coefficients of the genepair for condition A and condition B. In Eqn. 8, α is a tuning parameter that governs the balance between separation and parallel alignment. In [56] it was argued to use a value of α = 1.5. The null and alternative hypotheses tested are:
In [57] the ‘expected conditional Fstatistic’ (ECFstatistic) has been introduced to measure the differential coexpression of gene pairs (X Y). The method is based on a modified Fstatistic, where the variance and the mean parameter of the statistic are estimated from a mixture of two normal distributions.
The R package R/EBcoexpress provides an empirical Bayesian implementation to identify the differential coexpression of gene pairs [58].
Another method called liquid association (LA) has been proposed in [59] to identify coexpressed gene pairs. In contrast to pairwise correlation measures, the LA method considers the presence of a mediator gene, Z, for observing a coexpression between two genes at a given cellular state. Let X, Y and Z be geneexpression profiles. We say that X and Y form a liquid association pair (LAP), if the cellular state of X and Y is correlated with Z. The LA score of X and Y with respect to Z is estimated from rank transformed expression profiles of X, Y, Z given by
Here m corresponds to the number of samples. The LA method uses a permutation test for the identification of significant LA gene pair values. Due to the high computational burden of the method that would require N^{3} (N is the number of genes) evaluations of Eqn 11 plus additional permutations of the data, which is even for only N = 10^{3} genes intractable because it requires already more than 10^{9} evaluations. For this reason the method is only used to (A) find the gene Z for a given pair of genes or (B) find the LAP, X and Y, for a given gene Z.
Gene set and pathway analysis methods
Generally, a pathway is a group of interacting genes (a gene set) that deploy a cellular function. In a biological system the biological processes are coordinated functions of sets of genes which make the organism work. Some general pathways are, e.g., metabolic pathways, signaling pathways or regulation pathways that represent minimal functioning units of a cellular system. The consideration of pathways or gene sets for a comparative gene expression analysis is an important step toward the exploration of relevant functional mechanisms of a cell.
So far, many multivariate and univariate tests have been proposed for a gene set analysis, see Figure 2. Finding differentially expressed pathways, instead of individual genes, is not straight forward from a statistical and biological perspective and there are several hurdles to this approach. The first is presented by the data themselves, because the number of variables is usually (much) larger than the number of samples, i.e., n << p, that leads to many estimation problems. The second hurdle is our incomplete information about the constitution of biological pathways and the potentially high overlap of genes between different pathways. For example, databases like GO [60] provide valuable information about genes for a large variety of different organisms. However, this information is not static but continuously expanding leaving us at the moment with a snapshot of knowledge. This makes it difficult to find precise definitions for particular pathways of interest. The third problem comes from the underlying gene network structure that describes the true interactions between genes in a pathway. Here, the problem is that as a result of such interactions among genes it is usually not appropriate to assume their independence, as frequently done for statistical ease.
A motivating example for the general idea underlying gene set methods is shown in Figure 3C. For condition 1 (green), the samples are drawn from a multivariate normal distribution with μ_{1} = {2,2.2,2.4,2.6,5,8} and for condition 2 (red) μ_{2} = {2,7,7.5,2.6,5,8}. The covariance matrix, Σ_{1} = Σ_{2} is for both conditions the same. In this Figure, only 2 of the 6 gene are differentially expressed. This reflects biological situations because, usually, only of fraction of the genes belonging to a pathway is found to be differentially expressed. However, due to the fact that gene set methods are based on the expression of a set of genes such methods borrow strength from the combined analysis of the genes.
Reviews that focus entirely on gene set and pathway analysis methods can be found in [61–65].
Null hypothesis for gene set analysis
Gene set analysis methods can be broadly divided into two major categories, depending on what null hypothesis is tested. The first type of methods are called competitive methods, and the second type selfcontained methods[66]. Briefly, selfcontained tests use only the data from a target gene set under investigation, whereas competitive tests use, in addition, also data outside the target gene set (background data). In the following we describe popular competitive and selfcontained pathway methods.
Competitive gene set and pathway methods
In Table 1 we show an overview of gene set and pathway methods, described in the following.
GSEA
The gene set enrichment analysis method (GSEA) [27, 68] is one of the most widely used competitive test based method. The test uses a KolmogorovSmirnov test statistic to identify differential expressed gene sets. The gene set and background data set are defined in the following. Let W be the target gene set to be tested and W^{c} its complement in a way that the union of both sets defines all genes, i.e., V = W ∪ W^{c}, in the data set. The hypotheses tested by GSEA with respect to an enrichment score (ES) are:
Briefly, GSEA consists of the following steps, applied to each pathway:

(1)
Estimation of genelevel test statistics.

(2)
Rank ordering of the test statistics.

(3)
Calculation of an enrichment score (ES) for a pathway based on the genelevel test statistics.

(4)
Permutation of the genelabels to estimate the significance of the enrichment score for the pathway.
GSEArot
GSEArot (gene set enrichment analysis rotation) [70] is very similar to GSEA, but uses a different approach to randomize data in order to assess the significance of a target pathway. More specifically, a data matrix X is randomized by, first, rotating X around a random angle δ, resulting in a matrix X(δ). Second, from the matrix X(δ), the randomization matrix is obtained by a QR decomposition [76]. In [70] it is argued that this procedure has an advantage for small sample sizes, when only very few permutations are achievable from samplelabel permutations. The null hypothesis tested by GSEArot is the same as for GSEA.
Random set
The random set method introduced in [73] is a parametric test that is a generalization of Fisher’s exact test in the sense that enrichment scores of gene sets are compared with randomly formed sets. The enrichment scores are based on single genelevel test statistics reflecting their differential expression.

1.
Estimate the enrichment score of a target gene set W,
$$\stackrel{\u0304}{s}=\frac{1}{m}\sum _{i\in W}{s}_{i}.$$(12)Here s_{ i } are genelevel scores, e.g., tscores, and m = W is the number of genes in the target pathway.

2.
Estimate the enrichment score and its variance of the background gene set V = W ∪ W ^{c},
$$\mu =\frac{1}{p}\sum _{i\in V}{s}_{i},$$(13)$${\sigma}^{2}=\frac{1}{m}\left(\frac{pm}{p1}\right)\left(\frac{\sum _{i\in V}{s}_{i}^{2}}{p}{\left(\frac{\sum _{i\in V}{s}_{i}}{p}\right)}^{2}\right),$$(14)with p = W ∪ W^{c}.

3.
Estimate the standardized enrichment score
$$Z=\frac{\stackrel{\u0304}{s}\mu}{\sigma}.$$(15)
The Z score follows a standard normal distribution under the null hypothesis H_{0} given by:
It is notable that Z can be calculated without a numerical randomization of the data. Further, the background data consist of all genes V, including the ones in the target pathway W. In [77] this method has been applied to head and neck and cervical cancer for human papillomavirusespositive and negative samples.
GAGE
GAGE [71] (generally applicable gene set enrichment) is also a parametric test that, similarly to GSEA, compares the expression in a target gene set with that of the background. But instead of using a KolmogorovSmirnov like test [78] it employes a twosample ttest. The principle steps of the method are as follows:

1.
Estimate the mean fold change f and its standard deviation σ _{ f } for the m genes in the target pathway W.

2.
Estimate the mean fold change f ^{′}and its standard deviation $\left(\right)close="">{\sigma}_{{f}^{\prime}}$ for all p genes in the background gene set V = W ∪ W ^{c}.

3.
Estimate the tscore:
$$t=\frac{f{f}^{\prime}}{\sqrt{{\sigma}_{f}^{2}/m+{\sigma}_{{f}^{\prime}}^{2}/m}}$$(18)
with
degrees of freedom.
Also GAGE employs all genes V in the background gene set, including the ones in the target set W. The underlying assumption of GAGE is that the (mean) fold changes of genes are independent and identically distributed. The null hypothesis tested by GAGE is:
GSA
Another method is GSA (gene set analysis) [69]. The method, first, calculates zscores, z_{ i }, with i ∈ {1,…,m}, for all m genes in a given target pathway W. Then each zscore is transformed into two scores, assessing the sign of z_{ i }.
This results in two sets of nonzero scores ${\mathcal{S}}^{+}=\{{s}^{+}({z}_{1}),\dots ,{s}^{+}({z}_{m})\}$ and ${\mathcal{S}}^{}=\{{s}^{}({z}_{1}),\dots ,{s}^{}({z}_{m})\}$ from which their mean value is calculated,
Finally, the maxmean test statistic is defined by ${s}_{\mathrm{mm}}=max\{{\stackrel{\u0304}{s}}^{+},{\stackrel{\u0304}{s}}^{}\}$, giving the test statistic for the target pathway.
The null distribution is assessed by a restandardization, combining a sample and genelabel permutation.
Selfcontained gene set and pathway methods
In Table 2 we show an overview of selfcontained gene set and pathway methods that are describe in the following in more detail.
Sum of tsquare
The sum of tsquare test is an univariate test that is based on tscores, {t_{ i }}, individually obtained for each of the m genes in a given set [95], see also [79]. That means that each tscore assesses the difference of the mean expression between the two conditions,
with s_{ i } the pooled standard deviation. The test statistic for a pathway is based on the individual tscores given by
Because for each gene Δ μ_{ i } = 0 should hold if a gene is not differentially expressed, the null and alternative hypothesis can be formulated as:
The significance of TS is assessed from samplelabel permuted data.
SAMGS
The method SAMGS (Significance Analysis of Microarray for gene sets) [81] uses the test statistics,
with ${d}_{k}=\frac{{\stackrel{\u0304}{x}}_{1k}{\stackrel{\u0304}{x}}_{2k}}{{s}_{k}+{s}_{0}}$. Here $\left(\right)close="">{\stackrel{\u0304}{x}}_{1k}$ and $\left(\right)close="">{\stackrel{\u0304}{x}}_{2k}$ are the sample means of the control and treatment condition of gene k, s_{ k } corresponds to its pooled standard deviation and s_{0} is a constant for a sensitivity adjustment. The null and alternative hypothesis can be formulated as:
Statistical significance of SAMGS is again assessed from samplelabel permuted data.
Hotelling’s T^{2}
The Hotelling T^{2} test is a selfcontained test that is a multivariate generalization of the univariate ttest. Its null and alternative hypothesis can be formulated as:
Suppose we have two groups with n_{ C } samples from the control group and n_{ T } samples for the treatment group, each consisting of m genes. Let the expression level of the i^{th} sample of the control group and treatment group be given by ${X}_{i}^{C}={\left({X}_{i1}^{C},{X}_{i2}^{C},\dots {X}_{\mathrm{im}}^{C}\right)}^{t}$ and ${X}_{i}^{T}={\left({X}_{i1}^{T},{X}_{i2}^{T},\dots {X}_{\mathrm{im}}^{T}\right)}^{t}$, respectively. The pooled covariance matrix Sis then defined by
where S_{ C } and S_{ T } are the covariance matrices for the control and treatment group. Hotelling’s T^{2} is defined as
The inverse of the covariance matrix is estimated via the shrinkage estimator [96–99]. The statistical significance of the test statistic T^{2} is estimated from samplelabel permuted data.
Nstatistic
The Nstatistic is a nonparametric test that is used to test the equality of two distributions. Suppose the expression level of the i^{th} sample of the control group, n_{ C }, and the treatment group, n_{ T }, is given by ${X}_{i}^{C}={\left({X}_{i1}^{C},{X}_{i2}^{C},\dots {X}_{\mathrm{im}}^{C}\right)}^{t}$ and ${X}_{i}^{T}={\left({X}_{i1}^{T},{X}_{i2}^{T},\dots {X}_{\mathrm{im}}^{T}\right)}^{t}$, respectively. Let i ∈ {1…n_{ C }} correspond to the control dataset and i ∈ {1…n_{ T }} to the treatment dataset. The null and alternative hypothesis tested by the Nstatistic can be formulated as:
whereas F_{ C }(x) and F_{ T }(x) are two multivariate distribution functions from the control and the treatment condition.
The Nstatistic itself is defined as follows:
Here $K\left({x}_{i}^{C},{x}_{j}^{T}\right)$, defined as $K\left({x}_{i}^{T},{x}_{j}^{C}\right)={\u2225{x}_{i}^{T}{x}_{j}^{C}\u2225}_{2}$, is the Euclidean Kernel serving as distance function between the expression values in the two conditions.
Linear modelbased pathway methods
There are also several approaches that utilize either a linear or a generalized linear modeling framework for a gene set analysis. Examples for such methods are Global test [82], Extension of GSEA [80] or GlobalAncova [83].
Topological pathway methods based on existing network information
Some recent univariate methods, for instance, Pathwayexpress [90], SPIA [91] or SEPEA [92], use instead of correlation measures to estimate interactions among genes, predefined topological information as provided, e.g., by the KEGG database [100]. These methods assign each gene in a pathway a score that is based on the position of a gene in the given network structure and, finally, aggregate these individual gene scores to obtain a score for the pathway itself. Yet another approach is provided by PARADIGM [93]. This method uses a factor graph model combining gene copy number variation data with gene expression data for the identification of differentially expressed pathways.
Iterative Proportional Scaling: IPS
IPS (Iterative Proportional Scaling) [94, 101] is another method that utilizes the topology of pathways of a given gene set by testing the hypotheses:
In this method, the covariance matrices, $\left(\right)close="">{\mathit{\Sigma}}_{{c}_{1}},{\mathit{\Sigma}}_{{c}_{2}}$, are estimated from the data, for both conditions, using the Iterative Proportional Scaling (IPS) algorithm. The inverse of the estimated covariance matrices are positive definite (concentration) matrices for which it is assumed that the nonzero elements in $\left(\right)close="">{\mathit{\Sigma}}_{{c}_{1}}^{1}$ and $\left(\right)close="">{\mathit{\Sigma}}_{{c}_{2}}^{1}$ are identical; this is the meaning of ${\mathit{\Sigma}}_{{c}_{1}}^{1},{\mathit{\Sigma}}_{{c}_{2}}^{1}\in {S}^{+}(G)$ where S^{+} indicates the class of all symmetric positive definite matrices with nonzeros elements given by the binary matrix G. This means that the concentration matrices, ${\mathit{\Sigma}}_{{c}_{1}}^{1}$ and ${\mathit{\Sigma}}_{{c}_{2}}^{1}$, have identical zero element, but are allowed to have different nonzero entries. In other words, it is assumed that the underlying topology of a pathway is the same for condition c_{1} and c_{2}, given by G, whereas G_{ ij } = 0 corresponds to an ‘absent’ interactions among the genes i and j. Since the structure of G is not estimated from the data, it is necessary to obtain it from an independent source, e.g., from the KEGG database or Reactome [100, 102]. In [94] it is shown that a log likelihood (log(Λ)) ratio test can be used to test for the equality of the concentration matrices for the two conditions and that asymptotically the log likelihood ratio follows a Chisquare distribution with r + m degrees of freedom, i.e., $log(\Lambda )\sim {\chi}_{r+m}^{2}$, whereas m is the number of genes in the pathway and r is the number of nonvanishing edges in G corresponding to the fixed interaction structure of the pathway.
The IPS method has been used in [94] to study acute lymphocytic leukemia with and without BCR/ABL gene rearrangement. As a result, the JUN oncogene with RAS/MAPK/JNK followed by NFAT and NFKB seem to be crucial in distinguishing BCR/ABL positive and negative patients.
Differential correlation/interaction methods
In the previous sections, we discussed different gene set and pathwaybased methods for the identification of differentially expressed pathways. These methods focused either only on the expression of genes, or considered an underlying interaction topology among the genes as taken from an independent source. However, even when these methods considered an underlying network structure, this structure was assumed to be the same for the ‘treatment’ and ‘control’ group.
In contrast, in this section we discuss methods that estimate the correlation/interaction structure of the genes within pathways, for each experimental condition. The underlying rationale for these approaches is to assume that the expression profiles of genes are dependent on each other [103, 104] as the genes in a pathway interact, either directly or indirectly [105]. This assumption results from the observation that genes with similar functions or cellular localization are often coexpressed and cluster together. The methods discussed in this section bear a similarity to the statistical methods for the estimation of differential correlated genepairs (see section ‘Genepair analysis’). However, the extension of such genepair measures to the pathway level allows the identification of pathways that show, e.g., a condition specific correlation change.
In Figure 3E we show a simulated example scenario for condition specific correlation changes of the expression profiles for a gene set. In Figure 3E the correlation between all genepairs of a gene set is aggregated by a summary statistic. In this example, the mean values between the genes is of a comparable order, whereas the correlation of the gene set in the treatment condition is reduced.
A variety of different pathway methods have been developed that integrate the estimated gene correlations or coexpression structures with gene expression data. A summary of different methods that are used for the identification of differential correlation/interaction changes in pathways is shown in Table 3. In the following these methods are described in more detail.
Graph Edit Distance: GED
Among the first approaches that estimate the interaction structure for a pathway analysis to identify differentially correlated pathways (DCP) is a method introduced in [106]. This method uses the graph edit distance (GED) score as a test statistic.
More precisely, for a given pathway containing m genes an association graph, G, also called pseudopathway, is inferred for each condition. That means the resulting network comprises the m genes of this pathway only. The inference method estimates correlation and partial correlation coefficients and tests their statistical significance. That means, if either the correlation or partial correlation coefficient for two genes i and j in this pathway vanishes, then the resulting network will not have an interaction between gene i and j, i.e., E_{ ij } = 0, otherwise there is an interaction, E_{ ij } = 1. Here E corresponds to the adjacency matrix of the network. Suppose $\left(\right)close="">{G}_{{c}_{1}}$ and $\left(\right)close="">{G}_{{c}_{2}}$ are two networks that have been inferred from gene expression data for condition c_{1} and c_{2}. Further, assume that M_{1},M_{2},…,M_{ n } are all possible transformations that map $\left(\right)close="">{G}_{{c}_{1}}$ into $\left(\right)close="">{G}_{{c}_{2}}$, i.e, $\left(\right)close="">{M}_{i}({G}_{{c}_{1}})={G}_{{c}_{2}}$. Then the optimal cost of the optimal transformation, M^{′}, is given by c(M^{′}) = min{c(M_{ i })1 ≤ i ≤ n}. This value is used to define a dissimilarity measure ${d}_{\text{GED}}({G}_{{c}_{1}},{G}_{{c}_{2}})=c({M}^{{}^{\prime}})$ between the two networks $\left(\right)close="">{G}_{{c}_{1}}$ and $\left(\right)close="">{G}_{{c}_{2}}$, called the graph edit distance (GED) score [112]. For arbitrary networks, $\left(\right)close="">{G}_{{c}_{1}}$ and $\left(\right)close="">{G}_{{c}_{2}}$, the estimation of d_{GED}(G_{ A }G_{ B }) is numerically challenging. However, for our specific problem it can be efficiently calculated based on the adjacency matrices, $\left(\right)close="">{E}^{{c}_{1}}$ and $\left(\right)close="">{E}^{{c}_{2}}$, of the two networks,
In [106] the GED score has been used as a teststatistic for the formulation of the hypotheses:
In order to assess statistical significance, sample label permutations are performed to obtain the null distribution.
Extensions of this method can be found in [113] where mutual information values have been used to capture nonlinear relations among gene expression values. Further, in [114] a methods based on a relevance value (RV) has been defined for integrating different types of genomics data sets which has also a resemblance to the GED.
Gene set coexpression analysis: GSCA
A method that is based on (zeroorder) correlation coefficients is gene set coexpression analysis (GSCA) [107]. This method uses as test statistic the dispersion index, which is defined as follows:
Here $\left(\right)close="">{\rho}_{k}^{c}$, with c ∈ {c_{1},c_{2}}, is the kth correlation coefficient for a gene pair, which can be formed among the total number, $P=\left(\genfrac{}{}{0ex}{}{m}{2}\right)$, of such pairs for a pathway consisting of m genes. The null and alternative hypotheses tested are:
From the definition of the dispersion index follows that also this method aims at detecting at differential correlation among pathways, despite its name emphasizing coexpression. Interestingly, the dispersion index corresponds to the GED score if its components in Eqn. 47 are relabeled and one defines the components of the adjacency matrices $\left(\right)close="">{E}^{{c}_{1}},{E}^{{c}_{2}}$ as the correlation coefficients rather than the outcome of the hypotheses tests [106].
A visualization of the underlying idea of GSCA is shown in Figure 3E. The gene expression values are sampled from multivariate normal distribution N(μ_{ 1 },Σ_{1}) and N(μ_{2},Σ_{2}) with μ_{1} = μ_{2}, and the average covariance between genepairs is Σ_{1} = 0.8 and Σ_{2} = −0.2. Despite the fact that there is neither a difference in the individual expression of genes nor the the expression of a set of genes, condition 1 and 2 can be distinguished by using a measure based on a correlation change.
Partial least squares based scores: PLS
A statistical framework based on a partial least squares score is proposed in [115]. Similar to the above methods, two matrices for the two conditions are inferred. These matrices can be seen as weighted networks, whereas an edge weight corresponds to the strength of the association between two genes. In this paper, three different types of tests are introduced that allow (A) testing for changes in the module structure of the two networks, (B) testing for changed in the connectivity of a particular gene set, and (C) testing for changes in the connectivity of a particular gene.
Differentially coexpressed gene sets: dCoxS
In [108] the differentially coexpressed gene sets (dCoxS) algorithm is proposed. This is an entropybased method that uses the interaction score (IS) to measure the difference between two pathways. The IS is estimated by the correlation coefficient between the entropies of the two pathways. The entropies themselves are estimated by using the Rényi relative entropy, which is defined by:
Here α is a parameter and $\left(\right)close="">{\widehat{f}}_{h}({S}_{i})$ and $\left(\right)close="">{\widehat{f}}_{h}({S}_{j})$ are expression densities of the samples i and j, estimated by using a multiplicative kernel for the density estimation. Further, p and q are the probability density functions of P and Q. From this, the IS is estimated by:
In this equation, $\left(\right)close="">R{E}^{{P}_{1}}$ and $\left(\right)close="">R{E}^{{P}_{2}}$ are entropy matrices of two gene sets, P_{1} and P_{2}, for condition c_{ i }, whereas each component of the entropy matrices is proportional to $\left(\right)close="">\sim log\frac{{\widehat{f}}_{h}({S}_{i})}{{\widehat{f}}_{h}({S}_{j})}$. That means, strictly IS = IS(P_{1}(c_{ i }),P_{2}(c_{ i })). Application of Fisher’s Ztransformation to IS results in a zscore, z(IS(P_{1}(c_{ i }),P_{2}(c_{ i }))), for condition c_{ i }. Combination of both zscores for condition c_{1} and c_{2} leads to,
Here n_{1} and n_{2} correspond to the number of samples for condition c_{1}and c_{2}. The interpretation of the null hypothesis tested can be stated as:
In [108] dCoxS has been applied to gene expression data from lung cancer. Their analysis identified the Thrombin signaling and proteaseactivated receptors pathway, which is known to be involved in the angiogenesis of lung cancer, as the most frequently changed pathway. Another interesting result found is that all significant pathway pairs had a lower interaction score in lung cancer than in the normal control group. This might indicate that the variability in form of exploited parallel pathways is in cancer lower than in normal cells.
Gene regulatory networks
Finally, we would like to mention that also gene regulatory network inference methods have also been used in this context. More precisely, several attempts have been made to identify disease networks [110, 111] that corresponds to particular pathways. For instance, in [110] the C3NET inference method [116, 117] has been used to infer pathway specific networks for prostate cancer. A structural comparison between the pathwayspecific networks, similar to [106] which is based on testing the hypothesis in Eqn. 45, allowed to identify growth and cell cycle related pathways.
On a side note, we would like to add that Gaussian graphical models (GGM), also known as Markov random fields[118–120], are also frequently used to infer gene regulatory networks. This model assumes that all variables follow a multivariate normal distribution with a specific structure of the inverse of the covariance matrix, Ω = Σ^{−1}, whereas Ω is called the precision or concentration matrix. Network inference methods based on GGM make use of the relation,
connecting the partial correlation coefficient of fullorder (LHS) with the elements of Ω, ω_{ ij } ∈ Ω. The partial correlation is of fullorder (with respect to the number of genes) because V∖{ij} is the set of all genes excluding i and j, i.e., the largest possible set of genes not considering i and j.
Several methodological improvements have been suggested to infer gene regulatory networks based on GGM [121–123]. These methods differ in the way the inverse of the covariance matrix, Σ^{−1}, is estimated and in the statistical tests employed to assess significance. The reason for these technical variants comes from a variety of problems. For instance, if the number of samples is smaller than the number of genes, which is typically the case for a microarray data set, the sample covariance matrix is not positive definite and, hence, not invertible. This means that Eqn. 55 cannot be exploited. In order to overcome such practical estimation problems, recently, several extensions based on the LASSO (least absolute shrinkage and selection operator) have been suggested [124–128].
Importance of multiple hypotheses testing and sample size: An example for differentially expressed genes
Typical microarray experiments measure the concentration of thousands of mRNAs simultaneously. For this reason, usually, one does not just test one statistical hypotheses but dozens, hundreds or even thousands. This makes it mandatory to control the overall error rate for all the tests, because the probability to make at least one error, Pr(V ≥ 1α,t) = (1 − (1 − α)^{t}), for a test with a false positive rate of α, increases rapidly with the number of tests t, as can be seen in Figure 5. Here, V corresponds to the number of false positives. From this one can see that even for a moderate number of tested hypotheses, e.g., 300, this probability is already almost 100%. Hence, each of the three principle types of hypotheses tests discussed in the previous sections are severely effected by this problem.
Classically, a Bonferroni correction is used controlling the Family wise Error Rate (FWER) [129, 130],
Unfortunately, this method is often too stringent, which may give no significant results at all. For this reason, alternative error measures and control procedures have been introduced. A recent, very popular measure is the false discovery rate (FDR) [131],
controlled by a procedure introduced by Benjamini & Hochberg (BH) [131]. Subsequently, various related error measures have been proposed like pFDR [132], local FDR [133, 134] and a variety of other control procedures [129, 135]. Also extensions have been suggested [136] that allow the control of an error measure in cases where the underlying tests are not independent from each other. This is particularly important for microarray data that contain a none neglectable correlation structure among the genes.
In order to demonstrate the importance and the influence of different error measures and control procedures, we provide a numerical example, shown in Figure 6. Here, ‘BH’ means the Benjamini & Hochberg procedure to control the FDR [131], ‘pFDR’ indicates the positive false positive rate controlled with a method introduced in [132], ‘Bonferroni’ corresponds to the FWER controlled with a Bonferroni correction and ‘localFDR’ corresponds to the local FDR [134, 137]. The local FDR is defined as,
that means the local FDR is the probability that the null model is true conditioned on the observed test statistic. The data we used for this analysis correspond to simulated gene expression data sampled from a normal distribution with different mean values for the two conditions. More precisely, we simulate 2000 genes of which 400 are differentially expressed (true positives). Further, we study three different (constant) correlation structures with ρ = {0.0,0.2,0.5}. The results shown in Figure 6 are for each sample size averaged over 50 independent runs.
As one can see, ‘BH’ and ‘pFDR’ give more significant results and, hence, have a higher power than the Bonferroni correction and the local FDR when there is either no or only a moderate correlation among the genes. However, it is important to note that the utility of these methods depends on the characteristics of the data. For example, if the average correlation in the data is ρ = 0.0, then ‘pFDR’ tends to perform best (see Figure 6A). However, when the average correlation in the data increases (ρ = 0.5) then the ‘localFDR’ [134, 137] becomes preferable. We want to note, for a sample size of 5, the power of the methods is usually very low because only a couple of genes test significant. In addition, a large fraction of these can be false positives. This seems to be especially for the local FDR method a problem.
Recommendations
In general, there is a tradeoff between a high power of a statistical method on one side, which requires a large number of samples, and low experimental costs on the other. For the identification of differentially expressed genes the results in Figure 6 provide some guidelines. Even for the most favorable condition (for ρ = 0.0) a study will usually be underpowered for ≤ 20 samples, however, on the other hand, even for 10 samples the Type I error will be wellcontrolled.
For gene set and pathwaybased methods such recommendation are more delicate. In [105] two selfcontained (sum of tsquare and Hotelling’s T^{2}[84, 95]) and one competitive test (GSEA [27]) have been analyzed. As a results, it is suggested not to apply a method unconditionally to all pathways in a given data set, but to filter them in order to eliminate conditions for which a method is more likely to cause problems. This can be seen as a reflection of the heterogeneity of cancer, as discussed above in the section ‘Gene expression data and cancer’.
In [105] it has been suggested to filter pathways according to the following criteria: Hotelling’s T^{2} should only be applied to pathways with less than 35 genes and a sample size larger than 30. The sum of tsquare test should only be used for pathways with DC > 10% (DC is detection call; the percentage of differentially expressed genes in a pathway) and a sample size of 25 or larger. GSEA should only be used for pathways with DC > 10% and a sample size larger than 25. That means for the sum of tsquare test and GSEA, at least 10% of the genes in a pathway should be differentially expressed for the method to work. However, this is not independent of the correlation structure of the data. In general, in the presence of high correlations a larger number of differentially expressed genes is beneficial for these methods.
It is important to emphasize that these sample sizes are different to the minimal sample sizes necessary in order to avoid in addition that a study is underpowered. For the minimal sample sizes [105] predict a sample size of 59 for Hotelling’s T^{2} and 57 for the sum of tsquare test and 83 for GSEA. Further, in [95] it was found that using the Nstatistic with 40 samples (or more) leads to a good control of the Type I error and a satisfactory power for a variety of differing conditions, including different correlations of the data and DC values in the pathways. Further studies reviewing related methods can found in [61, 62, 65, 139, 140].
We would like to emphasize that the above recommendations are data dependent. That means it is not possible to judge solely based on the number of samples which method to use. Instead, one needs to estimate characteristics from a particular data set in order to select an appropriate test. This implies, e.g., to estimate the correlation structure and the detection call. In the context of cancer there is an additional problem that needs to be considered. It is known that a tumor is a heterogeneous collection of cells rather than a homogeneous one [26, 141]. This translates into the heterogeneity of gene expression data [142] making it even more dangerous to provide general recommendations without considering a particular data set.
On a general note, we would like to highlight that whenever a given data set allows to (I) identify changes in single genes, (II) identify changes in gene sets or pathways, and (III) identify changes in the correlation structure in pathways, then methods from each of the three categories (I III) should be applied and there is no need to focus on just one of these. The reason for this is that despite the fact that gene set or pathway methods have more explanatory power than methods to identify changes in single genes [64, 95] it does not mean that there are no conditions for which single gene methods reveal interesting biological information that may not be obtained by the other types of methods. For instance, the differential expression of a single gene based on changes in the mean (rather than the variance) may be an indicator for the presence of a single signaling chain rather than of many parallel pathways. Hence, this could provide information about the presence of a Mendelian trait or a complex trait that contains a strong monogenetic component. It appears that for such conditions single gene methods have an advantage over gene set or pathway methods, although, the latter methods may be adaptable to such question as well. However, this may require additional effort. In summary, we recommend to use all different approaches (I III) sideby side, whenever this is permitted by the data, to interrogate the data in the broadest way, because this translates into a diverse set of different biological questions.
Our recommendation complements a common line of thought asking for the combination of different types of data. Although it is certainly true that combining different types of highthroughput data, e.g., from DNA microarray and ChIPchip experiments, is in general more informative, it is also more time and cost intensive to generate such data combinations. For this reason, frequently, only gene expression data are available. Hence, our review provides a survey of method to get the most out of expression data sets.
Finally, we would like to emphasize that all methods require an appropriate filtering and normalization of the data in order to obtain robust and statistically sensible results [143, 144].
Conclusions and discussion
In the post genomic era, biology transitioned from a ‘genecentric’ to a systemsfocused field. This change is also reflected in the transition from methods to identify ‘differentially expressed single genes’ to approaches for finding ‘differentially changed pathways’. Such a transition is natural, because a systems view is required to understand the complex biological functions inside a cell that are responsible for the observable phenotypic outcomes [9, 11, 145].
As recent findings in cancer research demonstrate, cancer is a heterogeneous disorder, even within a particular cancer type. For example, breast cancer is currently subcategorized into four major tumor subtypes [146]: basallike, HER2enriched, luminal (which can be further reduced) and normallike tumors. Considering the fact that these results have been achieved by using highthroughput data one can expect further refinements when data from different hightthroughput technologies become available and being combined with each other. For this reason, it appears sensible to assume such a heterogeneity not only on the global, phenotype level, but also within the cells, on the pathwaylevel. This implies that a pathwaybased filtering, as suggested in [105], is necessary to apply a method only selectively, and not unconditional, to cancer pathways.
Regarding potential future directions, we expect to see an increase in methods that target changes in the correlation structure in pathways for three reasons. First, genes and their products do interact with each other. This implies that there exists a correlation structure among these entities that represents, potentially, useful biological information that may be missed by coexpression based methods [106]. Second, the costs to generate highthroughput data are declining, which makes it easier for the experimenter to generate a sufficiently large number of samples that enables such an analysis. This is an important point, since the required sample sizes for a pathway analysis is considerably larger than for single gene analyses. Third, biologically, the hallmarks of cancer point to a few pathways as pivotal elements in the molecular elucidation of carcinogenesis, e.g., Wnt/Notch signaling, Hedgehog signaling or DNA damage control [147–149]. Hence, semantically, pathway studies enable the systematic connection of oncogenes, tumorsuppressor genes and stability genes [150] to provide fundamental insights into causal mechanisms underlying cancer. Unfortunately, the temporary literature especially of methodological papers discuss their results rarely in the framework of the hallmark pathways. For this reason, we suggest that future studies aim for a conceptual discussion of their results within this enlightening framework. Not because it provides the final answers to understand cancer [151], but due to the fact that it enables a systematic approach to the emperor of all maladies [152].
Reviewers’ comments
First of all, we would like to thank all referees for their fruitful suggestions and comments. In the following, we kept our answers to the raised issues short but included our responses in the main text.
Referee 1: Dr. Arcady Mushegian
The manuscript by EmmertStreib and colleagues is a review of statistical methods for analysis of gene expression data, but it is also much more than that. It is relatively rare for the statisticians to review all classes of such methods and to give an eminently logical classification not only of the techniques on which the methods are based, but also of the kinds of questions that are asked when applying these methods. This, certainly, is a strength of the work and the reason why it should appeal to the biologists that would like to have a deeper insight into which methods are appropriate to which task at hand.
I have, however, several comments that rank somewhere between suggestions and concerns. Most importantly, the authors propose to distinguish three groups of methods: those that identify changes in single genes, those that identify changes in gene sets or pathways, and those that identify changes in the correlation structure in pathways. (By the way, in the Abstract and elsewhere, the description of the groups is almost the same as above, but “changes” are substituted by “differential changes”  is it not a tautology, in particular when there are only two samples?). Then, in discussing the first two classes of methods, the authors almost in every case give a clear formulation of the question that is being asked of the data, in the form of the statistical hypothesis about the data that is being tested. This is an excellent way of explaining things. Unfortunately, it is not consistently applied: even among these classes of methods, the hypotheses are not mentioned, and then, upon discussing the differentialcorrelation methods (pp. 1518), the hypotheses are not explicated at all, except for the IPS method. I think this need to be changed, and the null hypotheses need to be stated for all methods for which this is possible; and if the framework is such that no explicit null hypothesis exist, this needs to be discussed, and the applicable intuitive formulation be given.
Reply
We appreciate this suggestion and added to all methods the definition of their null hypothesis. In addition, we extended the discussion in section ‘Formulating biological hypotheses’ explaining why it can be difficult to find a biological interpretation for a null hypothesis and we offer some explanation for this.
Question
My other concern is abut Figures 3 and 4. The authors never state what the data points there represent. They must be expression values for two genes, but how are these data collected  are they technical replicates? biological replicates? some kind of ordered series? unordered series such as for example different drug treatments? Does it matter what of the above they are?
Reply
We added an explanation of the data, which are simulated data to visualize the principle idea underlying some methods, to the corresponding methods.
Question
The third shortcoming of the paper is that there is a significant disconnect between wellcovered methodology and the stated goal of discussing the application to cancer biology. In fact, the short discussion about cancer hallmarks is an excellent introduction that points out the way in which analysis of gene expression can lead to the understanding of changes in expression of particular (“hallmark”) pathways. This theme, however, is not followed through. Though occasionally we read that such and such method was applied to analysis of a particular type of cancer, there is never any discussion of what was found in gene expression data that allowed an insight into cancer biology. What happens to the hallmark pathways at the level of gene expression programs? Which methods have been used to support (or maybe question?) which aspects of the hallmark hypothesis? Which pathways were predicted or shown to be differentially regulated at the transcription or mRNA concentration level?
Reply
We agree with the reviewer that ‘Which methods have been used to support (or maybe question?) which aspects of the hallmark hypothesis? ’ is an important questions. Unfortunately, the methodology oriented literature does rarely touch this topic in a clear manner. That means in order to extend the paper in this direction we could not survey these issues but would need to establish such results. Instead, the concern in our paper is to propagate such an approach in the context of the presented methods. A discussion has been added to ‘Discussion and conclusions’.
Question
Finally, there is the question of, if you will, general biology of transcriptional response. It stands to reason, and indeed has been occasionally shown, that in order for a pathway to be regulated, it may not be necessary to regulate all its components at the same (in this case, mRNA concentration) level. One may find that the gene product amounts are regulated at different levels, or maybe even only one or a few, e.g., ratelimiting, components are regulated at all. This would argue that single genebased methods may in these cases provide a better clue to the process than pathwaybased or gene set enrichmentbased methods. It would be interesting to know whether this has been observed in the cancer datasets. A related question is about the rules of thumb in pathway analysis: for example, if a typical pathway (network module?) has a size of N genes, what is the number of genes in this pathway m < N that would still register as an enrichment in some of the tests that the authors discuss?
Reply
This is an important point. We included a discussion of this to section ‘Recommendations’. We added also a discussion of the danger of general suggestions and motivate this by known characteristics of gene expression data from cancer. The problem is twofold. First, each method has its own characteristics under what conditions it works best. Second, data sets from cancer are very heterogeneous so that two data sets containing about the same number of samples can exhibit a very different correlation structure and expression patterns. This holds potentially also for different grades of one cancer type.
Regarding the first question, it appears to us that this is related to the presence or absence of parallel pathways conveying a molecular signal. If for example no parallel pathways exist the detection of differentially expressed genes can provide a robust way to detect functional changes. On the other hand, if there are many alternatives this may not be the case and gene set methods appear to be better suited for such a situation. In general, this kind of crossmethod comparisons are not well studied and we are not aware that this has been systematically addressed for cancer or other data sets. One reason for this is that until recently, most data sets contained less than 20 samples per condition, which usually does not permit a robust analysis of gene set or pathway methods and once larger data sets became available the detection of differentially expressed genes was neglected, potentially, due to the erroneous assumption that differentially expressed gene set methods include the former tests.
In order to emphasize that it is desirable to apply methods from all three different levels simultaneously ((I) identify changes in single genes, (II) identify changes in gene sets or pathways, and (III) identify changes in the correlation structure in pathways) whenever a given data sets allows this, rather than to focus on just one of these levels, we added a discussion to section ‘Recommendations’.
Thank you for your suggestions and comments.
Referee 2: Dr. ByungSoo Kim
General comments This is a well organized review of recent statistical methods of analyzing microarray experiment data sets, particularly on cancers, from single gene analysis to identifying differential changes in pathway, and finally to comparing a given pathway under two different conditions. However, I would like to indicate following four points for the possible improvement. (1) Gaussian graphical model: From the methodological point of view, it is desired to include the sparse Gaussian graphical model (GGM) approach for estimating the gene network under the multivariate normal assumption from a microarray data set. For the recent development of GGM approach one can include glasso (Friedman, Hastie and Tibshirani, 2008; Witten, Friedman and Simon, 2011) [125, 127], SCAD penalty of Fan, Feng and Wu (2009) [124], adaptive lasso of Zou (2006) [128] and Kiiveri (2012) [126], among others. (2) Effect of intergene correlations on the single gene analysis. A series of Efron’s recent work (Efron 2007a, 2007b) [134, 137] discussed in detail on how intergene correlations could affect the detection of differentially expressed (DE) genes in a single gene analysis? By including Efron’s recent work and his R package “locfdr” authors can show how FDR can be used in the real data analysis in their Section on “Importance of multiple hypotheses testing and sample size: An example for differentially expressed genes”. (3) Some of the reviews are misleading. These are the few examples. (i) The sentence, at the middle of page 12, “However, in order to use a twosample ttest with equal size of the two samples it is assumed that the mean fold change f’ and its standard deviation σ_{ f } would be the same for a randomly selected background set consisting of only m genes, see Eqn. 10”. Actually ([99], Luo et al., 2009) assumes the i.i.d of the fold change of genes to make Eqn 10 have a t distribution. Here the key assumption was the independence, which was missing in the aforementioned sentence. (ii) p. 14. Eqn 16. In ([126], Tian et al., 2005) no tsquare statistic was employed. (iii) Eqn 24 of p. 18 does not make sense. Authors of ([20], Cho et al., 2009) didn’t make it clear in their equation (3) what Renyi entropy was when the underlying random variables were continuous. (iv) I would suggest authors to allocate more space on the work of ([90], Massa et al, 2010) which was methodologically sound and deserve more coverage than just the IPS algorithm. (4) Inconsistency of notations. In page 11 authors defined p and m to be sizes of the background genes and a target gene set, respectively. However, in line 2 of page 15 “p genes” (which should have been m genes according to page 11 definition) was incorrectly labeled. This inconsistency was repeated in Nstatistic section of p.15, and also in Eqn 16 in p. 14 and Eqn 22 of p. 17. The “pdimensional..” should be “mdimensional..” at the bottom two lines of p. 14.
Minor Comments

1.
p.2 “Gene expression data from next generation sequencing (RNAseq)”. This is an important issue. There is no direct relevance, however, with statistical methods reviewed in this paper.

2.
p.4. For detecting differential correlation and differential variance, it would be better to explain why these approaches were taken. For example, in ([54], Ho et al., 2008) it was clearly indicated that changes in expression variability were associated with changes in coexpression pattern, which implied that DV was a signal rather than a noise.

3.
Legend of Figure 2. “The data is..” should be “The data are.. ”.

4.
p.7. There is no reference of Figures A, B in the main text. Also indicate in the legend of Figure 3 what Σ is in Figure 3E.

5.
p.8. In the legend of Figure 4 what the symbols in the outerpanel represent? What do the lines represent? It is better to use different notation (A, B) to avoid confusion in the main text of the second paragraph of p. 9.

6.
p.9 What is “alpha” in Equation (4)?

7.
p.9 line 9. You may include two specific patterns of dependence of two genes, namely, type A dependence of Klevanov, Jordan and Yakovlev (2006), and hidden regulator dependence of Lim, Kim and Kim (2011).

8.
p.15 line 13. “euclidean Kernel” should be “Euclidean kernel” (9)

9.
p.15 line 10. “a either” should be “either”.

10.
p.15 line 8. Author may want to include Tsai and Chen (2009) for another reference of Hotelling’s Tsquare statistic.

11.
p.17. line 15, What are “A” and “B”?

12.
p. 18 line 2. Better to include Lauritzen (1996) as a reference of IPS algorithm.

13.
p. 22. It would be more beneficial for the read to move the last paragraph of p. 22 (extended to p. 23) to Introduction section.
References Efron B. (2007a). Correlation and largescale simultaneous signifance testing, J. Amer. Statist. Assoc. 102:93103. Efron B. (2007b). Size, power and false discovery rates, Annals of Statist. 35:13511377. Fan J, Feng Y, Wu Y. (2009). Network exploration via the adaptive lasso and SCAD penalties. Ann. Statist. 3:521541. Friedman J, Hastie T, Tibshirani R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9:432441. Lauritzen SL. (1996), Graphical models, Oxford: Clarendon Press. Lim J, Kim J, Kim BS (2011). An alternative model of type A dependence in a gene set of correlated genews, Statist. Appl. Genet. Mol. Biol. Vol. 9, Article 12. Kiiver H, de Hoog F. (2012). Fitting very large Gaussian graphical models. Comp. Statist. Data Anal. 56:26262636. Klebanov L, Jordan C, Yakovlev A. (2006). A new type of stochastic dependence revealed in gene expression data, Statist. Appl. Genet. Mol. Biol. Vol. 5, Article 7. Tsai CA, Chen J. (2009). Multivariate analysis of variance test for gene set analysis, Bioinformatics, 25:897903. Witten DM, Friedman JH, Simon N. (2011). New insights and faster computation for the graphical lasso. J. Comp. Graph. Stat. 20:892900. Zou H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101:14181429.
Reply
We revised our text correspondingly and addressed all your suggestions. We would like to point out that the major goal of our review is not a full coverage of statistical details but to provide sufficient information for the reader to acquire a basic understanding of major principles and assumptions that underly the methods. The problem is that if too many detail are presented the paper would turn quickly into a formal description which may not be appreciated by a biology oriented readership.
Minor Comments

1.
p.12. line 1: What is N?

2.
p. 15. line 7 5: “two i.i.d samples of genes..” is rather confusing. Luo et al. (2005) assumed the i.i.d of the fold change of genes, which was much stronger than just assuming equal mean and variance. It is better to rewrite this sentence to convey the original material.

3.
p.17. line 1: “p genes” should be “m genes”.

4.
p.22. Eqn. (50): What are p and q? What are S _{ i } and S _{ j }?

5.
p.37. Reference 30, p. 38. References 40, 59; The journal title should be consistent with Reference 27 or vise versa.

6.
p.40. References 90, 108: Location of the publisher is missing.

7.
p.40. References 93,94; The journal title should be consistent with Reference 119.

8.
p.40. Reference 109: Author was duplicated at the end. The location and the publisher were missing.

9.
p.41. Reference 117. The article title is missing.

10.
p.41. Reference 118: The location of the publisher is missing.

11.
p.41. Reference 133. The journal title should be consistent with Reference 27.
Reply
All comments have been addressed and we revised the main text correspondingly.
Thank you for your suggestions and comments.
Referee 3: Dr. Joel Bader
This manuscript reviews methods for analyzing gene expression data with tests of individual genes, gene pairs, gene sets, and networks. The manuscript is strong in covering many methods. It would be more helpful if the authors also provided a point of view or evaluation of methods. Can anything be said about the relative power of different approaches, or which have proven to be more useful in practice? What about the tradeoff between robustness, power, and speed for realistic data? Most of the discussion of method choice is generally about sample size requirements for all methods rather than method choice given sample size. The two parts of the manuscript, gene expression and cancer, don’t really mesh. Most of the methods review is not cancer specific. Possibly of greater relevance to cancer are methods that combine different types of data.
The manuscript is generally well written and easy to understand, with ample references to the original work and to previous reviews.
Minor corrections

1.
p. 1 ‘one gene, −> should be ‘ for openquote in latex, here and elsewhere

2.
p. 2 differnt microarray −> spelling

3.
p. 2 comprises, e.g., mRNAs −> ‘e.g.’ doesn’t sound right here. How about providing a full list: mRNA, tRNA, rRNA, and short regulatory RNAs

4.
p. 2 ‘In the third step the reads are mapped to known exon sequences of genes.’ Are there also de novo assembly methods that don’t require a template? ‘allows to overcome’ −> overcomes

5.
p. 3 allows to measure −> measures. Can also mention other advantages: splice variants, sequence polymorphisms, no need to design and build a custom chip

6.
‘correspond to: self sufficiency’ −> no colon between preposition and noun phrase. Can the hallmarks be parallel, all start with noun or verb?

7.
p. 9 Eq. 4. How is alpha calculated?

8.
Eq. 5 need i = 1 underneath the summation

9.
p. 14. Eq. 16 Under the null, it seems that Σ _{t,2}should approach $1/\sqrt{(}p)$ rather than 0.

10.
Eq. 16 How is the significance of SAMGS calculated?

11.
p. 18 Eq. 24 and text after, use log in math mode rather than log.
Reply
All comments have been addressed and we revised the main text correspondingly.
Thank you for your suggestions and comments.
References
 1.
Bock G, Goode J: Novartis Foundation Symposium. 1998, John Wiley & Sons
 2.
Van Regenmortel M: Reductionism and complexity in molecular biology. EMBO reports. 2004, 5 (9): 10161020.
 3.
Mazzocchi F: Complexity in biology. EMBO Rep. 2008, 9: 1014. 10.1038/sj.embor.7401147.
 4.
von Bertalanffy L: An outline of general systems theory. Br J Philosophy Sci. 1950, 1 (2): 134165.
 5.
Beadle GW, Tatum EL: Genetic control of biochemical reactions in neurospora. Proc Natl Acad Sci USA. 1941, 27 (11): 499506. 10.1073/pnas.27.11.499.
 6.
Hanahan D, Weinberg RA: The hallmarks of cancer. Cell. 2000, 100: 5770. 10.1016/S00928674(00)816839.
 7.
Noble D: Genes and causation. Phil Trans R Soc A. 2008, 366: 300103015. 10.1098/rsta.2008.0086.
 8.
Kitano H: Systems biology: a brief overview. Science. 2002, 295 (5560): 16621664. 10.1126/science.1069492.
 9.
Han JDJ: Understanding biological functions through molecular networks. Cell Res. 2008, 18 (2): 224237. 10.1038/cr.2008.16.
 10.
MacDougallShackleton SA: The levels of analysis revisited. Phil Trans R Soc B: Biol Sci. 2011, 366 (1574): 20762085. 10.1098/rstb.2010.0363.
 11.
Barabasi AL, Oltvai ZN: Network biology: understanding the cell’s functional organization. Nat Rev. 2004, 5: 101113. 10.1038/nrg1272.
 12.
Brazhnik P, de la Fuente A, Mendes P: Gene networks: how to put the function in genomics. Trends Biotechnol. 2002, 20 (11): 467472. 10.1016/S01677799(02)02053X.
 13.
EmmertStreib F, Glazko G: Network biology: a direct approach to study biological function. Wiley Interdiscip Rev Syst Biol Med. 2011, 3 (4): 379391. 10.1002/wsbm.134.
 14.
Davidson E, Levin M: Gene regulatory networks. Proc Natl Acad Sci USA. 2005, 102 (14): 493510.1073/pnas.0502024102.
 15.
de Matos Simoes R, Tripathi S, EmmertStreib F: Organizational structure of the peripheral gene regulatory network in Bcell lymphoma. BMC Syst Biol. 2012, 6: 3810.1186/17520509638.
 16.
Jones S, Thornton JM: Principles of proteinprotein interactions. Proc Nat Acad Sci. 1996, 93: 1320. 10.1073/pnas.93.1.13.
 17.
Maslov S, Sneppen K: Specificity and stability in topology of protein networks. Science. 2002, 296 (5569): 910913. 10.1126/science.1065103.
 18.
Jeong H, Tombor B, Albert R, Olivai Z, Barabasi A: The largescale organization of metabolic networks. Nature. 2000, 407: 651654. 10.1038/35036627.
 19.
Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA: Structure and evolution of transcriptional regulatory networks. Curr Opin Struct Biol. 2004, 14: 283291. 10.1016/j.sbi.2004.05.004.
 20.
Lee TI, et al: Transcriptional regulatory networks in saccharomyces cerevisiae. Science. 2002, 298 (5594): 799804. 10.1126/science.1075090.
 21.
Allison DB: Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 2006, 7: 5565. 10.1038/nrg1749.
 22.
Dehmer M, EmmertStreib F, Graber A, Salvador A(Eds): Applied Statistics for Network Biology: Methods for Systems Biology. 2011, Weinheim: WileyBlackwell
 23.
Quackenbush J: Computational analysis of microarray data. Nat Rev Genet. 2001, 2 (6): 418427. 10.1038/35076576.
 24.
Metzker ML: Sequencing technologies  the next generation (With NOTES). Nat Rev Genet. 2010, 11: 3146. 10.1038/nrg2626.
 25.
Wang Z, Gerstein M, Snyder M: RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 5763. 10.1038/nrg2484.
 26.
Hanahan D, Weinberg RA: Hallmarks of cancer: the next generation. Cell. 2011, 144 (5): 646674. 10.1016/j.cell.2011.02.013.
 27.
Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, Gillette M, Paulovich A, Pomeroy S, Golub T, Lander E, Mesirov J: Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. Proc Natl Acad Sci USA. 2005, 102 (43): 1554550. 10.1073/pnas.0506580102.
 28.
Chuang HY, Lee E, Liu YT, Ideker T: Networkbased classification of breast cancer metastasis. Mol Syst Biol. 2007, 3: 140
 29.
Compagno M, Lim WK, Grunn A, Nandula SV, Brahmachary M, Shen Q, Bertoni F, Ponzoni M, Scandurra M, Califano A, et al: Mutations of multiple genes cause deregulation of NFkappaB in diffuse large Bcell lymphoma. Nature. 2009, 459 (7247): 717721. 10.1038/nature07968.
 30.
Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, Felciano RM, Laurance MF, Zhao W, Qi S, Chen Z, et al: Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a molecular target. Proc Natl Acad Sci USA. 2006, 103 (46): 1740217407. 10.1073/pnas.0608396103.
 31.
Krivtsov AV, Twomey D, Feng Z, Stubbs MC, Wang Y, Faber J, Levine JE, Wang J, Hahn WC, Gilliland DG, et al: Transformation from committed progenitor to leukaemia stem cell initiated by MLLAF9. Nature. 2006, 442 (7104): 818822. 10.1038/nature04980.
 32.
Oskarsson T, Acharyya S, Zhang XHF, Vanharanta S, Tavazoie SF, Morris PG, Downey RJ, ManovaTodorova K, Brogi E, Massague J: Breast cancer cells produce tenascin C as a metastatic niche component to colonize the lungs. Nat Med. 2011, 17 (7): 867874. 10.1038/nm.2379.
 33.
Mavrakis KJ, Wolfe AL, Oricchio E, Palomero T, De Keersmaecker K, McJunkin K, Zuber J, James T, Khan AA, Leslie CS, et al: Genomewide RNAmediated interference screen identifies miR19 targets in Notchinduced Tcell acute lymphoblastic leukaemia. Nat Cell Biol. 2010, 12 (4): 372379. 10.1038/ncb2037.
 34.
Nam S, Park T: Pathwaybased evaluation in early onset colorectal cancer suggests focal adhesion and immunosuppression along with EpithelialMesenchymal transition. PLoS ONE. 2012, 7 (4): e3168510.1371/journal.pone.0031685.
 35.
Guedj M, Marisa L, De Reynies A, Orsetti B, Schiappa R, Bibeau F, Macgrogan G, Lerebours F, Finetti P, Longy M, et al: A refined molecular taxonomy of breast cancer. Oncogene. 2011, 31 (July 2011): 11961206.
 36.
Lehmann BD, Bauer JA, Chen X, Sanders ME, Chakravarthy AB, Shyr Y, Pietenpol JA: Identification of human triplenegative breast cancer subtypes and preclinical models for selection of targeted therapies. J Clin Invest. 2011, 121 (7): 27502767. 10.1172/JCI45014.
 37.
Fabbri G, Rasi S, Rossi D, Trifonov V, Khiabanian H, Ma J, Grunn A, Fangazio M, Capello D, Monti S, et al: Analysis of the chronic lymphocytic leukemia coding genome: role of NOTCH1 mutational activation. J Exp Med. 2011, 208 (7): 13891401. 10.1084/jem.20110921.
 38.
Ooi CH, Ivanova T, Wu J, Lee M, Tan IB, Tao J, Ward L, Koo JH, Gopalakrishnan V, Zhu Y, Cheng LL, Lee J, Rha SY, Chung HC, Ganesan K, So J, Soo KC, Lim D, Chan WH, Wong WK, Bowtell D, Yeoh KG, Grabsch H, Boussioutas A, Tan P: Oncogenic pathway combinations predict clinical prognosis in gastric cancer. PLoS Genet. 2009, 5 (10): e100067610.1371/journal.pgen.1000676.
 39.
Setlur SR, Royce TE, Sboner A, Mosquera JM, Demichelis F, Hofer MD, Mertz KD, Gerstein M, Rubin MA: Integrative microarray analysis of pathways dysregulated in metastatic prostate cancer. Cancer Res. 2007, 67 (21): 1029610303. 10.1158/00085472.CAN072173.
 40.
Nucera C, Porrello A, Antonello ZA, Mekel M, Nehs MA, Giordano TJ, Gerald D, Benjamin LE, Priolo C, Puxeddu E, et al: BRaf(V600E) and thrombospondin1 promote thyroid cancer progression. Proc Natl Acad Sci USA. 2010, 107 (23): 1064910654. 10.1073/pnas.1004934107.
 41.
Shah MA, Khanin R, Tang L, Janjigian YY, Klimstra DS, Gerdes H, Kelsen DP: Molecular classification of gastric cancer: a new paradigm. Clin Cancer Res. 2011, 17 (9): 26932701. 10.1158/10780432.CCR102203.
 42.
Perroud B, Lee J, Valkova N, Dhirapong A, Lin PY, Fiehn O, Kultz D, Weiss R: Pathway analysis of kidney cancer using proteomics and metabolic profiling. Mol Cancer. 2006, 5: 6410.1186/14764598564.
 43.
Trewavas A: A Brief History of Systems Biology: “Every object that biology studies is a system of systems.” Francois Jacob (1974). Plant Cell. 2006, 18 (10): 24202430. 10.1105/tpc.106.042267.
 44.
EmmertStreib F, Dehmer M: Networks for systems biology: conceptual connection of data and function. IET Syst Biol. 2011, 5 (3): 18510.1049/ietsyb.2010.0025.
 45.
Macneil LT, Walhout AJM: Gene regulatory networks and the role of robustness and stochasticity in the control of gene expression. Genome Res. 2011, 21 (5): 64557. 10.1101/gr.097378.109.
 46.
Lehman E: Testing Statistical Hypotheses. 2005, New York: Springer
 47.
DasGupta A: Probability for Statistics and Machine Learning. 2011, New York: Springer
 48.
Chen Y, Dougherty ER, Bittner ML: Ratiobased decisions and the quantitative analysis of cDNA microarray smages. J Biomed Optics. 1997, 2 (4): 364374. 10.1117/12.281504.
 49.
Zhang L, Zhou W, Velculescu VE, Kern SE, Hruban RH, Hamilton SR, Vogelstein B, Kinzler KW: Gene expression profiles in normal and cancer cells. Science. 1997, 276 (5316): 12681272. 10.1126/science.276.5316.1268.
 50.
Tusher V, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98 (18): 51165121.
 51.
Chu G, Narasimhan B, Tibshirani R, Tusher V: Significance analysis of microarrays (SAM) software. Nature. 2002, 5: 436442.
 52.
Smyth GK: Limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions using R and Bioconductor. Edited by: Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W. 2005, New York: Springer, 397420.
 53.
Efron B, Tibshirani R, JD S, Tusher V: Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc. 2001, 96 (456): 11511160. 10.1198/016214501753382129.
 54.
Ho JWK, Stefani M, Dos Remedios CG, Charleston MA: Differential variability analysis of gene expression and its application to human diseases. Bioinformatics. 2008, 24 (13): i390i398. 10.1093/bioinformatics/btn142.
 55.
Hu R, Qiu X, Glazko G, Klebanov L, Yakovlev A: Detecting intergene correlation changes in microarray analysis: a new approach to gene selection. BMC Bioinformatics. 2009, 10: 2010.1186/147121051020.
 56.
Dettling M, Gabrielson E, Parmigiani G: Searching for differentially expressed gene combinations. Genome Biol. 2005, 6 (10): R8810.1186/gb2005610r88.
 57.
Lai Y, Wu B, Chen L, Zhao H: A statistical method for identifying differential genegene coexpression patterns. Bioinformatics. 2004, 20 (17): 31463155. 10.1093/bioinformatics/bth379.
 58.
Dawson JA, Ye S, Kendziorski C: R/EBcoexpress: an empirical Bayesian framework for discovering differential coexpression. Bioinformatics. 2012, 28 (14): 19391940. 10.1093/bioinformatics/bts268.
 59.
Li KC: Genomewide coexpression dynamics: theory and application. Proc Natl Acad Sci USA. 2002, 99: 1687516880. 10.1073/pnas.252466999.
 60.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The gene ontology consortium. Nature Genet. 2000, 25: 2529. 10.1038/75556.
 61.
Ackermann M, Strimmer K: A general modular framework for gene set enrichment analysis. BMC Bioinformatics. 2009, 10: 4710.1186/147121051047.
 62.
Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Geneset analysis and reduction. Brief Bioinform. 2009, 10: 2434.
 63.
EmmertStreib F, Glazko G: Pathway analysis of expression data: deciphering functional building blocks of complex diseases. PLoS Comput Biology. 2011, 7 (5): e100205310.1371/journal.pcbi.1002053.
 64.
Khatri P, Sirota M, Butte A J: Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012, 8 (2): e100237510.1371/journal.pcbi.1002375.
 65.
Liu Q, Dinu I, Adewale A, Potter J, Yasui Y: Comparative evaluation of geneset analysis methods. BMC Bioinformatics. 2007, 8: 43110.1186/147121058431.
 66.
Goeman J, Buhlmann P: Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007, 23 (8): 9807. 10.1093/bioinformatics/btm051.
 67.
Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucl Acids Res. 2009, 37: 113. 10.1093/nar/gkn923.
 68.
Mootha V, Lindgren C, Eriksson KFea: PGC1alpharesponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genet. 2003, 34: 267273. 10.1038/ng1180.
 69.
Efron B, Tibshiran R: On testing the significance of sets of genes. Ann Appl Stat. 2007, 1: 107129. 10.1214/07AOAS101.
 70.
Dørum G, Snipen L, Solheim M, Sæbø S: Rotation testing in gene set enrichment analysis for small direct comparison experiments. Stat App Genet Mol Biol. 2009, 8: 34
 71.
Luo W, Friedman M, Shedden K, Hankenson K, Woolf P: GAGE: generally applicable gene set enrichment for pathway analysis. BMC Bioinformatics. 2009, 10: 16110.1186/1471210510161.
 72.
Kim SY, Volsky D: PAGE: Parametric Analysis of Gene Set Enrichment. BMC Bioinformatics. 2005, 6: 14410.1186/147121056144.
 73.
Newton M, Quintana F, den Boon Jea: Randomset methods identify distinct aspects of the enrichment signal in geneset analysis. Ann Appl Stat. 2007, 1: 85106. 10.1214/07AOAS104.
 74.
Freudenberg JM, Sivaganesan S, Phatak M, Shinde K, Medvedovic M: Generalized random set framework for functional enrichment analysis using primary genomics datasets. Bioinformatics. 2011 Jan 1, 27 (1): 707. 10.1093/bioinformatics/btq593.
 75.
Rafael IA, Chi W, Yun Z, Terence SP: Gene set enrichment analysis made simple. Stat Methods Med Res. 2009, 18 (6): 565575. 10.1177/0962280209351908.
 76.
Lange K: Numerical Analysis for Statisticians. Statistics and Computing. 2010, Springer
 77.
Pyeon D, Newton MA, Lambert PF, den Boon JA, Sengupta S, Marsit CJ, Woodworth CD, Connor JP, Haugen TH, Smith EM, Kelsey KT, Turek LP, Ahlquist P: Fundamental differences in cell cycle deregulation in human Papillomaviruspositive and human Papillomavirusnegative head/neck and cervical cancers. Cancer Res. 2007, 67 (10): 46054619. 10.1158/00085472.CAN063619.
 78.
Sheskin DJ: Handbook of Parametric and Nonparametric Statistical Procedures. 3rd edition. Boca Raton: RC Press; 2004.
 79.
Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci USA. 2005, 102 (38): 1354413549. 10.1073/pnas.0506577102.
 80.
Jiang Z, Gentleman R: Extensions to gene set enrichment. Bioinformatics. 2007, 23 (3): 306313. 10.1093/bioinformatics/btl599.
 81.
Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Improving gene set analysis of microarray data by SAMGS. BMC Bioinformatics. 2007, 8: 24210.1186/147121058242.
 82.
Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC: A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004, 20: 9399. 10.1093/bioinformatics/btg382.
 83.
Hummel M, Meister R, Mansmann U: GlobalANCOVA: exploration and assessment of gene group effects. Bioinformatics. 2008, 24: 7885. 10.1093/bioinformatics/btm531.
 84.
Lu Y, Liu P, Xiao P, Deng H: Hotelling’s T2 multivariate profiling for detecting differential expression in microarrays. Bioinformatics. 2005, 21 (14): 31053113. 10.1093/bioinformatics/bti496.
 85.
Kong S, Pu W, Park P: A multivariate approach for integrating genomewide expression data and biological knowledge. Bioinformatics. 2006, 22 (19): 23732380. 10.1093/bioinformatics/btl401.
 86.
Tsai C, Chen J: Multivariate analysis of variance test for gene set analysis. Bioinformatics. 2009, 25 (7): 897903. 10.1093/bioinformatics/btp098.
 87.
Xiong H: Nonlinear tests for identifying differentially expressed genes or genetic networks. Bioinformatics. 2006, 22 (8): 919923. 10.1093/bioinformatics/btl034.
 88.
Klebanov L, Glazko G, Salzman P, Yakovlev A, Xiao Y: A multivariate extension of the gene set enrichment analysis. J Bioinform Comput Biol. 2007, 5 (5): 11391153. 10.1142/S0219720007003041.
 89.
Yates P, Reimers M: RCMAT: a regularized covariance matrix approach to testing gene sets. BMC Bioinformatics. 2009, 10: 30010.1186/1471210510300.
 90.
Draghici S, Khatri P, Tarca AL, Amin K, Done A, Voichita C, Georgescu C, Romero R: A systems biology approach for pathway level analysis. Genome Res. 2007, 17 (10): 15371545. 10.1101/gr.6202607.
 91.
Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, Kim JS, Kim CJ, Kusanovic JP, Romero R: A novel signaling pathway impact analysis. Bioinformatics. 2009, 25: 7582. 10.1093/bioinformatics/btn577.
 92.
Thomas R, Gohlke JM, Stopper GF, Parham FM, Portier CJ: Choosing the right path: enhancement of biologically relevant sets of genes or proteins using pathway structure. Genome Biol. 2009, 10 (4): R4410.1186/gb2009104r44.
 93.
Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, Zhu J, Haussler D, Stuart JM: Inference of patientspecific pathway activities from multidimensional cancer genomics data using PARADIGM. Bioinformatics. 2010, 26 (12): i237—i245
 94.
Massa M, Chiogna M, Romualdi C: Gene set analysis exploiting the topology of a pathway. BMC Syst Biol. 2010, 4: 121
 95.
Glazko G, EmmertStreib F: Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics. 2009, 25 (18): 23482354. 10.1093/bioinformatics/btp406.
 96.
Ledoit O, Wolf M: Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. J Empir Finance. 2003, 10: 603621. 10.1016/S09275398(03)000070.
 97.
Ledoit O, Wolf M: A well conditioned estimator for largedimensional covariance matrices. J Multiv Anal. 2004, 88: 365411. 10.1016/S0047259X(03)000964.
 98.
Ledoit O, Wolf M: Honey, I shrunk the sample covariance matrix. J Portfolio Manage. 2004, 30: 110119. 10.3905/jpm.2004.110.
 99.
Schäfer J, Strimmer K: A shrinkage approach to largescale covariance matrix estimation and implications for functional Genomics. Stat Appl Genet Mol Biol. 2005, 4: 32
 100.
Kanehisa M, Goto S: KEGG: kyoto encyclopia of genes and genomes. Nuclei Acids Res. 2000, 28: 2730. 10.1093/nar/28.1.27.
 101.
Lauritzen S: Graphical Models. 1996, New York: Oxford Science Publications, Clarendon Press
 102.
Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Garapati P, Hemish J, Hermjakob H, Jassal B, Kanapin A, Lewis S, Mahajan S, May B, Schmidt E, Vastrik I, Wu G, Birney E, Stein L, D’Eustachio P: Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res. 2009, 37 (suppl 1): D619—D622
 103.
Klebanov L, Jordan C, Yakovlev A: A new type of stochastic dependence revealed in gene expression data. Stat Appl Genet Mol Biol. 2006, 5 (05/11): Article7
 104.
Lim J, Kim J, Kim B: An alternative model of type A dependence in a gene set of correlated genes. Stat Appl in Genet Mol Biol. 2010, 9: Article 12
 105.
Tripathi S, EmmertStreib F: Assessment method for a power analysis to identify differentially expressed pathways. PLoS ONE. 2012, 7 (5): e3751010.1371/journal.pone.0037510.
 106.
EmmertStreib F: The chronic fatigue syndrome: a comparative pathway analysis. J Comput Biol. 2007, 14 (7): 961972. 10.1089/cmb.2007.0041.
 107.
Choi Y, Kendziorski C: Statistical methods for gene set coexpression analysis. Bioinformatics. 2009, 25 (21): 27802786. 10.1093/bioinformatics/btp502.
 108.
Cho SB, Kim J, Kim JH: Identifying setwise differential coexpression in gene expression microarray data. BMC Bioinformatics. 2009, 10: 10910.1186/1471210510109.
 109.
Tesson BM, Breitling R, Jansen RC: DiffCoEx: a simple and sensitive method to find differentially coexpressed gene modules. BMC Bioinformatics. 2010, 11: 49710.1186/1471210511497.
 110.
Altay G, Asim M, Markowetz F, Neal DE: Differential C3NET reveals disease networks of direct physical interactions. BMC Bioinformatics. 2011, 12: 29610.1186/1471210512296.
 111.
Watkinson J, Wang X, Zheng T, Anastassiou D: Identification of gene interactions associated with disease from gene expression data using synergy networks. BMC Syst Biol. 2008, 2: 1010.1186/17520509210.
 112.
Bunke H: What is the distance between graphs?. Bull EATCS. 1983, 20: 3539.
 113.
Fuite J, Vernon S, Broderick G: Neuroendocrine and immune network remodeling in chronic fatigue syndrome: an exploratory analysis. Genomics. 2008, 92: 393399. 10.1016/j.ygeno.2008.08.008.
 114.
Wang YC, Lan CY, Hsieh WP, Murillo L, Agabian N, Chen BS: Global screening of potential Candida albicans biofilmrelated transcription factors via network comparison. BMC Bioinformatics. 2010, 11: 5310.1186/147121051153.
 115.
Gill R, Datta S, Datta S: A statistical framework for differential network analysis from microarray data. BMC Bioinformatics. 2010, 11: 9510.1186/147121051195.
 116.
Altay G, EmmertStreib F: Inferring the conservative causal core of gene regulatory networks. BMC Syst Biol. 2010, 4: 13210.1186/175205094132.
 117.
Altay G, EmmertStreib F: Structural influence of gene networks on their inference: analysis of C3NET. Biol Direct. 2011, 6: 3110.1186/17456150631.
 118.
Dempster A: Covariance selection. Biometrics. 1972, 28: 157175. 10.2307/2528966.
 119.
Koller D, Friedman N: Probabilistic Graphical Models: Principles and Techniques. 2009, Cambridge: The MIT Press
 120.
Whittaker J: Graphical Models in Applied Multivariate Statistics. 1990, Chichester: Wiley
 121.
Li H, Gui J: Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics. 2006, 7 (2): 302317.
 122.
Schäfer J, Strimmer K: An empirical Bayes approach to inferring largescale gene association networks. Bioinformatics. 2005, 21: 754764. 10.1093/bioinformatics/bti062.
 123.
Wille A, Zimmermann P, Vranova E, Furholz A, Laule O, Bleuler S, Hennig L, Prelic A, von Rohr P, Thiele L, Zitzler E, Gruissem W, Buhlmann P: Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biol. 2004, 5 (11): R9210.1186/gb2004511r92.
 124.
Fan J, Feng Y, Wu Y: Network exploration via the adaptive lasso and SCAD penalties. Ann Appl Stat. 2009, 3 (2): 521541. 10.1214/08AOAS215.
 125.
Friedman J, Hastie T, Tibshirani R: Sparse inverse covariance estimation with the graphical lasso. Biostatistics Oxford England. 2008, 9 (3): 432441. 10.1093/biostatistics/kxm045.
 126.
Kiiveri H, de Hoog F: Fitting very large sparse Gaussian graphical models. Comput Stat & Data Anal. 2012, 56 (9): 26262636. 10.1016/j.csda.2012.02.007.
 127.
Witten DM, Friedman JH, Simon N: New insights and faster computations for the graphical Lasso. J Comput Graphical Stat. 2011, 20 (4): 892900. 10.1198/jcgs.2011.11051a.
 128.
Zou H: The adaptive Lasso and its oracle properties. J Am Stat Assoc. 2006, 101 (476): 14181429. 10.1198/016214506000000735.
 129.
Dudoit S, van der Laan: Multiple Testing Procedures with Applications to Genomics. 2007, New York: Springer
 130.
Dudoit S, van der Laan M, Pollard K: Multiple testing. part I. singlestep procedures for control of general type I error rates. Stat App Genet Mol Biol. 2004, 3: 13
 131.
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc, Ser B (Methodological). 1995, 57: 125133.
 132.
Storey J: A direct approach to false discovery rates. J R Stat Soc, Ser B. 2002, 64: 479498. 10.1111/14679868.00346.
 133.
Aubert J, BarHen A, Daudin J, Robin S: Determination of the differentially expressed genes in microarray experiments using local FDR. BMC Bioinformatics. 2004, 5: 12510.1186/147121055125.
 134.
Efron B: Correlation and largescale simultaneous signifance testing. J Am Stat Assoc. 2007, 102 (477): 93103. 10.1198/016214506000001211.
 135.
Pounds S, Morris SW: Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of pvalues. Bioinformatics. 2003, 19 (10): 12361242. 10.1093/bioinformatics/btg148.
 136.
Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001, 29 (4): 11651188. 10.1214/aos/1013699998.
 137.
Efron B: Size, power and false discovery rates. Ann Stat. 2007, 35 (4): 13511377. 10.1214/009053606000001460.
 138.
Storey J, Tibshirani R: Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003, 100 (16): 94409445. 10.1073/pnas.1530509100.
 139.
Hung JH, Yang TH, Hu Z, Weng Z, DeLisi C: Gene set enrichment analysis: performance evaluation and usage guidelines. Briefings in Bioinformatics. 2012, 13 (3): 281291. 10.1093/bib/bbr049.
 140.
Nam D, Kim S: Geneset approach for expression pattern analysis. Brief Bioinform. 2008, 9 (3): 189197. 10.1093/bib/bbn001.
 141.
Weinberg RA: The Biology of Cancer. 2007, New York: Garland Science
 142.
Leek JT, Storey JD: Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007, 3 (9): e16110.1371/journal.pgen.0030161.
 143.
McClintick JN, Edenberg HJ: Effects of filtering by Present call on analysis of microarray experiments. BMC Bioinformatics. 2006, 7: 4910.1186/14712105749.
 144.
Bourgon R, Gentleman R, Huber W: Independent filtering increases detection power for highthroughput experiments. Proc Natl Acad Sci USA. 2010, 107 (21): 95469551. 10.1073/pnas.0914005107.
 145.
Carter GW: Inferring network interactions within a cell. Briefings in Bioinformatics. 2005, 6 (4): 380389. 10.1093/bib/6.4.380.
 146.
Perou CM, Sorlie T, Eisen MB, Van De Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, et al: Molecular portraits of human breast tumours. Nature. 2000, 406 (6797): 747752. 10.1038/35021093.
 147.
Gerstung M, Eriksson N, Lin J, Vogelstein B, Beerenwinkel N: The temporal order of genetic and pathway alterations in Tumorigenesis. PLoS ONE. 2011, 6 (11): e2713610.1371/journal.pone.0027136.
 148.
Jones S, Zhang X, Parsons DW, Lin JC, Leary RJ, Angenendt P, Mankoo P, Carter H, Kamiyama H, Jimeno A, et al: Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science. 2008, 321 (5897): 18011806. 10.1126/science.1164368.
 149.
Wood LD, Parsons DW, Jones S, Lin J, Sjöblom T, Leary RJ, Shen D, Boca SM, Barber T, Ptak J, et al: The genomic landscapes of human breast and colorectal cancers. Science. 2007, 318 (5853): 11081113. 10.1126/science.1145720.
 150.
Vogelstein B, Kinzler KW: Cancer genes and the pathways they control. Nature Med. 2004, 10 (8): 789799. 10.1038/nm1087.
 151.
Lazebnik Y: What are the hallmarks of cancer?. Nature Rev Cancer. 2010, 10 (4): 232233. 10.1038/nrc2827.
 152.
Mukherjee S: The Emperor of All Maladies: A Biography of Cancer. 2011, London: Fourth Estate
Acknowledgements
ST is supported by a studentship from the National Institute of Immunology. FES and RDMS are supported by the Engineering and Physical Sciences Research Council (EPSRC) and DEL.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Author’s contributions
ST simulated and visualized the data. ST, RDMS and FES analyzed and interpreted the results. ST, RDMS and FES wrote the paper. FES conceived the paper. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Received
Accepted
Published
DOI
Keywords
 Gene expression data
 Cancer data
 Statistical analysis methods
 Pathway methods
 Correlation structure
 Cancer genomics