Harnessing the complexity of gene expression data from cancer: from single gene to structural pathway methods
 Frank EmmertStreib^{1}Email author,
 Shailesh Tripathi^{1} and
 Ricardo de Matos Simoes^{1}
https://doi.org/10.1186/17456150744
© EmmertStreib et al.; licensee BioMed Central Ltd. 2012
Received: 30 July 2012
Accepted: 1 October 2012
Published: 10 December 2012
Abstract
Highdimensional gene expression data provide a rich source of information because they capture the expression level of genes in dynamic states that reflect the biological functioning of a cell. For this reason, such data are suitable to reveal systems related properties inside a cell, e.g., in order to elucidate molecular mechanisms of complex diseases like breast or prostate cancer. However, this is not only strongly dependent on the sample size and the correlation structure of a data set, but also on the statistical hypotheses tested. Many different approaches have been developed over the years to analyze gene expression data to (I) identify changes in single genes, (II) identify changes in gene sets or pathways, and (III) identify changes in the correlation structure in pathways. In this paper, we review statistical methods for all three types of approaches, including subtypes, in the context of cancer data and provide links to software implementations and tools and address also the general problem of multiple hypotheses testing. Further, we provide recommendations for the selection of such analysis methods.
Reviewers
This article was reviewed by Arcady Mushegian, ByungSoo Kim and Joel Bader.
Keywords
Gene expression data Cancer data Statistical analysis methods Pathway methods Correlation structure Cancer genomicsReview
Background
The early driving forces in biology were reductionist approaches. In general, a reductionist approach tries to breakdown a complex system into its parts list and explains its properties as the sum of its individual components. Hence, the individual constituents of a system inform its higher level functions [1–4]. However, the ‘one gene, one protein, one function’ working hypothesis [5] is not sufficient in order to explain the many emergent properties such as the phenotypic variability of organisms or the heterogeneity of cancer [6]. For this reason, nowadays, it is generally acknowledged that for achieving a functional understanding of biological systems, the genes in a cell need to be studied as a functioning collective [2, 3, 7]. In such a system, the collective functioning of groups of genes results in, for instance, signaling pathways or protein complexes that regulate cell differentiation, transcription regulation or growth.
A systems integration at the cellular level has the potential to answer many, until now, unsolved questions about biological systems and their collective functioning, regulatory programs for growth, development, phenotypic variability and the causality of many complex diseases [8–10]. Due to the enormous complexity of a cellular system, where many processes and interactions at different levels inside a cell work in harmony to assure the vital functioning of a cell, we need to understand key properties of biological systems like its robustness or modularity [2, 8] in order to enhance our understanding of complex diseases. These complex interactions occurring within a cell can be described by networks [11–13], including gene regulatory networks [14, 15], proteinprotein interaction (PPI) networks [16, 17], metabolic networks [18] and transcription regulatory networks [19, 20]. The networks are organized at different cellular levels and enable the functionality of the cell. The question now arising is how can the complexity inside a cell be understood, and analyzed?
The development of information processing technologies in the post genomic era enabled the generation of huge amounts of data. In this review, we focus on gene expression data from microarray platforms and summarize three major types of analysis strategies: (I) Identification of changes in single genes, (II) identification of changes in gene sets or pathways, and (III) identification of changes in the correlation structure within pathways. We discuss these methods in the context of cancer data sets to emphasize their biological meaning, implications and expressiveness.
Largescale gene expression data
In the next section, we briefly review highthroughput technologies that enable the generation of largescale gene expression data [21–23].
Gene expression data from microarray
A microarray experiment measures genomewide gene expression levels of mRNA in a cell or a tissue sample under a particular condition. A microarray chip quantifies the hybridization of fluorecsent labeled target nucleotide sequences to defined complementary probe sequences that are spotted on a glass or silicon slide. For different microarray platforms the spotted probes are synthetic oligonucleotides ranging from 25 to 80 nucleotides or long cDNA transcripts. Different microarray platforms were designed for a singlechannel or a multichannel experimental setting. For singlechannel arrays each condition sample is hybridized separately on individual arrays using a single dye. For multichannel arrays multiple conditions are hybridized together on individual arrays using multiple dyes. For example Affymetrix is a singlechannel platform, where multiple oligonucleotide probes (probeset) of 25 bases are used to measure the concentration of a mRNA transcript. The target mRNAs of expressed genes are extracted from a treatment or a control sample, reverse transcribed to cDNAs, labeled with a fluorescent dye and then hybridized to a microarray. An image of the microarray captures laser induced emitted fluoresent intensities of the probes at each spot. The intensities give a proportional measure of the corresponding mRNA concentration for each gene that was defined on the microarray.
Gene expression data from next generation sequencing (RNAseq)
The transcriptome of a cell comprises mRNA, tRNA, rRNA, and short regulatory RNAs. RNAseq is a transcriptome sequencing approach that uses deep sequencing techniques such as 454 (Roche), genome analyzer (Illumina solexa), SOLiD (support oligonucleotide ligation detection), Polonator G.007, HeliScope (Helicos BioSciences) and SMRT (single molecule real time sequencing) [24].
RNAseq has a wide variety of applications such as the measurement of gene expression levels from transcribed mRNA sequences [25]. In the first step of the procedure RNA is extracted from a given condition sample, fragmented, reverse transcribed to cDNA that is ligated to adapters. In the second step a library of reads is generated from the ligated fragments that are sequenced. In the third step the reads are mapped to known exon sequences of genes. The expression level of a gene is measured from the normalized number of mapped sequences that mapped to the known set of exon sequences of a gene. The RNAseq transcriptome sequencing approach overcomes several limitations of microarrays for measuring gene expression. For example, RNAseq measures large ranges of expression levels from very low to highly expressed genes and is able to consider unknown transcribed sequences. Since the novelty of the methodology, gold standard procedures for the management and processing of the data are currently being established.
Gene expression data and cancer

self sufficiency in growth signals

insensitivity to antigrowth signals

evading apoptosis

limitless replicative potential

sustained angiogenesis

tissue invasion and metastasis
Later, this list has been extended by adding two further hallmarks [26]

deregulating cellular energetics

avoiding immune destruction
and also two enabling characteristics

genome instability

mutation and tumorpromoting inflammation
It has been recognized that these hallmarks are gradually acquired by different types of cancers, potentially, in a variable order. This variability in the acquiring of these diseasebearing processes is one of the indicators of the complexity of cancer.
The biological processes in a cell are controlled and regulated by signaling pathways that are activated by internal and external signaling receptors and factors. The signaling pathways governing growth and cell proliferation are likely dysregulated in their functioning in cancer. For example, they become insensitive to antigrowth signals, or they are dysregulated in growth signaling pathways by gaining autonomy in their growth. It is assumed that interaction changes at various levels (genetic, mRNA or protein) lead to the unlimited growth of cells instead of the upregulation or downregulation of a single gene. Further, sometimes, even a moderate change in the expression of a group of genes can lead to a significant change in the biological function of an organism [27].
Currently, the underlying processes that contribute to cancer are being intensively investigated. However, so far, the molecular causes that initiate and maintain cancer are not well understood. For this reason, the understanding of gene expression profiles, which provide signatures of all the active genes and their interconnections in a cell, contain valuable information about the functioning of key pathways, as expressed by the hallmarks of cancer and, hence, enable a practical investigation of functional mechanisms thereof [28–36]. Despite the different focus of many studies of different cancer types, common themes in the form of ‘key pathways’ can be found throughout. For instance, the NFκB pathway involved in the cellular responses to external stimuli like cytokines or free radicals, and immune response to infection [29, 37–39]; the MAPK signaling pathway responsible for regulating growth factor signaling including the RAF, MEK, and MAPK cascade [34, 39, 40]; the p53 signaling pathways involved in DNA damage control, apoptosis and inhibition of angiogenesis [37, 41, 42]; or the Wnt signaling pathway involved in cell differentiation, and cell polarity [31, 34, 36, 38].
Formulating biological hypotheses
A main goal of highthroughput gene expression analysis is to identify differentially expressed genes or gene sets between two or more conditions to enable a functional interpretation of the underlying conditionspecific mechanisms. The biological processes at the gene level are complex in nature as they dynamically interact with each other. A single gene can participate in different biological processes and regulate different genes at different time points. The identification of key genes or pathways is a difficult task, because their interactions are unknown. We only observe the phenotypic outcome of test conditions and the corresponding gene expression patterns measured from a tissue or cell culture. Univariate and multivariate statistical methods can be applied in order to understand such differences from a statistical perspective. The first type of approach that has been used to identify changes in the gene expression is a differential gene expression analysis. This approach is commonly used to compare different conditions of microarray samples to identify differences between them. As a result, a single gene analysis approach gives a list of genes that show a statistically significant difference between two conditions. For cancer, such genes may correspond to oncogenes or tumor suppressor genes.
If we consider the underlying network where different biological functions are being described by groups of interacting genes, a single gene analysis does not resolve the biological functions that are affected primarily in disease conditions and are causal factors of the disease. In order to get a systematic understanding of the disease or phenotypes we have to first understand what biological functions contribute to these changes, and perform a comparison between conditions using groups of genes defined by biological pathways. This approach leads to comparing gene expression data at the pathway level where sets of genes are tested for differential expression.
Another interesting property that can be extracted from gene expression data is the correlation structure of gene expression profiles between all genes. This correlation structure shows associations between genes which directly or indirectly interact with each other [8, 43–45]. Comparative analyses of gene pathways that consider the correlation structure of expression data can provide a suitable test for the hypothesis of changes in the underlying network.
Before we procede, we would like to point out that all of these methods test statistical hypotheses [46]. That implies that in order to understand a particular method biologically, i.e., one is capable of providing a biological interpretation, one needs to understand the underlying null hypothesis. In our opinion, it is helpful to approximately categorize all statistical hypotheses into three categories with respect to their biological interpretability, whereas each category represents a different degree of difficulty to find a biological interpretation for a hypothesis. In the following, we provide a brief discussion of these three categories because it enables a better, potentially, more plausible understanding of the methods presented in the next sections.
In category one belong all hypotheses for which it is relatively easy to find a meaningful biological interpretation. An example from this category are tests that compare mean values (μ), e.g., to identify the differential expression of genes (section ‘Differential expression of a gene’). That means these tests use the mean as a test statistic. Due to the fact that the underlying (probability) distribution of the genes represents, biologically, the activity of the gene expressions, the interpretation of a null hypothesis is directly derived thereof. For this reason the biological interpretation of the rejection of the null hypothesis given in Eqn. 2, is intuitively clear and appealing, because it implies a change in the (mean) expression of genes which may indicate a change in a biological function because the number of available proteins may be altered.
In category two fall tests for which there are several alternative biological interpretations. This makes the interpretation of such tests ambivalent from a biological perspective. As example for such a test, we consider the detection of the differential variance of a gene (section ‘Differential variance of a gene’). Despite the fact that the underlying probability distribution of the expression of genes has a clear biological interpretation, the biological interpretation for the rejection of the null hypothesis in Eqn. 4 is not unique. For instance, a gene could have a different variance in two conditions because, e.g., in condition one it is periodically expressed, whereas in condition two it is constantly expressed on an intermediate level. The former condition may be related to the cell cycle or the circadian rhythm, or periodically triggered by an external signaling factor that is released by the administration of a medication that is regularly taken. A second equally plausible interpretation could be that in one condition the cell utilizes parallel pathways to transfer a signal whereas in the other condition only one signaling chain is used. The reason for the utilization of parallel pathways could be triggered by stress factors, e.g., in the presence of an infection, so that the cell is ‘running’ full power in order to execute all necessary programs that have been initiated by the presence of the intruder.
An example for a moment is the mean (which is the first moment), other examples of entities that can be expressed as a function of moments are the variance and the kurtosis. This means whenever the null hypothesis in Eqn. 39 is rejected it could be because of a difference in any moment of which there are, theoretically, infinite many. Put differently, this kind of unspecificity makes this test very powerful in the sense that it may detect any possible difference two distributions can exhibit. On the other hand, if the null hypothesis is rejected it is very difficult to identify a precise reason for its rejection. For instance, this could be related to a difference in the mean, variance, kurtosis or any higher moment or function thereof. These combinatorial factors do usually not allow to find a concise biological interpretation. Nevertheless, such a test can be of valuable use, e.g., for diagnostic purposes.
Singlegene analysis
Singlegene based methods can be subdivided into three major classes. A) Methods for detecting differential gene expression, B) methods for detecting differential correlation, and C) methods for detecting a differential variance.
Differential expression of a gene
The first published studies for gene expression analysis selected differentially expressed genes based on a foldchange criteria between a treatment and control condition [48]. For example, an early application of this measure was used to compare normal colon epithelium and primary colon cancers [49]. Since then, many statistical approaches have been developed to provide more robust measures. Among the most popular methods are, e.g., SAM [50, 51], limma [52], and the empirical Bayes approach from Efron et al. [53].
Differential variance of a gene
A gene is called differentially variable when the null hypothesis H_{0} is rejected. The DV analysis in [54] tests H_{0} by using a Ftest. In Figure 3B we show an example for a gene with a constant mean, but a changed variance in the two conditions. The samples for the two conditions are drawn from a standard normal distribution with the same mean but different variances for the conditions, i.e., N(μ_{1} = 0,σ_{1} = 1), and N(μ_{2} = 0,σ_{2} = 2).
Differential correlation of a gene
A gene i is called differentially correlated when H_{0} is rejected.
Genepair analysis
The functional activities of genes, as measured by gene expression values, reflect the interplay of the genes and their products in the underlying gene network. The objective of a genepair analysis is to identify either differential cocorrelated or differential coexpressed pairs of genes, instead of individual genes. The reason for looking for pairs of genes is that the concerted changes in genes is due to their common membership in biological pathways.
The principle idea to detect correlation changes in genepairs is visualized in Figure 3D. The data are sampled from a multivariate normal distribution with a constant mean vector for both conditions, μ_{ 1 } = μ_{ 2 } = (0,1), but a different correlation of ρ_{1} = 0.8 and ρ_{1} = −0.2. The point is despite no difference in the mean expression of the genepair, there is a difference in their correlation.
In [57] the ‘expected conditional Fstatistic’ (ECFstatistic) has been introduced to measure the differential coexpression of gene pairs (X Y). The method is based on a modified Fstatistic, where the variance and the mean parameter of the statistic are estimated from a mixture of two normal distributions.
The R package R/EBcoexpress provides an empirical Bayesian implementation to identify the differential coexpression of gene pairs [58].
Here m corresponds to the number of samples. The LA method uses a permutation test for the identification of significant LA gene pair values. Due to the high computational burden of the method that would require N^{3} (N is the number of genes) evaluations of Eqn 11 plus additional permutations of the data, which is even for only N = 10^{3} genes intractable because it requires already more than 10^{9} evaluations. For this reason the method is only used to (A) find the gene Z for a given pair of genes or (B) find the LAP, X and Y, for a given gene Z.
Gene set and pathway analysis methods
Generally, a pathway is a group of interacting genes (a gene set) that deploy a cellular function. In a biological system the biological processes are coordinated functions of sets of genes which make the organism work. Some general pathways are, e.g., metabolic pathways, signaling pathways or regulation pathways that represent minimal functioning units of a cellular system. The consideration of pathways or gene sets for a comparative gene expression analysis is an important step toward the exploration of relevant functional mechanisms of a cell.
So far, many multivariate and univariate tests have been proposed for a gene set analysis, see Figure 2. Finding differentially expressed pathways, instead of individual genes, is not straight forward from a statistical and biological perspective and there are several hurdles to this approach. The first is presented by the data themselves, because the number of variables is usually (much) larger than the number of samples, i.e., n << p, that leads to many estimation problems. The second hurdle is our incomplete information about the constitution of biological pathways and the potentially high overlap of genes between different pathways. For example, databases like GO [60] provide valuable information about genes for a large variety of different organisms. However, this information is not static but continuously expanding leaving us at the moment with a snapshot of knowledge. This makes it difficult to find precise definitions for particular pathways of interest. The third problem comes from the underlying gene network structure that describes the true interactions between genes in a pathway. Here, the problem is that as a result of such interactions among genes it is usually not appropriate to assume their independence, as frequently done for statistical ease.
A motivating example for the general idea underlying gene set methods is shown in Figure 3C. For condition 1 (green), the samples are drawn from a multivariate normal distribution with μ_{1} = {2,2.2,2.4,2.6,5,8} and for condition 2 (red) μ_{2} = {2,7,7.5,2.6,5,8}. The covariance matrix, Σ_{1} = Σ_{2} is for both conditions the same. In this Figure, only 2 of the 6 gene are differentially expressed. This reflects biological situations because, usually, only of fraction of the genes belonging to a pathway is found to be differentially expressed. However, due to the fact that gene set methods are based on the expression of a set of genes such methods borrow strength from the combined analysis of the genes.
Reviews that focus entirely on gene set and pathway analysis methods can be found in [61–65].
Null hypothesis for gene set analysis
Gene set analysis methods can be broadly divided into two major categories, depending on what null hypothesis is tested. The first type of methods are called competitive methods, and the second type selfcontained methods[66]. Briefly, selfcontained tests use only the data from a target gene set under investigation, whereas competitive tests use, in addition, also data outside the target gene set (background data). In the following we describe popular competitive and selfcontained pathway methods.
Competitive gene set and pathway methods
Overview of different competitive gene set methods
Principle Method  Reference  Test type  Software 

Overrepresentation analysis (hypergeometric test)  [67]  parametric  GOstats 
GSEA  [68]  nonparametric  GSEABase 
[27]  nonparametric  
GSA  [69]  nonparametric  
GSEArot  [70]  nonparametric  limma 
GAGE  [71]  parametric  GAGE 
PAGE  [72]  parametric  PGSEA, GAGE 
Random Set  [73]  parametric  part of CLEAN 
Generalized Random Sets  [74]  parametric  
Gene set enrichment analysis made simple  [75]  parametric 
GSEA
 (1)
Estimation of genelevel test statistics.
 (2)
Rank ordering of the test statistics.
 (3)
Calculation of an enrichment score (ES) for a pathway based on the genelevel test statistics.
 (4)
Permutation of the genelabels to estimate the significance of the enrichment score for the pathway.
GSEArot
GSEArot (gene set enrichment analysis rotation) [70] is very similar to GSEA, but uses a different approach to randomize data in order to assess the significance of a target pathway. More specifically, a data matrix X is randomized by, first, rotating X around a random angle δ, resulting in a matrix X(δ). Second, from the matrix X(δ), the randomization matrix is obtained by a QR decomposition [76]. In [70] it is argued that this procedure has an advantage for small sample sizes, when only very few permutations are achievable from samplelabel permutations. The null hypothesis tested by GSEArot is the same as for GSEA.
Random set
 1.Estimate the enrichment score of a target gene set W,$\stackrel{\u0304}{s}=\frac{1}{m}\sum _{i\in W}{s}_{i}.$(12)
Here s_{ i } are genelevel scores, e.g., tscores, and m = W is the number of genes in the target pathway.
 2.Estimate the enrichment score and its variance of the background gene set V = W ∪ W ^{ c },$\mu =\frac{1}{p}\sum _{i\in V}{s}_{i},$(13)${\sigma}^{2}=\frac{1}{m}\left(\frac{pm}{p1}\right)\left(\frac{\sum _{i\in V}{s}_{i}^{2}}{p}{\left(\frac{\sum _{i\in V}{s}_{i}}{p}\right)}^{2}\right),$(14)
with p = W ∪ W^{ c }.
 3.Estimate the standardized enrichment score$Z=\frac{\stackrel{\u0304}{s}\mu}{\sigma}.$(15)
It is notable that Z can be calculated without a numerical randomization of the data. Further, the background data consist of all genes V, including the ones in the target pathway W. In [77] this method has been applied to head and neck and cervical cancer for human papillomavirusespositive and negative samples.
GAGE
 1.
Estimate the mean fold change f and its standard deviation σ _{ f } for the m genes in the target pathway W.
 2.
Estimate the mean fold change f ^{ ′ }and its standard deviation $\left(\right)close="">{\sigma}_{{f}^{\prime}}$ for all p genes in the background gene set V = W ∪ W ^{ c }.
 3.Estimate the tscore:$t=\frac{f{f}^{\prime}}{\sqrt{{\sigma}_{f}^{2}/m+{\sigma}_{{f}^{\prime}}^{2}/m}}$(18)
degrees of freedom.
GSA
The null distribution is assessed by a restandardization, combining a sample and genelabel permutation.
Selfcontained gene set and pathway methods
Overview of selfcontained gene set and pathway methods
Principle Method  Reference  Software 

Average of singlegene statistics  [79]  sigPathway 
Linear Model Toolset for GSEA  [80]  GSEAlm 
SAMGS  [81]  
Globaltest  [82]  globaltest 
GlobalANCOVA  [83]  GlobalAncova 
Hotelling’s T^{2}  PCOT2  
Nstatistic  [88]  cramer 
RCMAT  [89]  
Nonlinear tests for identifying differentially expressed genes or genetic networks  [87]  
Pathwayexpress  [90]  
Signaling Pathway Impact Analysis  [91]  SPIA (Bioconductor) 
SEPEA  [92]  
PARADIGM  [93]  
Gene set analysis exploiting the topology of a pathway  [94]  IPS (available upon request) 
Sum of tsquare
The significance of TS is assessed from samplelabel permuted data.
SAMGS
Statistical significance of SAMGS is again assessed from samplelabel permuted data.
Hotelling’s T^{2}
The inverse of the covariance matrix is estimated via the shrinkage estimator [96–99]. The statistical significance of the test statistic T^{2} is estimated from samplelabel permuted data.
Nstatistic
whereas F_{ C }(x) and F_{ T }(x) are two multivariate distribution functions from the control and the treatment condition.
Here $K\left({x}_{i}^{C},{x}_{j}^{T}\right)$, defined as $K\left({x}_{i}^{T},{x}_{j}^{C}\right)={\u2225{x}_{i}^{T}{x}_{j}^{C}\u2225}_{2}$, is the Euclidean Kernel serving as distance function between the expression values in the two conditions.
Linear modelbased pathway methods
There are also several approaches that utilize either a linear or a generalized linear modeling framework for a gene set analysis. Examples for such methods are Global test [82], Extension of GSEA [80] or GlobalAncova [83].
Topological pathway methods based on existing network information
Some recent univariate methods, for instance, Pathwayexpress [90], SPIA [91] or SEPEA [92], use instead of correlation measures to estimate interactions among genes, predefined topological information as provided, e.g., by the KEGG database [100]. These methods assign each gene in a pathway a score that is based on the position of a gene in the given network structure and, finally, aggregate these individual gene scores to obtain a score for the pathway itself. Yet another approach is provided by PARADIGM [93]. This method uses a factor graph model combining gene copy number variation data with gene expression data for the identification of differentially expressed pathways.
Iterative Proportional Scaling: IPS
In this method, the covariance matrices, $\left(\right)close="">{\mathit{\Sigma}}_{{c}_{1}},{\mathit{\Sigma}}_{{c}_{2}}$, are estimated from the data, for both conditions, using the Iterative Proportional Scaling (IPS) algorithm. The inverse of the estimated covariance matrices are positive definite (concentration) matrices for which it is assumed that the nonzero elements in $\left(\right)close="">{\mathit{\Sigma}}_{{c}_{1}}^{1}$ and $\left(\right)close="">{\mathit{\Sigma}}_{{c}_{2}}^{1}$ are identical; this is the meaning of ${\mathit{\Sigma}}_{{c}_{1}}^{1},{\mathit{\Sigma}}_{{c}_{2}}^{1}\in {S}^{+}(G)$ where S^{+} indicates the class of all symmetric positive definite matrices with nonzeros elements given by the binary matrix G. This means that the concentration matrices, ${\mathit{\Sigma}}_{{c}_{1}}^{1}$ and ${\mathit{\Sigma}}_{{c}_{2}}^{1}$, have identical zero element, but are allowed to have different nonzero entries. In other words, it is assumed that the underlying topology of a pathway is the same for condition c_{1} and c_{2}, given by G, whereas G_{ ij } = 0 corresponds to an ‘absent’ interactions among the genes i and j. Since the structure of G is not estimated from the data, it is necessary to obtain it from an independent source, e.g., from the KEGG database or Reactome [100, 102]. In [94] it is shown that a log likelihood (log(Λ)) ratio test can be used to test for the equality of the concentration matrices for the two conditions and that asymptotically the log likelihood ratio follows a Chisquare distribution with r + m degrees of freedom, i.e., $log(\Lambda )\sim {\chi}_{r+m}^{2}$, whereas m is the number of genes in the pathway and r is the number of nonvanishing edges in G corresponding to the fixed interaction structure of the pathway.
The IPS method has been used in [94] to study acute lymphocytic leukemia with and without BCR/ABL gene rearrangement. As a result, the JUN oncogene with RAS/MAPK/JNK followed by NFAT and NFKB seem to be crucial in distinguishing BCR/ABL positive and negative patients.
Differential correlation/interaction methods
In the previous sections, we discussed different gene set and pathwaybased methods for the identification of differentially expressed pathways. These methods focused either only on the expression of genes, or considered an underlying interaction topology among the genes as taken from an independent source. However, even when these methods considered an underlying network structure, this structure was assumed to be the same for the ‘treatment’ and ‘control’ group.
In contrast, in this section we discuss methods that estimate the correlation/interaction structure of the genes within pathways, for each experimental condition. The underlying rationale for these approaches is to assume that the expression profiles of genes are dependent on each other [103, 104] as the genes in a pathway interact, either directly or indirectly [105]. This assumption results from the observation that genes with similar functions or cellular localization are often coexpressed and cluster together. The methods discussed in this section bear a similarity to the statistical methods for the estimation of differential correlated genepairs (see section ‘Genepair analysis’). However, the extension of such genepair measures to the pathway level allows the identification of pathways that show, e.g., a condition specific correlation change.
In Figure 3E we show a simulated example scenario for condition specific correlation changes of the expression profiles for a gene set. In Figure 3E the correlation between all genepairs of a gene set is aggregated by a summary statistic. In this example, the mean values between the genes is of a comparable order, whereas the correlation of the gene set in the treatment condition is reduced.
Overview of methods for the identification of differential correlation/interaction changes in pathways
Principle Method  Reference  Software 

Graph edit distance  [106]  
Geneset coexpression analysis (GSCA)  [107]  
Differential coexpression (dCoxS) between genesets  [108]  
DiffCoEx  [109]  R code is provided in the paper 
Differential disease network using C3NET  c3net (http://cran.rproject.org/web/packages/c3net/index.html)  
Disease associated interactions using Synergy network  [111]  MATLAB code is provided in the paper 
Graph Edit Distance: GED
Among the first approaches that estimate the interaction structure for a pathway analysis to identify differentially correlated pathways (DCP) is a method introduced in [106]. This method uses the graph edit distance (GED) score as a test statistic.
In order to assess statistical significance, sample label permutations are performed to obtain the null distribution.
Extensions of this method can be found in [113] where mutual information values have been used to capture nonlinear relations among gene expression values. Further, in [114] a methods based on a relevance value (RV) has been defined for integrating different types of genomics data sets which has also a resemblance to the GED.
Gene set coexpression analysis: GSCA
From the definition of the dispersion index follows that also this method aims at detecting at differential correlation among pathways, despite its name emphasizing coexpression. Interestingly, the dispersion index corresponds to the GED score if its components in Eqn. 47 are relabeled and one defines the components of the adjacency matrices $\left(\right)close="">{E}^{{c}_{1}},{E}^{{c}_{2}}$ as the correlation coefficients rather than the outcome of the hypotheses tests [106].
A visualization of the underlying idea of GSCA is shown in Figure 3E. The gene expression values are sampled from multivariate normal distribution N(μ_{ 1 },Σ_{1}) and N(μ_{2},Σ_{2}) with μ_{1} = μ_{2}, and the average covariance between genepairs is Σ_{1} = 0.8 and Σ_{2} = −0.2. Despite the fact that there is neither a difference in the individual expression of genes nor the the expression of a set of genes, condition 1 and 2 can be distinguished by using a measure based on a correlation change.
Partial least squares based scores: PLS
A statistical framework based on a partial least squares score is proposed in [115]. Similar to the above methods, two matrices for the two conditions are inferred. These matrices can be seen as weighted networks, whereas an edge weight corresponds to the strength of the association between two genes. In this paper, three different types of tests are introduced that allow (A) testing for changes in the module structure of the two networks, (B) testing for changed in the connectivity of a particular gene set, and (C) testing for changes in the connectivity of a particular gene.
Differentially coexpressed gene sets: dCoxS
In [108] dCoxS has been applied to gene expression data from lung cancer. Their analysis identified the Thrombin signaling and proteaseactivated receptors pathway, which is known to be involved in the angiogenesis of lung cancer, as the most frequently changed pathway. Another interesting result found is that all significant pathway pairs had a lower interaction score in lung cancer than in the normal control group. This might indicate that the variability in form of exploited parallel pathways is in cancer lower than in normal cells.
Gene regulatory networks
Finally, we would like to mention that also gene regulatory network inference methods have also been used in this context. More precisely, several attempts have been made to identify disease networks [110, 111] that corresponds to particular pathways. For instance, in [110] the C3NET inference method [116, 117] has been used to infer pathway specific networks for prostate cancer. A structural comparison between the pathwayspecific networks, similar to [106] which is based on testing the hypothesis in Eqn. 45, allowed to identify growth and cell cycle related pathways.
connecting the partial correlation coefficient of fullorder (LHS) with the elements of Ω, ω_{ ij } ∈ Ω. The partial correlation is of fullorder (with respect to the number of genes) because V∖{ij} is the set of all genes excluding i and j, i.e., the largest possible set of genes not considering i and j.
Several methodological improvements have been suggested to infer gene regulatory networks based on GGM [121–123]. These methods differ in the way the inverse of the covariance matrix, Σ^{−1}, is estimated and in the statistical tests employed to assess significance. The reason for these technical variants comes from a variety of problems. For instance, if the number of samples is smaller than the number of genes, which is typically the case for a microarray data set, the sample covariance matrix is not positive definite and, hence, not invertible. This means that Eqn. 55 cannot be exploited. In order to overcome such practical estimation problems, recently, several extensions based on the LASSO (least absolute shrinkage and selection operator) have been suggested [124–128].
Importance of multiple hypotheses testing and sample size: An example for differentially expressed genes
controlled by a procedure introduced by Benjamini & Hochberg (BH) [131]. Subsequently, various related error measures have been proposed like pFDR [132], local FDR [133, 134] and a variety of other control procedures [129, 135]. Also extensions have been suggested [136] that allow the control of an error measure in cases where the underlying tests are not independent from each other. This is particularly important for microarray data that contain a none neglectable correlation structure among the genes.
that means the local FDR is the probability that the null model is true conditioned on the observed test statistic. The data we used for this analysis correspond to simulated gene expression data sampled from a normal distribution with different mean values for the two conditions. More precisely, we simulate 2000 genes of which 400 are differentially expressed (true positives). Further, we study three different (constant) correlation structures with ρ = {0.0,0.2,0.5}. The results shown in Figure 6 are for each sample size averaged over 50 independent runs.
As one can see, ‘BH’ and ‘pFDR’ give more significant results and, hence, have a higher power than the Bonferroni correction and the local FDR when there is either no or only a moderate correlation among the genes. However, it is important to note that the utility of these methods depends on the characteristics of the data. For example, if the average correlation in the data is ρ = 0.0, then ‘pFDR’ tends to perform best (see Figure 6A). However, when the average correlation in the data increases (ρ = 0.5) then the ‘localFDR’ [134, 137] becomes preferable. We want to note, for a sample size of 5, the power of the methods is usually very low because only a couple of genes test significant. In addition, a large fraction of these can be false positives. This seems to be especially for the local FDR method a problem.
Recommendations
In general, there is a tradeoff between a high power of a statistical method on one side, which requires a large number of samples, and low experimental costs on the other. For the identification of differentially expressed genes the results in Figure 6 provide some guidelines. Even for the most favorable condition (for ρ = 0.0) a study will usually be underpowered for ≤ 20 samples, however, on the other hand, even for 10 samples the Type I error will be wellcontrolled.
For gene set and pathwaybased methods such recommendation are more delicate. In [105] two selfcontained (sum of tsquare and Hotelling’s T^{2}[84, 95]) and one competitive test (GSEA [27]) have been analyzed. As a results, it is suggested not to apply a method unconditionally to all pathways in a given data set, but to filter them in order to eliminate conditions for which a method is more likely to cause problems. This can be seen as a reflection of the heterogeneity of cancer, as discussed above in the section ‘Gene expression data and cancer’.
In [105] it has been suggested to filter pathways according to the following criteria: Hotelling’s T^{2} should only be applied to pathways with less than 35 genes and a sample size larger than 30. The sum of tsquare test should only be used for pathways with DC > 10% (DC is detection call; the percentage of differentially expressed genes in a pathway) and a sample size of 25 or larger. GSEA should only be used for pathways with DC > 10% and a sample size larger than 25. That means for the sum of tsquare test and GSEA, at least 10% of the genes in a pathway should be differentially expressed for the method to work. However, this is not independent of the correlation structure of the data. In general, in the presence of high correlations a larger number of differentially expressed genes is beneficial for these methods.
It is important to emphasize that these sample sizes are different to the minimal sample sizes necessary in order to avoid in addition that a study is underpowered. For the minimal sample sizes [105] predict a sample size of 59 for Hotelling’s T^{2} and 57 for the sum of tsquare test and 83 for GSEA. Further, in [95] it was found that using the Nstatistic with 40 samples (or more) leads to a good control of the Type I error and a satisfactory power for a variety of differing conditions, including different correlations of the data and DC values in the pathways. Further studies reviewing related methods can found in [61, 62, 65, 139, 140].
We would like to emphasize that the above recommendations are data dependent. That means it is not possible to judge solely based on the number of samples which method to use. Instead, one needs to estimate characteristics from a particular data set in order to select an appropriate test. This implies, e.g., to estimate the correlation structure and the detection call. In the context of cancer there is an additional problem that needs to be considered. It is known that a tumor is a heterogeneous collection of cells rather than a homogeneous one [26, 141]. This translates into the heterogeneity of gene expression data [142] making it even more dangerous to provide general recommendations without considering a particular data set.
On a general note, we would like to highlight that whenever a given data set allows to (I) identify changes in single genes, (II) identify changes in gene sets or pathways, and (III) identify changes in the correlation structure in pathways, then methods from each of the three categories (I III) should be applied and there is no need to focus on just one of these. The reason for this is that despite the fact that gene set or pathway methods have more explanatory power than methods to identify changes in single genes [64, 95] it does not mean that there are no conditions for which single gene methods reveal interesting biological information that may not be obtained by the other types of methods. For instance, the differential expression of a single gene based on changes in the mean (rather than the variance) may be an indicator for the presence of a single signaling chain rather than of many parallel pathways. Hence, this could provide information about the presence of a Mendelian trait or a complex trait that contains a strong monogenetic component. It appears that for such conditions single gene methods have an advantage over gene set or pathway methods, although, the latter methods may be adaptable to such question as well. However, this may require additional effort. In summary, we recommend to use all different approaches (I III) sideby side, whenever this is permitted by the data, to interrogate the data in the broadest way, because this translates into a diverse set of different biological questions.
Our recommendation complements a common line of thought asking for the combination of different types of data. Although it is certainly true that combining different types of highthroughput data, e.g., from DNA microarray and ChIPchip experiments, is in general more informative, it is also more time and cost intensive to generate such data combinations. For this reason, frequently, only gene expression data are available. Hence, our review provides a survey of method to get the most out of expression data sets.
Finally, we would like to emphasize that all methods require an appropriate filtering and normalization of the data in order to obtain robust and statistically sensible results [143, 144].
Conclusions and discussion
In the post genomic era, biology transitioned from a ‘genecentric’ to a systemsfocused field. This change is also reflected in the transition from methods to identify ‘differentially expressed single genes’ to approaches for finding ‘differentially changed pathways’. Such a transition is natural, because a systems view is required to understand the complex biological functions inside a cell that are responsible for the observable phenotypic outcomes [9, 11, 145].
As recent findings in cancer research demonstrate, cancer is a heterogeneous disorder, even within a particular cancer type. For example, breast cancer is currently subcategorized into four major tumor subtypes [146]: basallike, HER2enriched, luminal (which can be further reduced) and normallike tumors. Considering the fact that these results have been achieved by using highthroughput data one can expect further refinements when data from different hightthroughput technologies become available and being combined with each other. For this reason, it appears sensible to assume such a heterogeneity not only on the global, phenotype level, but also within the cells, on the pathwaylevel. This implies that a pathwaybased filtering, as suggested in [105], is necessary to apply a method only selectively, and not unconditional, to cancer pathways.
Regarding potential future directions, we expect to see an increase in methods that target changes in the correlation structure in pathways for three reasons. First, genes and their products do interact with each other. This implies that there exists a correlation structure among these entities that represents, potentially, useful biological information that may be missed by coexpression based methods [106]. Second, the costs to generate highthroughput data are declining, which makes it easier for the experimenter to generate a sufficiently large number of samples that enables such an analysis. This is an important point, since the required sample sizes for a pathway analysis is considerably larger than for single gene analyses. Third, biologically, the hallmarks of cancer point to a few pathways as pivotal elements in the molecular elucidation of carcinogenesis, e.g., Wnt/Notch signaling, Hedgehog signaling or DNA damage control [147–149]. Hence, semantically, pathway studies enable the systematic connection of oncogenes, tumorsuppressor genes and stability genes [150] to provide fundamental insights into causal mechanisms underlying cancer. Unfortunately, the temporary literature especially of methodological papers discuss their results rarely in the framework of the hallmark pathways. For this reason, we suggest that future studies aim for a conceptual discussion of their results within this enlightening framework. Not because it provides the final answers to understand cancer [151], but due to the fact that it enables a systematic approach to the emperor of all maladies [152].
Reviewers’ comments
First of all, we would like to thank all referees for their fruitful suggestions and comments. In the following, we kept our answers to the raised issues short but included our responses in the main text.
Referee 1: Dr. Arcady Mushegian
The manuscript by EmmertStreib and colleagues is a review of statistical methods for analysis of gene expression data, but it is also much more than that. It is relatively rare for the statisticians to review all classes of such methods and to give an eminently logical classification not only of the techniques on which the methods are based, but also of the kinds of questions that are asked when applying these methods. This, certainly, is a strength of the work and the reason why it should appeal to the biologists that would like to have a deeper insight into which methods are appropriate to which task at hand.
I have, however, several comments that rank somewhere between suggestions and concerns. Most importantly, the authors propose to distinguish three groups of methods: those that identify changes in single genes, those that identify changes in gene sets or pathways, and those that identify changes in the correlation structure in pathways. (By the way, in the Abstract and elsewhere, the description of the groups is almost the same as above, but “changes” are substituted by “differential changes”  is it not a tautology, in particular when there are only two samples?). Then, in discussing the first two classes of methods, the authors almost in every case give a clear formulation of the question that is being asked of the data, in the form of the statistical hypothesis about the data that is being tested. This is an excellent way of explaining things. Unfortunately, it is not consistently applied: even among these classes of methods, the hypotheses are not mentioned, and then, upon discussing the differentialcorrelation methods (pp. 1518), the hypotheses are not explicated at all, except for the IPS method. I think this need to be changed, and the null hypotheses need to be stated for all methods for which this is possible; and if the framework is such that no explicit null hypothesis exist, this needs to be discussed, and the applicable intuitive formulation be given.
Reply
We appreciate this suggestion and added to all methods the definition of their null hypothesis. In addition, we extended the discussion in section ‘Formulating biological hypotheses’ explaining why it can be difficult to find a biological interpretation for a null hypothesis and we offer some explanation for this.
Question
My other concern is abut Figures 3 and 4. The authors never state what the data points there represent. They must be expression values for two genes, but how are these data collected  are they technical replicates? biological replicates? some kind of ordered series? unordered series such as for example different drug treatments? Does it matter what of the above they are?
Reply
We added an explanation of the data, which are simulated data to visualize the principle idea underlying some methods, to the corresponding methods.
Question
The third shortcoming of the paper is that there is a significant disconnect between wellcovered methodology and the stated goal of discussing the application to cancer biology. In fact, the short discussion about cancer hallmarks is an excellent introduction that points out the way in which analysis of gene expression can lead to the understanding of changes in expression of particular (“hallmark”) pathways. This theme, however, is not followed through. Though occasionally we read that such and such method was applied to analysis of a particular type of cancer, there is never any discussion of what was found in gene expression data that allowed an insight into cancer biology. What happens to the hallmark pathways at the level of gene expression programs? Which methods have been used to support (or maybe question?) which aspects of the hallmark hypothesis? Which pathways were predicted or shown to be differentially regulated at the transcription or mRNA concentration level?
Reply
We agree with the reviewer that ‘Which methods have been used to support (or maybe question?) which aspects of the hallmark hypothesis? ’ is an important questions. Unfortunately, the methodology oriented literature does rarely touch this topic in a clear manner. That means in order to extend the paper in this direction we could not survey these issues but would need to establish such results. Instead, the concern in our paper is to propagate such an approach in the context of the presented methods. A discussion has been added to ‘Discussion and conclusions’.
Question
Finally, there is the question of, if you will, general biology of transcriptional response. It stands to reason, and indeed has been occasionally shown, that in order for a pathway to be regulated, it may not be necessary to regulate all its components at the same (in this case, mRNA concentration) level. One may find that the gene product amounts are regulated at different levels, or maybe even only one or a few, e.g., ratelimiting, components are regulated at all. This would argue that single genebased methods may in these cases provide a better clue to the process than pathwaybased or gene set enrichmentbased methods. It would be interesting to know whether this has been observed in the cancer datasets. A related question is about the rules of thumb in pathway analysis: for example, if a typical pathway (network module?) has a size of N genes, what is the number of genes in this pathway m < N that would still register as an enrichment in some of the tests that the authors discuss?
Reply
This is an important point. We included a discussion of this to section ‘Recommendations’. We added also a discussion of the danger of general suggestions and motivate this by known characteristics of gene expression data from cancer. The problem is twofold. First, each method has its own characteristics under what conditions it works best. Second, data sets from cancer are very heterogeneous so that two data sets containing about the same number of samples can exhibit a very different correlation structure and expression patterns. This holds potentially also for different grades of one cancer type.
Regarding the first question, it appears to us that this is related to the presence or absence of parallel pathways conveying a molecular signal. If for example no parallel pathways exist the detection of differentially expressed genes can provide a robust way to detect functional changes. On the other hand, if there are many alternatives this may not be the case and gene set methods appear to be better suited for such a situation. In general, this kind of crossmethod comparisons are not well studied and we are not aware that this has been systematically addressed for cancer or other data sets. One reason for this is that until recently, most data sets contained less than 20 samples per condition, which usually does not permit a robust analysis of gene set or pathway methods and once larger data sets became available the detection of differentially expressed genes was neglected, potentially, due to the erroneous assumption that differentially expressed gene set methods include the former tests.
In order to emphasize that it is desirable to apply methods from all three different levels simultaneously ((I) identify changes in single genes, (II) identify changes in gene sets or pathways, and (III) identify changes in the correlation structure in pathways) whenever a given data sets allows this, rather than to focus on just one of these levels, we added a discussion to section ‘Recommendations’.
Thank you for your suggestions and comments.
Referee 2: Dr. ByungSoo Kim
General comments This is a well organized review of recent statistical methods of analyzing microarray experiment data sets, particularly on cancers, from single gene analysis to identifying differential changes in pathway, and finally to comparing a given pathway under two different conditions. However, I would like to indicate following four points for the possible improvement. (1) Gaussian graphical model: From the methodological point of view, it is desired to include the sparse Gaussian graphical model (GGM) approach for estimating the gene network under the multivariate normal assumption from a microarray data set. For the recent development of GGM approach one can include glasso (Friedman, Hastie and Tibshirani, 2008; Witten, Friedman and Simon, 2011) [125, 127], SCAD penalty of Fan, Feng and Wu (2009) [124], adaptive lasso of Zou (2006) [128] and Kiiveri (2012) [126], among others. (2) Effect of intergene correlations on the single gene analysis. A series of Efron’s recent work (Efron 2007a, 2007b) [134, 137] discussed in detail on how intergene correlations could affect the detection of differentially expressed (DE) genes in a single gene analysis? By including Efron’s recent work and his R package “locfdr” authors can show how FDR can be used in the real data analysis in their Section on “Importance of multiple hypotheses testing and sample size: An example for differentially expressed genes”. (3) Some of the reviews are misleading. These are the few examples. (i) The sentence, at the middle of page 12, “However, in order to use a twosample ttest with equal size of the two samples it is assumed that the mean fold change f’ and its standard deviation σ_{ f } would be the same for a randomly selected background set consisting of only m genes, see Eqn. 10”. Actually ([99], Luo et al., 2009) assumes the i.i.d of the fold change of genes to make Eqn 10 have a t distribution. Here the key assumption was the independence, which was missing in the aforementioned sentence. (ii) p. 14. Eqn 16. In ([126], Tian et al., 2005) no tsquare statistic was employed. (iii) Eqn 24 of p. 18 does not make sense. Authors of ([20], Cho et al., 2009) didn’t make it clear in their equation (3) what Renyi entropy was when the underlying random variables were continuous. (iv) I would suggest authors to allocate more space on the work of ([90], Massa et al, 2010) which was methodologically sound and deserve more coverage than just the IPS algorithm. (4) Inconsistency of notations. In page 11 authors defined p and m to be sizes of the background genes and a target gene set, respectively. However, in line 2 of page 15 “p genes” (which should have been m genes according to page 11 definition) was incorrectly labeled. This inconsistency was repeated in Nstatistic section of p.15, and also in Eqn 16 in p. 14 and Eqn 22 of p. 17. The “pdimensional..” should be “mdimensional..” at the bottom two lines of p. 14.
Minor Comments
 1.
p.2 “Gene expression data from next generation sequencing (RNAseq)”. This is an important issue. There is no direct relevance, however, with statistical methods reviewed in this paper.
 2.
p.4. For detecting differential correlation and differential variance, it would be better to explain why these approaches were taken. For example, in ([54], Ho et al., 2008) it was clearly indicated that changes in expression variability were associated with changes in coexpression pattern, which implied that DV was a signal rather than a noise.
 3.
Legend of Figure 2. “The data is..” should be “The data are.. ”.
 4.
p.7. There is no reference of Figures A, B in the main text. Also indicate in the legend of Figure 3 what Σ is in Figure 3E.
 5.
p.8. In the legend of Figure 4 what the symbols in the outerpanel represent? What do the lines represent? It is better to use different notation (A, B) to avoid confusion in the main text of the second paragraph of p. 9.
 6.
p.9 What is “alpha” in Equation (4)?
 7.
p.9 line 9. You may include two specific patterns of dependence of two genes, namely, type A dependence of Klevanov, Jordan and Yakovlev (2006), and hidden regulator dependence of Lim, Kim and Kim (2011).
 8.
p.15 line 13. “euclidean Kernel” should be “Euclidean kernel” (9)
 9.
p.15 line 10. “a either” should be “either”.
 10.
p.15 line 8. Author may want to include Tsai and Chen (2009) for another reference of Hotelling’s Tsquare statistic.
 11.
p.17. line 15, What are “A” and “B”?
 12.
p. 18 line 2. Better to include Lauritzen (1996) as a reference of IPS algorithm.
 13.
p. 22. It would be more beneficial for the read to move the last paragraph of p. 22 (extended to p. 23) to Introduction section.
References Efron B. (2007a). Correlation and largescale simultaneous signifance testing, J. Amer. Statist. Assoc. 102:93103. Efron B. (2007b). Size, power and false discovery rates, Annals of Statist. 35:13511377. Fan J, Feng Y, Wu Y. (2009). Network exploration via the adaptive lasso and SCAD penalties. Ann. Statist. 3:521541. Friedman J, Hastie T, Tibshirani R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9:432441. Lauritzen SL. (1996), Graphical models, Oxford: Clarendon Press. Lim J, Kim J, Kim BS (2011). An alternative model of type A dependence in a gene set of correlated genews, Statist. Appl. Genet. Mol. Biol. Vol. 9, Article 12. Kiiver H, de Hoog F. (2012). Fitting very large Gaussian graphical models. Comp. Statist. Data Anal. 56:26262636. Klebanov L, Jordan C, Yakovlev A. (2006). A new type of stochastic dependence revealed in gene expression data, Statist. Appl. Genet. Mol. Biol. Vol. 5, Article 7. Tsai CA, Chen J. (2009). Multivariate analysis of variance test for gene set analysis, Bioinformatics, 25:897903. Witten DM, Friedman JH, Simon N. (2011). New insights and faster computation for the graphical lasso. J. Comp. Graph. Stat. 20:892900. Zou H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101:14181429.
Reply
We revised our text correspondingly and addressed all your suggestions. We would like to point out that the major goal of our review is not a full coverage of statistical details but to provide sufficient information for the reader to acquire a basic understanding of major principles and assumptions that underly the methods. The problem is that if too many detail are presented the paper would turn quickly into a formal description which may not be appreciated by a biology oriented readership.
Minor Comments
 1.
p.12. line 1: What is N?
 2.
p. 15. line 7 5: “two i.i.d samples of genes..” is rather confusing. Luo et al. (2005) assumed the i.i.d of the fold change of genes, which was much stronger than just assuming equal mean and variance. It is better to rewrite this sentence to convey the original material.
 3.
p.17. line 1: “p genes” should be “m genes”.
 4.
p.22. Eqn. (50): What are p and q? What are S _{ i } and S _{ j }?
 5.
p.37. Reference 30, p. 38. References 40, 59; The journal title should be consistent with Reference 27 or vise versa.
 6.
p.40. References 90, 108: Location of the publisher is missing.
 7.
p.40. References 93,94; The journal title should be consistent with Reference 119.
 8.
p.40. Reference 109: Author was duplicated at the end. The location and the publisher were missing.
 9.
p.41. Reference 117. The article title is missing.
 10.
p.41. Reference 118: The location of the publisher is missing.
 11.
p.41. Reference 133. The journal title should be consistent with Reference 27.
Reply
All comments have been addressed and we revised the main text correspondingly.
Thank you for your suggestions and comments.
Referee 3: Dr. Joel Bader
This manuscript reviews methods for analyzing gene expression data with tests of individual genes, gene pairs, gene sets, and networks. The manuscript is strong in covering many methods. It would be more helpful if the authors also provided a point of view or evaluation of methods. Can anything be said about the relative power of different approaches, or which have proven to be more useful in practice? What about the tradeoff between robustness, power, and speed for realistic data? Most of the discussion of method choice is generally about sample size requirements for all methods rather than method choice given sample size. The two parts of the manuscript, gene expression and cancer, don’t really mesh. Most of the methods review is not cancer specific. Possibly of greater relevance to cancer are methods that combine different types of data.
The manuscript is generally well written and easy to understand, with ample references to the original work and to previous reviews.
 1.
p. 1 ‘one gene, −> should be ‘ for openquote in latex, here and elsewhere
 2.
p. 2 differnt microarray −> spelling
 3.
p. 2 comprises, e.g., mRNAs −> ‘e.g.’ doesn’t sound right here. How about providing a full list: mRNA, tRNA, rRNA, and short regulatory RNAs
 4.
p. 2 ‘In the third step the reads are mapped to known exon sequences of genes.’ Are there also de novo assembly methods that don’t require a template? ‘allows to overcome’ −> overcomes
 5.
p. 3 allows to measure −> measures. Can also mention other advantages: splice variants, sequence polymorphisms, no need to design and build a custom chip
 6.
‘correspond to: self sufficiency’ −> no colon between preposition and noun phrase. Can the hallmarks be parallel, all start with noun or verb?
 7.
p. 9 Eq. 4. How is alpha calculated?
 8.
Eq. 5 need i = 1 underneath the summation
 9.
p. 14. Eq. 16 Under the null, it seems that Σ _{t,2}should approach $1/\sqrt{(}p)$ rather than 0.
 10.
Eq. 16 How is the significance of SAMGS calculated?
 11.
p. 18 Eq. 24 and text after, use log in math mode rather than log.
Reply
All comments have been addressed and we revised the main text correspondingly.
Thank you for your suggestions and comments.
Declarations
Acknowledgements
ST is supported by a studentship from the National Institute of Immunology. FES and RDMS are supported by the Engineering and Physical Sciences Research Council (EPSRC) and DEL.
Authors’ Affiliations
References
 Bock G, Goode J: Novartis Foundation Symposium. 1998, John Wiley & SonsGoogle Scholar
 Van Regenmortel M: Reductionism and complexity in molecular biology. EMBO reports. 2004, 5 (9): 10161020.PubMedGoogle Scholar
 Mazzocchi F: Complexity in biology. EMBO Rep. 2008, 9: 1014. 10.1038/sj.embor.7401147.PubMedPubMed CentralGoogle Scholar
 von Bertalanffy L: An outline of general systems theory. Br J Philosophy Sci. 1950, 1 (2): 134165.Google Scholar
 Beadle GW, Tatum EL: Genetic control of biochemical reactions in neurospora. Proc Natl Acad Sci USA. 1941, 27 (11): 499506. 10.1073/pnas.27.11.499.PubMedPubMed CentralGoogle Scholar
 Hanahan D, Weinberg RA: The hallmarks of cancer. Cell. 2000, 100: 5770. 10.1016/S00928674(00)816839.PubMedGoogle Scholar
 Noble D: Genes and causation. Phil Trans R Soc A. 2008, 366: 300103015. 10.1098/rsta.2008.0086.PubMedGoogle Scholar
 Kitano H: Systems biology: a brief overview. Science. 2002, 295 (5560): 16621664. 10.1126/science.1069492.PubMedGoogle Scholar
 Han JDJ: Understanding biological functions through molecular networks. Cell Res. 2008, 18 (2): 224237. 10.1038/cr.2008.16.PubMedGoogle Scholar
 MacDougallShackleton SA: The levels of analysis revisited. Phil Trans R Soc B: Biol Sci. 2011, 366 (1574): 20762085. 10.1098/rstb.2010.0363.Google Scholar
 Barabasi AL, Oltvai ZN: Network biology: understanding the cell’s functional organization. Nat Rev. 2004, 5: 101113. 10.1038/nrg1272.Google Scholar
 Brazhnik P, de la Fuente A, Mendes P: Gene networks: how to put the function in genomics. Trends Biotechnol. 2002, 20 (11): 467472. 10.1016/S01677799(02)02053X.PubMedGoogle Scholar
 EmmertStreib F, Glazko G: Network biology: a direct approach to study biological function. Wiley Interdiscip Rev Syst Biol Med. 2011, 3 (4): 379391. 10.1002/wsbm.134.PubMedGoogle Scholar
 Davidson E, Levin M: Gene regulatory networks. Proc Natl Acad Sci USA. 2005, 102 (14): 493510.1073/pnas.0502024102.PubMedPubMed CentralGoogle Scholar
 de Matos Simoes R, Tripathi S, EmmertStreib F: Organizational structure of the peripheral gene regulatory network in Bcell lymphoma. BMC Syst Biol. 2012, 6: 3810.1186/17520509638.PubMedPubMed CentralGoogle Scholar
 Jones S, Thornton JM: Principles of proteinprotein interactions. Proc Nat Acad Sci. 1996, 93: 1320. 10.1073/pnas.93.1.13.PubMedPubMed CentralGoogle Scholar
 Maslov S, Sneppen K: Specificity and stability in topology of protein networks. Science. 2002, 296 (5569): 910913. 10.1126/science.1065103.PubMedGoogle Scholar
 Jeong H, Tombor B, Albert R, Olivai Z, Barabasi A: The largescale organization of metabolic networks. Nature. 2000, 407: 651654. 10.1038/35036627.PubMedGoogle Scholar
 Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA: Structure and evolution of transcriptional regulatory networks. Curr Opin Struct Biol. 2004, 14: 283291. 10.1016/j.sbi.2004.05.004.PubMedGoogle Scholar
 Lee TI, et al: Transcriptional regulatory networks in saccharomyces cerevisiae. Science. 2002, 298 (5594): 799804. 10.1126/science.1075090.PubMedGoogle Scholar
 Allison DB: Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 2006, 7: 5565. 10.1038/nrg1749.PubMedGoogle Scholar
 Dehmer M, EmmertStreib F, Graber A, Salvador A(Eds): Applied Statistics for Network Biology: Methods for Systems Biology. 2011, Weinheim: WileyBlackwellGoogle Scholar
 Quackenbush J: Computational analysis of microarray data. Nat Rev Genet. 2001, 2 (6): 418427. 10.1038/35076576.PubMedGoogle Scholar
 Metzker ML: Sequencing technologies  the next generation (With NOTES). Nat Rev Genet. 2010, 11: 3146. 10.1038/nrg2626.PubMedGoogle Scholar
 Wang Z, Gerstein M, Snyder M: RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 5763. 10.1038/nrg2484.PubMedPubMed CentralGoogle Scholar
 Hanahan D, Weinberg RA: Hallmarks of cancer: the next generation. Cell. 2011, 144 (5): 646674. 10.1016/j.cell.2011.02.013.PubMedGoogle Scholar
 Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, Gillette M, Paulovich A, Pomeroy S, Golub T, Lander E, Mesirov J: Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. Proc Natl Acad Sci USA. 2005, 102 (43): 1554550. 10.1073/pnas.0506580102.PubMedPubMed CentralGoogle Scholar
 Chuang HY, Lee E, Liu YT, Ideker T: Networkbased classification of breast cancer metastasis. Mol Syst Biol. 2007, 3: 140PubMedPubMed CentralGoogle Scholar
 Compagno M, Lim WK, Grunn A, Nandula SV, Brahmachary M, Shen Q, Bertoni F, Ponzoni M, Scandurra M, Califano A, et al: Mutations of multiple genes cause deregulation of NFkappaB in diffuse large Bcell lymphoma. Nature. 2009, 459 (7247): 717721. 10.1038/nature07968.PubMedPubMed CentralGoogle Scholar
 Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, Felciano RM, Laurance MF, Zhao W, Qi S, Chen Z, et al: Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a molecular target. Proc Natl Acad Sci USA. 2006, 103 (46): 1740217407. 10.1073/pnas.0608396103.PubMedPubMed CentralGoogle Scholar
 Krivtsov AV, Twomey D, Feng Z, Stubbs MC, Wang Y, Faber J, Levine JE, Wang J, Hahn WC, Gilliland DG, et al: Transformation from committed progenitor to leukaemia stem cell initiated by MLLAF9. Nature. 2006, 442 (7104): 818822. 10.1038/nature04980.PubMedGoogle Scholar
 Oskarsson T, Acharyya S, Zhang XHF, Vanharanta S, Tavazoie SF, Morris PG, Downey RJ, ManovaTodorova K, Brogi E, Massague J: Breast cancer cells produce tenascin C as a metastatic niche component to colonize the lungs. Nat Med. 2011, 17 (7): 867874. 10.1038/nm.2379.PubMedPubMed CentralGoogle Scholar
 Mavrakis KJ, Wolfe AL, Oricchio E, Palomero T, De Keersmaecker K, McJunkin K, Zuber J, James T, Khan AA, Leslie CS, et al: Genomewide RNAmediated interference screen identifies miR19 targets in Notchinduced Tcell acute lymphoblastic leukaemia. Nat Cell Biol. 2010, 12 (4): 372379. 10.1038/ncb2037.PubMedPubMed CentralGoogle Scholar
 Nam S, Park T: Pathwaybased evaluation in early onset colorectal cancer suggests focal adhesion and immunosuppression along with EpithelialMesenchymal transition. PLoS ONE. 2012, 7 (4): e3168510.1371/journal.pone.0031685.PubMedPubMed CentralGoogle Scholar
 Guedj M, Marisa L, De Reynies A, Orsetti B, Schiappa R, Bibeau F, Macgrogan G, Lerebours F, Finetti P, Longy M, et al: A refined molecular taxonomy of breast cancer. Oncogene. 2011, 31 (July 2011): 11961206.PubMedPubMed CentralGoogle Scholar
 Lehmann BD, Bauer JA, Chen X, Sanders ME, Chakravarthy AB, Shyr Y, Pietenpol JA: Identification of human triplenegative breast cancer subtypes and preclinical models for selection of targeted therapies. J Clin Invest. 2011, 121 (7): 27502767. 10.1172/JCI45014.PubMedPubMed CentralGoogle Scholar
 Fabbri G, Rasi S, Rossi D, Trifonov V, Khiabanian H, Ma J, Grunn A, Fangazio M, Capello D, Monti S, et al: Analysis of the chronic lymphocytic leukemia coding genome: role of NOTCH1 mutational activation. J Exp Med. 2011, 208 (7): 13891401. 10.1084/jem.20110921.PubMedPubMed CentralGoogle Scholar
 Ooi CH, Ivanova T, Wu J, Lee M, Tan IB, Tao J, Ward L, Koo JH, Gopalakrishnan V, Zhu Y, Cheng LL, Lee J, Rha SY, Chung HC, Ganesan K, So J, Soo KC, Lim D, Chan WH, Wong WK, Bowtell D, Yeoh KG, Grabsch H, Boussioutas A, Tan P: Oncogenic pathway combinations predict clinical prognosis in gastric cancer. PLoS Genet. 2009, 5 (10): e100067610.1371/journal.pgen.1000676.PubMedPubMed CentralGoogle Scholar
 Setlur SR, Royce TE, Sboner A, Mosquera JM, Demichelis F, Hofer MD, Mertz KD, Gerstein M, Rubin MA: Integrative microarray analysis of pathways dysregulated in metastatic prostate cancer. Cancer Res. 2007, 67 (21): 1029610303. 10.1158/00085472.CAN072173.PubMedGoogle Scholar
 Nucera C, Porrello A, Antonello ZA, Mekel M, Nehs MA, Giordano TJ, Gerald D, Benjamin LE, Priolo C, Puxeddu E, et al: BRaf(V600E) and thrombospondin1 promote thyroid cancer progression. Proc Natl Acad Sci USA. 2010, 107 (23): 1064910654. 10.1073/pnas.1004934107.PubMedPubMed CentralGoogle Scholar
 Shah MA, Khanin R, Tang L, Janjigian YY, Klimstra DS, Gerdes H, Kelsen DP: Molecular classification of gastric cancer: a new paradigm. Clin Cancer Res. 2011, 17 (9): 26932701. 10.1158/10780432.CCR102203.PubMedPubMed CentralGoogle Scholar
 Perroud B, Lee J, Valkova N, Dhirapong A, Lin PY, Fiehn O, Kultz D, Weiss R: Pathway analysis of kidney cancer using proteomics and metabolic profiling. Mol Cancer. 2006, 5: 6410.1186/14764598564.PubMedPubMed CentralGoogle Scholar
 Trewavas A: A Brief History of Systems Biology: “Every object that biology studies is a system of systems.” Francois Jacob (1974). Plant Cell. 2006, 18 (10): 24202430. 10.1105/tpc.106.042267.PubMedPubMed CentralGoogle Scholar
 EmmertStreib F, Dehmer M: Networks for systems biology: conceptual connection of data and function. IET Syst Biol. 2011, 5 (3): 18510.1049/ietsyb.2010.0025.PubMedGoogle Scholar
 Macneil LT, Walhout AJM: Gene regulatory networks and the role of robustness and stochasticity in the control of gene expression. Genome Res. 2011, 21 (5): 64557. 10.1101/gr.097378.109.PubMedPubMed CentralGoogle Scholar
 Lehman E: Testing Statistical Hypotheses. 2005, New York: SpringerGoogle Scholar
 DasGupta A: Probability for Statistics and Machine Learning. 2011, New York: SpringerGoogle Scholar
 Chen Y, Dougherty ER, Bittner ML: Ratiobased decisions and the quantitative analysis of cDNA microarray smages. J Biomed Optics. 1997, 2 (4): 364374. 10.1117/12.281504.Google Scholar
 Zhang L, Zhou W, Velculescu VE, Kern SE, Hruban RH, Hamilton SR, Vogelstein B, Kinzler KW: Gene expression profiles in normal and cancer cells. Science. 1997, 276 (5316): 12681272. 10.1126/science.276.5316.1268.PubMedGoogle Scholar
 Tusher V, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98 (18): 51165121.PubMedPubMed CentralGoogle Scholar
 Chu G, Narasimhan B, Tibshirani R, Tusher V: Significance analysis of microarrays (SAM) software. Nature. 2002, 5: 436442.Google Scholar
 Smyth GK: Limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions using R and Bioconductor. Edited by: Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W. 2005, New York: Springer, 397420.Google Scholar
 Efron B, Tibshirani R, JD S, Tusher V: Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc. 2001, 96 (456): 11511160. 10.1198/016214501753382129.Google Scholar
 Ho JWK, Stefani M, Dos Remedios CG, Charleston MA: Differential variability analysis of gene expression and its application to human diseases. Bioinformatics. 2008, 24 (13): i390i398. 10.1093/bioinformatics/btn142.PubMedPubMed CentralGoogle Scholar
 Hu R, Qiu X, Glazko G, Klebanov L, Yakovlev A: Detecting intergene correlation changes in microarray analysis: a new approach to gene selection. BMC Bioinformatics. 2009, 10: 2010.1186/147121051020.PubMedPubMed CentralGoogle Scholar
 Dettling M, Gabrielson E, Parmigiani G: Searching for differentially expressed gene combinations. Genome Biol. 2005, 6 (10): R8810.1186/gb2005610r88.PubMedPubMed CentralGoogle Scholar
 Lai Y, Wu B, Chen L, Zhao H: A statistical method for identifying differential genegene coexpression patterns. Bioinformatics. 2004, 20 (17): 31463155. 10.1093/bioinformatics/bth379.PubMedGoogle Scholar
 Dawson JA, Ye S, Kendziorski C: R/EBcoexpress: an empirical Bayesian framework for discovering differential coexpression. Bioinformatics. 2012, 28 (14): 19391940. 10.1093/bioinformatics/bts268.PubMedPubMed CentralGoogle Scholar
 Li KC: Genomewide coexpression dynamics: theory and application. Proc Natl Acad Sci USA. 2002, 99: 1687516880. 10.1073/pnas.252466999.PubMedPubMed CentralGoogle Scholar
 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The gene ontology consortium. Nature Genet. 2000, 25: 2529. 10.1038/75556.PubMedPubMed CentralGoogle Scholar
 Ackermann M, Strimmer K: A general modular framework for gene set enrichment analysis. BMC Bioinformatics. 2009, 10: 4710.1186/147121051047.PubMedPubMed CentralGoogle Scholar
 Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Geneset analysis and reduction. Brief Bioinform. 2009, 10: 2434.PubMedPubMed CentralGoogle Scholar
 EmmertStreib F, Glazko G: Pathway analysis of expression data: deciphering functional building blocks of complex diseases. PLoS Comput Biology. 2011, 7 (5): e100205310.1371/journal.pcbi.1002053.Google Scholar
 Khatri P, Sirota M, Butte A J: Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012, 8 (2): e100237510.1371/journal.pcbi.1002375.PubMedPubMed CentralGoogle Scholar
 Liu Q, Dinu I, Adewale A, Potter J, Yasui Y: Comparative evaluation of geneset analysis methods. BMC Bioinformatics. 2007, 8: 43110.1186/147121058431.PubMedPubMed CentralGoogle Scholar
 Goeman J, Buhlmann P: Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007, 23 (8): 9807. 10.1093/bioinformatics/btm051.PubMedGoogle Scholar
 Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucl Acids Res. 2009, 37: 113. 10.1093/nar/gkn923.PubMed CentralGoogle Scholar
 Mootha V, Lindgren C, Eriksson KFea: PGC1alpharesponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genet. 2003, 34: 267273. 10.1038/ng1180.PubMedGoogle Scholar
 Efron B, Tibshiran R: On testing the significance of sets of genes. Ann Appl Stat. 2007, 1: 107129. 10.1214/07AOAS101.Google Scholar
 Dørum G, Snipen L, Solheim M, Sæbø S: Rotation testing in gene set enrichment analysis for small direct comparison experiments. Stat App Genet Mol Biol. 2009, 8: 34Google Scholar
 Luo W, Friedman M, Shedden K, Hankenson K, Woolf P: GAGE: generally applicable gene set enrichment for pathway analysis. BMC Bioinformatics. 2009, 10: 16110.1186/1471210510161.PubMedPubMed CentralGoogle Scholar
 Kim SY, Volsky D: PAGE: Parametric Analysis of Gene Set Enrichment. BMC Bioinformatics. 2005, 6: 14410.1186/147121056144.PubMedPubMed CentralGoogle Scholar
 Newton M, Quintana F, den Boon Jea: Randomset methods identify distinct aspects of the enrichment signal in geneset analysis. Ann Appl Stat. 2007, 1: 85106. 10.1214/07AOAS104.Google Scholar
 Freudenberg JM, Sivaganesan S, Phatak M, Shinde K, Medvedovic M: Generalized random set framework for functional enrichment analysis using primary genomics datasets. Bioinformatics. 2011 Jan 1, 27 (1): 707. 10.1093/bioinformatics/btq593.PubMedPubMed CentralGoogle Scholar
 Rafael IA, Chi W, Yun Z, Terence SP: Gene set enrichment analysis made simple. Stat Methods Med Res. 2009, 18 (6): 565575. 10.1177/0962280209351908.Google Scholar
 Lange K: Numerical Analysis for Statisticians. Statistics and Computing. 2010, SpringerGoogle Scholar
 Pyeon D, Newton MA, Lambert PF, den Boon JA, Sengupta S, Marsit CJ, Woodworth CD, Connor JP, Haugen TH, Smith EM, Kelsey KT, Turek LP, Ahlquist P: Fundamental differences in cell cycle deregulation in human Papillomaviruspositive and human Papillomavirusnegative head/neck and cervical cancers. Cancer Res. 2007, 67 (10): 46054619. 10.1158/00085472.CAN063619.PubMedPubMed CentralGoogle Scholar
 Sheskin DJ: Handbook of Parametric and Nonparametric Statistical Procedures. 3rd edition. Boca Raton: RC Press; 2004.Google Scholar
 Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci USA. 2005, 102 (38): 1354413549. 10.1073/pnas.0506577102.PubMedPubMed CentralGoogle Scholar
 Jiang Z, Gentleman R: Extensions to gene set enrichment. Bioinformatics. 2007, 23 (3): 306313. 10.1093/bioinformatics/btl599.PubMedGoogle Scholar
 Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Improving gene set analysis of microarray data by SAMGS. BMC Bioinformatics. 2007, 8: 24210.1186/147121058242.PubMedPubMed CentralGoogle Scholar
 Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC: A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004, 20: 9399. 10.1093/bioinformatics/btg382.PubMedGoogle Scholar
 Hummel M, Meister R, Mansmann U: GlobalANCOVA: exploration and assessment of gene group effects. Bioinformatics. 2008, 24: 7885. 10.1093/bioinformatics/btm531.PubMedGoogle Scholar
 Lu Y, Liu P, Xiao P, Deng H: Hotelling’s T2 multivariate profiling for detecting differential expression in microarrays. Bioinformatics. 2005, 21 (14): 31053113. 10.1093/bioinformatics/bti496.PubMedGoogle Scholar
 Kong S, Pu W, Park P: A multivariate approach for integrating genomewide expression data and biological knowledge. Bioinformatics. 2006, 22 (19): 23732380. 10.1093/bioinformatics/btl401.PubMedPubMed CentralGoogle Scholar
 Tsai C, Chen J: Multivariate analysis of variance test for gene set analysis. Bioinformatics. 2009, 25 (7): 897903. 10.1093/bioinformatics/btp098.PubMedGoogle Scholar
 Xiong H: Nonlinear tests for identifying differentially expressed genes or genetic networks. Bioinformatics. 2006, 22 (8): 919923. 10.1093/bioinformatics/btl034.PubMedGoogle Scholar
 Klebanov L, Glazko G, Salzman P, Yakovlev A, Xiao Y: A multivariate extension of the gene set enrichment analysis. J Bioinform Comput Biol. 2007, 5 (5): 11391153. 10.1142/S0219720007003041.PubMedGoogle Scholar
 Yates P, Reimers M: RCMAT: a regularized covariance matrix approach to testing gene sets. BMC Bioinformatics. 2009, 10: 30010.1186/1471210510300.PubMedPubMed CentralGoogle Scholar
 Draghici S, Khatri P, Tarca AL, Amin K, Done A, Voichita C, Georgescu C, Romero R: A systems biology approach for pathway level analysis. Genome Res. 2007, 17 (10): 15371545. 10.1101/gr.6202607.PubMedPubMed CentralGoogle Scholar
 Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, Kim JS, Kim CJ, Kusanovic JP, Romero R: A novel signaling pathway impact analysis. Bioinformatics. 2009, 25: 7582. 10.1093/bioinformatics/btn577.PubMedPubMed CentralGoogle Scholar
 Thomas R, Gohlke JM, Stopper GF, Parham FM, Portier CJ: Choosing the right path: enhancement of biologically relevant sets of genes or proteins using pathway structure. Genome Biol. 2009, 10 (4): R4410.1186/gb2009104r44.PubMedPubMed CentralGoogle Scholar
 Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, Zhu J, Haussler D, Stuart JM: Inference of patientspecific pathway activities from multidimensional cancer genomics data using PARADIGM. Bioinformatics. 2010, 26 (12): i237—i245PubMedPubMed CentralGoogle Scholar
 Massa M, Chiogna M, Romualdi C: Gene set analysis exploiting the topology of a pathway. BMC Syst Biol. 2010, 4: 121PubMedPubMed CentralGoogle Scholar
 Glazko G, EmmertStreib F: Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics. 2009, 25 (18): 23482354. 10.1093/bioinformatics/btp406.PubMedPubMed CentralGoogle Scholar
 Ledoit O, Wolf M: Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. J Empir Finance. 2003, 10: 603621. 10.1016/S09275398(03)000070.Google Scholar
 Ledoit O, Wolf M: A well conditioned estimator for largedimensional covariance matrices. J Multiv Anal. 2004, 88: 365411. 10.1016/S0047259X(03)000964.Google Scholar
 Ledoit O, Wolf M: Honey, I shrunk the sample covariance matrix. J Portfolio Manage. 2004, 30: 110119. 10.3905/jpm.2004.110.Google Scholar
 Schäfer J, Strimmer K: A shrinkage approach to largescale covariance matrix estimation and implications for functional Genomics. Stat Appl Genet Mol Biol. 2005, 4: 32Google Scholar
 Kanehisa M, Goto S: KEGG: kyoto encyclopia of genes and genomes. Nuclei Acids Res. 2000, 28: 2730. 10.1093/nar/28.1.27.Google Scholar
 Lauritzen S: Graphical Models. 1996, New York: Oxford Science Publications, Clarendon PressGoogle Scholar
 Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Garapati P, Hemish J, Hermjakob H, Jassal B, Kanapin A, Lewis S, Mahajan S, May B, Schmidt E, Vastrik I, Wu G, Birney E, Stein L, D’Eustachio P: Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res. 2009, 37 (suppl 1): D619—D622PubMedPubMed CentralGoogle Scholar
 Klebanov L, Jordan C, Yakovlev A: A new type of stochastic dependence revealed in gene expression data. Stat Appl Genet Mol Biol. 2006, 5 (05/11): Article7PubMedGoogle Scholar
 Lim J, Kim J, Kim B: An alternative model of type A dependence in a gene set of correlated genes. Stat Appl in Genet Mol Biol. 2010, 9: Article 12Google Scholar
 Tripathi S, EmmertStreib F: Assessment method for a power analysis to identify differentially expressed pathways. PLoS ONE. 2012, 7 (5): e3751010.1371/journal.pone.0037510.PubMedPubMed CentralGoogle Scholar
 EmmertStreib F: The chronic fatigue syndrome: a comparative pathway analysis. J Comput Biol. 2007, 14 (7): 961972. 10.1089/cmb.2007.0041.PubMedGoogle Scholar
 Choi Y, Kendziorski C: Statistical methods for gene set coexpression analysis. Bioinformatics. 2009, 25 (21): 27802786. 10.1093/bioinformatics/btp502.PubMedPubMed CentralGoogle Scholar
 Cho SB, Kim J, Kim JH: Identifying setwise differential coexpression in gene expression microarray data. BMC Bioinformatics. 2009, 10: 10910.1186/1471210510109.PubMedPubMed CentralGoogle Scholar
 Tesson BM, Breitling R, Jansen RC: DiffCoEx: a simple and sensitive method to find differentially coexpressed gene modules. BMC Bioinformatics. 2010, 11: 49710.1186/1471210511497.PubMedPubMed CentralGoogle Scholar
 Altay G, Asim M, Markowetz F, Neal DE: Differential C3NET reveals disease networks of direct physical interactions. BMC Bioinformatics. 2011, 12: 29610.1186/1471210512296.PubMedPubMed CentralGoogle Scholar
 Watkinson J, Wang X, Zheng T, Anastassiou D: Identification of gene interactions associated with disease from gene expression data using synergy networks. BMC Syst Biol. 2008, 2: 1010.1186/17520509210.PubMedPubMed CentralGoogle Scholar
 Bunke H: What is the distance between graphs?. Bull EATCS. 1983, 20: 3539.Google Scholar
 Fuite J, Vernon S, Broderick G: Neuroendocrine and immune network remodeling in chronic fatigue syndrome: an exploratory analysis. Genomics. 2008, 92: 393399. 10.1016/j.ygeno.2008.08.008.PubMedGoogle Scholar
 Wang YC, Lan CY, Hsieh WP, Murillo L, Agabian N, Chen BS: Global screening of potential Candida albicans biofilmrelated transcription factors via network comparison. BMC Bioinformatics. 2010, 11: 5310.1186/147121051153.PubMedPubMed CentralGoogle Scholar
 Gill R, Datta S, Datta S: A statistical framework for differential network analysis from microarray data. BMC Bioinformatics. 2010, 11: 9510.1186/147121051195.PubMedPubMed CentralGoogle Scholar
 Altay G, EmmertStreib F: Inferring the conservative causal core of gene regulatory networks. BMC Syst Biol. 2010, 4: 13210.1186/175205094132.PubMedPubMed CentralGoogle Scholar
 Altay G, EmmertStreib F: Structural influence of gene networks on their inference: analysis of C3NET. Biol Direct. 2011, 6: 3110.1186/17456150631.PubMedPubMed CentralGoogle Scholar
 Dempster A: Covariance selection. Biometrics. 1972, 28: 157175. 10.2307/2528966.Google Scholar
 Koller D, Friedman N: Probabilistic Graphical Models: Principles and Techniques. 2009, Cambridge: The MIT PressGoogle Scholar
 Whittaker J: Graphical Models in Applied Multivariate Statistics. 1990, Chichester: WileyGoogle Scholar
 Li H, Gui J: Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics. 2006, 7 (2): 302317.PubMedGoogle Scholar
 Schäfer J, Strimmer K: An empirical Bayes approach to inferring largescale gene association networks. Bioinformatics. 2005, 21: 754764. 10.1093/bioinformatics/bti062.PubMedGoogle Scholar
 Wille A, Zimmermann P, Vranova E, Furholz A, Laule O, Bleuler S, Hennig L, Prelic A, von Rohr P, Thiele L, Zitzler E, Gruissem W, Buhlmann P: Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biol. 2004, 5 (11): R9210.1186/gb2004511r92.PubMedPubMed CentralGoogle Scholar
 Fan J, Feng Y, Wu Y: Network exploration via the adaptive lasso and SCAD penalties. Ann Appl Stat. 2009, 3 (2): 521541. 10.1214/08AOAS215.PubMedPubMed CentralGoogle Scholar
 Friedman J, Hastie T, Tibshirani R: Sparse inverse covariance estimation with the graphical lasso. Biostatistics Oxford England. 2008, 9 (3): 432441. 10.1093/biostatistics/kxm045.Google Scholar
 Kiiveri H, de Hoog F: Fitting very large sparse Gaussian graphical models. Comput Stat & Data Anal. 2012, 56 (9): 26262636. 10.1016/j.csda.2012.02.007.Google Scholar
 Witten DM, Friedman JH, Simon N: New insights and faster computations for the graphical Lasso. J Comput Graphical Stat. 2011, 20 (4): 892900. 10.1198/jcgs.2011.11051a.Google Scholar
 Zou H: The adaptive Lasso and its oracle properties. J Am Stat Assoc. 2006, 101 (476): 14181429. 10.1198/016214506000000735.Google Scholar
 Dudoit S, van der Laan: Multiple Testing Procedures with Applications to Genomics. 2007, New York: SpringerGoogle Scholar
 Dudoit S, van der Laan M, Pollard K: Multiple testing. part I. singlestep procedures for control of general type I error rates. Stat App Genet Mol Biol. 2004, 3: 13Google Scholar
 Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc, Ser B (Methodological). 1995, 57: 125133.Google Scholar
 Storey J: A direct approach to false discovery rates. J R Stat Soc, Ser B. 2002, 64: 479498. 10.1111/14679868.00346.Google Scholar
 Aubert J, BarHen A, Daudin J, Robin S: Determination of the differentially expressed genes in microarray experiments using local FDR. BMC Bioinformatics. 2004, 5: 12510.1186/147121055125.PubMedPubMed CentralGoogle Scholar
 Efron B: Correlation and largescale simultaneous signifance testing. J Am Stat Assoc. 2007, 102 (477): 93103. 10.1198/016214506000001211.Google Scholar
 Pounds S, Morris SW: Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of pvalues. Bioinformatics. 2003, 19 (10): 12361242. 10.1093/bioinformatics/btg148.PubMedGoogle Scholar
 Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001, 29 (4): 11651188. 10.1214/aos/1013699998.Google Scholar
 Efron B: Size, power and false discovery rates. Ann Stat. 2007, 35 (4): 13511377. 10.1214/009053606000001460.Google Scholar
 Storey J, Tibshirani R: Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003, 100 (16): 94409445. 10.1073/pnas.1530509100.PubMedPubMed CentralGoogle Scholar
 Hung JH, Yang TH, Hu Z, Weng Z, DeLisi C: Gene set enrichment analysis: performance evaluation and usage guidelines. Briefings in Bioinformatics. 2012, 13 (3): 281291. 10.1093/bib/bbr049.PubMedPubMed CentralGoogle Scholar
 Nam D, Kim S: Geneset approach for expression pattern analysis. Brief Bioinform. 2008, 9 (3): 189197. 10.1093/bib/bbn001.PubMedGoogle Scholar
 Weinberg RA: The Biology of Cancer. 2007, New York: Garland ScienceGoogle Scholar
 Leek JT, Storey JD: Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007, 3 (9): e16110.1371/journal.pgen.0030161.PubMed CentralGoogle Scholar
 McClintick JN, Edenberg HJ: Effects of filtering by Present call on analysis of microarray experiments. BMC Bioinformatics. 2006, 7: 4910.1186/14712105749.PubMedPubMed CentralGoogle Scholar
 Bourgon R, Gentleman R, Huber W: Independent filtering increases detection power for highthroughput experiments. Proc Natl Acad Sci USA. 2010, 107 (21): 95469551. 10.1073/pnas.0914005107.PubMedPubMed CentralGoogle Scholar
 Carter GW: Inferring network interactions within a cell. Briefings in Bioinformatics. 2005, 6 (4): 380389. 10.1093/bib/6.4.380.PubMedGoogle Scholar
 Perou CM, Sorlie T, Eisen MB, Van De Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, et al: Molecular portraits of human breast tumours. Nature. 2000, 406 (6797): 747752. 10.1038/35021093.PubMedGoogle Scholar
 Gerstung M, Eriksson N, Lin J, Vogelstein B, Beerenwinkel N: The temporal order of genetic and pathway alterations in Tumorigenesis. PLoS ONE. 2011, 6 (11): e2713610.1371/journal.pone.0027136.PubMedPubMed CentralGoogle Scholar
 Jones S, Zhang X, Parsons DW, Lin JC, Leary RJ, Angenendt P, Mankoo P, Carter H, Kamiyama H, Jimeno A, et al: Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science. 2008, 321 (5897): 18011806. 10.1126/science.1164368.PubMedPubMed CentralGoogle Scholar
 Wood LD, Parsons DW, Jones S, Lin J, Sjöblom T, Leary RJ, Shen D, Boca SM, Barber T, Ptak J, et al: The genomic landscapes of human breast and colorectal cancers. Science. 2007, 318 (5853): 11081113. 10.1126/science.1145720.PubMedGoogle Scholar
 Vogelstein B, Kinzler KW: Cancer genes and the pathways they control. Nature Med. 2004, 10 (8): 789799. 10.1038/nm1087.PubMedGoogle Scholar
 Lazebnik Y: What are the hallmarks of cancer?. Nature Rev Cancer. 2010, 10 (4): 232233. 10.1038/nrc2827.Google Scholar
 Mukherjee S: The Emperor of All Maladies: A Biography of Cancer. 2011, London: Fourth EstateGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.