Unexpected links reflect the noise in networks

Background Gene covariation networks are commonly used to study biological processes. The inference of gene covariation networks from observational data can be challenging, especially considering the large number of players involved and the small number of biological replicates available for analysis. Results We propose a new statistical method for estimating the number of erroneous edges in reconstructed networks that strongly enhances commonly used inference approaches. This method is based on a special relationship between sign of correlation (positive/negative) and directionality (up/down) of gene regulation, and allows for the identification and removal of approximately half of all erroneous edges. Using the mathematical model of Bayesian networks and positive correlation inequalities we establish a mathematical foundation for our method. Analyzing existing biological datasets, we find a strong correlation between the results of our method and false discovery rate (FDR). Furthermore, simulation analysis demonstrates that our method provides a more accurate estimate of network error than FDR. Conclusions Thus, our study provides a new robust approach for improving reconstruction of covariation networks. Reviewers This article was reviewed by Eugene Koonin, Sergei Maslov, Daniel Yasumasa Takahashi. Electronic supplementary material The online version of this article (doi:10.1186/s13062-016-0155-0) contains supplementary material, which is available to authorized users.


Introducing the concept of unexpected correlations
It is quite common, especially in biology, that in order to understand how system transitions from one state to another (e.g. from health to disease) scientists compare how parameters such as gene expressions, protein levels, or metabolite abundances differ between these states. One result of such a comparison is a list of parameters up-or downregulated (that is, some numerical value attributed to the parameter has either increased or decreased) from the first state to the second. The parameters are not regulated independently from each other; rather, they make up regulatory networks each with a limited number of key drivers that govern the transition. A common approach to the reconstruction of regulatory network structure is the inference of a correlation network build from these parameters. In particular, correlation (or, for the purposes of this paper, co-variation) networks are widely used in gene expression analysis (see, for example, Butte et al., 2000, Opgen-Rhein andStrimmer, 2007, andreferences within). Any covariation network inference implies that any edge in the network (corresponding to correlations between parameter/nodes) is an empirical result of either direct or indirect causal relationships unless they edge is erroneously drawn. The primary question that drove this study was thus whether the causal nature of gene expression networks has any specific implication for their structure and organization. Furthermore, in the case that this relation (causality-network structure) exists, we ask whether it can be used to improve gene network analysis.
In order to address this question we look to basic principles connecting correlation and causality. Causal effects have to follow Reichenbach's principles (Reichenbach, 1956;Pearl, 2009) which, in the example at hand, imply that if there is a correlation between two genes expressions g 1 and g 2 , provided that it is not a statistical artifact, at least one of three must hold: 1) g 1 regulates g 2 ; 2) g 2 regulates g 1 ; or 3) there is common cause (perhaps another gene, g 3 ) that regulates (directly or indirectly) both g 1 and g 2 ( Figure 1). Thus, in the particular situation under discussion, namely a system with two equilibrium states with two types of regulation (stimulation and inhibition) we propose a scheme in which a sign (positive or negative) of correlation coefficient is associated with direction of regulation of correlated genes. Sign association follows a simple set of rules: • If there is a correlation between two mutually "up" or "down" regulated genes, the corresponding sign associated with the link is positive. • If there is a correlation between an "up" regulated gene and a "down" regulated gene, the corresponding sign associated with the link is negative.
We hypothesize that correlations whose sign disagrees with that associated with the corresponding link are erroneous (i.e. the result of noise or statistical error rather than causal relationships). We will hereafter call such correlations unexpected, and their rough proportion we abbreviate as PUC (the Proportion of Unexpected Correlations).
The fundamental reasoning motivating this hypothesis is that regulation mechanisms in biological systems (as well as many other systems) are not generally a function of biological state. Though gene expression levels in a cancerous cell may vary from that in a healthy cell, gene function and regulation schemes in most cases remain constant. The differences in gene expression levels between two biological states should reflect the nature of their regulatory pathways. We expect positively correlated genes to mutually increase or decrease in expression, and negatively correlated ones to be regulated in opposite directions (i.e. up/down or down/up). Note that correlations are evaluated within each state independently, while differences in gene expression is evaluated between the two states. A deviation from this behavior suggests that a particular correlation between the expression levels of two genes is NOT due to a causal link.
A straightforward way to empirically test whether, as we hypothesize, unexpected correlations are erroneous is to analyze some real-world data and compare PUC, which we believe to be a measure of error in a correlation network, to a standard measure of network error, the false discovery rate (FDR) [Benjamini&Hochberg, 1995]. For a proof-ofconcept comparison, we used gene expression data from our recently published paper on network analysis in cervical cancer (Mine and Shulzhenko et al., 2013).
We felt that this network should provide excellent real data to analyze our prediction, as it was constructed from a robust meta-analysis of five cancer gene expression datasets and thus validated by large, independent datasets. To our great satisfaction and some surprise, under an FDR threshold of 5% we observed an identical PUC of 5% in this gene expression network ( Figure 2, see section I.1. of the supporting material and Figure S1).
The fact that we observed similar levels of unexpected correlations and of erroneous edges in the network reconstructed form cervical cancer data suggests that it can be extrapolated to the whole field of gene-gene regulation and that PUC can potentially be used as a measure of error.
Encouraged by this result, to better understand the properties of this new metric (PUC) we went further to establish a mathematical framework for its application. Indeed, although concept of PUC can be formulated and tested empirically without mathematical theory, a rigorous mathematical formalization of PUC is necessary for its establishment as a widely applicable and powerful method of analysis.

Mathematical formalism relating causation and the sign of correlation
Our hypothesis that unexpected correlations are erroneous can be rigorously proven for systems that transit between two stable states with two types of relations between parameters: stimulation and inhibition. Herein, we provide a proof of our hypothesis in the domain of Bayesian networks (Pearl, 2009) with two equilibrium states and linear dependences between nodes (see proof for more general case in Supplementary Material, section II.2). In order to formulate our results we need to introduce some mathematical notation.
Consider some regulatory network, directed without loops (i.e. a directed acyclic graph, DAG), represented by a graph = ( , ). Any edge ∈ is an oriented pair of vertices (nodes) = ( , ) ∈ 2 . The orientation of an edge represents the direction of causality in a regulatory network (that is, an orientation ( , ) implies that regulates ). For any node we associate the set of its parents as ( ) ≔ { ∈ : ( , ) ∈ }. We define the set of grandfathers ( ) for the graph as the set of all nodes without parents: The graph will be weighted graph. It means that every edge = ( , ) ∈ has a label (weight) ∈ ℝ . With any node ∈ we associate a random variable . The distribution of random variables is given by their respective structural linear equations = ∑ ∈ ( ) + , where are mutually independent and identically distributed with mean 0 and variance 2 .
In the previously discussed biological framework, a graph represents the entire gene expression network. A node represents some gene, which has an expression level . An edge = ( , ) represents a causal link between two genes and in which the expression of is regulated by . The sign of reflects the direction of regulation: negative sign and positive sign correspond to inhibition and stimulation, respectively. The parents of are simply all genes which regulate and the grandfathers of are the primary regulators of the entire network, the genes at the top of the regulatory chain.
For simplicity, we consider a regulatory network with only one grandfather (| ( )| = 1), denoted by the vertex . Let The mathematical definition of expected and unexpected links, given heuristically in the introduction, is formally expressed in the following way: Definition. An edge ∈ is called an expected link between nodes , ∈ if and only if

. Any edge which is not an expected link constitutes an unexpected link.
This definition states that the directions of regulation of two genes between two states should agree with the sign of the correlation between them within each state.
It is straightforward to prove the following lemma (proven in section II.1 of the supporting material): Lemma 1. For any finite DAG with linear structural equations there exists some 0 2 such that for any variance 2 < 0 2 there are no unexpected links in the graph.
Lemma 1 implies that in regulatory networks unexpected correlations must have appeared as a result of noise within the network. Thus, the proportion of unexpected correlation thus reflects the noise level in a network.
As a side note, the linear relations between variables can be generalized by the expression = �{ } ∈ ( ) ; �, where is some monotonic function over its variables and ε v is the internal network noise. If the functions are not linear but monotonic, then the lemma still holds.

Unexpected correlations reflect the noise in real and simulated networks.
Mathematical models are restricted by the domain of their assumptions, which may sometimes correspond to only a fraction of real world situations, making them exceedingly limited in applicability. Thus, although we have empirically observed an appropriately small PUC at a low FDR threshold in cervical cancer data, we wanted to verify whether this correspondence would still hold in the gene regulation of an entirely different biological process.
For this we chose a more mundane physiological process than cancer: we analyzed the gene expression network perturbed as a result of colonization of intestinal tissue with normal microbiota (i.e. the mix of microorganisms that live in the gut). In these data, we again found that a low FDR threshold corresponds to a low PUC. Furthermore, PUC is highly correlated with FDR ( Figure 2), which provides additional support for our prediction that PUC, similarly to FDR, quantitatively reflects network error.
An important question, however, is whether PUC brings any advantage over the standard approach to measuring the proportion of erroneous edges in a reconstructed regulation network (i.e. FDR). Real data makes such a comparison difficult because though both methods of analysis will return values for network error, there is not necessarily any obvious way to determine which is more accurate; i.e. in real data, the "correct" level of network error is not known.
. CC-BY-NC 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted November 15, 2013. ; https://doi.org/10.1101/000497 doi: bioRxiv preprint To investigate the behavior of PUC in a "controlled environment" we simulated Bayesian networks as a model of gene regulation. We define as "true error" any correlation found between the nodes of disjoint, independent networks ( Figure 3a).
In order to determine which method (FDR or PUC) better quantifies error, we look at all three measures of error (FDR, PUC, and the true error) and compare the accuracies of FDR and PUC relative to true error (Figure 3b). Simulation results demonstrate that PUC is more accurate than FDR in estimating true error.
It is known that FDR is an overly conservative approach (i.e. it overestimates the number of false positives) in cases when the hypotheses of an analysis are inter-dependent. In the case of regulatory networks, each edge constitutes a hypothesis; interdependency of regulatory network hypotheses manifests in indirect regulation between genes. Indeed, this is exactly the case with co-variation networks, in which it is possible to find numerous indirect pathways with only a few direct links. Using PUC as a measure of error, however, Figure 3: (a) In order to compare the effectiveness of PUC and FDR, two regulatory networks are constructed and simulated independently, and both networks' node expression levels combined into one data set. In reconstructing a correlation network from the simulated data, any correlations between nodes from independent networks are known to be erroneous. This scheme allows for a true measure of network error against which to compare PUC and FDR analysis results.
(b) Simulations suggest that PUC more accurately reflects network error than FDR as network size grows, which seems to be due to a more general mathematical feature of PUC (c).
. CC-BY-NC 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted November 15, 2013. ; https://doi.org/10.1101/000497 doi: bioRxiv preprint does not require any assumption of hypothesis independence. PUC may thus be more applicable than FDR for reconstruction of networks with a large number of interconnected nodes. The degree of dependency between hypotheses also depends on the size and number of sub-networks that compose a network. A network made up of ten subnetworks consisting of ten nodes each should have a lower degree of hypothesis interdependency than a single network consisting of one hundred nodes lacking any welldefined sub-networks. PUC may thus similarly be more applicable than FDR for analyzing networks with a large edge density. In agreement with these presumptions, we found in simulation analyses that FDR initially provides an accurate estimate of real false positives for small networks (approximately 20-50 nodes, Figure 3c), but diverges from true error as the sizes of networks grow.
We hypothesize that PUC is expected to reflect error independently of size of the network. In order to test this prediction, we performed the same comparisons between the accuracies of FDR and PUC for networks of varying size. The results demonstrated that PUC is more accurate than FDR for larger networks, with differences in accuracy becoming negligible at network sizes of approximately 20 nodes (Figure 3c).

Noise estimation and error correction.
Another very important property of PUC is that it represents approximately half of all erroneous correlations: A formal proof of this statement is given in section III.3 of the supporting material, as well as an explanation for why it should make intuitive sense.
The identification of unexpected correlations has two primary impacts. Firstly, it provides a new method to estimate the proportion of erroneous links in a network. Secondly, it allows for the removal of approximately half of the erroneous edges in the network (namely, those that are unexpected), decreasing their proportion by a factor of two, thus improving the overall accuracy of the reconstructed network. The final value of network error consists of an estimated proportion of remaining false positive correlations.
The entire procedure for a correlation network is as such: first, all correlations in a differential expression list are ranked by p-value. A network is constructed with edges consisting of correlations within an arbitrary p-value threshold (e.g. 0.01). Unexpected links are identified, counted, and removed from the network. The final error in the remaining network is given by /( − ) , where is the number of unexpected correlations and is the total number of correlations within the p-value threshold.

PUC in a non-biological system.
The fact that we could mathematically prove the relationship between unexpected correlations and network error suggests that this principle could be widespread beyond gene interactions in various biological systems. As a proof-of-concept of PUC's generality, we turned our attention to economics. The basis for this interest was the presumption that economy, similarly to biology, is ruled by cause-effect relationships and, by extension, can be described with regulatory networks. We analyzed 1503 parameters (retrieved from World Bank economic databases) for the year 2008 in 193 countries in such areas as business, education, health, etc. Parameters with bimodal distributions (such as expenditure on primary education as a percent of GDP per capita) defined distinct states of economic networks for any given country. As expected, these networks also demonstrated a high concordance between the network errors given by PUC and FDR ( Figure 2C, Figure S2). This result supports the idea that the concept of unexpected correlations can be extrapolated to a large variety of causal networks and that measurement of the proportion of unexpected correlations (PUC) can improve network analysis in many different fields of science.

Discussion
The growth of molecular biology has advanced such that we can measure the expression of thousands of genes simultaneously. Simply measuring the expression of multiple individual genes, however, is insufficient to describe a systems issue such as complex diseases. To relate gene expression to physiological states (e.g. disease) and other variables in an organism's environment we utilize gene expression networks. These networks enable more intelligent identification of molecular subtypes of diseases and molecular targets for treatment. The reconstruction of gene expression networks, however, is not easily accomplished. Constructing reliable gene expression networks with current methods requires obtaining large data sets and/or discarding sizeable portions of data to reduce false positive deductions.
Although the False Discovery Rate (FDR -Benjamini-Hochberg, see Benjamini and Hochberg, 1995) is the most popular multiple hypothesis correction procedure, its application for network inference is a conservative procedure and makes the often unfitting assumption of the independence between correlations in gene networks. There are less popular versions of FDR (for example Benjamini-Yukateli) which take into account various dependence structures between the hypotheses under consideration, but the usage of these corrections does not demonstrate any significant advantage over PUC (data not shown). Consequently, these corrections tend to have a rate of high false negative discovery (i.e. low power) and require vast sample sizes in order attain desirable degrees of certainty about reconstructed networks. There is thus a critical need for more powerful methods of estimation of false positive connections between genes in co-expression networks.
In this study we have revealed and mathematically proved a new feature of causal networks. This feature is based on the notion that any correlation has causal and noise components. In the case that causal components prevail over noise, the sign of a correlation between two genes should be related to their up-or down-regulation of the genes between two states (Figure 1). We proposed using this relation for identifying false connections in co-variation networks, increasing network accuracy, an estimating total network error. This approach demonstrates clear advantage over the classic method (FDR) not only by providing better estimates of error in large reconstructed networks, but also by allowing the removal of approximately half of all erroneous edges. The fact that PUC demonstrates similar behavior to standard methods of analysis (i.e. PUC has a strong correlation with FDR) in both real and simulated Bayesian networks further supports the use of this adopted modeling approach. Indeed, certain questions can only be answered using a modeled system. We had to use simulated networks where we know the exact number of false links to compare FDR and PUC.
The concept of expected and unexpected correlations that we introduced is closely related to the concept of monotone causal effects and the covariance between them. The rules we proved for linear relations should therefore hold for any monotone relationships; this idea is expanded in section II.2. of the supporting material, and the framework of PUC extended to a broader class of networks than those mentioned thus far.
We must also address how non-monotonicity affects the notion and application of unexpected correlations. The concept of non-monotonicity can be exemplified for our problem as different types of relationships in two network states, such as a negative correlation between parameters in one biological state and a positive correlation in another. In such cases, despite violation of monotonicity, we expect unexpected correlations to arise primarily due to noise, rather than the change in relationships. Nonetheless, we demonstrated (see section II.4. of the supporting material) that there is no evidence for non-monotonicity to suggest that these exceptionally rare non-erroneous correlations are in fact responsible for the observed changes in gene expression between states of a biological system. Therefore, because the ultimate goal of network inference is actually to model and understand the transition of biological system from one state to another, we can safely remove these unexpected correlations from the reconstructed network for independent reasons (i.e. that they do not have causal contribution to system state transition).
We believe that this work introduces an entirely new way of dealing with error in regulatory network reconstruction. Indeed, statistical methods employed for such problems normally estimate an error, but cannot detect erroneous edges. We propose a method that besides (according to simulations, potentially superior) error estimation allows for identification and removal of approximately half of total network error. Thus, the identification and removal of unexpected correlations decreases the proportion of irrelevant and erroneous connections and strongly increases the power of network inferences.
Finally, our study provides a good example of the success of a systems approach. The collaboration between biologists and mathematicians resulted in the integration of fundamental principles of causality with real world findings (e.g. Figure 2a cervical cancer) to provide the scientific community with a powerful technique that improves the traditional task of network inference from observational data.
Supporting Material:

I.1. Statistically significant correlations between differentially expressed genes (DEGs) show expected signs
In our recent study (Nature Commun. 2013;4:1806) we have shown that key drivers of cervical carcinogenesis are located in regions of frequent chromosomal aberrations and that these genes cause most of the alteration in gene expression in cervical cancer. Therefore, in order to evaluate whether statistically significant correlations between DEGs which result from known causal relations follow our prediction we performed the following analysis: First, we selected two groups of genes from DEGs discovered in our previous study: 1) genes in which it has been determined that chromosomal aberrations are responsible for the change in regulation; and 2) genes located in regions in which aberrations are rare, defined by FqG -FqL between -0.1 and 0.1 ( Figure S1). Next, we analyzed gene coexpression in tumors samples in order to find correlations between those two groups of DEGs. We found 626 correlated gene-gene pairs with FDR 5%. The results provided support to our hypothesis that significant correlations should to have "expected" signs. Indeed, 95% (594 of 626 total pairs) of significant correlations had expected signs.

II. Theoretical basis.
Here we provide some formal definitions of concepts used in the paper and all necessary proofs. This section consists of four parts: 1) we introduce the mathematical machinery for PUC using Bayesian networks; 2) we generalize the previous formalism to handle a broader set of cases; 3) we demonstrate that PUC reflects half of total network error; and 4) we address concerns with network non-monotonicity.

II.1. PUC on Bayesian networks.
In order to apply the new concept of noise estimator we use Bayesian Networks as a convenient model for gene expression. Let = ( , ) be some network, which is directed acyclic graph (DAG). Any edge ∈ is an oriented pair of vertices = ( , ): and direction of edge is from the first vertex to the second vertex . We assume that the graph is weighted graph -any edge = ( , ) has its labels (weight), , which is some real number ∈ ℝ . For any node we associate the set of parents of the node : We define the set of grandfathers for the graph : With any node (gene) ∈ we associate the random variable (gene expression) . The random variables satisfy the following linear relations (structure equations): for any ∉ ( ) where are i.i.d. random variable (intrinsic noise) with mean 0 and variance 2 . Moreover, for simplicity we suppose that there exist only one grandfather | ( )| = 1 and let us denote it as a vertex .
A path ( , ) of length from a vertex to a vertex is a sequence of edges = ( , +1 ), = 1, … , − 1 , with 0 = and = . The weight of the path ( ( , )) is the product of weights of edges from this path: Let Π( , ) be the set of all paths connecting nodes and . And let . CC-BY-NC 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted November 15, 2013. Proof. Direct from formulas (7), (8). By definition (9) and by representations (7), (8) we have The second sum can be made as less as possible because of 2 . It proves the Lemma.
The formula (8) shows that any link/correlation between two nodes in a network can be represented as a sum of two parts: causal propagation from causal node and noise propagation part: Here, it is easy to see that if the grandfather variance 2 increase, then the causal propagation will determine the sign of the covariance after some threshold. It means that it determines a link to be expected or unexpected.
Moreover, Lemma says that if we observe in such regulation networks (DAGs with linear relationships between variables) unexpected correlations, it means that they appeared as a result of noise propagation within the network. Thus the proportion of unexpected correlation reflects the noise level on a network. Estimation of noise. The error estimation based on the following. If two genes belong to two unrelated subnetworks (see Figure 3a), then the correlation between their respective expression levels has to be equal to 0. However, observable correlation can be significantly different from 0 due to noise, in which case, the observable correlation is positive (or negative) in close to 50% of the cases (see formula (20)). Then, on average, half of all random correlations between any pair of genes from unrelated subnetworks can be classified as unexpected, as in (9). Thus 2 • can be utilized as an error estimator.
Moreover, it is possible to prove for tree like graphs that within one network the noise propagation (see the formula (11)) has the same property as stated in formula (20).
Indeed, the representation (6) means that any variable ( ) can be decomposed into the causal component ( ) ( , ) and the noise component Then the covariance between ( ) and ( ) can be calculated exactly (compare with formula (10)) If are mutually independent, identically distributed, with positive probabilities for being positive or negative, then the covariance (11) for any ∈ { , } will be negative approximately in half of cases.

II.2. Definitions and generalization.
Here we study the concept of unexpected links in a more general framework. The positive and negative correlation inequalities are an active research direction in the field of probability and statistical mechanics. We believe these inequalities will allow us to generalize the concept of unexpected correlations in the PUC method. The following framework connects FKG (Fortuin-Kasteleyn-Ginibre) inequality in Statistical Mechanics to the concept of expected and unexpected links.
Let Ω be the underlying sample space of a biological system, as an example of a biological system we consider a gene regulatory network, and Ω can be considered as a set of all possible gene expression configurations. We can suppose that the state space Ω has an ordering (or partial ordering) "≺" assigned to pairs of its elements. Here, if , ′, ′′ ∈ Ω, and if ≺ ′ and ′ ≺ ′′, then ≺ ′′.
In statistics and in statistical mechanical models the notion of an increasing random variable is remarkable. ′). Both types of random variables, increasing and decreasing, are said to be monotone random variables.

Definition. A random variable = ( ) is said to be increasing if
In the field of statistical mechanics and probabilistic combinatory, the FKG inequality (Fortuin-Kasteleyn-Ginibre inequality) explains most of the results involving monotone random variables and monotone (increasing or decreasing) events. It states that for two increasing random variables and , In some applications, such as percolation models, partial ordering of is sufficient for the FKG to hold. See reference [1]. Many important results in applied mathematics and physics, such as the exact value of critical probability in two-dimensional percolation models, would have been impossible without the FKG inequality. Let = ( , ) be a graph (network) with vertices (nodes) and edges . Nodes ∈ represent the genes. Let ( ) be monotone functions (random variables) assigned to each node ∈ . Here represents the noiseless gene expressions. In this framework it is convenient represent the state system as a probability measure. Consider two probability measures and over such that in which case we say that the two gene expressions and have expected correlations. If one or both expected correlations inequalities are not satisfied, we say that and have unexpected correlations. Proof. Indeed, if is increasing (decreasing) variable, then ∆ ≤ 0 (∆ ≥ 0). Now, if both and are either increasing or decreasing the FKG inequality (12) implies non-negative correlations, so that for any state ∈ { , } Note that � − � < 0 regardless of the values of and (both of which are strictly positive). Thus in the case Δ > 0 the change Δ will still be negative. The sign of Δ will be positive only if Δ ≫ 0. ☐ . CC-BY-NC 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted November 15, 2013. ; https://doi.org/10.1101/000497 doi: bioRxiv preprint Figure S2: PUC and FDR correlate strongly when reconstructing macroeconomic networks using various bimodal parameters to define system states. Parameters shown are: ADA -Duration of compulsory education; AIA -Cause of death, by communicable diseases and maternal, prenatal and nutrition conditions (% of total); AVS -Manufactures exports (% of merchandise exports); BEG -Educational expenditure in pre-primary as % of total educational expenditure; QZ -Private credit bureau coverage (% of adults); RW -Strength of legal rights index; UU Passenger cars (per 1,000 people) . CC-BY-NC 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted November 15, 2013. ; https://doi.org/10.1101/000497 doi: bioRxiv preprint