# 'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve

- Eran Elhaik
^{1}Email author, - Dan Graur
^{2}and - Krešimir Josić
^{3}

**5**:10

https://doi.org/10.1186/1745-6150-5-10

© Elhaik et al; licensee BioMed Central Ltd. 2010

**Received: **7 February 2010

**Accepted: **17 February 2010

**Published: **17 February 2010

## Abstract

### Background

The Z-curve is a three dimensional representation of DNA sequences proposed over a decade ago and has been extensively applied to sequence segmentation, horizontal gene transfer detection, and sequence analysis. Based on the Z-curve, a "genome order index," was proposed, which is defined as *S* = *a*^{2}+ *c*^{2}+*t*^{2}+*g*^{2}, where *a*, *c*, *t*, and *g* are the nucleotide frequencies of A, C, T, and G, respectively. This index was found to be smaller than 1/3 for almost all tested genomes, which was taken as support for the existence of a constraint on genome composition. A geometric explanation for this constraint has been suggested. Each genome was represented by a point *P* whose distance from the four faces of a regular tetrahedron was given by the frequencies *a*, *c*, *t*, and *g*. They claimed that an inscribed sphere of radius *r* = 1/
contains almost all points corresponding to various genomes, implying that *S* <*r*^{2}. The distribution of the points *P* obtained by *S* was studied using the Z-curve.

### Results

In this work, we studied the basic properties of the Z-curve using the "genome order index" as a case study. We show that (1) the calculation of the radius of the inscribed sphere of a regular tetrahedron is incorrect, (2) the *S* index is narrowly distributed, (3) based on the second parity rule, the *S* index can be derived directly from the Shannon entropy and is, therefore, redundant, and (4) the Z-curve suffers from over dimensionality, and the dimension stands for GC content alone suffices to represent any given genome.

### Conclusion

The "genome order index" *S* does not represent a constraint on nucleotide composition. Moreover, *S* can be easily computed from the Gini-Simpson index and be directly derived from entropy and is redundant. Overall, the Z-curve and *S* are over-complicated measures to GC content and Shannon *H* index, respectively.

### Reviewers

This article was reviewed by Claus Wilke, Joel Bader, Marek Kimmel and Uladzislau Hryshkevich (nominated by Itai Yanai).

## Keywords

## Background

The nucleotide composition of genomes varies dramatically between and among taxa. The GC content is the primary measure to characterize genomic regions in terms of homogeneity, compositional bias, and compositional constraints [1].

*x, y*, and

*z*:

where *a*, *c*, *t*, and *g* are the frequencies of the four nucleotides in a sequence. For instance, in the case of the sequence ACGTCGCG, the three coordinates are (0,0,-0.5).

Since it was first proposed, the Z-curve has been used in many applications of sequence segmentation [3–5], horizontal gene transfer detection [6], isochoric domain inference [3, 5], and sequence analysis [7].

*S*, proposed to measure nucleotide composition. Zhang and Zhang [7] defined

*S*as

*a*,

*c*,

*t*, and

*g*are the frequencies of the four nucleotides in a sequence. It was observed [7, 8] that

*S*is negatively correlated with the Shannon entropy

*H*[9], defined as

*S*is also "a linear transformation" of the Gini-Simpson index

*GSI*, defined as

Since *GSI* = *S* + 1, for all intents and purposes, these two measures are the same. Zhang and Zhang [7] calculated *S* for 809 genomes of different species and found that *S* < 1/3 for all but two genomes. They claimed the limited observed range of *S*, from 1/4 to 1/3, supported the existence of a new constraint on nucleotide composition [8].

*P*inside a regular tetrahedron of height 1, and edge length /2 using simplicial coordinates. The four tetrahedral faces represent the four nucleotides, and the distances of

*P*to the four faces equal the frequencies of A, C, T, and G, respectively. The sum of the distances from

*P*to the four faces equals the height of the tetrahedron, a constant. For example, the sequence ACGTCGCG is mapped onto a point

*P*with the simplicial coordinates (0.125, 0.375, 0.125, 0.375), representing the distances between

*P*and the four tetrahedral faces. The position of each point

*P*was calculated again in a Cartesian coordinate system with its origin at the center of the tetrahedron (see also [2]). Finally,

*S*was calculated for all 809 points according to Eq. (4). For example,

*S*= 0.3125 for the sequence ACGTCGCG. Zhang and Zhang [7] claimed that almost all points are located within a sphere inscribed in a regular tetrahedron so that

*S*<

*r*

^{2}. The radius

*r*of the inscribed sphere in the case of a regular tetrahedron was calculated as 1/ (see Figure 1), so that

*S*< 1/3. This inequality was presented as a new biological constraint.

The "genome order index" was selected as a case study to the usefulness of the Z-curve method. We show that the inscribed sphere calculations were erroneous and that the "conversion" from simplicial coordinates to Z-curve coordinates is misleading. Further, we show that *S* is a narrowly distributed measure of nucleotide composition and use the second parity rule to show that it can be derived directly from the Shannon entropy. Therefore, any constraints on *S* follow from constraints on *H*. Finally, we show that the Z-curve suffers from over dimensionality and that the only informative dimension is equivalent to the GC content measure.

## Results and Discussion

### "Inscribed sphere" or "circumscribed sphere"?

We first note that a regular tetrahedron of height 1 has an inscribed sphere of radius 0.25 rather than 1/
≈ 0.58. The center of the tetrahedron is defined as the intersection of two space heights. A sphere of radius 1/
at the center nearly encompasses the tetrahedron (Figure 1). Hence, the conclusion that almost all genome mapping points *P* are located within an inscribed sphere of *r* = 1/
and thus follow *S* <*r*^{2} is a consequence of a mathematical error.

In Figure 1, we present the *x* and *z* coordinates of 235 full bacterial genomes, which have very diverse GC content ranging from 0.22 to 0.77. We found that 45% of the bacterial genomes fall outside the actual inscribed sphere. Moreover, points *P* in the simplicial space are unrelated to the points calculated by the Z-curve. Since the only relevant coordinates are the Z-curve coordinates, which can be calculated directly from the data, the graphical representation of points *P*, using a tetrahedron is misleading and provides no additional information.

*S*is narrowly distributed

*S*< 1/3. This result was surprising because, in theory, the values of

*S*can range between 0.25 and 1. We next show that the range of

*S*is expected to be much narrower. We calculated

*S*values for 235 bacterial genomes (Figure 2) and found that the distribution of

*S*values follows an exponential distribution, as in Zhang and Zhang [7].

*S*can range only between 0.25 and 0.5 (see details in Appendix). Thus, since the second parity rule implies that certain nucleotide frequencies, such as those used by Zhang [11], are not attainable in practice, the potential range of

*S*for real genomic sequences is much narrower than that proposed in Zhang and Zhang [7]. Figure 3 illustrates the narrow range of

*S*versus

*H*for 235 bacterial genomes and for simulated data. We note that the only detectable constraint is that determined by the empirical GC content of genomes, which in nature ranges approximately between 0.22 and 0.77. The GC content distribution delimits

*S*to a range between 0.25 (GC content of 0.50) and 0.33 (GC content of 0.22). The second parity rule was similarly used by Zhang [8] to study the relations between

*S*and

*H*. In agreement with Mitchell and Bridge [13] we note that, although it is unknown which rule dictates the genome composition of single stranded genomes, it cannot be assumed that they are constrained by

*S*.

*S* equals to the Gini-Simpson index and is equivalent to *H*

*S*is equal to

*GSI*- 1 (Eq. 4 and 6), and is, thus, redundant. Subsequently, we show that under the assumption of the second parity rule,

*S*can be derived directly from the Shannon entropy

*H*, which we express as

*a*and

*t*are equal, as are

*c*and

*g*. Using Eq. (A1-A6),

*S*is directly related to

*H*according to

where *S* ranges from 0 to 0.5. Zhang and Zhang [7] claimed that *S* "is a kind of negative *H* function," and that "*S* is negatively correlated with the Shannon *H* function [entropy], but with a simpler form and clear geometrical meanings." Eq. (8) shows that *S* is completely determined from the entropy and the two are correlated, contrary to Zhang [11], but in agreement with Zhang [8]. Moreover, *S* is less informative than *H* since it cannot be interpreted directly in an information theoretical or geometric sense, and it does not have the useful mathematical properties of *H*, such as additivity [11]. We note that the relation between *S* and *H* given in (Eq. 8) may not hold for DNA sequences that violate the second parity rule, such as organellar DNA and single stranded DNA sequences [13]. However, even these genomes obey a less stringent rule: that the number of *a*+*g*'s approximately equals the number of *t*+*c*'s, and therefore they cannot be used as evidence that *S* does not derive from *H*.

### The over dimensionality of the Z-curve

*S*using the Z-curve. Here we show that the Z-curve suffers from over dimensionality. In Figure 4, we show three histograms of the distributions of

*x*,

*y*, and

*z*in 235 bacterial genomes. We used principal components analysis to calculate the coefficient of variation [14] and reduce the dimensionality of the dataset. The principal component variance of

*z*is 0.07, compared to ~10

^{-5}for both

*x*and

*y*. Moreover, 99.91% of the variance is accounted for by the

*z*coordinate and the

*x*and

*y*coordinates accounted for 0.053% and 0.003% of the variance, respectively, as expected from Chargaff's second parity rule [[12], see also Appendix]. The

*z*axis is, therefore, the only meaningful coordinate for studying nucleotide composition. We note that the

*z*axis is perfectly correlated with the GC content measure:

## Conclusions

The Z-curve, a three dimensional representation of DNA sequences, was proposed to characterize isochoric segments [3, 15] and complete genomes [2, 7]. Recently, Zhang and Zhang [7] suggested that a measure of DNA composition, the genome order index *S*, is a constraint for genome compositions. The authors used the Z-curve with a geometric explanation to support their arguments. The relation between *S* and the Shannon entropy *H* were further studied by Zhang [8].

By showing that the geometric representation was in error and that the range of values for *S* is narrower than originally claimed, we conclude that *S* does not represent a constraint on nucleotide composition. Moreover, we point at the limited usefulness of the Z-curve method to support the calculations of *S*. Next, we show not only that *S* can be easily computed from the Gini-Simpson index, but it can also be directly derived from entropy and is redundant. Finally, using principal component analysis, we show that only one out of three components of the Z-curve is important. Not surprisingly, the most useful component of the Z-curve is the GC content measure (*z*) that is already being used widely in genomic studies. The other dimensions of the Z-curve (*x* and *y*) contribute less than 1% of the variance and would be of little use in most studies but can be employed to study deviations from the second parity rule. Overall, we must conclude that both the Z-curve and *S* are over-complicated measures to GC content and Shannon *H* index, respectively.

## Methods

We used two datasets of nucleotide frequencies based on real and surrogate data. For the first dataset, we used 235 bacterial genomes that were downloaded from NCBI website http://www.ncbi.nlm.nih.gov/. The list of species can be found in Table S1 (Additional file 1). For the second dataset, we generated 1,000 genomic sequences with nucleotide frequencies drawn from a uniform distribution ranging from 0 to 0.5 using Eq. A1 and A2.

## Appendix

### Detailed steps showing that *S*is narrowly distributed

*H*ranges from 1 to 2. Solving Eq. A5 with Eq. 4 solution gives an expression for

*H*that depends on only on

*S*

## Reviewers' comments

### Reviewer's report 1

Review by Claus Wilke, Institute for Cell and Molecular Biology, University of Texas

This is a brief theoretical work that addresses whether a barycentric coordinate system can provide a useful characterization of the four nucleotide frequencies in a DNA sequence. Specifically, the authors investigate whether such a coordinate system can provide insight into constraints on nucleotide frequencies beyond just GC content. The authors find that this is not the case, contrary to claims made in the literature.

The article is clearly written and convincing. I agree with all claims made in this article.

### Reviewer's report 2

Review by Uladzislau Hryshkevich and Itai Yanai, Department of Biology, Technion - Israel Institute of Technology

Elhaik et al. and Zhang & Zhang are involved in a dispute stemming from a publication by the latter group in 2004. Given the situation as it is presented by Elhaik *et al.* we find that the described results support the conclusions. Most convincingly for us is the lack of information derived from the x- and y- axes of the Z-curve. Given this lack of x and y variation for known genomes, we agree with Elhaik *et al.* that the Z-curve contains excessive dimensions. The z-axis is the only one that contains variation which Elhaik *et al.* show is reflective of the GC-content. Thus, while the Z-curve is a mathematically elegant approach to reveal compositional constraints, real genomes apparently do not scatter within this defined space sufficiently well to warrant its use. Rather, genome DNA composition follows Chargaff's 2nd parity rule which consequently makes the GC-content the simplest indicator of compositional variation.

### Reviewer's report 3

Review by Joel Bader, Department of Biomedical Engineering, Johns Hopkins University

This manuscript describes methods to analyze nucleotide content and appears technically correct. It reconciles results that have been reported by other groups by providing a more thorough treatment.

### Reviewer's report 4

Review by Marek Kimmel, Department of Statistics, Rice University

The paper makes a conclusive point concerning inaccuracies in the definition and evaluation of properties of the so-called Z-curve by its authors and other sources uncritically citing them. All mathematical point, which are made, seem to be correct, with a particular emphasis concerning geometric arguments about spheres inscribed in a regular tetrahedron and over-dimensionality of the Z-curve.

There seems to be one intriguing point: The analysis of 235 bacterial genomes hints at the distribution of *S* values following exponential distribution. However, inspection of Figure 2 suggests that the distribution is imperfect, and it has a "dip" around the value *S* = 0.27. This might be an artifact of sampling. If so, then taking more genomes might fix it. On the other hand, if this is such simple distribution, should it be not computable from "first-principles" concerning for example the distributions of the GC/AT composition of the genomes?

#### Author's response

This quadratic function achieves a minimum exactly at *a* = 1/4, at which point *S* = 1/4. Therefore the apparent exponential distribution of *S* results from the observation that between genomes the distribution of the frequency of *a* has mean 1/4, and falls off away from this mean. Under the quadratic transformation that defines *S*, this distribution appears exponential: it is vanishes to the left of *S = 1/4*, and its fall-off to the right of *S = 1/4* is accentuated. A distribution could be computed from a fitted distribution of nucleotide frequencies - something we do not attempt. Fluctuations due to limited sample size seem to be responsible for a dip in the distribution of frequencies of *a* and *t* around 0.32 and the frequencies of *c* and *g* around 0.18 (giving *S* = 0.27). This translates into a "dip" and adjacent "hill" observed in Figure 2. We are not aware of any such special properties of the actual nucleotide frequency distribution, so the trend observed in Figure 2 is likely due to sample bias.

## Declarations

### Acknowledgements

This work was supported by NSF grant DBI-0543342 to DG, and NSF grants DMS-0604429, DMS-0817649, and a Texas ARP/ATP grant to KJ. We thank Niv Sabath, Vasyl Pihur, and Emily Foster for their helpful comments on our manuscript.

## Authors’ Affiliations

## References

- Graur D, Li W-H: Fundamentals of molecular evolution. 2000, Sunderland, Mass.: Sinauer, 2Google Scholar
- Zhang CT, Zhang R: Analysis of distribution of bases in the coding sequences by a diagrammatic technique. Nucleic Acids Res. 1991, 19: 6313-6317. 10.1093/nar/19.22.6313.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang CT, Zhang R: An isochore map of the human genome based on the Z curve method. Gene. 2003, 317: 127-135. 10.1016/S0378-1119(03)00665-6.PubMedView ArticleGoogle Scholar
- Zhang CT, Wang J, Zhang R: A novel method to calculate the G+C content of genomic DNA sequences. J Biomol Struct Dyn. 2001, 19: 333-341.PubMedView ArticleGoogle Scholar
- Wen SY, Zhang CT: Identification of isochore boundaries in the human genome using the technique of wavelet multiresolution analysis. Biochem Biophys Res Commun. 2003, 311: 215-222. 10.1016/j.bbrc.2003.09.198.PubMedView ArticleGoogle Scholar
- Zhang R, Zhang CT: Identification of genomic islands in the genome of Bacillus cereus by comparative analysis with Bacillus anthracis. Physiol Genomics. 2003, 16: 19-23. 10.1152/physiolgenomics.00170.2003.PubMedView ArticleGoogle Scholar
- Zhang CT, Zhang R: A nucleotide composition constraint of genome sequences. Comput Biol Chem. 2004, 28: 149-153. 10.1016/j.compbiolchem.2004.02.002.PubMedView ArticleGoogle Scholar
- Zhang Y: Relations between Shannon entropy and genome order index in segmenting DNA sequences. Phys Rev E Stat Nonlin Soft Matter Phys. 2009, 79: 041918-PubMedView ArticleGoogle Scholar
- Shannon CE: A mathematical theory of communication. Bell System Technical Journal. 1948, 27: 379-423.View ArticleGoogle Scholar
- Elhaik E, Graur D, Josić K: 'Genome order index' should not be used for defining compositional constraints in nucleotide sequences. Comput Biol Chem. 2008, 32: 147-10.1016/j.compbiolchem.2007.11.003.PubMedView ArticleGoogle Scholar
- Zhang R: A rebuttal to the comments on the genome order index. Comput Biol Chem. 2008Google Scholar
- Rudner R, Karkas JD, Chargaff E: Separation of B. subtilis DNA into complementary strands. 3. Direct analysis. Proc Natl Acad Sci USA. 1968, 60: 921-922. 10.1073/pnas.60.3.921.PubMedPubMed CentralView ArticleGoogle Scholar
- Mitchell D, Bridge R: A test of Chargaff's second rule. Biochem Biophys Res Commun. 2006, 340: 90-94. 10.1016/j.bbrc.2005.11.160.PubMedView ArticleGoogle Scholar
- Sokal RR, Rohlf FJ: Biometry. 1995, W.H. Freeman and Company, NY, 3Google Scholar
- Zhang CT, Zhang R: Isochore structures in the mouse genome. Genomics. 2004, 83: 384-394. 10.1016/j.ygeno.2003.09.011.PubMedView ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.