'Genome order index' should not be used for defining compositional constraints in nucleotide sequences - a case study of the Z-curve
© Elhaik et al; licensee BioMed Central Ltd. 2010
Received: 7 February 2010
Accepted: 17 February 2010
Published: 17 February 2010
The Z-curve is a three dimensional representation of DNA sequences proposed over a decade ago and has been extensively applied to sequence segmentation, horizontal gene transfer detection, and sequence analysis. Based on the Z-curve, a "genome order index," was proposed, which is defined as S = a2+ c2+t2+g2, where a, c, t, and g are the nucleotide frequencies of A, C, T, and G, respectively. This index was found to be smaller than 1/3 for almost all tested genomes, which was taken as support for the existence of a constraint on genome composition. A geometric explanation for this constraint has been suggested. Each genome was represented by a point P whose distance from the four faces of a regular tetrahedron was given by the frequencies a, c, t, and g. They claimed that an inscribed sphere of radius r = 1/ contains almost all points corresponding to various genomes, implying that S <r2. The distribution of the points P obtained by S was studied using the Z-curve.
In this work, we studied the basic properties of the Z-curve using the "genome order index" as a case study. We show that (1) the calculation of the radius of the inscribed sphere of a regular tetrahedron is incorrect, (2) the S index is narrowly distributed, (3) based on the second parity rule, the S index can be derived directly from the Shannon entropy and is, therefore, redundant, and (4) the Z-curve suffers from over dimensionality, and the dimension stands for GC content alone suffices to represent any given genome.
The "genome order index" S does not represent a constraint on nucleotide composition. Moreover, S can be easily computed from the Gini-Simpson index and be directly derived from entropy and is redundant. Overall, the Z-curve and S are over-complicated measures to GC content and Shannon H index, respectively.
This article was reviewed by Claus Wilke, Joel Bader, Marek Kimmel and Uladzislau Hryshkevich (nominated by Itai Yanai).
The nucleotide composition of genomes varies dramatically between and among taxa. The GC content is the primary measure to characterize genomic regions in terms of homogeneity, compositional bias, and compositional constraints .
where a, c, t, and g are the frequencies of the four nucleotides in a sequence. For instance, in the case of the sequence ACGTCGCG, the three coordinates are (0,0,-0.5).
Since it was first proposed, the Z-curve has been used in many applications of sequence segmentation [3–5], horizontal gene transfer detection , isochoric domain inference [3, 5], and sequence analysis .
Since GSI = S + 1, for all intents and purposes, these two measures are the same. Zhang and Zhang  calculated S for 809 genomes of different species and found that S < 1/3 for all but two genomes. They claimed the limited observed range of S, from 1/4 to 1/3, supported the existence of a new constraint on nucleotide composition .
The "genome order index" was selected as a case study to the usefulness of the Z-curve method. We show that the inscribed sphere calculations were erroneous and that the "conversion" from simplicial coordinates to Z-curve coordinates is misleading. Further, we show that S is a narrowly distributed measure of nucleotide composition and use the second parity rule to show that it can be derived directly from the Shannon entropy. Therefore, any constraints on S follow from constraints on H. Finally, we show that the Z-curve suffers from over dimensionality and that the only informative dimension is equivalent to the GC content measure.
Results and Discussion
"Inscribed sphere" or "circumscribed sphere"?
We first note that a regular tetrahedron of height 1 has an inscribed sphere of radius 0.25 rather than 1/ ≈ 0.58. The center of the tetrahedron is defined as the intersection of two space heights. A sphere of radius 1/ at the center nearly encompasses the tetrahedron (Figure 1). Hence, the conclusion that almost all genome mapping points P are located within an inscribed sphere of r = 1/ and thus follow S <r2 is a consequence of a mathematical error.
In Figure 1, we present the x and z coordinates of 235 full bacterial genomes, which have very diverse GC content ranging from 0.22 to 0.77. We found that 45% of the bacterial genomes fall outside the actual inscribed sphere. Moreover, points P in the simplicial space are unrelated to the points calculated by the Z-curve. Since the only relevant coordinates are the Z-curve coordinates, which can be calculated directly from the data, the graphical representation of points P, using a tetrahedron is misleading and provides no additional information.
Sis narrowly distributed
S equals to the Gini-Simpson index and is equivalent to H
where S ranges from 0 to 0.5. Zhang and Zhang  claimed that S "is a kind of negative H function," and that "S is negatively correlated with the Shannon H function [entropy], but with a simpler form and clear geometrical meanings." Eq. (8) shows that S is completely determined from the entropy and the two are correlated, contrary to Zhang , but in agreement with Zhang . Moreover, S is less informative than H since it cannot be interpreted directly in an information theoretical or geometric sense, and it does not have the useful mathematical properties of H, such as additivity . We note that the relation between S and H given in (Eq. 8) may not hold for DNA sequences that violate the second parity rule, such as organellar DNA and single stranded DNA sequences . However, even these genomes obey a less stringent rule: that the number of a+g's approximately equals the number of t+c's, and therefore they cannot be used as evidence that S does not derive from H.
The over dimensionality of the Z-curve
The Z-curve, a three dimensional representation of DNA sequences, was proposed to characterize isochoric segments [3, 15] and complete genomes [2, 7]. Recently, Zhang and Zhang  suggested that a measure of DNA composition, the genome order index S, is a constraint for genome compositions. The authors used the Z-curve with a geometric explanation to support their arguments. The relation between S and the Shannon entropy H were further studied by Zhang .
By showing that the geometric representation was in error and that the range of values for S is narrower than originally claimed, we conclude that S does not represent a constraint on nucleotide composition. Moreover, we point at the limited usefulness of the Z-curve method to support the calculations of S. Next, we show not only that S can be easily computed from the Gini-Simpson index, but it can also be directly derived from entropy and is redundant. Finally, using principal component analysis, we show that only one out of three components of the Z-curve is important. Not surprisingly, the most useful component of the Z-curve is the GC content measure (z) that is already being used widely in genomic studies. The other dimensions of the Z-curve (x and y) contribute less than 1% of the variance and would be of little use in most studies but can be employed to study deviations from the second parity rule. Overall, we must conclude that both the Z-curve and S are over-complicated measures to GC content and Shannon H index, respectively.
We used two datasets of nucleotide frequencies based on real and surrogate data. For the first dataset, we used 235 bacterial genomes that were downloaded from NCBI website http://www.ncbi.nlm.nih.gov/. The list of species can be found in Table S1 (Additional file 1). For the second dataset, we generated 1,000 genomic sequences with nucleotide frequencies drawn from a uniform distribution ranging from 0 to 0.5 using Eq. A1 and A2.
Detailed steps showing that Sis narrowly distributed
Reviewer's report 1
Review by Claus Wilke, Institute for Cell and Molecular Biology, University of Texas
This is a brief theoretical work that addresses whether a barycentric coordinate system can provide a useful characterization of the four nucleotide frequencies in a DNA sequence. Specifically, the authors investigate whether such a coordinate system can provide insight into constraints on nucleotide frequencies beyond just GC content. The authors find that this is not the case, contrary to claims made in the literature.
The article is clearly written and convincing. I agree with all claims made in this article.
Reviewer's report 2
Review by Uladzislau Hryshkevich and Itai Yanai, Department of Biology, Technion - Israel Institute of Technology
Elhaik et al. and Zhang & Zhang are involved in a dispute stemming from a publication by the latter group in 2004. Given the situation as it is presented by Elhaik et al. we find that the described results support the conclusions. Most convincingly for us is the lack of information derived from the x- and y- axes of the Z-curve. Given this lack of x and y variation for known genomes, we agree with Elhaik et al. that the Z-curve contains excessive dimensions. The z-axis is the only one that contains variation which Elhaik et al. show is reflective of the GC-content. Thus, while the Z-curve is a mathematically elegant approach to reveal compositional constraints, real genomes apparently do not scatter within this defined space sufficiently well to warrant its use. Rather, genome DNA composition follows Chargaff's 2nd parity rule which consequently makes the GC-content the simplest indicator of compositional variation.
Reviewer's report 3
Review by Joel Bader, Department of Biomedical Engineering, Johns Hopkins University
This manuscript describes methods to analyze nucleotide content and appears technically correct. It reconciles results that have been reported by other groups by providing a more thorough treatment.
Reviewer's report 4
Review by Marek Kimmel, Department of Statistics, Rice University
The paper makes a conclusive point concerning inaccuracies in the definition and evaluation of properties of the so-called Z-curve by its authors and other sources uncritically citing them. All mathematical point, which are made, seem to be correct, with a particular emphasis concerning geometric arguments about spheres inscribed in a regular tetrahedron and over-dimensionality of the Z-curve.
There seems to be one intriguing point: The analysis of 235 bacterial genomes hints at the distribution of S values following exponential distribution. However, inspection of Figure 2 suggests that the distribution is imperfect, and it has a "dip" around the value S = 0.27. This might be an artifact of sampling. If so, then taking more genomes might fix it. On the other hand, if this is such simple distribution, should it be not computable from "first-principles" concerning for example the distributions of the GC/AT composition of the genomes?
This quadratic function achieves a minimum exactly at a = 1/4, at which point S = 1/4. Therefore the apparent exponential distribution of S results from the observation that between genomes the distribution of the frequency of a has mean 1/4, and falls off away from this mean. Under the quadratic transformation that defines S, this distribution appears exponential: it is vanishes to the left of S = 1/4, and its fall-off to the right of S = 1/4 is accentuated. A distribution could be computed from a fitted distribution of nucleotide frequencies - something we do not attempt. Fluctuations due to limited sample size seem to be responsible for a dip in the distribution of frequencies of a and t around 0.32 and the frequencies of c and g around 0.18 (giving S = 0.27). This translates into a "dip" and adjacent "hill" observed in Figure 2. We are not aware of any such special properties of the actual nucleotide frequency distribution, so the trend observed in Figure 2 is likely due to sample bias.
This work was supported by NSF grant DBI-0543342 to DG, and NSF grants DMS-0604429, DMS-0817649, and a Texas ARP/ATP grant to KJ. We thank Niv Sabath, Vasyl Pihur, and Emily Foster for their helpful comments on our manuscript.
- Graur D, Li W-H: Fundamentals of molecular evolution. 2000, Sunderland, Mass.: Sinauer, 2Google Scholar
- Zhang CT, Zhang R: Analysis of distribution of bases in the coding sequences by a diagrammatic technique. Nucleic Acids Res. 1991, 19: 6313-6317. 10.1093/nar/19.22.6313.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang CT, Zhang R: An isochore map of the human genome based on the Z curve method. Gene. 2003, 317: 127-135. 10.1016/S0378-1119(03)00665-6.PubMedView ArticleGoogle Scholar
- Zhang CT, Wang J, Zhang R: A novel method to calculate the G+C content of genomic DNA sequences. J Biomol Struct Dyn. 2001, 19: 333-341.PubMedView ArticleGoogle Scholar
- Wen SY, Zhang CT: Identification of isochore boundaries in the human genome using the technique of wavelet multiresolution analysis. Biochem Biophys Res Commun. 2003, 311: 215-222. 10.1016/j.bbrc.2003.09.198.PubMedView ArticleGoogle Scholar
- Zhang R, Zhang CT: Identification of genomic islands in the genome of Bacillus cereus by comparative analysis with Bacillus anthracis. Physiol Genomics. 2003, 16: 19-23. 10.1152/physiolgenomics.00170.2003.PubMedView ArticleGoogle Scholar
- Zhang CT, Zhang R: A nucleotide composition constraint of genome sequences. Comput Biol Chem. 2004, 28: 149-153. 10.1016/j.compbiolchem.2004.02.002.PubMedView ArticleGoogle Scholar
- Zhang Y: Relations between Shannon entropy and genome order index in segmenting DNA sequences. Phys Rev E Stat Nonlin Soft Matter Phys. 2009, 79: 041918-PubMedView ArticleGoogle Scholar
- Shannon CE: A mathematical theory of communication. Bell System Technical Journal. 1948, 27: 379-423.View ArticleGoogle Scholar
- Elhaik E, Graur D, Josić K: 'Genome order index' should not be used for defining compositional constraints in nucleotide sequences. Comput Biol Chem. 2008, 32: 147-10.1016/j.compbiolchem.2007.11.003.PubMedView ArticleGoogle Scholar
- Zhang R: A rebuttal to the comments on the genome order index. Comput Biol Chem. 2008Google Scholar
- Rudner R, Karkas JD, Chargaff E: Separation of B. subtilis DNA into complementary strands. 3. Direct analysis. Proc Natl Acad Sci USA. 1968, 60: 921-922. 10.1073/pnas.60.3.921.PubMedPubMed CentralView ArticleGoogle Scholar
- Mitchell D, Bridge R: A test of Chargaff's second rule. Biochem Biophys Res Commun. 2006, 340: 90-94. 10.1016/j.bbrc.2005.11.160.PubMedView ArticleGoogle Scholar
- Sokal RR, Rohlf FJ: Biometry. 1995, W.H. Freeman and Company, NY, 3Google Scholar
- Zhang CT, Zhang R: Isochore structures in the mouse genome. Genomics. 2004, 83: 384-394. 10.1016/j.ygeno.2003.09.011.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.