### Orly Alter review

R. Cangelosi and A. Goriely present two novel mathematical methods for estimating the statistically significant dimension of a matrix. One method is based on the Shannon entropy of the matrix, and is derived from fundamental principles of information theory. The other method is a modification of the "broken stick" model, and is derived from fundamental principles of probability. Also presented are computational estimations of the dimensions of six well-studied DNA microarray datasets using these two novel methods as well as ten previous methods.

Estimating the statistically significant dimension of a given matrix is a key step in the mathematical modeling of data, e.g., as the authors note, for data interpretation as well as for estimating missing data. The question of how best to estimate the dimension of a matrix is still an open question. This open question is faced in most analyses of DNA microarray data (and other large-scale modern datasets). The work presented here is not only an extensive analysis of this open question. It is also the first work, to the best of my knowledge, to address this key open question in the context of DNA microarray data analysis. I expect it will have a significant impact on this field of research, and recommend its publication.

For example, R. Cangelosi and A. Goriely show that, in estimating the number of eigenvectors which are of statistical significance in the PCA analysis of DNA microarray data, the method of cumulative percent of variance should not be used. Unfortunately, this very method is used in an algorithm which estimates missing DNA microarray data by fitting the available data with cumulative-percent-of-variance- selected eigenvectors [Troyanskaya et al., Bioinformatics 17, 520 (2001)]. This might be one explanation for the superior performance of other PCA and SVD-based algorithms for estimating DNA microarray data [e.g., Kim et al., Bioinformatics 15, 187 (2005)].

In another example, R. Cangelosi and A. Goriely estimate that there are two eigenvectors which are of statistical significance in the yeast cdc15 cell-cycle dataset of 799 genes × 15 time points. Their mathematical estimation is in agreement with the previous biological experimental [Spellman et al., MBC 9, 3273 (1998)] as well as computational [Holter et al., PNAS 97, 8409 (2000)] interpretations of this dataset.

Declaration of competing interests: I declare that I have no competing interests.

### John Spouge's review (John Spouge was nominated by Eugene Koonin)

This paper reviews several methods based on principal component analysis (PCA) for determining the "true" dimensionality of a matrix subject to statistical noise, with specific application to microarray data. It also offers two new candidates for estimating the dimensionality, called "information dimension" and the " modified broken stick model".

Section 2.1 nicely summarizes matrix methods for reducing dimensionality in microarray data. It describes why PCA is preferable to a singular value decomposition (a change in the intensities of microarray data affects the singular value decomposition, but not PCA).

Section 2.2 analyzes the broken stick model. Section 2.3 explains in intuitive terms the authors' "modified broken stick model", but the algorithm became clear to me only when it was applied to data later in the paper. The broken stick model has the counterintuitive property of determining dimensionality without regard to the amount of data, implicitly ignoring the ability of increased data to improve signal-to-noise. The modified broken stick model therefore has some intuitive appeal.

Section 2.4 explains the authors' information dimension. The derivation is thorough, but the resulting measure is purely heuristic, as the authors point out. In the end, despite the theoretical gloss, it is a just formula, without any desirable theoretical properties or intuitive interpretation.

The evaluation of the novel measures therefore depends on their empirical performance, found in the Results and Discussion. Systematic responses to variables irrelevant to the (known) dimensionality of synthetic data become of central interest. In particular, the authors show data that their information dimension increases systematically with noise, clearly an undesirable property. The authors also test the dimensionality estimators on real microarray data. They conclude that six dimensionality measures are in rough accord, with three outliers: Bartlett's test, cumulative percent of variation explained, and the information dimension (which tends to be higher than other estimators). They therefore propose the information dimension as an upper bound for the true dimensionality, with a consensus estimate being derived from the remaining measures.

The choice of dimensionality measure is purely empirical. While it is desirable to check all estimators (and report them in general accord, if that is the case), it is undesirable to report all estimators for any large set of results. The information dimension's property of increasing with noise makes it undesirable as an estimator, and it can not be recommended. The main value of the paper therefore resides in its useful review and its software tools.

### Answers to John Spouge's review

The main point of the reviewer is the suggestion that the information dimension's undesirable property of increasing with noise makes it undesirable as an estimator. We analyze the information in detail and indeed reached the conclusion that its prediction increases with noise. In the preprint reviewed by Dr. Spouge, we only considered the effect of noise on the information dimension. It is crucial to note that ALL methods are functions of the noise level present in the data. In the new and final version of the manuscript, we study the effect of noise on two other methods (Jolliffe's modification of he Guttman-Kaiser rule and LEV). It clearly appears that in one case the estimator increases with noise and in the other one, it decreases with noise (both effects are undesirable and unavoidable). The message to the practitioner is the same, understand the signal to noise ratio of the data and act accordingly. We conclude that the information dimension could still be of interest as an estimator.

### David Horn and Roy Varshavsky joint review (both reviewers were nominated by O. Alter)

This paper discusses an important problem in data analysis using PCA. The term 'component retention' that the authors use in the title is usually referred to as dimensional truncation or, in more general terms, as data compression. The problem is to find the desired truncation level to assure optimal results for applications such as clustering, classification or various prediction tasks.

The paper contains a very exhaustive review of the history of PCA and describes many recipes for truncation proposed over the 100 years since PCA was introduced. The authors propose also one method of their own, based on the use of the entropy of correlation eigenvalues. A comparison of all methods is presented in Table 2, including 14 criteria applied to 6 microarray experiments. This table demonstrates that the results of their proposed 'information dimension' are very different from those of most other truncation methods.

We appreciate the quality of the review presented in this paper, and we recommend that it should be viewed and presented as such. But we have quite a few reservations regarding the presentation in general and their novel method in particular.

1. The motivation for dimensional reduction is briefly mentioned in the introduction, but this point is not elaborated later on in the paper. As a result, the paper lacks a target function according to which one could measure the performance of the various methods displayed in Table 2. We believe one should test methods according to how well they perform, rather than according to consensus. Performance can be measured on data, but only if a performance function is defined, e.g. the best Jaccard score achieved for classification of the data within an SVM approach. Clearly many other criteria can be suggested, and results may vary from one dataset to another, but this is the only valid scientific approach to decide on what methods should be used. We believe that it is necessary for the authors to discuss this issue before the paper is accepted for publication.

2. All truncation methods are heuristic. Also the new statistical method proposed here is heuristic, as the authors admit. An example presented in Table 1 looks nice, and should be regarded as some justification; however the novel method's disagreement with most other methods (in Table 2) raises the suspicion that the performance of the new method, once scrutinized by some performance criterion on real data, may be bad. The authors are aware of this point and they suggest using their method as an upper bound criterion, with which to decide if their proposed 'consensus dimension' makes sense. This, by itself, has a very limited advantage.

3. The abstract does not represent faithfully the paper. The new method is based on an 'entropy' measure but this is not really Shannon entropy because no probability is involved. It gives the impression that the new method is based on some 'truth' whereas others are ad-hoc which, in our opinion, is wrong (see item 2 above). We suggest that once this paper is recognized as a review paper the abstract will reflect the broad review work done here.

4. Some methods are described in the body of the article (e.g., broken stick model), while others are moved to the appendix (e.g., portion of total variance). This separation is not clear. Unifying these two sections can contribute to the paper readability.

In conclusion, since the authors admit that information dimension cannot serve as a stopping criterion for PCA compression, this paper should not be regarded as promoting a useful truncation method. Nevertheless, we believe that it may be very useful and informative in reviewing and describing the existing methods, once the modifications mentioned above are made. We believe this could then serve well the interested mathematical biology community.

### Answers to Horn's and Varshavsky's review

We would like to thank the reviewers for their careful reading of our manuscript and their positive criticisms. We have modified our manuscript and follow most of their recommendations.

Specifically, we answer each comment by the reviewer:

1. The status of the paper as a review or a regular article.

It is true that the paper contains a comprehensive survey of the literature and many references. Nevertheless, we believe that the article contains sufficiently many new results to be seen as a regular journal article. Both the modified broken-stick method and the information dimension are new and important results for the field of cDNA analysis.

2. The main criticism of the paper is that we did not test the performance of the different methods against some benchmark.

To answer this problem we performed extensive benchmarking of the different methods against noisy simulated data for which the true signal and its dimension was known. We have added this analysis in the paper where we provide an explicit example. This example clearly establishes our previous claim that the information dimension provides a useful upper bound for the true signal dimension (whereas other traditional methods such as Velicer's underestimate the true dimension). Upper bounds are extremely important in data analysis as they provide a reference point with respect to which other methods can be compared.

3. The abstract does not represent the paper.

We did modify the abstract to clarify the relationship of information dimension that we propose with respect to other methods (it is also a heuristic approach!). Now, with the added analysis and wording, we believe that the abstract is indeed a faithful representation of the paper.

4. Some methods appear in the appendix.

Indeed, the methods presented in the appendix are the one that we review. Since we present a modification of the broken-stick method along with a new heuristic technique, we believe it is appropriate to describe the broken-stick method in the main body of the text while relegating other known approaches (only used for comparison) to the appendix. Keeping in mind that this is a regular article rather than a review, we believe it is justified.