Over the last decade, enormous culture-independent inventories of microbial taxa have allowed biologists to address long-standing questions regarding the global diversity of microorganisms. Using the Global Prokaryotic Census (GPC), a collection of 16S rRNA gene sequences from 492 studies (34,368 sites), Louca et al. [1] concluded that Earth contains 0.8–1.6 million microbial taxa [1]. This estimate is six orders of magnitude lower than a prediction based on a comparably large data set of microbial communities [2]. Below, we demonstrate that the low estimate from [1] arises from violations of sampling theory and the misuse of biodiversity theory. After correcting for the misinterpretations of our previous work [2], we find that the GPC supports the prediction that there are at least 1012 microbial taxa on Earth.
Louca et al. [1] estimated microbial richness (i.e., number of taxa) at the global scale using approaches based on sampling theory that account for the frequencies of low-abundance classes (e.g., singletons, doubletons, etc.). These statistical estimators make no assumptions about biological processes and use more available information than approaches based on models of biodiversity [2]. However, such statistical estimators assume that unobserved taxa are present during sampling and that samples are unbiased representatives of the study system. Despite being one of the largest compilations of 16S rRNA gene sequences to date, the vast majority of samples in the GPC were obtained from central North America, central Europe, and Eastern China, while vast swaths of Earth are barely represented (see S1 Fig in [1]). Perhaps this geographical bias explains why 125,780 of the observed taxa (17%) were only recovered in one or two samples. Regardless, neither intuition nor evidence suggest the GPC is sufficiently representative of Earth’s microbiome to avoid the underestimation of global microbial diversity using statistical estimators. The authors did not acknowledge these violations of sampling theory, but instead concluded that “everything is everywhere” and then proceeded to use statistical richness estimators to predict global microbial diversity [1].
To demonstrate how violations of sampling theory can affect richness estimation, we simulated the spatial distribution of 107 individuals belonging to 105 species under realistic and unrealistic scenarios. We randomly resampled increasing numbers of sites up to and including the entire simulated landscape before calculating richness using two common estimators (Chao2, ICE). Under an “everything is everywhere” scenario where taxa are similar in abundance and uniformly distributed in space, estimates quickly converged on the true richness of the system (Fig. 1a). Under these conditions, diversity estimators used by Louca et al. [1] and others [3] are justifiable and may perform better than other approaches that use less sample-based information, (e.g., [2]). However, when we simulated more realistic conditions where taxa have uneven abundances and are aggregated in space (see S1 Fig), richness was substantially underestimated even when all areas of the simulated landscape were sampled (Fig. 1b-d). Rather than explain the magnitude of discrepancy in real-world diversity estimates, our simulations simply illustrate why ecologists advise against using richness estimators when critical assumptions are violated [4] and why modern estimates for the global diversity of other taxa are hardly, if ever, based on such estimators.
Aware of the limitations of richness estimators, we predicted global-scale microbial diversity using a combination of empirical scaling laws and a well-vetted model of biodiversity [2]. We began by documenting diversity-abundance relationships (DARs), which are statements of how rarity, dominance (Nmax), evenness, and richness scale with the total number of individuals or sequence reads (N). These DARs are not simply phenomenological but instead have been shown to emerge from interactions between biological processes, energetic constraints, and species traits [5]. Because Louca et al. [1] did not test whether the GPC exhibited DARs, we performed this task using their publicly available data. We found that the GPC supported DARs and that their scaling exponents were similar to those in our previous study (Fig. 2). For example, richness in the GPC scaled with abundance at a rate comparable (0.51 vs. 0.47) to that of the Earth Microbiome Project (EMP), which comprised > 70% of the microbial data in our study (Fig. S7-H in [2]). In addition, the GPC data supported the same nearly isometric scaling of Nmax with N (r2 = 0.91), which held over 30 orders of magnitude [2].
In our original study, we based our formal predictions of richness for the human gut, cow rumen, global ocean, and all of Earth on the lognormal model of biodiversity and independent data obtained from previously published studies. The lognormal model had been rederived for predicting richness in large microbial systems [6] and requires only two empirical inputs: N and Nmax. Because the values of these inputs are at the same inherent scale as the value of the prediction, this approach is not, as others have incorrectly claimed (i.e., [1, 7]), an extrapolation. Instead, our approach simply assumes that 1) estimates of N and Nmax from previous studies are reasonable and 2) that the global distribution of abundance among microbial taxa is lognormal. After arriving at a prediction of ~ 1012 microbial taxa on Earth, we then repeated the procedure using values of Nmax that were predicted via the dominance DAR. In both cases, we found that the data supported an estimate of ~ 1012 species for Earth. In response to criticisms that were later raised [7], we reaffirmed the power of our approach by predicting global avian richness to within 6% of the modern estimate where, unlike microbes, the number of bird species is largely agreed upon [8]. A recent global analysis of bacteria from waste water treatment plants lends further support to our approach and the prediction of 1012 microbial taxa on Earth [9].
Despite claiming to refute our prediction of global microbial richness, Louca et al. [1] neglected to apply any of our approaches to their data. While they did use a lognormal model, they did so in an inappropriate way. Instead of using a lognormal model that takes global scale inputs (e.g., for N and Nmax) and returns global scale richness, they fit a lognormal species abundance distribution (SAD) to randomized aggregations of GPC data and then integrated across their fitted SAD to arrive at an estimate of ~ 106 taxa. There are three critical problems with this approach. First, random combinations of data generate entirely artificial SADs. Regardless of whether the resulting SAD is lognormal or not, the result of this exercise is disconnected from the original non-randomized, non-aggregated data. Second, even if permitted, the results would only be pertinent to the data the model was fitted to, not the under-sampled biosphere. Regardless of how great a value of N was achieved through haphazardly combining sample abundances, the fitted model is irrelevant outside the context of the data it was fitted to. Third, integrating across a fitted SAD can hardly yield an estimate of richness that is orders of magnitude greater than the number of species observed in the data. Consequently, it is not surprising that estimated richness was in the same order of magnitude of what was observed [1].
After reanalyzing the GPC dataset using the appropriate lognormal approach [2, 6], we arrived at a prediction for global richness of ~ 1014 microbial taxa. In regard to orders of magnitude, this value is closer to the 1012 prediction of our previous study [2] than to the 106 estimate from Louca et al. [1]. Consequently, the discrepancy between the census-based estimate of one million taxa [1] and the theoretically grounded prediction of one trillion taxa was not due to fundamental differences in the two comparably large data sets or even in the potential accuracy of the philosophically disparate approaches. Rather, the estimate that Earth’s microbiome is comprised of only 106 taxa is the direct consequence of questionable assumptions and decisions that were made in the original analyses of the GPC data [1]. Given our findings and arguments in the current study, along with more recent estimates of massive microbiomes [9, 10] and the fact that immense regions of the planet remain unsampled, it is not beyond reason that Earth is home to 1012 microbial taxa or, at least, magnitudes more than 106.