Phylogenetic analysis based on single molecular markers
The 16S rRNA, recA and hisA genes are widely used as molecular markers to study BCC bacteria. To determine the impact of using these genes for identifying the BCC taxa, we performed phylogenetic analyses using available sequences from all 116 BCC strains. We included three outgroup strains, B. pseudomallei K96243, B. oklahomensis C6786, and B. glumae LMG 2196, in the analysis and added as many related type strains as possible (Additional file 3). Five strains were removed from the 16S rRNA-based phylogenetic analysis due to failure to extract the full-length 16S rRNA gene sequences. The hisA gene could not be extracted from one of the 116 strains (B. ubonensis MSMB0106), and this strain was excluded from the hisA analysis. All recA sequences from the 116 BCC strains were included for the construction of recA-based phylogeny. The phylogenetic trees based on 16S rRNA, recA and hisA are shown in Fig. 1.
The phylogenetic tree of 16S rRNA shows a poor and low bootstrap support overall (Fig. 1a). Many clades were condensed and nested to each other, especially in those lineages comprising B. cenocepacia. These observations revealed that the amount of phylogenetic signal presented by 16S rRNA is small, resulting in many short internal branches that are difficult to resolve. Pairwise comparison of sequences from BCC stains revealed that their identical levels were between 97.87 and 100%. Similarities of the BCC strains towards three outgroups were also in the ranges of 97.91–99.09%. Notably, some type strains of different BCC species share a nearly identical 16S rRNA sequence, which leads to their relationship being completely unresolved. For example, the type strains B. stabilis ATCC BAA-67T and B. pyrrocinia DSM 10685T share identical 16S rRNA gene sequences. In addition, B. lata 383T and B. contaminans LMG 23361T also share identical 16S rRNA gene sequences with five other strains, including those taxonomically annotated by Genebank as B. lata FL-7-5-30-S1-D0, B. contaminans 293K04B, B. contaminans FL-1-2-30-S1-D0, B. cenocepacia PC184 Mulks and B. cenocepacia VC12802. These results indicate that the 16S rRNA gene has low taxonomic resolution in the identification of strains with BCC, which is in line with other reports [13, 14, 47].
In contrast, the phylogenetic trees inferred from the recA and hisA genes were well resolved (Fig. 1b and c). The lineages were divided and grouped clearly with branches showing very strong bootstrap support. Similarity analysis demonstrated that the average identical levels of both genes were slightly larger than 95%. Specifically, the recA sequences of 116 BCC strains showed a range of 93.08–100% identity, while the hisA sequences of 115 strains ranged from 91.76 to 100%. Although these two trees revealed a better resolution than the 16S rRNA-based phylogeny, the trees surprisingly exhibit some extent of confusion and discordance. In both phylogenies, strain B. cepacia DWS 37UF10B-2 is far from the B. cepacia major clade represented by the type strain B. cepacia ATCC 25416T (Fig. 1b and c). In fact, B. cepacia DWS 37UF10B-2 is not clustered in any other taxa clade and forms a single branch. This phenomenon suggests that this strain probably represents a species different from the current BCC species in view of the recA and hisA phylogenies. In the recA-based phylogeny, strain B. cepacia GG4 formed a dependent branch different from the B. cepacia major clade with B. cepacia Bu72 (Fig. 1b). However, this strain formed a similar branch with B. cepacia JBK9 in the hisA-based phylogeny (Fig. 1c). The distinct independent branches suggest that the current taxonomic classification of B. cepacia may be problematic and need to be further divided. Similar situations can be observed in other BCC species, such as B. cenocepacia and B. ambifaria. Moreover, clade B. cenocepacia IIIA represented by B. cenocepacia J2315T and B. cenocepacia IIIB represented by B. cenocepacia AU 1054 were separated into two different clades in the recA-based phylogeny but shared a recently common ancestor in the hisA-based phylogeny (Fig. 1b and c). The cluster and topology difference suggest that the phylogeny based on different individual genes may conflict due to their different evolutionary history. All these contradictions also demonstrated that single gene-based phylogeny could hardly reconstruct the true phylogenetic relationship of BCC species. Despite the discordance exhibited by the recA and hisA trees, many BCC strains seem to be misidentified by a previous study according to the concordant result from the two phylogenetic trees. For example, B. multivorans FDAARGOS 496 clustered in the B. ubonensis clade is more likely to be B. ubonensis rather than B. multivorans. Two isolates previously identified as B. cenocepacia, DDS 22E-1 and DWS 37E-2, are more likely to be B. pseudomultivorans and B. latens, respectively. B. stabilis LA20W seems to be more similar to the type strain B. pyrrocinia than B. stabilis, which was also confirmed by another study [1]. Two additional strains, LO6 and DDS 7H-2, were previously identified as B. cepacia and probably belong to B. dolosa and B. cenocepacia, respectively. B. territorii A63 may be B. cepacia because it is more similar to type the strain B. cepacia ATCC 25416 in both trees (Fig. 1b and c).
Species tree based on genomes and comparison to MLSA
To overcome the defects of single molecular markers, we conducted a MLSA, which are widely used to differentiate BCC strains. Here, all seven loci (atpD, gltB, gyrB, recA, lepA, phaC and trpB) were successfully extracted for 114 of the 116 tested strains, and two strains at all but one (B. ubonensis MSMB0106 and MSMB0108). We still included these two strains for MSLA because the sequences from the other six genes were normally sufficient to identify a BCC isolate [11, 48]. The phylogenetic tree of seven concatenated housekeeping loci is shown in Fig. 2b.
To reconstruct the accurate genealogy of BCC species, we estimated a species tree using only single-copy orthologous genes. These genes are vertically inherited during evolution and thus preserved a more complete genealogical history. Trees inferred from the concatenation of single-copy protein sequences provide higher resolution than those obtained from a single phylogenetic-marker gene or multiple loci [32, 49, 50]. Studies have shown that using single-copy orthologous genes could minimize artifacts that result from the confounding effects of horizontal gene transfer [51, 52]. To recover the species tree of all BCC strains, a total of 1005 single-copy orthologous genes shared by all 116 strains were first identified. Then, we aligned each orthologous family and concatenated them to infer a maximum likelihood phylogeny (Fig. 2a).
The inferred species tree and the MLSA-based phylogenetic tree are displayed in Fig. 2. The MLSA tree and species tree were well resolved, and they revealed robust support with most branches having a maximum support value. The two trees showed a much more similar topology and consistent pattern with each other. Especially regarding the major clades and backbone branches, the MLSA phylogeny was completely congruent with the single-copy orthologous genes-based phylogeny. Although slight differences can be observed, strains of different BCC species were well organized and grouped regularly. Nine taxa, including B. ubonensis, B. pseudomultivorans, B. dolosa, B. multivorans, B. latens, B. vietnamiensis, B. metallica, B. contaminans and B. anthina, formed monophyletic groups with high support both in the MLSA phylogeny and species tree (Fig. 2). However, the cluster status of the two trees highlighted apparent taxonomic inconsistencies. First, the possible misidentification of the seven strains mentioned above was reconfirmed with the two phylogenies (Fig. 2). Second, the confusing circumstances were still observed between B. cepacia, B. cenocepacia and B. ambifaria, as well as between the B. pyrrocinia and B. stabilis groups.
BCC species demarcation based on dDDH and ANI
Our phylogenetic analyses revealed the confusing status of current BCC taxonomy and the possible misidentifications. Though the relationship of the different BCC clades was provided, they could not determine the species boundaries. As complementary methods, DDH and ANI values are widely used as a gold standard for the prokaryotic species definition. These two approaches evaluate the whole genomic similarity of bacteria, and dDDH is a fast and accurate replacement for the traditional laboratory-based DDH [25, 26, 53]. Here, we performed in silico dDDH and ANI analyses based on whole genome sequences and the results are listed in Additional file 4. We primarily used the pairwise dDDH values to cluster BCC species and their corresponding ANI values to cross reference and evaluate the congruence of the two approaches (Fig. 3). Although dDDH and ANI use different algorithms for the calculations, i.e., ANI evaluates the similarity of two genomes from the shared elements or fragments, while dDDH uses the sequence similarity of conserved regions between two genomes [54], the results were very consistent (Additional file 5). The ANI values were strongly correlated with the dDDH values (R2 = 0.9947). Based on the simulated exponential equation f(x) = 89.78*exp.(0.00107*x)-57.74*exp.(− 0.07575*x) for the entire dataset, the 70% dDDH threshold for species delineation corresponded to an ANI value of 96.48%, while ANI values of 95–96% corresponded to dDDH values of 59.193 to 66.29%, respectively (Additional file 5). This indicates that the traditional 70% dDDH threshold for BCC species demarcation is more stringent.
Pairwise dDDH and ANI values were calculated and are shown in Fig. 3 and Additional file 4. Previous research has shown that the dDDH species cutoff (70%) is generally more stringent than the ANI species cutoff (95%~ 96%) [55]. Considering the intricacies of BCC taxonomy, we used the 70% dDDH (0.0361 for the recommended GGDC setting) and upper boundary 96% ANI (0.04 for the genome distance) thresholds to reclassify the BCC strains for species delineation, which divided 116 strains into 38 clusters and 36 clusters, respectively. All strains belong to the unanimous clusters except for the strain NFIX32 and FERMP-21014. Specifically, strain NFIX32 and XXB-24 shared a high mutual ANI value of 96.4% and dDDH value of 69.8%; although the mutual dDDH value is slightly below 70% threshold, the high ANI value (> 96% ANI threshold) and the well-constructed monophyly in species tree indicated they should be merged into one cluster that represents a novel species in BCC (Fig. 2). Similarly, strain FERMP-21014 shared dDDH value of 69.7% with B. pyrrocinia DSM 10685T and were clustered into a different group according to the 70% dDDH threshold; however, they shared a high mutual ANI value of 96.4% (> 96% ANI threshold) and formed a highly supported clade in species tree (Fig. 2), which indicated that they should be merged into a single cluster. As a result, the 116 strains are reclassified into 36 clusters, labeled BCC01 through BCC36 (Fig. 4). Taking the type strain as the standard, we found that clusters BCC01, BCC04, BCC08, BCC14, BCC17, BCC22–23, BCC25, BCC29, BCC30, BCC32, BCC35 and BCC36 corresponded to the species B. ambifaria, B. catarinensis, B. cenocepacia, B. contaminans, B. lata, B. multivorans, B. paludis, B. puraquae, B. pyrrocinia, B. stabillis, B. cepacia, B. ubonensis and B. vietnamiensis well (Figs. 3 and 4). However, the taxonomy of the BCC strains was complicated and required further investigation.
Reclassification of the BCC based on species tree and dDDH/ANI
To better classify the BCC species and elucidate their relationship with the BCC clusters, we annotated 36 clusters on our species tree, which were estimated based on single copy orthologous genes (Fig. 5). Through this approach, we redefined the classification and clarifies all misidentifications of the BCC.
The reclassified taxonomy of BCC species is well consistent with the species tree topology with high support, suggesting that our core-genome species tree agrees with the pangenome-based taxonomy (i.e., dDDH/ANI-based clustering) and is suitable for comprehensive taxonomic analysis in the BCC (Fig. 5).
We found that the previously identified B. cepacia strains excluding the misidentified strains LO6 and DDS 7H-2 (Fig. 2) were distributed in six clusters (BCC07, BCC09–12 and BCC32). Cluster BCC32 represented B. cepacia as indicated by the presence of type strain ATCC 25416T. Notably, strain A63 in the BCC32 strains that was misidentified as B. territorii before and should be reclassified to B. cepacia. Clusters BCC07 and BCC09–12 each had a single member located far away from the type strain cluster BCC32 (Fig. 5). These five strains diverged so much that they may represented five separate novel species in the BCC rather than B. cepacia. This suggested that the current taxonomy of B. cepacia is not well elucidated, which signified a need for further division of previously identified B. cepacia species.
As for B. cenocepacia, cluster BCC08 should be the representation due to the presence of the type strain B. cenocepacia J2315T (=LMG 16656T). Again, we noted that strain DDS 7H-2 in cluster BCC08 was misidentified as B. cepacia and should be reclassified as B. cenocepacia on the basis of dDDH and ANI as well as phylogenetic analysis (Figs. 3 and 5). Specifically, BCC08 and BCC05 should represent B. cenocepacia genomovars IIIA and IIIB, respectively [23, 56, 57]. The dDDH and ANI estimations were above 79.8 and 97.7% among the cluster BCC05 and even above 89.2 and 98.8% within cluster BCC08, respectively. Between clusters BCC05 and BCC08, the dDDH values ranged from 59.7 to 60.9%, which is below the 70% threshold. In contrast, the ANI values ranged from 95.1 to 95.5%, which is near the threshold 95%~ 96% (Figs. 3 and 4). In the case of species delineation, dDDH is proven to be more discriminatory, as demonstrated in the study of Vibrio cidicii and Bradyrhizobium brasilense [58, 59]. Studies showed that when the species were compared against their closest relatives, ANI may be inconclusive, whereas the dDDH values were below the threshold [27]. Therefore, based on dDDH, clusters BCC05 and BCC08 should represent different but closely related species in the BCC. This finding indicated that the traditional B. cenocepacia genomovar IIIA represented classical B. cenocepacia and that genomovar IIIB should be divided as a novel species. Furthermore, BCC06, with only one strain formerly described as B. cenocepacia, should also be classified as a novel species because its dDDH and ANI estimations (56.5% and 94.4, respectively) with B. cenocepacia J2315T were both lower than the threshold for species delineation (Figs. 3 and 5).
The clade containing strains previously identified as B. stabillis and B. pyrrocinia was confused. Cluster BCC30 contained two B. stabillis strains, including the type strain ATCC BAA-67T. Strains in cluster BCC26 yielded dDDH values ≤45.8% and ANI values ≤92.5% with type strain B. pyrrocinia DSM 10685T and formed a separate branch in species tree (Figs. 3 and 5), indicating that they were previously misidentified and BCC26 should represent a putative novel species. Cluster BCC29 was represented by B. pyrrocinia DSM 10685T and contained another strain (LA20W) that was previously misidentified as B. stabillis, which is also supported by another study [1]. Core genome phylogeny and dDDH/ANI similarity suggests that strain previously named as B. stabillis FERMP-21014 in BCC29 also should be reclassified as B. pyrrocinia, because it shared a middle dDDH value of 49.4% and ANI value of 93.3% with B. stabillis ATCC BAA-67T that both lower than species delineation threshold. Cluster BCC27 (BCC27-mHSR5) in the clade represented a putative novel species that was previously misclassified as B. pyrrocinia as well. These results showed that traditional B. pyrrocinia species is more complicated than we thought and require further separation (Figs. 3, 4 and 5).
In cluster BCC35 representing B. ubonensis, strain FDAARGOS 496 was misclassified as B. multivorans (represented by BCC22). Cluster BCC34 contained only one strain that was formerly identified as B. ubonensis. However, this strain yielded a dDDH value of 59.2% and an ANI value of 95.7% with B. ubonensis MSMB22. Because dDDH is more discriminatory under such conditions, we believe that cluster BCC36 is likely to be a novel species that is closely related B. ubonensis (Fig. 3, 4 and 5).
In nine clusters (BCC03, BCC15–16, BCC20–21, BCC24, BCC28, BCC31 and BCC33), strains formed a monophyletic group, and their dDDH/ANI values satisfied the species delineation threshold. Despite a lack of a type strain, with the necessary reclassification of some isolates, these clusters probably represented B. anthina, B. diffusa, B. dolosa, B. latens, B. metallica, B. pseudomultivorans, B. seminalis, B. stagnalis and B. territorii, respectively.
Four clusters (BCC02, BCC13, BCC18 and BCC19), each formed by one strain, should be reclassified as four different putative new species.
Collectively, the current BCC species can be divided into 36 clusters. Twenty-two of the 36 clusters (BCC01, BCC03, BCC04, BCC08, BCC14–16, BCC17, BCC20–25, BCC28, BCC29, BCC30–33, BCC35 and BCC36) defined the current 22 known species with the appropriate correction of some strains. The other fourteen clusters (BCC02, BCC05–07, BCC09–13, BCC18–19, BCC26, BCC27 and BCC34) should be reclassified as 14 potential novel species (Fig. 5, Additional file 1).