Mycobacteriophage-drived diversification of Mycobacterium abscessus

Background Mycobacterium abscessus is an emerging opportunistic pathogen which diversity was acknowledged by the recent description of two subspecies accommodating M. abscessus, Mycobacterium bolletii and Mycobacterium massiliense isolates. Results Here, genome analysis found 1–8 prophage regions in 47/48 M. abscessus genomes ranging from small prophage-like elements to complete prophages. A total of 20,304 viral and phage proteins clustered into 853 orthologous groups. Phylogenomic and phylogenetic analyses based on prophage region homology found three main clusters corresponding to M. abscessus, M. bolletii and M. massiliense. Analysing 135 annotated Tape Measure Proteins found thirteen clusters and four singletons, suggesting that at least 17 mycobacteriophages had infected M. abscessus during its evolution. The evolutionary history of phages differed from that of their mycobacterial hosts. In particular, 33 phage-related proteins have been horizontally transferred within M. abscessus genomes. They comprise of an integrase, specific mycobacteriophage proteins, hypothetical proteins and DNA replication and metabolism proteins. Gene exchanges, loss and gains which occurred in M. abscessus genomes have been driven by several mycobacteriophages. Conclusions This analysis of phage-mycobacterium co-evolution suggests that mycobacteriophages are playing a key-role in the on-going diversification of M. abscessus. Reviewers This article was reviewed by Eric Bapteste, Patrick Forterre and Eugene Koonin.

There are a few data regarding mycobacteriophages in M. abscessus complex including a 81-kb prophage in the reference M. abscessus genome [1]. Also, we recently resolved the electron microscopy 3D structure of a M. bolletii mycobacteriophage named Araucaria [23]. However, the repertoire of M. abscessus phages and their evolutionary history within this bacterium is unknown and no systematic exploration for prophages and mycobacteriophages has been performed among additionally available sequenced M. abscessus genomes, leaving undetermined whether these initial observations were unique to some particular isolates or were representative of the M. abscessus species.
Here, exploiting genome sequence available for 48 M. abscessus mycobacteria by original bio-informatic analyses, we explored the repertoire of M. abscessus mycobacteriophages to gain insights into their evolution history compared to that of M. abscessus hosts.

Methods
Establishing the repertoire of M. abscessus phages The genomes of 48 M. abscessus mycobacteria available in June 2013 were downloaded from Genbank (Table 1, Additional file 1). As for 47 unfinished genomes, the contigs were reoriented based on the M. abscessus type strain genome (GenBank GCF_000069185.1) used as reference using MAUVE software [24]. The prophage regions were detected using PHAST software [25]. Protein sequences were predicted in all genomes using prodigal software [26] in order to normalize prediction. M. abscessus panproteome was annotated using BlastP search with a cutoff E-value < 0.001, percentage similarity > 30% and an alignment length > 50 amino-acids against a home-made database (including PHAST database, Mimivirus, Marseillevirus and additional mycobacteriophage proteins). We further analyzed the M. abscessus complex genomes for Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) using CRISPRs finder program [27].
Determining M. abscessus phage phylogenies M. abscessus mycobacteria tree M. abscessus genomes were aligned using Muscle aligner implemented in Mauve software [24]. Mauve alignment generated an identity matrix which the identity scores range between 0 and 1, where 0 indicates that no identical homologous nucleotides were found, and 1 indicates that every homologous nucleotide was identical. This matrix was then used to construct M. abscessus split network using Neighbor-Net algorithm in the package SplitsTree4 [28].

M. abscessus phage tree
The annotated viral and phage proteins were classified using OrthoMCL software [29]. Only protein sequences > 50-amino acid residues were considered for further analyses. Homologous sequences were selected using the all-against-all BlastP algorithm [30] with an E value of <10 −5 . Then, clustering of the orthologous sequences was analyzed using the Markov Cluster algorithm [31]. The inflation index of 1.5 was used to regulate cluster tightness (granularity). The resulting orthologous groups were used to construct a whole-genome network using the Neighbor-Net algorithm based on a gene content matrix. The similarity between two species is defined as the number of phage genes in common divided by the total number of genes of the two species. [32,33]. Using this matrix, we constructed also a heatmap clusterization using R package [http://www.r-project.org/].

Detecting gene transfer events
The orthologous groups identified by OrthoMCL were submitted to PhyloPattern for the analysis and manipulation of phylogenetic trees (within the DAGOBAH framework) [34,35]. The M. abscessus tree was used as a reference to infer topologies in order to detect gene gain and lost as previously described [35]. The results were submitted to FIGENIX [36] for phylogenetic reconstruction within the DAGOBAH framework as previously described [35]. The output generated by FIGENIX was submitted to the multi-agent system DAGOBAH, in which horizontal gene transfer (HGT) events were detected using an in-house-built transfer filter called HGT agent, as previously described [35,36]. This filter uses PhyloPattern to annotate each internal duplication node of the tree with three tags, including the recipient species, the donor species and external species [34]. Then, it applies a special phyletic pattern and searches the gene tree to find recipient species that are closer to donor species than to other external species that would otherwise be placed between the recipient and donor species in the species tree. In other words, a "donor" subtree must contain only species of a specific group and not those from the "recipient" group and vice versa and there should be no common species between the donor and external groups. Using HGT agent, one can specify the name of the donor and recipient species according to their usage.

Results
The repertoire of M. abscessus phages Among 48 analysed genomes of M. abscessus, we found that only M. abscessus M154 encodes no prophage regions whereas the other 47 M. abscessus mycobacteria genomes harbour one to eight prophage regions. A total of 171 predicted prophage regions could be separated into four types i) intact prophages encoding structural proteins, lysis proteins, integration proteins and proteins necessary for replication and recombination ii) questionable prophages iii) incomplete prophage regions iv) small prophage-like elements ( Table 1, Additional file 1).
In order to estimate the number of phages infecting M. abscessus, a phylogenetic tree was constructed based on Tape Measure Proteins (TMP) (Figure 1). The TMP was selected because it is typically the longest gene in mycobacteriophage genomes and because regions within the TMP gene are conserved [37].We could annotate 135 TMPs which clustered into thirteen groups and four singletons using orthoMCL. The TMP-based phylogenetic

M. abscessus phage tree
The split network based on prophage gene content (i.e., the presence or absence of orthologous proteins) showed an organization that differed from that of the M. abscessus tree ( Figure 4B). The organization of splits in the M. abscessus phage split network differs from M. abscessus split network. M. massiliense type strain clusters with M. abscessus, while M. abscessus 4S strain clusters with M. massiliense strains 2B and M159. The phage split network shows clusterization of M. massiliense 1S strains with M. bolletii while M. abscessus 6G strains forming a different cluster from M. abscessus. This phylogenomic analysis showed that M. abscessus viral and phage gene repertoires have different evolutionary histories. Also, a heatmap clusterization was constructed using a matrix of presence/absence of orthologous proteins. The heatmap clusterization showed a species organization different from that of the M. abscessus tree suggesting that M. abscessus may have been infected by several phages during their evolution ( Figure 5). Likewise, using the tree based on whole-genome content, individual phylogenetic analysis for the different orthologous proteins groups revealed many topologies that differed from that of the M. abscessus tree. These results suggested that gene loss and HGT are relevant for all gene functions. Interestingly, Araucaria TMP clusters with M. massiliense strains 1S and M172, suggesting that mycobacteriophages infecting M. massiliense mycobacteria may have features similar to Araucaria.

Detection of HGT cases
Among the 853 orthologous groups, phylogenetic trees were successfully reconstructed for 213 (25%) of the cases, 156 transfer events (Additional file 3) were detected out of which 33 cases were associated with strong boostrap support for HGT. A 45.45% proportion of the transferred proteins are homologous to mycobacteriophages proteins, 12.12% homologous to viral proteins, 21.21% to proteins of phages infecting gram-positive bacteria and 21.21% to proteins of phages infecting gram-negative bacteria. The probable sources are environmental bacteria in 33 cases, M. abscessus in 13 cases, M. bolletii in 6 cases and M. massiliense in 4 cases.

Discussion
Analysing 171 prophage regions in 47 M. abscessus complex genomes indicated that M. abscessus complex has been infected by at least 17 different mycobacteriophages, including Araucaria, the sole available mycobacteriophage that we recently isolated from M. bolletii [23]. Noteworthy, M. abscessus M154, an isolate from Malaysia [38] is the only strain lacking any evidence for phage. Our previous analysis revealed no unique genes in this strain [39], which nevertheless encodes four possible CRISPRs, one cmr1 family and one cmr4 family, with potential immunity against phage infection [40]. Alternatively, no phage was detected in this strain because of database limitations. For example, Araucaria did not yield significant homology with any other M. abscessus complex phage, as confirmed by phylogenetic studies using whole viral and phage proteins clusterization and TMP protein sequence based tree. Sequencing additional mycobacteriophages may next reveal prophage regions in M. abscessus M154.  We further observed that some M. abscessus phage proteins had homology to other mycobacteriophages and to phages infecting environmental bacteria. M. abscessus complex mycobacteria are opportunistic pathogens, but these observations suggest that environments, rather than host microbiota, are sources of evolution for M. abscessus complex mycobacteriophages. Noteworthy, potential sources are living in amoeba (Additional file 4) where M. abscessus complex mycobacteria including M. massiliense [22] and M. bolletii [41] are also residing. Amoeba are a place for DNA exchanges between sympatric organisms and the amoeba themselves [42][43][44][45][46][47][48]. Data here reported suggest that amoeba are a likely place for mycobacteriophage exchanges and therefore, may be a place to look at for the discovery of new mycobacteriophages.
Accordingly, a striking feature of M. abscessus complex phage genomes is their pervasive mosaicism, a previously reported hallmark of mycobacteriophages [49,50]. Our phylogenomic and phylogenetic analyses revealed a different split network topology between the hosts and the phages. This probably reflects reciprocal genome evolution through a dynamic co-evolutionary process [51]. M. abscessus complex was infected by at least 17 phages and these infections contributed to differentiate M. abscessus complex into several clusters of mycobacteria. Widespread occurrence of phage sequences in almost all studied M. abscessus complex isolates suggests that the rate of prophage invasion is faster than the rate of mutation, implying rapid evolution of M. abscessus. Also, in M. abscessus complex a total of 6/33 (19%) gene transfers (See figure on previous page.) Figure 4 M. abscessus phylogeny. A-M. abscessus species split network. B-M. abscessus phage split network based on phage gene content matrix. Every edge is associated with a split of the taxa, but there may be a number of parallel edges associated with each split. The edges separate taxa on one side of the split from the taxa on the other side of the split. The length of an edge in the network is proportional to the weight of the associated split. This is analogous to the length of a branch in a phylogenetic tree. occurred between a set of donor species (at least two donors) and a recipient species and 27/33 (81%) gene transfers occurred between a single species donor and a recipient species, clearly indicating it is an on-going process.

Conclusions
Excluding the prophage-free strain M154, phages account for only 6.7-9.6% of M. abscessus complex genomic content, but they profoundly impact their hosts, participating to their on-going diversification. The general topic of this research and the reported findings are very relevant for Biology Direct, however the current version of this MS is certainly not yet ready for publication.

Reviewers' comments
Even though I am willing to trust the authors about their main conclusions, I strongly recommend major revisions, because it is currently hardly possible to evaluate most of the evidence on which they based their observations.
In short: -The trees supporting lateral gene transfers should be presented in an organized fashion in a Supp. Mat.
Authors' response: The reviewer is right, all 605 trees are now provided as Additional File and trees supporting LGT are highlighted.
-Many current figures are not of sufficient quality to be printed in a journal (gene/taxon names are impossible to read, etc.) Authors' response: Authors improved the quality of documents.
-The methodology used for the tree reconstruction is not sufficiently detailed: crucial information such as the number of positions retained or the substitution models used are lacking.
-Many figures presented in the text are underinterpreted, and not critically discussed.
Elements requiring significantly more details: p.4. l. 86: The authors report that 'M. abscessus proteomes were aligned using Mauve software [24]'. I am not familiar with ProgressiveMauve, but does this software really align proteomes, or is it rather a tool to align genomes based on their content and gene order? If so, the next sentence in the MS is hard to understand Authors' response: M. abscessus genomes were aligned using Mauve software. Progressive Mauve uses Muscle or clustlW to perform alignment. Here we used Muscle. The authors corrected and explained the methodology (P.4 line 86).
'Then M. abscessus tree was constructed using Neighbor-Net algorithm in the package SplitsTree'. What distance matrix was provided to Splitstree? What was this distance reflecting?
Authors' response: Mauve alignment generates an identity matrix file which calculates the identity score range between 0 and 1; 0 indicates no homologous nucleotides and 1 indicates that every homologous nucleotide was identical. The authors explained this part in the text (P.5 line 88).
If the proteomes alignments evoked above were 'classic' protein alignments, then what happened to poorly aligned sites? How many positions were retained? For how many proteins? The material and methods must be much more detailed for the reader to really understand the analysis and the results. Please note that this criticism also applies for the trees that show some LGT.
Authors' response: The authors corrected that M. abscessus genomes was aligned not proteomes in the text (P.4 line 86).
p.5. l, 95-96: The analysis described here faces a potential pitfall if 'presence' and 'absence' were treated in a symmetric fashion, especially if there were lots of 'absences' in this matrix. While 'presences' can be used to group genomes sharing some features, more caution is required in the use of shared 'absences'.
If the groupings of genomes are firstly caused by the lack of shared features, then these groupings can be very artefactual (sharing 'losses' is different from sharing nothing. The potential problem here is that two genomes that have positively nothing, or not much in common, may still be grouped due to their lack of genes, while sometimes it is a better idea not to group genomes that share nothing in common!). How did the authors address this potential issue?
Authors' response: Here we constructed a matrix based on the similarity between two species which is defined as the number of genes that they have in common divided by their total number of genes (P.5 line l00-104).
p.5, l. 98-108: Incongruence between trees is considered to be synonym of lateral gene transfer, and the possibility of tree reconstruction artefacts is not discussed. Since the tree reconstruction methods are poorly described, it is difficult to evaluate this part of the work.
Authors' response: The methods of tree reconstruction are now described in the materials and methods section. (P.5 line 105-124).
p.6., l. 119: The authors write that 'a phylogenetic tree was constructed based on Tape Measure Proteins (TMP)' What are 'Tape Measure Proteins'? Why this marker? What methods/positions/models were used to reconstruct this tree?
Authors' response: The TMP was selected because it is typically the longest gene in mycobacteriophage genomes and because regions within TMP gene are conserved (P.6 line 125).
p.7, l. 156. Figure  Overall, the exploitation of these 2 figures is a bit vague. The authors only write about it that: ' A phylogenomic tree based on prophage gene content (i.e., the presence or absence of orthologous proteins) showed an organization that differed from that of the M. abscessus tree ( Figure 4B).' Please, increase the descriptions of what these differences are (or use a metrics to compare these two split networks).
Authors' response: The authors performed a better figure quality and more description in the text (P.8 line 186-190).
p.8, l.165: Figure 5 is neither described nor exploited in a way that allows to make sense of the main text about it. Please, give more time to a careful critical description of the figure.
p.8, l. 172: The 214 phylogenetic trees mentioned here (reconstructed how, please precise) should be logically classified and presented as Supp. Mat, or made available somewhere. Currently, it is simply impossible to review this part of the MS without being able to look at the evidence.
Authors' response: The phylogenetic trees are provided in Additional file 4.
p.10. l-210-211: ' Also, M. abscessus complex phages further shuttled gene transferts, 16/29 (55%) of which occurred between different M. abscessus complex clusters but 13/29 (45%) of which occurred between strains of the same cluster, clearly indicating it is an on-going process'. Where do these numbers come from? How were they obtained? Where is the evidence?
Authors' response: The authors clarified this point in the text (P.10 line 241-246).
p.19, l. 379: Figure 2 legend: What is 'an interacting map'? How is one supposed to read such a map? What are the nodes? What are the edges? More descriptions are required.
There are also some minor typos/issues: p.3, l. 56 'three different organisms': do you really mean organisms, or species, or strains here?
Authors' response: We mean organisms. p.4, l.86 (and in some other places in the text): the authors refer to the splitsnetwork as the 'M. abscessus tree'. Elswhere, as in the legend of Figure 4, they call this type of graphs 'network trees'. This wording is confusing. Is it a network or is it a tree? To me, each of this graph should be called a split network.
Authors' response: The authors corrected network tree to split network (P.5 line 91).
p.5., l. 95: Likewise, what the authors call 'a wholegenome phylogenetic tree' looks very much like a network.
Authors' response: The authors corrected phylogenetic tree to network (P.5 line 100). Figure 2: 'Few exceptions were observed: strains M139 and 1S_51_0915 showed prophage region homology with the M. abscessus cluster while M. bolletii M24 showed prophage region homology with the M. massiliense cluster'. Please help the reader more to see this, it is impossible to guess where the strains discussed here are in this map, add some arrows. Figure 3 is likely too large in its current format for publication.
Authors' response: The authors improved the quality of Figures. p.6, l.136: 'Interestingly". Why? Please explain why it is interesting.
Authors' response: The authors clarified this point (P.7 line 137). Authors' response: The authors corrected this point. p.34. Table three: what is the difference between a 'parent species' and a 'donor species?' ('donor' takes only 1 'n'). What does the column 'Nb Duplications before parent' refers to?
Authors' response: The parent species is the node which contains the two sub-trees: recipient and donor species and the number of duplication before parent refers to the number of gene duplication before the HGT event.
Second revision requested be the Reviewer 1: Dr. Eric Bapteste (UPMC, Institut de Biologie Paris Seine, France) The revised version of the MS by Sassi et al. is improved. I am still uncertain whether the quantification of HGT using gene trees means much biologically. I suppose this is because I doubt that trees alone convey that kind of evidence anyway. I am more convinced by studies of synteny showing prophages with similar genes inserted at various positions of Mycobacterium genomes. I remain also unconvinced (to be honest somewhat skeptical) about the quality of the figures. I suppose this latter possible issue would be addressed by the publisher then.
p.9. l. 195. The content of Additional file 4 is useful, also not yet perfect for its purpose: both trees with and without candidate HGT are present in this file (i.e. there are around 214 trees in it, not 156 trees), making it difficult to evaluate the trees with HGT only. The legend for this file (p.27, l.614) is confusing as it seems to announce 214 trees with candidate HGT. When one looks at some of these tree files however, one finds the following associated description: 'None horizontal gene transfert event'. (So some of these trees should be removed from Additional file 4, and the English of this final description could be improved).
Authors' response: In the additional file 4 we changed the report by trees as figures and it presents 75 trees representing a total of 156 transfer events. The HGT is represented by yellow squares in the figures. The legend is corrected accordingly to the reviewer's comment.
p.9. l.199-201: When discussing the sources of HGT, I find it strange that the category 'unknown' is not quantified, nor discussed. In the few trees I have looked at from additional file 4, 'unknown' was the major HGT donor… Authors' response: The category "unknown" in the report files represents the taxonomy used in the project which is not that same used by NCBI. As the species name is too long we removed it and only the strain name is presented, meaning that the program we run does not recognize the names we gave in data. Here we present as additional file 4 the figures of the trees to avoid any confusion.
p.10. l. 227. 'a different topology' , sure, but a topology of what? I suppose of split networks. Indeed, the networks look different, but note that their difference is still not assessed by any formal distance computed between them, nor by any statistical test. Some might find that this aspect of the study would have deserved to be improved.
Authors' response: Indeed a different split network topology. This is corrected in the text Page10, line 227.
All the minor comments were corrected.  Tables 1 and 2 Table  (how many genomes  There are several minor points Lane 123; define prophage region homology. Authors' response: This methodology section has been re-written (Lines 112-123).
Lane 128: 242,067 proteins (which proteins?) in general be more precise Authors' response: The authors clarified this point (P.7 line 139).
Lane 140 and elsewhere, gram-positive bacteria is no more a valid taxonomic grouping, better to indicate Firmicutes Authors' response: The authors corrected this point (P.7 line 151).
Lane 193: gram-negative and gram positive bacteria! i.e. all bacteria except mycobacteria?? So environmental bacteria is sufficient.
Authors' response: The authors corrected this point (P.9 line 206).  The importance of bacteriophage contribution to the evolution of bacterial genomes is increasingly recognized. Here Sassi and coworkers conclude that bacteriophages drive the evolution of the Mycobacterium abscessus complex. I find this appealing and credible idea but fail to see how the data presented in the manuscript, even assuming that the identification of prophages is accurate (no specific evidence of that is provided), support such a strong conclusion. I can agree that the authors demonstrate differences in the prophage content between the bacteria in the complex. Then, I suppose, the argument would be that the trees of the bacteria and phages are different, suggesting that there has been some exchange of prophages and individual genes. The robustness of the trees is a concern because the trees for phages can be notoriously difficult. But, even assuming they are correct, this argument seems to fall far short of the ambitious claim of the paper. It could be very helpful if the authors made an effort to carefully present their logic.