A family of GFP-like proteins with different spectral properties in lancelet Branchiostoma floridae

Background Members of the green fluorescent protein (GFP) family share sequence similarity and the 11-stranded β-barrel fold. Fluorescence or bright coloration, observed in many members of this family, is enabled by the intrinsic properties of the polypeptide chain itself, without the requirement for cofactors. Amino acid sequence of fluorescent proteins can be altered by genetic engineering to produce variants with different spectral properties, suitable for direct visualization of molecular and cellular processes. Naturally occurring GFP-like proteins include fluorescent proteins from cnidarians of the Hydrozoa and Anthozoa classes, and from copepods of the Pontellidae family, as well as non-fluorescent proteins from Anthozoa. Recently, an mRNA encoding a fluorescent GFP-like protein AmphiGFP, related to GFP from Pontellidae, has been isolated from the lancelet Branchiostoma floridae, a cephalochordate (Deheyn et al., Biol Bull, 2007 213:95). Results We report that the nearly-completely sequenced genome of Branchiostoma floridae encodes at least 12 GFP-like proteins. The evidence for expression of six of these genes can be found in the EST databases. Phylogenetic analysis suggests that a gene encoding a GFP-like protein was present in the common ancestor of Cnidaria and Bilateria. We synthesized and expressed two of the lancelet GFP-like proteins in mammalian cells and in bacteria. One protein, which we called LanFP1, exhibits bright green fluorescence in both systems. The other protein, LanFP2, is identical to AmphiGFP in amino acid sequence and is moderately fluorescent. Live imaging of the adult animals revealed bright green fluorescence at the anterior end and in the basal region of the oral cirri, as well as weaker green signals throughout the body of the animal. In addition, red fluorescence was observed in oral cirri, extending to the tips. Conclusion GFP-like proteins may have been present in the primitive Metazoa. Their evolutionary history includes losses in several metazoan lineages and expansion in cephalochordates that resulted in the largest repertoire of GFP-like proteins known thus far in a single organism. Lancelet expresses several of its GFP-like proteins, which appear to have distinct spectral properties and perhaps diverse functions. Reviewers This article was reviewed by Shamil Sunyaev, Mikhail Matz (nominated by I. King Jordan) and L. Aravind.


Background
Genetically encoded fluorescent probes are indispensable tools for in vivo imaging of molecules, cells and whole organisms. Among the fluorescent proteins that have been developed as reporters, members of the GFP family, with the green fluorescent protein of hydroid Aequorea victoria as its founding member [1,2] are unique in that their chromophore/fluorophore is formed solely from the polypeptide chain itself, and their maturation as well as light emission does not require any cofactors other than oxygen. Members of the GFP family are found in many cnidarians, where they display a range of excitation and emission spectra, from fluorescence to bright coloration in the visible light. GFPs proved to be extraordinarily amenable to genetic manipulation: some of the useful traits of the engineered GFP derivatives include shifts in the maxima of excitation and/or emission; timed responses, such as kindling or color change after excitation; photoactivation; assembly of functional monomers from the fragments of the molecule; and others [3][4][5]. With all this knowledge about structure-function relationships in the GFP family, there is nonetheless a considerable interest in naturally occurring GFP-like proteins with novel properties.
The evolutionary history of the GFP family and its relationship to other proteins are not well-understood. In addition to Cnidaria, fluorescent proteins with significant sequence similarity to GFPs have been found several years ago in marine crustaceans of the Pontellidae family [6]. In an unrelated work, the structure of one domain (G2) within nidogen, a protein component of basement membranes in various groups of metazoan animals, was found to share spatial similarity with GFP and GFP-like proteins [7]. The common fold, described as 11-stranded β-barrel, is so close in nidogen G2 domain and in GFPs that the superimposition of these structures is possible with root mean square deviation of 2.5Å between 195 carbon alpha atoms (out of approximately 225 in GFP). The lack of discernible sequence similarity between G2 domains and GFPs, however, leaves open the evolutionary questions, i.e., whether the two β-barrels descend from the common ancestral gene, and, if such common ancestor existed, what was its function and in which organism did it reside.
In this work, we present the results of our analysis of a family of GFP-like proteins from a cephalochordate, the lancelet Branchiostoma floridae. The imaging of adult animals reveals anatomically discrete areas of green and red fluorescence surrounding the oral aperture. The nearlycompletely sequenced genome of B. floridae encodes at least 12 GFP-like proteins, which appear to have arisen by duplication after separation of the protostome and deuterostome lineages. Several of these genes are represented in the public EST libraries obtained from eggs and from different stages of embryo development. To assess the utility of these genes as reporters, we expressed the humanized versions of two of these proteins in a mammalian system.

Results and Discussion
We were interested in the evolutionary provenance of the β-barrel proteins and searched for the homologs of various β-barrels in the EST databases. When GFP and GFPlike proteins were used as queries in the TBLASTN searches of the non-human, non-mouse EST database at NCBI, we detected a few trivial matches with perfect identity to several cloned GFPs, apparently originating from the unfiltered fragments of the cloning vectors (data not shown). To our surprise, however, the overwhelming majority of the matches with significant sequence similarity were from the EST libraries corresponding to various developmental stages of the lancelet B. floridae. The lancelet ESTs represented evolutionarily distinct sequences not found in other organisms. Translations of these ESTs were 30-40% identical to copepod GFPs, their nearest database homologs, and were more distant from the cnidarian GFPs and GFP-like proteins. At this level of similarity, nonetheless, the matches had high statistical significance (E-value < 10 -10 ), and the residues that are involved in the chromophore formation in other GFP-like proteins appeared to be well-conserved, suggesting that the lancelet GFP-like sequences may represent a fluorescent protein.
A more detailed analysis of the ESTs indicated that there may be several distinct, closely related members of the GFP family in lancelets, and searches in the DNA trace archive of the B. floridae genome suggested that it may encode additional homologs of GFPs not represented in the EST databases. Most recently, we took advantage of the DOE Joint Genome Institute release v.1.0 of the annotated genome assembly of B. floridae and collected a nonredundant set of genes encoding full-length homologs of GFP (Table 1 and Figure 1). In addition, we examined the first release of the genome assembly of cnidarian Nematostella vectensis, the only other sequenced genome that is known to encode GFP-like proteins.
Comparative analysis of genomes and of protein sequences paints a picture of an ancient origin of the GFPfamily proteins and their evolution by vertical descent followed by frequent gene loss ( Figure 1, Table 1, and Additional file 1). The phylogenetic tree inferred from the aligned protein sequences indicates that all cnidarian GFPs form one well-supported clade in the tree, all copepod GFPs form another, and the set of lancelet GFP-like proteins forms the third clade (Additional file 1). The branching order in the midpoint-rooted tree follows the Metazoan phylogeny, with copepod and lancelet clades being closest to each other. This is compatible with the presence of an ancestral GFP in the common ancestor of Metazoa, followed by loss of this gene in some of the present-day species and lineage-specific expansion in the others.
A further indication of the ancient ancestry of GFPs in Metazoa comes from the comparison of the intron positions in cnidarian and cephalochordate genes. GFP genes in B. floridae and N. vectensis appear to share at least one intron in the homologous position of the codon 32 (Figure 1). The probability of independent insertion of an intron into a homologous site within the orthologous genes in two lineages of eukaryotes is thought to be less than 20% [8,9] and may be less than 10% within Metazoa [9], suggesting that the intron in this codon is much more likely to be ancestral than convergently inserted. The position of this conserved intron close to the 5' termini of the GFP genes is compatible with the recently documented 5'to-3' bias towards retention of ancestral introns and the opposite bias towards intron gain and loss in multicellular organisms [10]. Moreover, another intron is present in the start codon of almost all genes in lancelet and of two genes in corals also supports this view, though the sequence conservation at the beginning of the coding region is lower and their alignment is more ambigous.
Six of the genes encoded by the genome of B. floridae are represented in the EST libraries made from eggs, various stages of embryo development and adult animals. We wondered whether any of these genes may encode proteins that would confer fluorescence to the animals. Sequence conservation in the GFP family Figure 1 Sequence conservation in the GFP family. Multiple alignment of protein sequences of GFP-like proteins from cnidarians, copepods, and lancelet. Protein sequences of gene products predicted from the genome assembly of B. floridae were clustered at the 90% identity cutoff, and one representative per cluster that did not contain internal deletions was included into the alignment (see Table 1 for details). Identifier of each sequence in JGI genome browser or in GenBank is given after each sequence. The consensus secondary structure derived from multiple known three-dimensional structures of GFP-like proteins is shown below the alignment. Red type indicates conserved small or kinky side chains (G, S, A, or P), yellow shading indicates conserved bulky hydrophobic residues (I, L, V, M, F, Y, or W), blue type indicates conserved acidic or amidic residues (D, E, N, or Q), blue shading indicates conserved basic residues (K or R), purple type with gray shading indicates the tripeptide directly participating in rearrangement that leads to the chromophore formation, and white type on black indicates the amino acid whose codon contains an intron in the known genome sequence.
- Reports of yellow or yellow-green fluorescence in various tissues of lancelets, most notably in fixed neural cells, have been published in the past, but were attributed to fluorescence of small molecules, such as retinol derivatives or monoamines [11]. To get a more up-to-date picture of fluorescence of live animals, we surveyed the animals in captivity. Adult lancelets B. floridae gathered off the Florida coast could be housed successfully for almost a year in a salt-water aquarium. We observed typical burrowing behaviour: usually only the anterior end of the animal could be seen protruding above the surface of the substrate, but was quickly withdrawn on disturbance.
Using both widefield and confocal fluorescence microscopy, we observed green fluorescence, distributed diffusely around most of the body but most clearly pronounced in the anterior part, in almost every animal ( Figure 2A). The fluorescent signal was concentrated in the oral cirri [12], the semi-circle of tentacles that surrounds the buccal cavity of the animal (Figure 2A-C). The cirri overlay a web-like sheet of integument and specialized muscles and together with oral hood they form the pre-oral structures that feed into the long inner vestibule. The strongest green fluorescence signal was emitted by the linked L-shaped structures that appear to support each tentacle ( Figure 2B-C). This fluorescence pattern corresponds to the position of cirral skeletal rods [12], and it appears that fluorescent cells are located mostly within (or sheathed around) the horizontal part and the lower portion of the vertical part of the "L" formed by each skeletal rod. These observations are in broad agreement with imaging results of Deheyn and co-authors [13], except that the bases of skeletal rods, which in their hands displayed fluorescence only in Asian species of Branchiostoma, are brightly fluorescent in B. floridae adults that we studied. Interestingly, we also observed clear red fluorescence signal that was similarly concentrated around oral cirri, and strongly overlapped with the areas of green fluorescence (Figure 2A-C and 2E). Most of the red signal was restricted to the cirral skeletal ring and oral cirrus just above the cirral skeletal rods, but some of it extended to the distal tips of the oral cirri ( Figure 2C). Analysis of full spectral information at each pixel in the image indicates that there was little pixel-to-pixel overlap between the areas associated with the green and red fluorescence, suggesting spatial segregation of red and green emitters, perhaps even cell-specific expression of each (though higher optical resolution than our 3 μm would be needed for a definitive proof of this).
In order to further study the properties of lancelet GFPlike genes and to relate them to the fluorescence observed in live animals, we have synthesized cDNAs for two of the lancelet genes that corresponded to the largest number of database ESTs, with codon frequencies optimized for expression in the mammalian system, and transfected them into HEK 293 cells ( Figure 2D). Cells expressing each construct emitted green light when excited with blue light. LanFP1 and LanFP2 were also expressed in E. coli and affinity-purified: both preparations exhibited maximal absorption at 500 nm, and the emission spectra of both proteins were similar (λ max = 510 nm for LanFP1, and 516 nm for LanFP2). Molecular brightness of the two fluorescent proteins in transfected HEK 293 cells and in purified samples was, however, significantly different, with LanFP1 much brighter than LanFP2. Analyses of the protein preparations purified from E. coli showed that the quantum efficiency of LanFP1 was by the two orders of magnitude higher than that of LanFP2, and about half of the brightness of the potent Venus derivative of YFP (Additional file 2).
We exploited the fact that LanFP1 and LanFP2 have slightly different excitation and emission spectra and compared them with endogenous fluorescent signals from the cirri. The emission spectrum from B. floridae has a maximum at 521 nm, which is much closer to that of LanFP2-expressing HEK-293 cells and of affinity-purified LanFP2 than to the corresponding values of LanFP1 (Figure 2F).
Analysis of the EST libraries indicates that transcripts encoding both LanFP1 and LanFP2 are abundant in eggs and embryos, though thus far no matching ESTs have been found in the libraries from adult animals. We did not detect any fluorescent signals in the red spectrum in mammalian cells singly or doubly transfected with LanFP1 or LanFP2.
Days before the submission of this manuscript, 21 cDNA sequences encoding proteins of GFP family from a European species B. lanceolatum were released by Genbank. They apparently include naturally occurring and engineered cDNAs, most of which display either green or red fluorescence [14]. There is no one-to-one correspondence between these sequences and the products of GFP-like genes encoded by B. floridae genome.

Conclusion
In this work, we identified a family of genes encoding GFP-like proteins in lancelet B. floridae and expressed two of them in bacterial and mammalian cell cultures. Both proteins, LanFP1 and LanFP2, exhibit green fluorescence of different brightness and distinct spectral properties when expressed in mammalian cells, and they, along with at least four other GFP-like proteins, appear to be expressed at various stages of lancelet development as judged from the analysis of the EST libraries. Moreover, adult lancelets display overlapping but distinct patterns of green and red fluorescence, with green fluorescence shifted towards more basal regions of oral cirri and red fluorescence more prominent at the cirri tips.
Recently, Deheyn and co-authors reported the sequence of LanFP2 and the pattern of green fluorescence in lancelet eggs, larvae and adults [1], and Israelsson [14] reported a similarly diverse family of cDNAs expressing red and green fluorescent proteins in B. lanceolatus. It remains to be investigated which of the GFP-like proteins encoded by the Florida lancelet genome are responsible for fluorescence observed by these authors and by us. Intriguingly, LanFP1 appears to be a bright GFP, whereas LanFP2 fluoresces only dimly in vitro and in heterologous expression systems. At the same time, its emission spectrum is the closest match to the emission of the bright fluorescence observed in live adult animals. Gene-specific probes now available for individual LanFPs will help to establish the identity of green and red emitters at the various stages of lancelet development.
Phylogenetic inference and analysis of conserved intron positions allow us to tentatively place a GFP-like protein into the common ancestor of metazoan animals. The relationship between GFPs and structurally similar G2 domains, however, remains unclear. The similarity between the spatial structures of representative of the two families makes them amenable to structure-based superimposition, and phylogenetic trees based on such alignments have been presented [13]. These trees, however, do not prove the fact of sequence homology in the first place. Using HHpred, which is one of the most sensitive sequence comparison programs and does not rely explicitly on spatial information [15] and the sequence model of G2 domain family derived from pfam07474, we observed several matches to cnidarian fluorescent proteins, with the borderline HHpred P-values from 0.0017 to 0.05. If this borderline sequence similarity, with additional consideration of the same number of β-strands in two families, is taken as the indication of their common ancestry, then the origin of nidogen G2 domain appears to postdate the emergence of GFPs, as G2 domain homologs are detected in nematodes and insects but do not appear in primitive metazoans, such as nearly-completely sequenced cnidarian N. vectensis and extensively covered flatworm Schmidtea mediterranea. G2 domains are also more diverse in sequence than the GFP-like proteins. If the former have evolved from the latter, the constraints on the sequence of a GFP-like protein, which has to enable the maturation of a chromophore or a fluorophore, have apparently been lifted in the G2 domains, which became coopted into a mutidomain polypeptide that mediates protein-protein interactions in the basal membrane of metazoa. Alternatively, GFP and G2 families may share a common β-barrel ancestor and may have other, still unrecognized relatives in present-day organisms.
Biological functions of GFPs remain elusive. It has been speculated that GFPs in cnidaria may be involved in quenching the UV irradiation [16] and/or in camouflaging coloration against predators [17]. Both hypotheses are compatible with the fact the zone of the brightest fluorescence in lancelets corresponds to the anterior part of the adult animal that is exposed during feeding. On the other hand, lancelets feed on plankton organisms, which display phototaxis, and it is possible that the signal emitted by these proteins serves as bait. On a more practical note, the study of lancelet ESTs and live imaging of animals expands the repertoire of GFP-like proteins, potential sensors and reporters for biological experimentation.

Spectroscopy
Absorption spectra for purified proteins were obtained using a SOFTmax UV/VIS spectrometer (Molecular Devices). Emission spectra were obtained using a Horiba/ Jovan fluoromax fluorescence spectrometer. Emission scans were recorded by exciting the sample at lmax for absorption with 5 nm slit. Molecular absorptivity and quantum yield were measured as in [18] and [19]. The protein concentration for absorbivity measurements was determined by BCA assay and by fluorescence correlation spectroscopy [20].

Bioinformatics
Sequence similarity searches were performed using the gapped BLAST and PSI-BLAST family of programs [21]. Multiple sequence alignments were done using the MUS-CLE program [22]. The JTT distance matrix of protein sequences was constructed using the PROTDIST program and the neighbour-joining phylogenetic tree was constructed using the NEIGHBOR program of the Phylip package [23]. The statistical support of the nodes was assessed by making 100 bootstrap replicates of the aligned sequences, building 100 trees, making the consensus tree, and marking all nodes in the original tree that were supported by more than 50% of the bootstrapped replicates. HMM-to-HMM matching was done using the HHpred server [15].

Cloning and expression of LanFP1 and LanFP2
LanFP1 and LanFP2 coding sequences have been initially assembled from the B. floridana EST libraries as two of the proteins supported by the largest number of EST. Prior to the release of the lancelet genome assembly, there were six nucleotide positions in the LanFP1 coding sequence showing variations between the ESTs in the database, and we used the most frequently conserved nucleotide for these positions. The sequences were subsequently verified by comparing them to the genome assembly. To optimize the codons for expressing in mouse and other mammalian systems, we reverse translated the LanFP1 and LanFP2 into DNA using a standard mouse codon set. For ease of cloning, we included HindIII and BamHI restriction sites at 5' and 3' ends of both sequences. The gene was synthesized by Bioclone, Inc. (San Diego, CA). The synthesized gene was cloned into Sigma's p3XFlag-Myc-CMV plasmid HindIII and BamHI sites, which contained a 5' Flag and a 3' Myc tags plus a stop codon on the vector backbone. The sequence for optimized sequences can be found in Additional File 3.

Cell Culture and Transfection
Human embryonic kidney (HEK) 293 cells were cultured in the minimum essential medium (MEM) (Invitrogen) supplemented with 5% fetal bovine serum (FBS) and 2 mM glutamine and were maintained at 37°C in a humidified environment of 5% CO2. The cells were plated on 25 mm round coverslips coated with poly-D Lysine (Sigma) 24 to 48 hours and transfected with DNA plasmids when cells reached 70-80% confluence. Plasmid DNA used for transfection was obtained using the HiSpeed maxi-prep kit (Qiagen) and repurified by sodium acetate and ethanol precipitation. 2 μg of plasmid DNA was mixed with 12 μg of Nupherin (Biomol Research Laboratories, Plymouth Meeting, PA) in 300 μl of MEM containing no FBS or antibiotics for 15 min and then combined with 300 μl of MEM containing 6 μl of LipofectAMINE 2000 (Invitrogen) for another 15 min at room temperature. The culture medium was replaced with 600 μl of transfection medium containing the LipofectAMINE-Nupherin-DNA complex. After incubating for 0.5 to 1 hour, the transfection medium was replaced with 2 ml of culture medium. Cell imaging occurred 24-72 hours post transfection.
Protein purification BL-21 Ecoli cells were transformed with LanFP1 or LanFP2 subcloned into pRSET-B bacteria expression vector (Invitrogen, Carlsbad, CA) in frame with 6 histidine tags. Overnight cultures were grown to OD 0.4 prior to induction. Bacterial cultures were induced by the addition 1 mM IPTG, grown to OD 0.8, lysed and the His-tagged protein was extracted using a Ni-agarose beads solution (Qiagen cat no. 30210). Concentration of purified protein was established using a BCA assay (Pierce, cat no 23225) and fluorescence correlation spectroscopy.

List of abbreviations
GFP: Green fluorescent protein.
rather generic GFP, LanFP2 is dimmer because of its low quantum yield but instead possesses impressively high molar extinction coefficient, on par with anthozoan purple-blue chromoproteins that are known for their striking color appearance. I would mention in the discussion that this difference in spectral properties may be the result of functional sub-specialization that maintained both genes in the lancelet genome after duplication. Also, given its very high molar extinction, I would expect that LanGFP2 protein should look quite colorful (not necessarily green!) in solution. Is that so?
still do not have a robust statistical validation of the GFP-G2 monophyly at the sequence level. Thus, interested as we are in discussing scenarios that assume a common ancestor, we can do it only hypothetically. We have modified the Conclusions section with this in mind.
What is the evidence that the red fluorescence comes from a GFP like protein and not some other fluorochrome?
Authors' response: This question was also raised by M.
Matz, see our response (the short answer is: no evidence, but see ref. 14).