pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model
© Neuberger et al. 2007
Received: 09 January 2007
Accepted: 12 January 2007
Published: 12 January 2007
Skip to main content
© Neuberger et al. 2007
Received: 09 January 2007
Accepted: 12 January 2007
Published: 12 January 2007
Protein kinase A (cAMP-dependent kinase, PKA) is a serine/threonine kinase, for which ca. 150 substrate proteins are known. Based on a refinement of the recognition motif using the available experimental data, we wished to apply the simplified substrate protein binding model for accurate prediction of PKA phosphorylation sites, an approach that was previously successful for the prediction of lipid posttranslational modifications and of the PTS1 peroxisomal translocation signal.
Approximately 20 sequence positions flanking the phosphorylated residue on both sides have been found to be restricted in their sequence variability (region -18...+23 with the site at position 0). The conserved physical pattern can be rationalized in terms of a qualitative binding model with the catalytic cleft of the protein kinase A. Positions -6...+4 surrounding the phosphorylation site are influenced by direct interaction with the kinase in a varying degree. This sequence stretch is embedded in an intrinsically disordered region composed preferentially of hydrophilic residues with flexible backbone and small side chain. This knowledge has been incorporated into a simplified analytical model of productive binding of substrate proteins with PKA.
The scoring function of the pkaPS predictor can confidently discriminate PKA phosphorylation sites from serines/threonines with non-permissive sequence environments (sensitivity of ~96% at a specificity of ~94%). The tool "pkaPS" has been applied on the whole human proteome. Among new predicted PKA targets, there are entirely uncharacterized protein groups as well as apparently well-known families such as those of the ribosomal proteins L21e, L22 and L6.
The supplementary data as well as the prediction tool as WWW server are available at http://mendel.imp.univie.ac.at/sat/pkaPS.
Erik van Nimwegen (Biozentrum, University of Basel, Switzerland), Sandor Pongor (International Centre for Genetic Engineering and Biotechnology, Trieste, Italy), Igor Zhulin (University of Tennessee, Oak Ridge National Laboratory, USA).
This article was reviewed by Erik van Nimwegen (Biozentrum, University of Basel, Switzerland), Sandor Pongor (International Centre for Genetic Engineering and Biotechnology, Trieste, Italy) and Igor Zhulin (University of Tennessee, Oak Ridge National Laboratory, USA). For the full reviews, please go to the Reviewers' comments section.
Phosphorylation is one of the biologically most important post-translational modifications known today. Eukaryote kinases, which are the enzymes that are responsible for this type of chemical alteration, transfer phosphate moieties onto the hydroxyl groups of serines, threonines or tyrosines of substrate peptides. Phosphorylation plays a key role in a large set of signal transduction pathways and is known to regulate the functions of a vast number of different proteins. Not only are substrate motifs for phosphorylation found in proteins from various cellular contexts, there are also >500 kinases  with at least partly non-overlapping substrate specificities encoded in each of the higher eukaryote genomes. This broad distribution, coupled with the potential medical applications, makes them interesting research targets with regard to their role in signaling cascades. Therefore, it is important to determine the complete protein substrate set for each kinase. The sheer number of yet uncharacterized proteins implies that a lot of phosphorylation motifs remained undetected so far. Accurate in silico predictors recognizing kinase substrates from their amino acid sequences are desirable to bring this task closer to a solution. A low false-positive prediction rate is especially important in this context.
Protein kinase A (PKA), alternatively called cAMP-dependent protein kinase, is one of the best studied members of the kinase group of enzymes and, therefore, appears among the most attractive targets for substrate site predictor development. It is actually the first kinase for which the crystal structure has been resolved [2, 3]. PKA acts on serine and, to a lesser extent, threonine residues that are embedded in a specific recognition motif. In its first characterizations, the PKA motif was described as consisting of arginines at the 3rd and 2nd positions prior to the phosphorylation site, and of a large hydrophobic amino acid immediately thereafter [4–7].
Several groups already applied various approaches for predicting PKA phosphorylation sites from primary protein sequence. NETPHOS  was one of the first to outperform simpler PROSITE-like approaches [9–11] by applying artificial neural networks. A more recent version, NETPHOSK , makes kinase-specific predictions. SCANSITE 2.0 uses position-specific scoring matrices (PSSM) to predict phosphorylation motifs for 62 different kinases, again including PKA . PREDPHOSPHO is a kinase-specific predictor that uses support vector machines . GPS does not use standard machine learning approaches but implements a so-called group-based scoring technique, which makes use of the BLOSUM62 matrix to score distances between query sequences and known clusters of kinase substrate peptides [15, 16]. As GPS focuses on straight sequence similarity traits, the likelihood for GPS to recognize query peptides as phosphorylation targets that are similar to known sites is especially high whereas GPS might have difficulties if it is confronted with unusual substrate examples of the same kinase that are not reflected in the learning set. Among these tools, GPS [15, 16] and PREDPHOSPHO  appear to have highest accuracy. Although the sequence sets used for testing are limited, their sensitivities are clearly below 90% for specificities estimated to be close to 90%. As more than 10% of the query sites are expected to be misclassified, database-wide studies that rely solely on current predictors cannot produce reliable results.
In order to achieve higher sensitivity and specificity, major improvements are needed. In this work, we implemented two new aspects: (i) Since there is no "average phosphorylation site", high prediction accuracy can only be achieved if the function for scoring of putative phosphorylation sites is specific for each kinase system. In our approach, the scoring function is thought to estimate the probability of productive binding of the respective substrate protein segment with the binding site of PKA; thus, the scoring function is a simplified physical model of the binding process [17–22]. (ii) The motif regions that are used to discriminate between true sites and non-permissive targets should be as long as possible. These shall include all substrate sequence stretches that influence the binding process and should not be restricted to the region of the motif that is most conserved in terms of amino acid types. It is also necessary to consider properties of correlated motif positions .
It should be noted that, for most post-translational modifications, only a handful of substrate proteins per modifying enzyme is known. Even for the better studied cases, the available experimental information can only reliably parameterize a scoring function with a small number of fitted values. In similar cases of predictor development such as for GPI lipid anchoring , N-terminal N-myristoylation , prenylation  and peroxisomal targeting [22, 24], our simplified substrate protein binding model has been successfully applied. It should be noted that, in all these cases, the sequence signal coding for the posttranslational modification or the translocation is located either at the N- or C-terminal end of the polypeptide chain. In this work, we wanted to test the approach for an internal sequence signal.
PKA-dependent phosphorylation is an excellent example in this context since the rich experimental data allow for the derivation of a quite accurate qualitative binding site model as we show in this work. Not only are there more than 200 documented phosphorylation sites for PKA. The available sequence data is also accompanied by other valuable heterogeneous information such as 3D data and mutation experiments [3, 25].
The whole work consist of two major parts – first, the derivation of the property pattern that characterizes sequence segments with PKA phosphorylation sites and, second, the development and the validation of a prediction tool for the recognition of PKA phosphorylation sites in query sequences.
The following four sections of the Results ("The motif length", "Positive charge in the N-terminal flank", "Polarity and flexibility in the C-terminal flank", "Phylogenetic variation of the substrate binding site of PKA") are dedicated to the derivation of the sequence motif coding for PKA phosphorylation sites. This work is based on analyses of the sequence environment of known phosphorylation sites in substrate proteins and of the PKA sequences and structures. We correlate amino acid compositions at various alignment positions with physical properties of amino acid residues. As major results, we obtain the sequence length of the motif and the pattern of physical properties in various sequence segments surrounding the phosphorylation site. Moreover, if several phosphorylation sites occur in one protein, they tend to be sequentially clustered.
The next three sections ("Predictor description and the self-consistency test", "Neighbor-jackknife test", "Summary of the prediction performance and comparison to other tools") describe the development of the prediction tool and its validation with the self-consistency test and a rigorous cross-validation procedure called neighbor-jackknife test (exclusion of groups of sequentially similar proteins). The specificity and the sensitivity values are close to 95% and, thus, superior compared with previously published predictors.
The succeding section of the Results ("Prediction of PKA targets within the human proteome") describes the application of the predictor to the human proteome. Among new predicted PKA targets, there are entirely uncharacterized protein groups as well as apparently well-known families such as those of the ribosomal proteins L21e, L22 and L6. The last section of the Results ("Description of the associated WWW site") supplies information about the PKA WWW server.
The deduction of accurate motif boundaries is not straightforward, as this region also comprises positions that make only minor contributions to substrate recognition by PKA. For example, these include residues that interact only weakly with the receptor or which are context-dependent upon neighboring positions. As a consequence, it is helpful to base such estimations on a standard model which has already been successfully applied in related situations.
The concept of a linker-embedded binding motif is utterly suited for this task. The underlying assumption is that the peptide stretch which binds to the receptor enzyme and which is buried in the catalytic cleft must first be made accessible for interaction: as part of an intrinsically disordered region, through a permanent native location on the surface of the globular part of the substrate protein or via exposure after an induced conformational change. As a consequence, the flanking regions which connect the sequence segment that fits into the catalytic cavity and the rest of the substrate protein must have sufficient conformational flexibility and hydrophilicity. Such a motif structure has already been observed and successfully applied in predictor development [21–24, 26]. Recent work by Dunker and co-workers further confirms the applicability of this model to protein phosphorylation motifs as they find evidence for inherently disordered regions surrounding phosphorylated residues. They used a similar formulation of the concept for "disorder enhanced" prediction of phosphorylation sites .
The calculated values deviate from the database averages over a sequence stretch that covers about twenty positions both the N- and at the C-terminal side of the documented phosphorylation site. The curves fall back to the average database values with increasing distance from the phosphorylated site. Moreover, similar behavior is exhibited by many other hydrophobicity- and flexibility-related properties (data not shown). It appears also interesting that this region is slightly longer on the C-terminal side than on the N-terminal one. This might be a result of the more hydrophobic nature of the residues that lie adjacent to the phosphorylated site on the C-terminal side. As depicted in Figure 1, the motif boundaries cannot be boiled down unambiguously to a unique position. We set the edges well into the regions where the property mean values do not fall below the steady level of approximately ± 20%. The resulting region is defined from positions -18 to +23 and, thus, we estimate the total length of the sequence signal for PKA-dependent phosphorylation as 42 positions.
Historically, charge requirements were the first observed characteristics of the PKA motif. Kinetic studies at the end of the 1970s revealed a cluster of positive residues directly N-terminally of the phosphorylated site as main determinant for PKA substrate specificity. The main constituents of this cluster are the 2nd and 3rd positions prior to the phosphorylated serine or threonine. Kemp et al.  postulated that at least one arginine should be present at one of these locations. Moreover, replacement of the arginine by lysine was reported to cause less activity loss than substitutions by other amino acids. In another study , the adjacent arginines were positioned at various distances from the phosphorylated site and activity measurements were performed. The results demonstrated that the binding affinity is indeed highest at positions -3/-2 and decreases with increasing distance from the site.
Physico-chemical preferences in the region prior to the phosphorylation site are complemented with flexibility and polarity requirements, e.g. for the property VINM940103 (normalized flexibility parameters , R ≥ 0.62) at positions -8 to -6 and -4, or for the hydrophilicity-related scales EISD840101  (R ≥ -0.66 at position -3 and -0.69 at position -2) and KRIW790102  (R = 0.60 at positions -7, -6 and -4). Although these might be a remnant of charge requirements, it seems clear that a substitution of arginine by hydrophilic residues is less disfavored than an exchange by bulky, apolar amino acids.
Interestingly, correlation effects can be detected between positions +1 and +4, as indicated by an F-value of 1.38 (96.6%) for the property GEIM800107 (β-strand indices for β-proteins ). Few data about the role of residue +4 is available from the literature, as this position is missing in the currently resolved 3D structures. It has no clear amino acid preferences, although it is preferentially less polar than the clearly hydrophilic surrounding positions (data not shown). Its spatial location in vicinity of the hydrophobic patch (Figure 5) combined with the correlations with residue +1 could suggest that positions +1 and +4 both may interact with the apolar surface loop of PKA. However, alternative conformations which involve an apolar residue at position +3 also appear possible.
The intermediary positions +2 and +3 can be characterized by a preference for small residues. Numerous size-related scales such as FASG760101  (R of -0.62 and -0.63 for positions +2 and +3, respectively) produce significant correlation coefficients. To some extent, position +3 also seems to favor flexible amino acids, as indicated by a correlation coefficient R of 0.65 for VINM940102 . The respective substrate positions indeed lie in a spatially constrained region at the mouth of the binding cavity (Figure 5), which explains the appearance of size restrictions.
The motif structure that was presented in the preceding sections served as a basis for the generation of a prediction tool. The final version of the predictor, called "pkaPS", uses one profile over 13 sequence positions and 14 physico-chemical property terms. In the self-consistency test, the pkaPS predictor generates scores S ≥ 0 for 236 out of 239 (98.7%) positive examples from the learning set, and, thus, correctly predicts these sequences as potential substrates for PKA-dependent phosphorylation.
Results of the self-consistency test
T 1 ,T 2
Kishimoto et al. 1985 
T 1 -T 4
Ekdahl 1987 
Sewing & Müller 1994 
The expected rate of false-positive predictions can directly be estimated using the set of 1026 negative examples. For a given serine or threonine residue of a query sequence, the probability of true-negative prediction lies at 93.5% (Fp-rate of 6.5%). This set was used to generate an empirical score distribution of negative examples. In order to obtain a value for the false-positive rate for any generated score S, an analytical approximation of this score distribution was determined (Materials and Methods).
Thorough cross-validation tests are needed in order to assess whether the score function is stably parameterized by the learning set. The pkaPS tool was subjected to a strict cross-validation test where the query sequence in addition to sequences which share more than 30% of identical amino acids with the query were excluded from the parameterization procedure (neighbor-jackknife test, Materials and Methods).
Results of the neighbor-jackknife test
T 7 ,T 8
Kishimoto et al. 1985 
T 2 ,T 4
Thorens et al. 1996 
T 4 ,T 7
Tullai et al. 2000 
Huang et al. 1974 
T 2 ,T 10
Li et al. 2001 
T 4 ,T 12
Blind et al. 1996 
T 2 ,T 12
Wang et al. 1993 
T 1 ,T 2
Kishimoto et al. 1985 
T 1 -T 4
Ekdahl 1987 
Sewing & Müller 1994 
All of the seven entries that were predicted in the self-consistency test but not in the neighbor-jackknife test have only marginally negative scores between zero and -0.5. Only the three entries that had scores below zero in the self-consistency test also had an S < -0.5 in the neighbor-jackknife test (see Tables 1 and 2). We think that the score interval between 0.0 and -0.5 represents a twilight zone.
To conclude, the prediction performance of the pkaPS tool is considerable. Its sensitivity lies in the range of at least 95.8% as estimated from the neighbor jackknife test, and is as high as 98.7% in the self-consistency test. At the same time, a specificity of 93.5% could be achieved.
Prediction performances of available algorithms compared to pkaPS.
S n [%]
S p [%]
Iakoucheva et al. 2004 
Zhou et al. 2004 
Blom et al. 2004 
Xue et al. 2005 
Kim et al. 2004 
As judged from the published predictor performance ratings, pkaPS provides a better specificity and sensitivity than all other currently available tools. Among these methods, only DISPHOS  is a predictor for "average phosphorylation" sites without considering kinase specificity. All other tools have implementations for specific kinases including PKA. The algorithms that come closest to the performance of pkaPS are PREDPHOSPHO, which uses a support vector machine based implementation , and GPS, which rests upon a group-based scoring method .
In addition to thorough cross-validation tests, the performance of the pkaPS tool was studied by analyzing predicted PKA-dependent phosphorylation sites in the human proteome. The human protein sequences were retrieved from the NCBI FTP-site (40877 sequences, September 14th 2006 at ). From a total of 2,485,866 serines and threonines, 258,271 (10.4%) were predicted as putative phosphorylation sites with scores S > 0. In our understanding, the list of predicted sites contains (a) true phosphorylation PKA sites, (b) sites that are phosphorylated in vitro by PKA but not in vivo due to the absence of biological context (see the comment on hidden signals in the discussion and in ref. ) and (c) real false-positive predictions of phosphorylated serines/threonines that are in sequence stretches without capability of productive interaction with the catalytic site of PKA. The consideration of additional functional sequence regions has a significant impact on the rate of predicted PKA-dependent phosphorylation sites. For example, there are 649979 ST-sites in 10195 proteins with predicted signal peptides (with any of the taxonomic versions of SIGNALP 3.0 ). pkaPS generates hits for 56970 sites (8.8%), a considerably lower value than that for the full proteome.
The probability of wrongly predicting a site within a generally non-phosphorylated protein appears to dramatically increase with the number of S/T sites in its sequence. Among the 40877 sequences that are included in the retrieved file, only 4860 entries (11.9%) do not have a single predicted site for PKA-dependent phosphorylation. This result strongly emphasizes the main difficulty of predicting post-translational modifications that can occur in a query protein with multiple suitable serines/threonines. In such cases, a single false-positive site may be responsible for an incorrect assignment of the entire protein.
Prediction of the clustered human proteome.
Predicted S/T (%)
Predicted entries (%)
High mobility group
Splicing factor, predicted RNA-binding
Ribosomal protein L21e
Ribosomal protein L22
60S ribosomal protein L6
HIV-1 Vpr-binding, High mobility group
Ras GTPase-activating protein
RNA-binding protein TIA-1/TIAR (RRM superfamily)
Uncharacterized conserved protein (KOG4791)
The prediction of phosphorylation sites in ribosomal proteins such as L21e, L22 and L6 deserves special attention in context with the recent discovery of phosphorylation of some ribosomal proteins by specific kinases (such as the ribosomal protein S6 kinase (S6K)) and the important biological role of this phosphorylation [47, 48].
Supplementary data as well as the pkaPS WWW-server are available at the Mendel WWW-site . The pkaPS server currently accepts up to 500 sequences in fasta-format (with no more 10000 S/T residues). For analyzing larger sets, we recommend contacting the authors. In interpreting the results, we advise to consider scores above 0 as good predictions; the twilight zone limit is -0.5. The predictor pkaPS analyzes the capability of the query sequence to productively interact with PKA. Additional information should be gathered from the literature or from predictors for other sequence properties to decide whether the prediction is not a hidden signal and makes sense in the biological context of the query. Additionally, we provide (i) access to the learning set, (ii) detailed results of the self-consistency and the neighbor-jackknife tests, (iii) downloads of the predictions for the human proteome both in plain and MCL-clustered forms .
Despite considerable algorithmic advances in the field, none of the prediction tools for PKA-dependent phosphorylation previously described in the literature achieves specificity and sensitivity rates both above 90%. In our view, several biological and computational aspects contribute to this development. Among them, there are several problems: (a) with the availability of experimental data, (b) with serine/threonine-rich regions, (c) with the incorporation of the available physico-chemical and biological knowledge into the scoring function used to discriminate between productively interacting substrates from non-permissive sequence stretches, (d) with the issue of accessibility of the phosphorylation motif within the protein's three-dimensional structure (intrinsically disordered regions surrounding the site) [27, 50] and (e) with the issue of hidden signals (proteins that would be phosphorylated if they were in contact with PKA but which never have the appropriate biological context during their life cycle) .
With regard to point (a), little experimental data is available for most kinases with regard to the sequence variability of substrates, structural detail of kinase-substrate complexes, kinetic or energetic aspects of the interaction. Only a few kinases including PKA are reasonably well studied in this respect. For example, the number of sequentially dissimilar substrate sequences for the reliably parameterization of the profile term was estimated at least 200 in . This number is reached for PKA even in the neighbor jackknife test (among the 239 sequence examples, the number of excluded sequences has never been above 10) and the results of this test show that stable profile parameterization has almost been achieved.
The issue of many serines/threonines in the sequence (b) is especially challenging since ST-rich regions are common in intra- and extracellular proteins. To detect phosphorylated proteins on a large-scale basis, every single potential site in a sequence must be taken into consideration. If S p (measured as value between 0 and 1) is the rate of correct rejection of a site and if there are n serine/threonine residues in a query sequence, the specificity of the task for classification of query proteins decreases significantly (to ). Considering the difficulties associated with prediction of potentially multiple sites in ST-rich regions, it is clear that very high accuracies are needed if such algorithms are to be applied routinely on a large-scale proteome basis.
The incorporation of the heterogeneous knowledge about the PKA-substrate protein relationship into the scoring function (issue c) is a non-trivial problem. Experimental reports usually do not provide the knowledge in the form that is necessary for formulating algorithms. There are two ways to deal with this problem – either to take the information as is and to hope that machine learning procedures filter the aspects of the data that are relevant for prediction, or to formulate a physically reasonable model of productive binding events with the kinase directly. Machine learning approaches have shown their usability in a variety of applications, especially in cases where lots of uniform data are available. The classical example is SIGNALP, for the derivation of which learning sets in the size of thousands of substrate proteins were collected .
In many other biological applications, the data situation is by far not that comfortable. In such circumstances, the usage of machine learning software packages as "black boxes" for autonomous extraction of score function parameters without human interference and without explicitly considering the physico-chemical and biological realities of the problem under study can become dangerous. In their letter to the editors of the Biophysics Journal in 1994 , Frank Darius and Raul Rojas analyze the difficulties arising from the discrepancy between the very high dimension of the parameter space in modern machine learning approaches and scarce data when exemplarily criticizing an alternative signalpeptide predictor. To summarize, the problem is that the calculated parameters are not reliable and it is not clear whether the correlations found are numerical noise of the data or biologically meaningful. If the data are scarce and no one tells the "black box" how the substrate protein interacts with the receptor, then the box would indeed need to be a "magic box" to know about it in order to pick the correct significant parameters.
As a direct consequence, human involvement and additional biological knowledge are indispensable for dimensionality reduction. In contrast to machine learning approaches, a physically justified model of productive binding with the kinase already provides a reasonable analytical form of scoring function terms. In this context, it is not so important whether this form can be further improved. We wish to emphasize that the number of parameters to be determined with the help of learning data is dramatically reduced.
In our approach, we consequently follow these considerations and try to incorporate all biologically relevant information into the analytical form of the prediction function. For example, it is utterly important that among all sequence positions, which carry relevant information, as many as possible are considered for the prediction procedure. In the case of PKA, we found the region to occupy the segment -18...+23. Also, we have very few parameters: a profile term over 13 positions centered around the putative site and 14 physical property terms T j (typically involving 3 parameters: the mean and the standard deviation of an amino acid index averaged over some sequence region as well as a weight factor for the whole term). Especially the latter set of parameters is determined with high significance given the 239 positive examples. We think that the simplicity of our algorithm is its big strength since the output of the decision function clearly indicates what kind of property supports or prevents the prediction of a query as PKA substrate. In the process of predictor development, human interference can assure that only the biologically meaningful among the significant correlations enter the decision function.
The influence of the structure of the protein on the accessibility of the phosphorylation motif to the kinase (issue d) is difficult to estimate at present. By demanding an excess of hydrophilic, small and flexible residues in the region -18...+23 with the physical property terms, it becomes quite unlikely that a sequence region hit by our predictor is actually part of a 3D structure but rather represents an intrinsically disordered segment. Nevertheless, it cannot be excluded that sequence stretches that are not included in the motif definition might cause the entire protein to fold in such a way that the potential phosphorylation site is not accessible to its modifying kinase.
Finally, the cellular context is important (issue e). In the case of some translocation signals, it has been experimentally shown that even the presence of an in vivo functional motif does not mean that the carrying protein is also imported into the corresponding subcellular compartment . This discovery of hidden sequence signals highlights the significance of cellular hierarchies for small functional protein motifs. Hence, current phosphorylation predictors including pkaPS do not really predict phosphorylation, but the potential of a sequence stretch to interact productively with the modifying kinase. This seemingly small detail may appear negligible but it is important to be considered in all predictions. E.g. a targeting signal located far away from a functional phosphorylation site on the same protein may lead to a removal from the cellular compartment of the respective kinase, thereby overriding the phosphorylation motif. This means that the analyzed motif is not the sole sequence stretch on the protein which is responsible for the modification.
These considerations mean that a phosphorylation predictor is not the only source of information that must be consulted when evaluating the phosphorylation state of a protein. Moreover, the number of apparently wrong predictions (if only the physiologically relevant cases are counted) of an algorithm is not only determined by the imperfection of its design since the predictor focuses on the query sequence stretch. Hence, even if all permissive amino acid permutations of the substrate motif are known, the theoretical accuracy of any post-translational modification predictor will have an upper limit clearly below 100%.
The refinement of the PKA phosphporylation motif showed that approximately 20 sequence positions flanking the phosphorylated residue on both sides are restricted in their sequence variability. The conserved physical pattern can be rationalized in terms of a qualitative binding model with the catalytic cleft of the protein kinase A. The pkaPS predictor based on this motif description confidently discriminates PKA phosphorylation sites from serines/threonines with non-permissive sequence environments (sensitivity of ~96% at a specificity of ~94%).
UNIPROT [52–54] accessions and positions of sites which are phosphorylated via PKA were retrieved from the Phospho.ELM database version 4.0 . Subsequently, an alignment was generated which contains the 81-residue long sequences that span the phosphorylated residue in addition to the 40 flanking amino acids on each side. Positions outside of the N- or C-terminal ends were treated as non-occupied (without amino acid) in further calculations. The original sequences of the substrate proteins were obtained from the UNIPROT database . Initiator methionines were removed according to the 1.29Å rule [57, 58].
Previous analyses of typical annotation errors in databases [17, 20, 22, 42] emphasized the importance of learning set curation. As expected, a couple of entries were inaccurate or had unclear verifications. Therefore, the following modifications were introduced (protein sequences are indicated by UNIPROT accession numbers):
Position 137 of the Casein-B precursor sequence (P02666) contains arginine, not serine. According to the paper where the experimental verification is reported , phosphorylation occurs in a "variant B" which contains this mutation.
The experimental verification in the paper cited for the entry P11168 was actually performed for the rat (RINm5F cell line), not the human protein . As a consequence, the entry was replaced by the corresponding rat sequence P12336, with the reported phosphorylation sites located at positions 489, 501, 503 and 510.
The exact localization of the PKA-dependent phosphorylation sites in the PTH/PTHrP type I receptor (P25107) was performed using the rat protein, not the one from opossum . Therefore, P25107 was replaced by P25961 (positions 491, 473 and 475).
Phosphorylation of the sites in P00698 appears to occur only in the denaturized protein . As the experimental verification status of the entry is not entirely clear, it was excluded from the dataset.
Entry P13280 is removed from the dataset because the experimental verification for this extremely unusual motif is unclear .
Serine 259 from P04049 is phosphorylated  but not listed in Phosph.ELM and was, therefore, added subsequently.
According to the corresponding paper  and the UNIPROT entry, the phosphorylated residue of P20020 lies at position 1216 and not 1178.
Phosphorylation of γ-aminobutyric-acid receptor β1 is originally reported for the mouse protein instead of the human counterpart . As a consequence, entry P18505 was replaced by P50571.
The reported phosphorylation sites for P32245 are only proposed to be potential sites for PKA and GRK. They actually lack any direct experimental verification  for PKA-dependent phosphorylation. The corresponding entries were removed from the learning set as a consequence.
Phosphorylation of the metallopeptidase EP24.15 is reported for the rat instead of the human protein . Entry P52888 was, therefore, replaced by P24155 including the correct location of the phosphorylated serine.
The phosphorylated serine in P07101 is located at position 71, not 40. Although the original paper reports Ser40 as site , the PKA-motif is shifted in direction of the C-terminus in the UNIPROT entry as a result of additional N-terminal regions from splice variants. Moreover, the experimental verification was performed using the rat protein, and the phosphorylated serine is annotated as "by similarity" in the UNIPROT sequence. Hence, the entry was removed from the learning set.
P01233 can theoretically be phosphorylated at three sites. However, the post-translational modification states depend on whether the implicated β-subunit is free and in its native form . Therefore, the corresponding entry was removed from the learning set.
To generate a set of negative examples, the references of a set of learning set sequences were screened. If it could be deduced that the phosphorylated S/T-sites reported in a publication were the sole amino acids that are modified by PKA, then all remaining S/T-sites were added to the set of negative examples.
The final learning sets consist of 143 sequences with 239 phosphorylated sites and 28 sequences with 1026 non-phosphorylated serines and threonines. Although the set of positive examples contains entries from various taxonomical groups, it is mostly centered on mammalian species. Around one half of the 239 sites originate from H. sapiens (120 sites). Together with the other mammalian entries (93 sites), they make up 89% of the learning set. From the remaining entries, 21 originate from other metazoan species, and only 5 are from yeast and viridiplantae.
Distribution of the number of phosphorylated sites per sequence in the learning set.
Sites per sequence
The phosphorylated serines/threonines of a learning set together with their flanking sequences are represented in a gapless multiple alignment of npos = 81 positions. The nseq = 239 phosphorylated sites occupy a single column that acts as reference location with an assigned position number of 0. Sequence positions further N-terminally have negative values, positions further C-terminally have positive values.
To assess physical and chemical requirements at specific motif positions, we make use of 20-dimensional property vectors v which assign characteristic values v a to each amino acid a. These values have been measured in various experimental setups and quantify amino acid properties such as e.g. hydrophobicity, volume or charge. We used a pre-compiled property database [23, 28] for the motif analysis. Here, single property vectors are typically specified by short identifiers such as EISD840101. We use only properties which have defined values for all 20 amino acids.
t α is the argument of the Student's distribution function for a one-sided criterion with the confidence level α, and 3 stands for the number of conditions (two for the linear regression and one for the sum of all residue type frequencies being unity).
The obtained F-value follows an F-distribution with n seq – 1 degrees of freedom . For weighted sequences, n seq needs to be replaced by the sum of the weights of all sequences that are included in F-value calculation.
Mean values and standard deviations (equations 2, 3 and 4) as well as property correlations (equations 5 and 6) and F-tests (equation 7) have been routinely used in the derivation of the physical property pattern surrounding the phosphorylation sites (see first three sections of Results). In the Results, we often write , σ, R and F without positional arguments when we describe the positions in the text.
Each term T j has an associated property v j that represents the type of physico-chemical requirement, e.g. hydrophobicity or size. Its mean value in the query sequence over a set of n pos motif positions l i , together with the respective mean value over the learning set , the learning set dispersion and a freely selectable parameter t j are used as a basis for calculation of T j .
We use two different types of penalties: (i) fixed ones and (ii) gauss-type penalties. Fixed penalties are simple penalties that are applied if the property mean value in the query sequence exceeds the predefined threshold t j , without taking the learning set into account. T j is then either 0 or -1.
Summary of the physical terms T j in the scoring function of the pkaPS predictor.
(+) H, K, R
-6 to -2
Isoelectric point (positive charge), long range
(+) H, K, R; (-) D, E
-6 to -2
Total charge, long range
β-strand preference, compensated
-9 to -4
Minimal linker – flexibility
-9 to -4
Minimal linker – hydrophilicity
+4 to +9
Minimal linker – hydrophilicity
+4 to +9
Minimal linker – flexibility
-18 to -6, +6 to +23
Avoid buried regions – hydrophilicity
-18 to -6, +6 to +23
Avoid buried regions – flexibility
Alternatively, one can use the "false-negative" (F n ) and "false-positive" (F p ) rates. They express the opposite of sensitivity and specificity, namely the amount of wrongly classified sequences for each prediction class, and are equal to 1 minus the respective S n or S p values.
To assess false-positive rates "on the fly" for obtained total scores S, a previously described estimation methodology [17, 20, 79, 80] is used that follows the spirit of BLAST p-values . This allows an easier interpretation of the total score S and provides the possibility for a better comparison with outputs from other prediction programs. The probability of false-positive prediction is approximated to the empirical distribution of sequences that are known not to carry the feature of interest. If a set of negative examples exists, it can be directly used for this task. If none is available, the function can be extrapolated from the distribution of low scores.
Approximations of the empirical distributions are calculated using iterative non-linear curve fitting implemented in the XMGRACE tool .
The predictor for protein kinase A (PKA) dependent phosphorylation, pkaPS, integrates the motif-related knowledge presented in the preceding sections. The profile term Sprofile is calculated using positions -6 to +6. The implemented terms T j reflect the structure of the substrate motif as deduced from the available sequence, structural and kinetic data. The main determinants for substrate specificity are the residues that interact with the enzyme in its binding pocket, and the adjacent positions at the mouth of the cavity. Various terms analyze these amino acids and combinations thereof for deviations from the typical physico-chemical motif fingerprint. Another group of terms evaluates the quality of the linkers that flank this region. These must have a minimal length to ensure that the phosphorylation site and its adjacent positions are sufficiently separated from the core of the respective substrate protein. The last set of terms is calculated over a region that extends further than the minimal linker length. The purpose of these functions is to exclude hydrophobic domains that might fold to protein cores, and thereby become inaccessible for substrate recognition. A summary of these terms, including the utilized physico-chemical properties, the implicated positions and references to the underlying rationales is presented in Table 6.
Erik van Nimwegen, Biozentrum, University of Basel, Switzerland.
This is a very thorough description of an algorithm for identifying phosphorylation sites of Protein Kinase A (PKA). It is clear that the authors put a lot of effort in deciding which physical features to use and how to use them. I am generally quite convinced that the pkaPS predictor provides the current state-of-the-art for PKA phosporylation site prediction. Therefore this is clearly a very worthwhile paper. I have two main points of criticism:
• The paper is too long. I appreciate all the information that the authors provide but I think the paper could be made much more readable by moving a lot of the material to supplementary materials and leaving a much more condensed and structured description of the key points. Right now there is just too much material to wade through for the reader to get a good overview of what is being done.
Author response: Our previous predictor developments have always been described in a pair of papers – one for the analysis of the property pattern near the modification site, another for the description and validation of the predictor. Thus, two different but related scientific tasks have been composed into one text. Further, we wish to supply all information that an interested reader can recreate the whole work and the implementation of the predictor. We feel that none of the information provided is dispensable. Nevertheless, we understand the concerns of the reviewer and decided to add an introductory overview section to the Results that summarizes the purpose of the respective sections and the major results described therein.
• I have concerns about over-fitting. There are a lot of parameters that go into the method that seem to have been set by hand (actually it is not entirely clear from the text how the parameters were set. This could be better explained). Examples are the collections of α weights and the thresholds t j . Given this moderately large set of parameters that have been tuned by the authors one wonders about over-fitting. In the description of the neighbor-jackknife test there is mention of "the parameterization procedure (neighbor-jackknife test, Materials and Methods)" but I did not see any description of this parameterization procedure. To address over-fitting, I propose that the authors do something like randomly dividing both the data set of positive examples as well as the set of negative examples in half. The parameters of the model should then be tuned independently on these two half-sets and false positive/negative rates can then be estimated by applying the two predictors to the half-sets not used in the training.
Author response: The revised version of the Methods section clarifies that the values t j have not been used as adjustable parameters but as a concept to formally unify Gaussian-type physical property terms and fixed penalties. The values t j have been described in the legend of Table 6 . The only adjustable parameters in the physical property term part of the score are the 14 α ppt,j . These are listed in Table 6 and the procedure for their determination is specified in the legend of Table 6 . The exact values of the α ppt,j are not critical since the physical property terms never generate a positive score (their purpose is to penalize non-permissive queries) and it is only important that the physical property terms do generate values close to zero for most of the learning set sequences. In the initial versions of the score function, each physical property term was even checked individually against the learning set to find maximal values for the α ppt,j . Simple linear kernel support vector machines were used to obtain optimized guesses and, thus, to reduce the size of the twilight zone, a zone of scores indicating unclear hits. The question of parameter overfitting has already been answered in the neighbor jack-knife test when whole homologous groups of sequences have been taken out of the learning set. This is a more rigorous approach compared with the random selection of sequences since the score function can be biased already due to the occurrence of a single homologue in the set.
• page 3: "a prototypic model for the kinase group". In what sense is PKA prototypic?
Author response:The reformulation emphasizes that PKA is the one of the best studied kinases and, therefore, well suited for substrate site predictor development.
• page 30: "The fact that phosphorylation frequently occurs .... is shown in table 4." I don't understand the reasoning here. In table 5 the number of phosphorylation sites per protein just seems to fall exponentially suggesting that the distribution might simple be a Poisson distribution, i.e no particular bias toward having multiple sites per protein.
Author response:The respective paragraph has been expanded to clarify that we just wish to describe the phosphorylation site distribution of the learning set. We do not intend to postulate a specific bias except for the observation that, if multiple sites do occur in one protein, they tend to cluster together (see Results).
• Page 33: Quantity R(l). Is this quantity ever used in the predictor? If not, what is the use of introducing it here?
Author response: The equations described in the Methods section "Sequence analysis part 2. Derivation of physical property characteristics" are used to filter the physical property pattern (see first three sections of the Results) prior to predictor development. We added text to this section to clarify this issue.
• page 35: "using the PSIC algorithm..." I am confused because on page 31 it was mentioned that the "sum of mismatches" method of Vingron and Argos is used.
Author response:Redundancy removal due to the occurrence of homologous sequences in the learning set is carried out differently for the physical property terms (with a modification of the Vingron-Argos procedure [74, 75]) and for the profile term (with the PSIC method ). The PSIC method is more sensitive but requires independent consideration of alignment positions. In physical property terms, we regularly consider multiple positions and the PSIC concept is formally not applicable in this context.
• page 39: "S th is a polynomial..." The expression in (16) is not a polynomial.
Author response:True, f(S th ) is the polynomial function considered here. We reformulated the respective part.
Sandor Pongor, International Centre for Genetic Engineering and Biotechnology, Trieste, Italy.
The manuscript of Neuberger et al "pkaPS: Prediction of Protein Kinase A Phosphorylation Sites with the Simplified Kinase-Substrate Binding Model" describes an heuristic method for describing PKA phosphorylation sites based on the distribution of various physicochemical parameters in the region flanking the phosphorylated residue as well as information on the foldedness of the polypeptide region. They present a scoring function that can confidently discriminate PKA phosphorylation sites from S/T residues in other environments. The predictor is made publicly available on a website. The description of the work is detailed and reproducible, and is in line with the groups previous works on similar subjects. The improvement over the other existing methods is convincing, and the idea of combining a physically reasonable model with statistical learning is an attractive one. I suggest the manuscript be published without modifications.
Igor Zhulin, University of Tennessee, Oak Ridge National Laboratory, USA.
In this paper, authors present the development of a prediction tool termed "pkaPS" for the purpose of identifying substrate proteins for the serine/threonine kinase PKA. Through a very thorough sequence/structure analysis, authors built a PKA-specific binding motif model, which can discriminate between PKA phosphorylation sites and other potential serine/threonine sites.
In my opinion, both the strength and the weakness of this paper are in its very detailed format. The manuscript is quite long and jam-packed with information even though the authors moved most technical details into the methods section. I am sure that bioinformaticians will find this paper very interesting, whereas most biologists are unlikely to reach the third page of the results section. This is unfortunate, because some of the derived predictions would be quite interesting for them (see below). I recommend adding information on biologically relevant predictions to the abstract at the expense of some technical details. This may capture attention of those to whom this information is addressed.
While skipping some details, I managed to follow authors' logic, which eventually resulted in building the analytical model for the kinase binding motif. I have to admit that this is a very difficult and noble task. The tendency to produce large numbers of false positives is a signature of most "sequence-only" motif predictors, and any attempt to overcome this problem inevitably leads to the need to incorporate chemistry and structure into the model. The authors did just that.
Ultimately, the success of the new predictive method and associated tool will be measured by the number of correctly identified targets (although shown measures of sensitivity and specificity are important). Authors indicate that when applied to the human proteome, the predictor ranked most highly protein families that are known PKA targets, such as histone H2A. I found it very intriguing that the top scorers also include ribosomal proteins L21e, L22, L6 that are not known to undergo phosphorylation or interact with protein kinases. However, there is a new body of evidence that some ribosomal proteins, for example S6, can be phosphorylated by specific protein kinases (Ruvinsky & Meyuhas, 2006 Trends Biochem Sci 31:342-8). Thus, predictions look very exciting and indeed produce testable hypotheses that might lead to novel discoveries in eukaryotic signal transduction.
Author response: Similar to the first reviewer, this referee expresses his concern with respect to readability of the article. We think that the new introductory overview section of the Results removes these concerns. We are grateful for the hint to the Ruvinsky & Meyuhas article that supports some of the predictions in this work. We complement the summary with some of our biological results.
The prediction tool is available as WWW server at http://mendel.imp.univie.ac.at/sat/pkaPS/ and it is thought for fair use. Please contact the authors if large sets (>500 sequences) are planned to be analyzed. The access is possible with any web-browsing tool.
The authors are grateful for generous support from Boehringer Ingelheim. This project has been partly funded by the Austrian Gen-AU bioinformatics integration network (BIN phase II) sponsored by BM-BWK. The computational facilities have been supported by SUN Microsystems, Inc. within an academic Center of Excellence.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.