SVM training, parameter tuning, and validation
The SVM is a statistical learning method originally developed by Vapnik [18, 19, 52]. SVMs are based on rigorous statistical principles and show excellent performance when making predictions on many types of large genomic datasets. The algorithm seeks a maximal separation between two groups of binary labelled (e.g., 0, 1 or negative, positive) training examples [53]. The training examples are feature vectors x of individual genes, each vector populated by measurements taken on genome scale datasets (see below). These measurements are the attributes of the data. A single classifier based on these features is then constructed to predict targets for each TF. Positives are genes which are known targets of the TF, and negatives are a randomly chosen subset of genes (equal in size to the positive set) which are least likely to be targets. The positives are taken from ChIP-chip experiments [23, 54], Transfac 6.0 Public [24], and a list curated by Young et al., from which we have excluded indirect evidence such as sequence analysis and expression correlation [25]. ChIP-chip interactions of p-value ≤ 10-3 are considered positives [54] and are not filtered using any other criteria. Since some TFs are tested under multiple experimental conditions, the negatives will be the targets with highest average p-values under all experiments (and they must not be targets under any condition).
The separation of targets from non-targets is accomplished by an optimization which finds a hyperplane separating the two classes [18, 19]. Aside from the number of features on which to base the classifier, only one parameter must be set in our framework before training the SVM, the C parameter, which controls the tradeoff between the so-called margin and the classification error. Since it would be computationally expensive to choose these parameters during the training of every classifier (which would also be more likely to over-fit the data), they are first optimized on the classifier for one TF. The learned values are then applied to the remaining classifiers.
The C parameter adjusts the tolerance of the algorithm for misclassifications. Lower values indicate higher tolerance for errors or noisy data, with a compensating increase in the strictness of the error margin around the SVM separating hyperplane. As with feature selection (described below), the classifier for YIR018W (Yap5) was used as the prototype for parameter selection since it is known to regulate ~70 genes, which is close to the average for all TFs being analyzed. Grid selection was performed on the training set for YIR018W using many values of C, and classifier accuracy was measured with 5-fold cross validation. The SVM was seen to be insensitive to the choice of C, with most values less than 1 showing similar performance. Tested values include: [2-7 2-5 2-3 2-1 1 1.5 2 22 23 24 25 26]. The value 0.0078 was chosen as this was the value reported by the SPIDER machine learning package [55] as having the best performance. In tests with other transcription factors the same value was chosen in most cases.
In previous work [17] we experimented with various varieties of SVM and found the linear SVM to be largely superior with the datasets examined here (unpublished work). By reanalyzing that original data we see that in one initial experiment where the type of SVM was allowed to vary between linear, quadratic, and cubic, the linear SVM had a higher cross-validation accuracy in 90 out of 104 cases. Similarly, using cross-validation on 104 TF classifiers, the PPV obtained by linear SVM was significantly higher by t-test than those using RBF (p = 3.62e-4) or Gaussian (p = 8.5e-68) SVMs. The formulas for RBF and Gaussian are quite similar and can be seen in [17]. They are used as implemented in the SPIDER machine learning toolbox [55].
Choosing negatives for TF target prediction can be difficult, since there is no defined set of genes known not to be targets. As in our previous work, ChIP-chip results can serve as a guide. For every TF, ChIP-chip results are used to identify genes which have the highest p-values (least significant) for binding under all tested conditions. From these least significant binders, the negative gene pool is chosen to be at least three times the size of the positive set, or 600 genes, whichever is larger. Recall that for each individual classifier, the number of positives and negatives is balanced. This is done by randomly under-sampling from the negative pool. Classifiers constructed on different randomly chosen negatives may give different results, since some unknown targets may be incorrectly assigned to the negative set. To smooth out these fluctuations, 50 classifiers are constructed for each transcription factor using a random resampling from the negative set. Each resampling is equal in size to the positive set and all 50 classifiers are tested using leave-one-out cross validation (LOOCV). The final performance statistics (Accuracy, PPV, etc.) are averages from the 50 trials. It is important to keep in mind that, although balanced datasets are often used in machine learning, using balanced sets means that an Accuracy of 50% is equivalent to random chance. Although many metrics exist by which to evaluate classifiers we report Accuracy and Positive Predictive Value (PPV). Accuracy is defined as the ratio of correctly classified examples to all examples classified:
Where TP = true positive, TN = true negative, FP = false positive, and FN = false negative. PPV is defined as the ratio of correctly predicted positives to all positive predictions:
The scheme used here for classifier construction is outlined in Figure 1. To illustrate the construction and validation more concretely, a short outline is provided below. For a particular TF
1. Assemble positive set of n known targets. Sample n genes randomly from the negative pool (see above) to construct the negative set
2. Split the data for LOOCV (n-1 genes in training set, 1 gene in test set).
3. Use SVM-RFE to rank all features in the training set.
4. Construct SVM classifier on top 1500 features. Save full feature ranking.
5. Classify left out gene.
6. Repeat steps 2–5 to complete LOOCV. Save all feature rankings.
7. Calculate performance statistics (Accuracy, PPV, etc.)
8. Repeat steps 1–7 50 times.
9. Calculate final performance statistics (i.e., mean Accuracy, mean PPV).
A new gene can be classified by applying all 50 TF-specific classifiers to the feature vector for that gene in balanced genomic test sets. Each classification produces a Platt score (see below), and the mean of all 50 scores is used. If the mean P > 0.95, classify the gene as a target of the TF. The full set of feature rakings on every training set is used to calculate the final feature rank (see below).
Genomic feature selection and ranking
The SVM algorithm can be used to select and rank data features. An important output from the algorithm is the vector w, which contains the learned weights of each feature. This vector points in a direction perpendicular to the hyperplane, and thus defines its orientation. Features with higher components in w are more useful in separating the positive and negative classes. The SVM recursive feature elimination (SVM-RFE) algorithm uses w to select features useful for classification [39]. The original SVM-RFE algorithm trains an SVM on a training set, and the components (attributes) of the feature vector x which have smallest weights are discarded [39]. The w vector is recalibrated and the process is repeated until the desired number of attributes remains. If feature selection were only performed one time for each TF prior to cross validation there would naturally be a risk of over-fitting, since both training and test information would be used to choose the best features. Instead we perform feature selection on each training set during cross-validation, ensuring that any over-fitting would be detected as low accuracy on the test set.
In our study, features for all genomic datasets are concatenated to produce large attribute sets for each gene. SVM-RFE is used to select relevant features during classifier training and various feature subset sizes are tested using a leave-one-out cross validation. Thus we allow the datasets to adjust, automatically selecting the most important features, irrespective of the data sets from which they originated. Once again Yap5 was used as a prototype. Figure 9 shows the effect that changes in feature number have on Yap5 classifier accuracy. Although as few as five features achieve 70% accuracy, the addition of more features continues to improve accuracy until 1500 features are selected, where accuracy is approximately 85%. 1500 is then the number chosen for the remaining TFs. The specific 1500 features are, of course, chosen individually for each TF. A more optimal procedure would be to also choose the best number of features for each classifier as this may vary from TF to TF. Such a procedure would be computationally prohibitive. In our procedure half of the features are removed during each iteration of SVM-RFE until at least 1550 features remain. Features are then removed one at a time until the target of 1500 is reached
It is important to appreciate the difference between the feature selection performed during classifier construction and the feature ranking which is later used to assess the overall usefulness of features for predicting targets of a TF. Feature selection for each classifier is simply an application of the SVM-RFE procedure. The feature rankings mentioned in the Discussion are created by compiling the individual rankings from all 50 TF classifiers. This provides a more accurate selection of features by choosing those which are ranked highly in a majority of training sets.
The final feature rank is achieved in the following way. After a single application of the SVM-RFE algorithm, the w-vector for the top 1500 features is used to determine the rank of the features in that training set, with higher weighted features having higher rank. These rankings are accumulated over every training set during cross validation of all 50 classifiers created for a TF. The result is a large set of feature rankings for a particular factor. The top 40 features (the number 40 was chosen arbitrarily) from each ranking are collected into a list, and a count is taken of the number of times each feature appears. The final rank is established by sorting the features based on the frequency of their appearance. Therefore, features which are consistently ranked high during all cross-validation trials are given a high rank. Clearly, features high on this list are reliably important for separation and robust to changes in the training set. In keeping with our example of Swi6, there are a total of 7100 feature rankings available for Swi6 features (142 examples times 50 cross validation repetitions). Figure 4 shows a plot of the features for Swi6 sorted by their occurrence in the top 40 ranked features within the 7100 rankings.
Classifying new targets and prediction significance
As described in [27], SVMs can provide a probabilistic output which in this case measures the likelihood that any given gene is a target. Here this output is referred to simply as a Platt score or enrichment score. The intuition of this method of assigning scores is that data points which are deeper in the positive region (i.e., further from negative examples) are the most likely to be true positives.
Because the prior probabilities of each class (positive and negative) for a transcription factor is unknown, we choose each class to be of equal size. Thus, the Platt probabilities are strictly accurate only on the training set rather than the entire genome, where the classes would be very imbalanced. Using a balanced training set has an advantage in that some classifiers trained on balanced data often have better performance (as measured by ROC analysis) than classifiers trained on imbalanced data. This can vary, however, depending on the data sampling technique used. See [56] for an insightful discussion and comparison of sampling strategies. Thus, it should be kept in mind when evaluating the results herein that, when using balanced datasets, an Accuracy of 50% is random. As discussed above, randomized controls may be used to assess the significance of any single classifier.
Platt scores can nevertheless still be used to rank new predictions, and only those genes with a score of ≥ 0.95 are considered as potential targets in this study. The fact that our framework requires that any new target achieve an average score of ≥ 0.95 across 50 classifiers partly offers an intrinsic correction and increases the confidence in the predictions made. Randomized simulations can then be used determine whether any classifier performs significantly better than random.
Nevertheless, it may be of interest to future work to be able to correct these raw Platt scores to account for the imbalances known to exist in the full genome (where, e.g., 90% or more of the genes may be non-binders for any TF). As a conservative assumption about the proportion of the genome (π) which will be bound by a TF, take the number of genes which are bound as 10%, so π = 0.1. The p-value associated for any one gene as corrected for genomic imbalance will be given by
where p is the p-value (1-Platt score) in the sample and p_full is the p-value for the genome. As an example, if we see that a TF is predicted to bind a gene with a Platt score of 0.99, this conditional probability is equivalent to an uncorrected p-value of 0.01. Thus, the correction above would transform the p-value of 0.01 to approximately a p_full of 0.1. Note that this is a very conservative correction since it does not take into account the fact that our Platt score is the averageover 50 classifiers.
Here and elsewhere we refer to the average, uncorrected Platt output using the upper case P (e.g., P > 0.99), whereas p-values measured by other means are shown in lower case (e.g., p < 0.01).
Feature Datasets
Eight different types of features were used to describe genes. The first six feature sets have been used previously and their full descriptions along with many relevant references can be found in [17]. The remaining two datasets have been modified or are novel. All together these datasets comprise 15516 features. The k-mer based kernels are inspired by the spectrum kernel [57], the (g, k)-gappy kernel [58] and the mismatch kernel [59] which have been proposed for sequence classification. In cases were computations were made on promoter regions (datasets 1–3, 5, 7,8) the promoters were defined as the 800 base pairs upstream of the coding region (translation start cite) and S. cerevisiae promoters were downloaded from RSA tools [60]. Before analysis all features are normalized to 0 mean and standard deviation of 1. Each of the datasets is available for download at [28].
1. k-mers (KMER) – Feature vectors are formed by enumerating all possible strings of nucleotides of length 4, 5, and 6. The number of occurrences of each string is counted in a gene's promoter region, and this string of counts is the gene's feature vector. K-mer counting which was used in part for datasets 1, 2 and 8 were performed using code modified from a script kindly provided by Dr. William Stafford Noble of Washington University.
2. k-mers with Mismatch (M01) – Similar to k-mer counts, occurrences of all strings of length 4, 5, and 6 are counted. In addition, any string which contains only one mismatch is also considered a hit, but is given a count of 0.1 rather than 1.
3. Melting Temperature Profile (MT) – It is possible that TF binding is facilitated by conformational adjustments in promoter DNA, which depends on the stability of the helix. Some recent evidence shows correlation between sites of promoter melting, regulatory sites, and transcription initiation sites [61]. The EMBOSS [62] toolbox is used to calculate the melting temperature profiles of all yeast promoters using a sliding window of 20 bp. The feature vectors are the same as described in [17]. A temperature is calculated for each 20 bp window in the promoter, and this vector of window temperatures comprises the MT profile. 20 bp is the default window size in the EMBOSS tool.
4. Homolog Conservation (HC) – [63] BLASTP is used to compare proteins in yeast to those in 180 prokaryotic genomes. The best hit E-values to each genome are discretized by placing them into one of six bins using empirically determined E-value cut-offs. Bin numbers range from 0 (no significant hit) to 5 (very significant). Each gene then has 180 features, each for a different genome, with values ranging from 0–5, signifying the strength of the best BLASTP hit of that gene's protein to another genome.
5. k-mer Median Positions (Kpo) – For each possible k-mer (k = 4, 5, and 6) we record its median distance from the transcription start in each gene.
6. Expression (EXP) – Normalized log2 ratios for each gene across 1011 experiments [64] are used as features. Each gene's expression profile is normalized to a mean of 0 and standard deviation of 1. For each gene a vector 1011 long (one feature for each expression experiment) is included in the data set.
7. k-mer Overrepresentation (Kev) – This method counts the number of each k-mer appearing in a promoter and calculates the significance of its occurrence. This method is the same as that reported in our previous work [17], except that the binomial distribution is used to calculate p-values rather than the Poisson distribution. This is in line with the calculations described in RSA tools [40, 60]. A higher Significance corresponds to a more relevant k-mer.
8. Conserved k-mers – This method for constructing a k-mer conservation matrix is based on output generated by the PhastCons algorithm [41, 65]. PhastCons is a two state phylogenetic hidden Markov model. The underlying idea is that conserved elements evolve more slowly than non-conserved elements. Thus, it has a "slow" state for conserved DNA and a "fast" state for non-conserved, more rapidly changing sites. Given DNA sequence alignments from multiple species, PhastCons outputs a probability score for each base pair in the alignment indicating from which state the sequence arises. This probability can be interpreted as the likelihood that the base pair is part of a conserved element. Genomic alignments for seven yeast species are used to generate the probability scores, which are available for download from the USC genome browser website [66, 67]. Note that the conservation scores shown in Figure 6 are from the PhastCons algorithm and were taken directly from the UCSC Genome Brower website.
During k-mer counting, each k-mer is given a unique weight depending on the average PhastCons score of its nucleotide positions. Simply weighting by the probabilities would result in missing data, since some genomic regions have no alignments. Instead we introduce a weighting scheme which increases the weight of a k-mer according to its conservation. Our weighting metric is:
where P
c
is the average PhastCons score for a particular k-mer. β is an adjustable parameter which controls how much the conservation of a k-mer increases its count. In this study we choose β = 0.75, so that an element with a maximum conservation of 1 has a count of 4. An element which shows no conservation has the default count of 1. Increasing β will further emphasize the effect of conservation. This method based on PhastCons is inspired by the "marginalized motif kernel for phylogenetic shadowing" introduced in [68]. Their method uses promoter alignments and a probabilistic model of fast and slow evolution to assess conserved elements. While their method can be considered more robust when good sequence alignments are available, we adopt the approach described here so that all yeast sequences may be included in our analysis.
Functional Analysis, Software, and Expression Data for the Swi6 Analysis
Statistical enrichment of GO biological process terms in gene sets was performed using the GO Term Finder on the Saccharomyces Genome Database website [69, 70]. Most of the analysis was performed in MATLAB [71] using custom scripts along with the SPIDER software package [55].
Expression data for Swi6 was taken from [32] and their associated website [72]. Expression data for YMR279C showing cell cycle induction was from [34] and was explored using the Expression Connection [35] at the SGD website.
VisAnt Networks
The networks (such as Figure 3) created with the VisAnt toolkit[20, 21] show links which have come from many publications. Any particular type of link (e.g., protein-protein interaction) may represent a collection of data from several genomic datasets. Each link type is referred to in VisAnt as a "method" and each method has a unique identifier. The method IDs for the link types in this paper include: M0037(phylogenetic profile), M0013(copurification), M0040(screened yeast-2-hybrid), M0031(other biophysical), M0046(Bayesian Predicted Interaction), M0045(affinity chromatography). Complete references and datasets are available in the VisAnt suite, accessible from the VisAnt website [73].
Comments
View archived comments (1)