Experimental setup
We compare the performance of the following approaches:

1.
SWC+DECOMP+SSS: integrated approach consisting of SWC, DECOMP, and SSS

2.
SWC: Supervised Weighting of Composite network, using six clustering algorithms combined with majority voting

3.
DECOMP: decomposition of PPI network, using six clustering algorithms combined with majority voting

4.
SSS: SizeSpecific Supervised Weighting

5.
PPI+COMBINE: PPI network weighted by reliability, using six clustering algorithms combined with majority voting

6.
PPI+clustering algorithm: PPI network weighted by reliability, using a single clustering algorithm
We perform random subsampling crossvalidation, repeated over ten rounds, using manuallycurated complexes as reference complexes for training and testing. For yeast, we use the CYC2008 [13] set which consists of 408 complexes. For human, we use the CORUM [14] set which consists of 1829 complexes. Previously, we had tested our approaches on only large complexes (for SWC [9] and DECOMP [10]), or only small complexes (for SSS [12]); here, we test our integrated approach on all complexes, large and small. In each crossvalidation round, t
% of the complexes are selected for testing, while all the remaining complexes are used for training. Thus we use a large percentage of test complexes t
%=90 %, giving 41 training complexes in yeast, and 183 training complexes in human. Each edge (u,v) in the network is given a class label cocomplex if u and v are in the same training complex, otherwise its class label is noncocomplex. For SSS, the edges labeled cocomplex are further split into two subclasses, smallcocomplex and largecocomplex, for edges in small complexes (composed of two or three distinct proteins) and large complexes (composed of at least four distinct proteins), respectively. For the supervised approaches, learning is performed using these labels, and the edges of the entire network are then weighted using the learned models. The topweighted k edges from the network are then used by the clustering algorithms to predict complexes. In our experiments we use k=20000 for SWC and DECOMP, and k=10000 for SSS (as described in their respective papers).
We use precisionrecall graphs to evaluate how well the predicted clusters match the test complexes. Each cluster P is ranked by its score. To obtain a precisionrecall graph, we calculate and plot the precision and recall of the predicted clusters at various clusterscore thresholds. Given a set of predicted clusters P={P
_{1},P
_{2},…}, a set of test reference complexes C={C
_{1},C
_{2},…}, and a set of training reference complexes T={T
_{1},T
_{2},…}, the recall and precision at score threshold s are defined as follows:
$$\begin{array}{c} Recall_{s} = \frac{\left \{ C_{i}  C_{i} \in \textbf{C} \wedge \exists P_{j} \in \textbf{P},\, score(P_{j}) \geq s,\, P_{j} \; matches \; C_{i} \} \right} {\left\textbf{C}\right} \end{array} $$
$$\begin{array}{l} Precision_{s} = \frac{\left \{ P_{j}  P_{j} \in \textbf{P},\,score(P_{j}) \geq s \wedge \exists C_{i} \in \textbf{C}, C_{i} \, matches \, P_{j} \} \right} {\left \{ P_{k}  P_{k} \in \textbf{P},\, score(P_{k}) \geq s \wedge (\nexists T_{i} \in \textbf{T}, T_{i} \, matches \, P_{k} \vee \exists C_{i} \in \textbf{C}, C_{i} \; matches \, P_{k}) \} \right} \end{array} $$
$$C \, matches \, P =\left\{ \begin{array}{ll} \text{true} & \text{if}\,\, size(C) > 3 \wedge size(P) > 3 \wedge Jaccard(P,C) \geq lg\_match \\ & \text{or}\,\, size(C) \leq 3 \wedge size(P) \leq 3 \wedge Jaccard(P,C) \geq sm\_match \\ \text{false} & \text{otherwise} \end{array} \right. $$
The precision of clusters is calculated only among those clusters that do not match a training complex, to eliminate the bias of the supervised approaches for predicting training complexes well. We require small complexes to be matched perfectly, as a mismatch of just one protein in a small complex may render the prediction less useful; on the other hand we allow a slight tolerance for mismatch for large complexes. Thus we require that small complexes must be matched by small clusters with a match threshold of s
m_m
a
t
c
h, and large complexes must be matched by large clusters with a different threshold of l
g_m
a
t
c
h. We define l
g_m
a
t
c
h=0.75 for large yeast complexes, l
g_m
a
t
c
h=0.5 for large human complexes (since they are more challenging to predict), and s
m_m
a
t
c
h=1 for small complexes in both yeast and human.
Complex prediction
Figure 2 shows the precisionrecall graphs for complex prediction in yeast. Figure 2
a shows that SWC and DECOMP both attain higher precision than PPI+COMBINE, demonstrating the benefits of supervised weighting and PPI decomposition (note that all three of these approaches use the COMBINE strategy). As SSS’ predictions are limited to small complexes, which is moreover a difficult challenge with a perfect matching requirement, it has lower precision levels compared to PPI+COMBINE. However, the integrated approach, SWC+DECOMP+SSS, is able to predict both large and small complexes, and achieves much higher recall as well as precision. Figure 2
b shows that individual clustering algorithms (used with the PPI network) give lower precision and recall compared to PPI+COMBINE, showing the utility of combining the clusters from multiple clustering algorithms.
We noticed that the generated small clusters may depress the precision, as many of them are false positives. Figures 2
c and d show the performance when these small clusters are removed. As expected, recall drops substantially, as the small complexes are now unable to be predicted: for example, for PPI+COMBINE, recall drops from over 40 % to about 20 %. However, precision is improved, as the many falsepositive small clusters are removed. For our integrated approach(SWC+DECOMP+SSS), the removal of small clusters means removing those clusters generated by SSS. We still achieve higher precision and recall than the other approaches, showing that our integrated approach still outperforms other approaches when considering large complexes only. Moreover, without removing small clusters, our integrated approach maintains high precision as it uses a specialized approach, SSS, to predict small complexes.
Figure 3 shows the corresponding precisionrecall graphs for complex prediction in human. Figure 3
a shows that SWC and DECOMP both attain higher precision than PPI+COMBINE, showing the benefits of supervised weighting and PPI decomposition. SSS shows poor performance as it is limited to predicting small complexes, which is especially challenging in human. The integrated approach, SWC+DECOMP+SSS, is able to predict both large and small complexes, and achieves higher recall as well as precision. Figure 3
b shows that most of the individual clustering algorithms (used with the PPI network) give lower precision or recall compared to PPI+COMBINE, showing the utility of combining the clusters from multiple clustering algorithms. The exception is Coach, which attains high precision as it does not generate small clusters by design, thereby cutting down on its falsepositive predictions.
Figures 3
c and d show the performance when the generated small clusters are removed. Compared to yeast, here the recall does not drop as much: for example, for PPI+COMBINE, recall drops by about 5 % only. However, the improvement in precision is substantial: for example, PPI+COMBINE sees more than fivefold increase in precision at many points in the graph. This reveals an issue in complex prediction which is more obvious in human but still apparent in yeast: predicting small complexes alongside large ones means accepting a drop in precision due to large numbers of falsepositive small clusters; while improving precision by excluding small clusters means that no small complexes can be predicted. On the other hand, our integrated approach uses a specialized approach, SSS, to generate the small clusters separately from the large ones, which allows effective prediction of the small complexes while still maintaining high precision levels.
To investigate the performance of our integrated approach with respect to the three challenges that we highlighted, we stratify the reference complexes in terms of their sizes, extraneous edges, and densities. First, to quantify whether a complex is embedded within a highlyconnected region of the PPI network, we derive EXT, the number of external proteins that are highly connected to it, defined as being connected to at least half of the proteins in the complex. Second, to quantify how sparse a complex is, we derive DENS, the density of each complex, defined as the number of PPI edges in the complex divided by the total number of possible edges in the complex. In our analysis, we stratify the complexes into large and small complexes, and further stratify the large complexes into low, medium, and high DENS (corresponding to DENS of [ 0,.35], (.35,.7], and (.7,1] respectively), and low and high EXT (corresponding to EXT ≤3 and >3 respectively), to give seven total strata (one for small complexes, and six for large complexes). Figures 4 and 5 show the size distribution, and the DENS and EXT analysis strata of the large complexes, of the yeast and human complexes. Note that the lesschallenging complexes to predict are the large complexes with high DENS and low EXT, and these correspond to around only 15 % and 5 % of complexes in yeast and human respectively; the remaining complexes are challenging in some way, with the majority of them falling into the smallcomplex category.
We take the top 1000 clusters generated by each approach, and determine how well the reference complexes in the different strata are matched by these clusters. Figures 6 and 7 show the average improvements in matching scores among the stratified complexes for our approaches versus PPI+COMBINE, in yeast and human respectively.
Among yeast and human large complexes, SWC gives the biggest improvements among complexes with low to medium density: it uses data integration and supervised learning to fill in missing edges of sparse complexes to allow them to be predicted. Among sparse complexes, even those with high EXT see an improvement, showing that SWC’s supervised weighting can effectively reduce the number of spurious edges in the PPI network. DECOMP gives the biggest improvements among complexes with high EXT, within each density stratum. This is because it decomposes the PPI network into spatially and temporallycoherent subnetworks, in which complexes may become disconnected from their original denselyconnected neighbourhoods, allowing their borders to be better delimited by clustering algorithms. As expected, SSS improves the performance among small complexes. Our integrated approach (SWC+DECOMP+SSS) spreads out the improvements among the complexes in the different strata, showing that the different approaches complement each other to predict different types of challenging complexes.
Novel complexes
Here we investigate the number and quality of novel complexes predicted by our approaches. For the supervised approaches, we use the entire sets of reference complexes for training. We keep only predicted complexes that are novel, unique, and highconfidence. First, predicted complexes that are similar to each other are filtered to keep only the highestscoring one. Next, we keep only the topscoring predictions such that the precision of these predictions (i.e. proportion of predictions that match a reference complex) is greater than 0.4. Finally, we keep only novel predictions by removing those that match a reference complex. We use a Jaccard similarity threshold of 0.5 in the above procedure for matching.
We measure the quality of these novel predictions by their semantic coherence in each of the three GO classes, biological process (BP), cellular compartment (CC), and molecular function (MF). First, we use the most informative common ancestor method to calculate the semantic similarity between two GO terms [15]. Then we define the semantic coherence between two proteins as the highest semantic similarity between their two sets of annotated GO terms, for each GO class. Finally the semantic coherence of a set of proteins is their averaged pairwise semantic similarity, for each GO class.
Figure 8
a shows the number and quality of novel predictions in yeast. Each of our individual approaches (SWC, DECOMP, and SSS) predicts more novel complexes compared to the baseline (PPI+COMBINE), while the integrated approach generates the highest number of novel complexes. The novel complexes from our individual approaches attain higher semantic coherence in one or more of the GO classes, compared to the baseline. The novel predictions from the integrated approach attain semantic coherence that is averaged out between its three constituent approaches, which gives it higher coherence than the baseline across all three GO classes.
Figure 8
b shows the number and quality of novel predictions in human. As described above, PPI+COMBINE generates a great number of small clusters in human, most of which are falsepositives; this gives it a greater number of novel predictions compared to each of our individual approaches. Nonetheless, our integrated approach still generates the greatest number of novel complexes. As in yeast, our individual approaches generate novel complexes with greater semantic coherence compared to PPI+COMBINE; the integrated approach achieves greater semantic coherence, in all three GO classes, in its predictions compared to the baseline. Thus, in both yeast and human, our integrated approach generates the greatest number of novel predictions, with higher quality compared to the baseline approach of combined clustering with a PPI network.
An example novel human complex discovered by our integrated approach consists of proteins PELP1, SENP3, TEX10, and LASIL1. This complex is not part of our reference complexes, but is validated as four out of five members of a recentlycharacterized 5FMC complex [16]. This complex is also predicted by SWC and DECOMP at low scores (ranked 103 and 12 respectively); through voting, our integrated approach rescores it much higher (secondhighest rank).