A novel framework for horizontal and vertical data integration in cancer studies with application to survival time prediction models

Mihaylov, Iliyan; Kańduła, Maciej; Krachunov, Milko; Vassilev, Dimitar

doi:10.1186/s13062-019-0249-6

Research
Open access
Published: 21 November 2019

A novel framework for horizontal and vertical data integration in cancer studies with application to survival time prediction models

Iliyan Mihaylov¹^na1,
Maciej Kańduła^2,3^na1,
Milko Krachunov¹ &
…
Dimitar Vassilev¹

Biology Direct volume 14, Article number: 22 (2019) Cite this article

6230 Accesses
32 Citations
Metrics details

Abstract

Background

Recently high-throughput technologies have been massively used alongside clinical tests to study various types of cancer. Data generated in such large-scale studies are heterogeneous, of different types and formats. With lack of effective integration strategies novel models are necessary for efficient and operative data integration, where both clinical and molecular information can be effectively joined for storage, access and ease of use. Such models, combined with machine learning methods for accurate prediction of survival time in cancer studies, can yield novel insights into disease development and lead to precise personalized therapies.

Results

We developed an approach for intelligent data integration of two cancer datasets (breast cancer and neuroblastoma) − provided in the CAMDA 2018 ‘Cancer Data Integration Challenge’, and compared models for prediction of survival time. We developed a novel semantic network-based data integration framework that utilizes NoSQL databases, where we combined clinical and expression profile data, using both raw data records and external knowledge sources. Utilizing the integrated data we introduced Tumor Integrated Clinical Feature (TICF) − a new feature for accurate prediction of patient survival time. Finally, we applied and validated several machine learning models for survival time prediction.

Conclusion

We developed a framework for semantic integration of clinical and omics data that can borrow information across multiple cancer studies. By linking data with external domain knowledge sources our approach facilitates enrichment of the studied data by discovery of internal relations. The proposed and validated machine learning models for survival time prediction yielded accurate results.

Reviewers

This article was reviewed by Eran Elhaik, Wenzhong Xiao and Carlos Loucera.

Background

In the last decade, high-throughput technologies have been massively used alongside clinical tests to study various diseases in order to decipher the underlying biological mechanisms and devise novel therapeutic strategies. The generated high-throughput data often correspond to measurements of different biological entities (e.g., transcripts, proteins), represent various views on the same entity (e.g., genetic, epigenetic) and are created through different technologies (e.g., microarrays, RNA-Sequencing). The data are heterogeneous, of different types and formats. There is an obvious necessity to integrate the data, in order to store, access, relate, analyse and mine them easily.

Data integration is understood as a mean to combining data from different sources, creating a unified view and improving their accessibility to a potential user [1–3]. Data integration and biomedical analyses are separate disciplines and have evolved in relative isolation. There is a general agreement that uniting both these disciplines in order to develop more sustainable methods for analysis is necessary [4, 5]. Data integration fundamentally involves querying across different data sources. These data sources could be, but are not limited to, separate relational databases or semi-structured data sources distributed across a network. Data integration facilitates dividing the whole data space into two major dimensions, referring to where data or knowledge about metadata reside and to the representation of data and data models. Biomedical experiments take advantage of a vast number of different analytical methods that facilitate mining relevant data from the dispersed information. Some of the most frequent experiments are related to gene expression profiling, clinical data analytics [6], rational drug design [7, 8], which attempt to use all available biological and clinical knowledge to make informed development decisions. Moreover, machine learning-based approaches for finding and highlighting the useful knowledge in the vast space of abundant and heterogeneous data are applied for improving these analytics. Metadata, in particular, are gaining importance, being captured explicitly or inferred with help of machine learning models. Some examples include the use of machine learning methods for the inference of data structure, data distribution, and common value patterns.

The heterogeneity of data makes any integrative analysis highly challenging. Data generated with different technologies include different sets of attributes. Where data are highly heterogeneous and weakly related two interconnected integrative approaches are applied: horizontal and vertical integration (Fig. 1). The horizontal data integration unites information of the same type, but from different data sources and, potentially, in different formats. It facilitates uniting heterogeneous data, like clinical information, from many different sources in one data model. The vertical data integration, on the other hand, means relating different analyses and knowledge across multiple types of data, helping to manage links between the patient’s gene expression, clinical information, available chemical knowledge, and existing ontologies. Most existing approaches for data integration focus on one type of data or one disease and cannot facilitate cross-type or -disease integration [9, 10].

Related work

In this work horizontal integration is considered to be a management approach in which the raw data (patients, clinical records, expression profiles, etc.) can be “owned” and managed by one network. Usually, each type of raw data can define different semantics for common management purposes. In contrast, vertical integration semantically combines the attributes of each separate type of data that are related to one another. Additional information, in particular for the molecular data, can be found in external domain knowledge sources. With this newly added information the missing parts of the studied data can be filled in. In this way relations between attributes of the different records can be learnt. Currently, there are many established algorithms that address single-track data analysis [7, 8, 11, 12], and some recent successful approaches to integrative exploration [13]. These, however, usually only focus on one of the integration applications, either horizontal or vertical, underutilizing the entireness of the available information and the latent relations. We propose a novel framework that employs both these integration views. We show its value on a first example application to machine learning-based survival time prediction.

Novel model

In this study we combine data from neuroblastoma (NB) and breast cancer (BC). Via our data integration approach whole datasets are joined, but the semantic integrity of the data is kept and enriched. Through combining data from multiple cancers in this way we create a network of data where entities, like proteins, clinical features and expression features, are linked with each other [14]. Data can be often represented as networks, where nodes indicate biologically relevant entities (typically genes or proteins) and edges represent relationships between these entities (e.g., regulation, interaction). In our generated network, nodes represent patients and edges represent similarities between the patients’ profiles, consisting of clinical data, expression profiles and copy number information. Such network can be used to group similar patients and to associate these groups with distinct features [15]. The main challenges here are: (1) building an appropriate linked data network, discovering a semi-structure of the data model [16] and mapping assertions by the applied model for data integration [17]; and (2) data cleaning, combined into a formal workflow for data integration.

We focus on two aspects of data integration: horizontal and vertical. As explained, horizontal data integration means combining data within the same data source. In the datasets analysed here, the data sources are, specifically: clinical information, expression profiles and copy number data. Each type of data is measured by a different technology and potentially available in various data formats. As an example, we treat clinical data from two cancers as one data source, or one entity, even if it is in different formats. These entities are, however, semantically similar. Vertical data integration, on the other hand, is applied to creating relations between all horizontally integrated objects. This vertical data integration provides a connection between all different types of entities. This connection covers relations between patients through clinical information, expression and copy number profiles. Based on these relations we can easily detect all patients closely related to each other by, for instance, protein mutations, diagnosis and/or therapy.

Different databases are required for horizontal and for vertical data integration because each of these approaches address different aspects of the integration problem. Horizontal data integration deals with unstructured and heterogeneous data. Thus, we use a document-based database (such as MongoDB), which can handle different data types and formats. For vertical data integration a graph-based database is applied, as it is suitable for representing relations − crucial in this case. In this study, all relations are established between existing records for each entity, and represented by a semi-structure.

Data integration model with a NoSQL database can potentially unite medical studies data, alternatively to the most frequently used statistical/machine learning methods. Most of the NoSQL database systems share common characteristics, supporting the scalability, availability, flexibility and ensuring fast access times for storage, data retrieval and analysis [18, 19]. Very often when applying cluster analysis methods for grouping or joining data issues occur − mainly with outliers, small classes, and mostly with data dynamically changing relatedness. To overcome these problems a NoSQL database integration model can be applied. Further we extend the potential of the model by using multiple datasets, regardless of the level of heterogeneity, formats, types of data, etc. − all very relevant in cancer studies [20].

Our integrative framework facilitates direct analyses of the data. We first focus on a specific clinically relevant application: modeling and prediction of the survival time of cancer patients. This consists of applying both conventional classification methods and machine learning algorithms. Via data integration a new integrated and universal, i.e. applicable to both cancers, feature for survival time prediction is introduced. This feature is built from three clinical features which are most related to survivability. This integrated feature, further, provides a connection to the newly developed linked data network. This feature is used, in conventional classification k-neighbours method, to find patients that are related most closely to the studied one. After that, via the linked data we find other patients who may not have the new integrative feature but are still related by different types of data, like gene expression or CNV. Machine learning models, based on support vector and decision tree regression, are then used for survival time prediction and cross validation.

Material and methods

Our multilayer model for data integration consists of linked and internal networks built for both of the studied types of cancer: neuroblastoma and breast cancer. Both of these cancers include several types of data for each patient, such as expression data, copy number data and corresponding clinical information. In order to find common mutated proteins and to provide common therapies, we gain insight about the clinical outcome by detecting relations between these multiple types of data. By using such built relations we can find patients closest to the studied patient of interest, based on semantic similarity of diagnosis, applied therapy and gene expression profile. With this data integration model, which contains linked and relevant knowledge, we can build a specific network for each studied patient.

Modeling relations between molecular data sources and the linked information (clinical data, molecular data sources, patient records, etc.) is a crucial aspect of data integration in our study. In this regard two basic approaches have been proposed. The first approach, called here ‘internal data network’, requires data to be expressed in terms of internal relations. These relations can be found directly in the raw data. The second approach, called ‘linked data network’, requires the data to be “enriched” by using external domain knowledge sources [21].

Specifically, molecular data can be linked with external domain knowledge sources, like pathway and protein databases, by a general approach known as Linked Data schema. Linked Data is a method of publishing structured data so that it can be interlinked and become more informative through semantic queries. It is built upon standard Web technologies such as Hypertext Transfer Protocol (HTTP, [1]), Representational State Transfer (RESTful) and Uniform Resource Identifiers (URIs) and extends them to share information in a way that can be read automatically by computers, mostly via RESTful APIs [22].

The structure of Linked Data is based on a set of principles and standard recommendations created by the W3C. Single data points are identified with HTTP [1] URIs. Similar to how a web page can be retrieved by resolving its HTTP URI (e.g., ‘http://en.wikipedia.org/wiki/Presenilin’), data including a single entity in the Linked Data space can be retrieved by resolving its HTTP URI (e.g., ‘http://dbpedia.org/resource/Presenilin’). In order to “impute” missing parts of the integrated data, like protein annotations, protein relationships, mutations, finding hidden protein motifs, etc., it is necessary to use Linked Data from different domain knowledge sources, like UniProt, Ensembl, GO databases [23, 24]. This is defined as another network layer over the already built one in the data integration step. In Linked Data space all entities are interlinked. This results in one large overarching network where objects are interrelated. The challenge here is to apply this network to finding more complete and reliable information for each of the studied patients, as well as to be able to use this information for survival time prediction modeling.

Data description

Two datasets − neuroblastoma (NB) [12] and breast cancer (BC) [25], are used in this study. Data were provided by the CAMDA 2018 challenge [26]. Similar type of information is provided by different sources in different formats. The neuroblastoma dataset contains RNA-Seq gene expression profiles of 498 patients as well as Agilent microarray expression and aCGH copy number data for a matched subset of 145 patients each, and corresponding clinical information. The breast cancer set contains profiles for microarray expression and CNV copy number data, and clinical information (survival time, multiple prognostic markers, therapy data) for about 2,000 patients. The types of data and information sources are shown in Fig. 2. We integrate all data both horizontally and vertically.

Data preprocessing

For initial data preprocessing we developed a programming module in Python (version 3.7) with library scikit-learn [27, 28] for reading in the raw files. The module automatically discovers the delimiter which separates each attribute in the raw data files. Each file has a header with rows, containing specific information about the file, the technology applied for generation of this file, types and number of attributes, and references to other files (clinical data files have reference to expression files via file ID). Our programming module reads this information in and uses it to create a so-called semi-structure. This semi-structure contains attributes which exist in each type of data. Data types include: clinical information, expression and copy number profiles (Fig. 1). This module is used to build a semi-structure repeatedly and iteratively, record by record. Each record is built from fields/attributes (all values from one record). For each record we store aggregated information for all fields in one data structure, which contains two parameters − field name and count of repeated fields [29]. When new fields are added to the data semi-structure they are imported into our database. The database consists of two layers − first: non relational document-based database; and second: graph-based database. This way the workflow is completed and raw data are integrated into the database as a data semi-structure. These fields − in each record, represent a small set of all fields/attributes. In the document-based database we apply a restriction (called ‘data schema’) based on the generated semi-structure. The applied data schema over each record for each type of data joins data in different formats and from different sources. For each type of data this data schema always contains ID and the Sample ID (representing the name of the subject, as provided in the clinical information).

Data integration

Utilizing the semi-structure, heterogeneous data are integrated into one database, where the final goal is to create a network of relations between all types of data. In these networks, nodes represent patients and edges represent similarities between patient profiles. The similarity means that two patients are related to each other by multiple proteins, based on expression profiles and copy number changes. These networks of relations facilitate grouping of the patients. Patient groups can then be associated with distinct clinical outcome.

The network has two layers. First layer, covering internal relationships, is built with raw data, i.e. clinical information, expression data, and copy number variants. These are transformed into relationships between patients and proteins. The second layer includes semantically linked data from external domain knowledge sources. These sources provide information about additional proteins related to those existing in our dataset. These new relations are stored in our graph-based database. In order to utilize the additional information from the external knowledge sources we link them within our network via hyperlinks (URLs). This way we can avoid a visual incomprehensibility that would be caused by the redundancy of information. These two layers are combined into one network, where each relation is weighted. Our approach to data integration consists of the following steps (Fig. 3).

All the data from the experimental datasets are integrated horizontally with NoSQL (MongoDB) technology and represented as a semi-structure. This results in a semi-structure per data type, i.e. all clinical data are united in a semi-structure, all expression data in another semi-structure, and all copy number data in a semi-structure. All the raw and metadata are stored in MongoDB in JSON format. In order to integrate the data further, vertically, we first need to find relations between already built semi-structures for clinical records, expression profiles and copy number data. These relationships are managed in the graph-based database − Neo4j. For example, patient A with semi-structure {ID, [attributes]} is related to patient B with semi-structure {ID, [attributes]}. In this relation ID is the important key, while the attributes provide general information about the type of data record (clinical, expression, copy number). Such relations facilitate building a network, different for each studied patient. This network includes expression profiles, copy number, and the mutated proteins. In this way we can detect and link all patients through a specific set of expressed and mutated proteins.

Linking external data sources

Through semantic data integration, via https RESTFul endpoints (programming access points) specifically, we are able to find additional relationships between proteins from the external domain knowledge sources (EDKS), like Gene Ontology compendium (GO), UniProt, Ensembl [23, 24, 30]. Through EDKS proteins can be found that are closely related to the ones available in the expression profiles. The strength of relation of proteins is established via a score mechanism [30]. For each protein, before importing it into the graph-based database for vertical integration, we search for related proteins. As a result a list of proteins, containing ‘Hugo symbols’ − protein identifiers, is obtained. We use these ‘Hugo symbols’ to find a semi-structure of proteins in our database. The semi-structure is then used to create relationships between the proteins. Thus, relationships between proteins found in our database are generated based also on data from EDKS.

Usually, the number of relationships generated with help of EDKS is unfeasibly large (over a billion), increasing dimensionality of such data. To account for that, we developed a strategy to continue working only with so-called “trusted relationships”. These “trusted relationships” are found by a scoring mechanism. This scoring mechanism is introduced to rank, i.e. score, the most relevant relations (based on semi-structures) originating from our datasets. Internal relationships, based on raw data, have higher score than the relations derived from linked data. We, furthermore, define them as trusted relationships when they occur more than 10 times among different patients. This is necessary for differentiating the significant links between the proteins and for reducing the noise of the relationships between the patients through the added protein information. The noise is introduced by the external knowledge sources, where, potentially, all proteins can be related. The scoring mechanism also ranks the relations originating from external knowledge sources. Naturally, these should have lower scores, compared to the ones derived from the real datasets. In the process of scoring we can also improve the scores of the relations stemming from the external domain knowledge sources on the basis of the frequency with which the certain relation appears. In the next step we classify the already integrated datasets by tumor-related properties. Specifically, we normalize the data by removing the mean and scaling to unit variance [20]. After that, a k-neighbours classification mechanism is applied to split the data into relatively equal groups. Classified data are further used to remove redundant records of the analysed patients. We then normalize the data again by removing the mean and scaling to unit variance.

Novel integrated tumor-specific feature

For survival time prediction in breast cancer the Nottingham prognostic index (NPI) is usually applied. It helps to determine prognosis following the surgery. Its value is calculated using three pathological criteria: the size of the lesion, the number of involved lymph nodes, and the grade of the tumor. The NPI can be used to stratify patients into groups and is used to predict five-year survival (in accordance with the more commonly used time scales for survival in other types of cancers) [31]. We do not utilize NPI in our framework because it only applies to one specific disease – breast cancer. In our case a universal predictor is essential, in order to account for other cancers, e.g., neuroblastoma. Thus, we develop a novel and universal predictive parameter – Tumor Integrated Clinical Feature (TICF). To predict patient survival time (in both cancer studies combined) we select specific informative clinical features. We tested different features, their combinations and order, and established the optimal setup (not shown). Specifically, the TICF feature is built by numerically concatenating tumor stage, tumor size and age at diagnosis (Fig. 3) in this exact order. The order of concatenation of the clinical data also shows the importance of clinical information for tumor development and relevance to the patient survival rate. A patient with a tumor in stage four, naturally, will have a shorter survival time compared to patients with a tumor in stage two. The next feature – tumor size, is added second because with an increase of the tumor size the survival rate of a patient is reduced. It is also less important to the survival time than the stage of the tumor. Age at the time of diagnosis, is concatenated third, and indicates that older patients have a lower survival rate. If the order of concatenation of these TICF-composing features would differ patients with distant survival-related features would be incorrectly grouped. In this manner, we provide a normalized distance between patients, essential in our subsequent machine learning approaches to survival time prediction.

Classification and data enrichment

We normalize the TICF feature by subtracting the mean and scaling it according to the unit variance. Centering and scaling are done independently for each record by computing the relevant statistics on the samples. Mean and standard deviation are then stored to be used in later data analysis with the transform method. Patients are then stratified into groups with regard to the TICF similarity using a k-neighborhood approach.

Using the TICF we find a group of patients most relevant to and build an individual dataset for every studied patient (Fig. 4). In the first step this dataset contains only patients from the found group. It contains information about the TICF and the related mutated proteins. In the already semantically integrated datasets we search for other relations between mutated proteins and patients. These relations can be found within the vertically integrated data. Within each of the defined patient groups we detect relations of these patients to certain proteins. Using these proteins we find relations to other patients, who have the same mutated proteins as in the selected group. These relations are all based on internal relationships. We, thus, enrich each defined group with new related patient records. The next step is to extend the number of related proteins of the selected group by using linked data, based on external knowledge sources. We, again, enrich the defined group of patients through new relations to proteins, and then to other related patients. To avoid redundancy of the linked data relationships we apply the scoring mechanism. This generates a massive dataset, which is different for each patient.

Survival time prediction models

Next we apply machine learning models to predict and validate survival time of the patients. Artificial intelligence, and in particular machine learning models, has been regularly used in cancer research, with practical implementations [32]. Artificial neural networks and decision trees, for example, have been used in cancer detection and diagnosis for nearly 30 years [33]. Various models, applying Support Vector Machine (SVM) to cancer prognosis, have been successfully used for approximately two decades [34].

Machine learning models used in our study are based on Support Vector Regression (SVR) with different kernels: Radial Basis Function (RBF), Linear and Poly, and Decision Tree Regression model (DTR). Similar models were shown to perform well for survival prediction in cancer studies [35, 36]. Moreover, using these models facilitates a seamless cross-validation.

The TICF features are built for a selected group of patients. We extend the selected group of patients with new closer patients from internal networks and linked data. This newly built set of patients includes enriched TICF features. This set of already enriched TICF features for the selected group and respective relations are used as an input – first parameter to the machine learning models. Second parameter is a number which represents a patient’s survivability. For the survivability prediction model we use the count of months after cancer is diagnosed, information available for most of the patients in studied datasets. As a result, the machine learning models return an approximate value which represents survival time in count of months. As mentioned, these first models are based on Support Vector Regression with different kernels. SVR with RBF can be defined as a simple single-layer type of an artificial neural network called an RBF network. This RBF is used as an interpolation approach which ensures that the fitting set covers the entire data equidistantly. SVR-Linear represents a function for transforming the data into a higher dimensional feature space in order to enable a linear separation. SVR-Poly represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models. The feature space of a polynomial kernel is equivalent to that of polynomial regression. In the SVR-DTR, a decision tree represents a regression or classification model in the form of a tree structure. It breaks down a dataset into smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

These models fit the features selected for survival time prediction from our integrated dataset and the yielded results are directly comparable.

(Cross-)validation

We validate the outcomes of the applied machine learning models by using randomly smaller subsets of both raw and integrated data, in a cross-validation setup. Specifically, a k-fold cross-validation is applied, where the original sample is randomly partitioned into k equal-size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k −1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used, but in general, k remains an unfixed parameter. This validation model can be used to estimate any quantitative measure that is appropriate for the data and the model.

Results

Semantic data network

We developed a novel network-based data integration model, where we combine clinical and molecular data, using both raw data records and external knowledge sources. Relations derived from the raw data represent the internal network and relations based on external domain knowledge sources (EDKS) are represented as a semantically linked network. Our semantically linked network is connected to EDKS via RESTFul API endpoints. These endpoints are different for each type of EDKS. As a result we use two types of EDKS data. The first type consists of proteins from GO, related to the studied protein, based on scores provided in the GO. The second type of data we use, includes additional information about proteins in the raw data, e.g., ‘Hugo Symbol’. These proteins often are not completely defined by families and domains, so we use the Hugo symbols and search for these protein domains and families through the EKDS. Using similar proteins from EDKS (GO) we semantically enrich our internal network with new knowledge about relations between proteins, which cannot be derived from the raw data. The resulting highly dimensional network, consisting of more than one billion relations, includes redundant information, which we reduce via our scoring mechanism. Technically, the fusion of the two studied types of cancer involves both horizontal and vertical data integration, using two different database models. The first is a document database model where all heterogeneous raw data are integrated. The second is a graph database model where all different types of relations between patients and proteins are joined. For the purpose of survival time prediction, combining clinical information, we developed a novel universal Tumor Integrated Clinical Feature (TICF). The TICF features are first identified using the raw data, based on three existing clinical features – tumor stage, tumor size and age at diagnosis. The TICF features are then used to create patient similarity network that, in the next step, is further extended with molecular information.

Figure 5 shows an example of a network of patients that are semantically related to a studied patient – patient we are interested in analysing. We build a TICF for all the patients for whom the necessary clinical information is available. Focusing now on a patient of interest, we then find patients related to her by TICF similarity. Specifically, we use the k-neighbours model to split patients into 5 classes. This initial group contains a small set of patients because not every patient has a TICF feature. In the subsequent steps of the study, we use the molecular data and find all proteins related to this selected group of patients. Considering relationships between these proteins, based on the molecular data, we can link patients with each other. Next, we find all semantically related proteins using the external, i.e. linked, data (EDKS) to still find additional related patients. We can then combine all information – internal and linked relations, from the found group into one new extended, semantically enriched dataset.

Machine learning models for survival time prediction can then be applied to any patient within this dataset who has a defined TICF feature.

Machine learning models for survival time prediction

Groups obtained via the TICF feature, naturally, can be unbalanced. For example, including patients with smaller number of data records – which presents an obstacle for predicting the survival time. For validation we focus on smaller datasets (approximately 25% of the whole dataset) from the raw data which are clustered into 5 subgroups by using the k-fold algorithm.

After the dataset is normalised and patients stratified into groups we apply several machine learning models for survival time prediction: Support Vector Regression (SVR with RBF, Linear and Polynomial kernels) as well as Decision Tree Regression (DTR).

In Fig. 6 performance – accuracy, of the machine learning models applied to survival time prediction is shown. Survival time is predicted using the data of both cancers combined and with our framework used for processing and integrating the data. Decision Tree Regression (DTR) and Support Vector Regression with linear kernel (SVR-Linear) perform best, the latter yielding the most accurate results for survival time prediction. The potential of these models is in improving the accuracy of survival time prediction by improving iteratively the training dataset over the whole integrated dataset. Specifically, with every new studied patient we iterate over, we enrich the training dataset with new trusted relations from our linked relationships, defined by the increased frequency of their use.

Next we compare the applied models in a cross-validation approach (Tab. 1). The validation is based on four parameters for error evaluation: trained R2 (coefficient of determination) and trained explained variance are related to the accuracy of the used model; while trained negative mean square log error and negative mean absolute error are related to the noise (error) level. The resulting accuracies (Fig. 7) again confirm that the SVR-Linear and DTR models using TICF outperform other models, i.e. SVR-RBF, with regard to accuracy. This shows that SVR-Linear and DTR are more suitable, among the four compared models, for accurate survival time prediction.

Table 1 Aggregated results of cross-validation

Full size table

To confirm that our integrative framework is indeed necessary to obtain best performance, we examine alternative approaches (see the Additional files 1, 2, 3 and 4). First we show cross-validation results using our relational network but without the TICF integrated clinical feature (Additional file 1). Instead we use the clinical features as separate regressors in the experiment. Next we use the TICF feature but this time we do not extend the patient similarity search with the relational network (Additional file 2). Finally, we look at predictions when only separate clinical features are used and no relational network (Additional file 3). Executional times of the applied ML models are given in Additional file 4. As is evidenced, models building on our novel fully integrative framework outperform the alternatives.

Discussion

In this work we introduce a novel unified and universal approach for integration of data generated in independent cancer studies. We demonstrate its application to breast cancer and neuroblastoma datasets. Our model is built to facilitate application and extension to multiple different diseases with different types of multi-omics data. Subsequently, we highlight clinical relevance of our data integration method by applying it to survival time prediction, using machine learning models.

The original contribution of our work is the data integration model. A number of interesting and different approaches, related to the similar problem were presented in the previous CAMDA challenges [26]. We developed our strategy for data integration by using a semantically defined network approach based on different database models. Major objectives in our integrative framework are to integrate and utilize information, also latent, available in whole and dynamically growing datasets for multiple diseases. Additionally to the potential extensibility of our data integration model, it also facilitates a seamless integration with external knowledge sources. The data integration challenge was solved by using models for horizontal and for vertical integration. Specifically, we applied new database technologies: document type database for horizontal integration and graph database for vertical integration – MongoDB and Neo4j, respectively. Such software technology facilitates finding relations between the records in the integrated datasets. The main merit of our approach is that we are able, also dynamically, to add more data and relations. We explore these opportunities by adding new semantically defined relations from the external knowledge sources. Such approach gives us not only a solution to the particular task of the CAMDA challenge, but can also be applied in similar research and practical projects. Our software platform can be easily extended and supported.

Moreover, we apply and compare the performance of multiple machine learning models that use the semantically linked data. Specifically, we develop a new classification feature for survival time prediction.

The new feature – TICF, is an integrated parameter, allowing semantical enrichment through the semi-structure generated by our novel data integration model. TICF can be used for the application of certain machine learning models in order to find patients closely related to the studied one. Inclusion of related patients with different clinical and expression parameters, as we show, is essential for improving the accuracy of survival prediction models.

For survival time prediction we apply supervised regression models [35, 36]. Models used in this study utilize the TICF feature to improve the accuracy of patient survival time prediction. Moreover, application of these specific machine learning algorithms ensures a reliable validation of our semantic data integration approach. Cross-validation of these models showed stable results with regard to achieved accuracy – both in the context of success and error rates, in survival time prediction.

Conclusions

We developed models for defining and enriching relations by integrating data of various types and from disparate sources (two different cancer datasets), and consolidating them into meaningful and valuable information by the use of semantic technologies. The use of linked and overlayed NoSQL database technologies allowed us to aggregate the non-structured, heterogeneous cancer data with their various relationships. The applied semantic integration of different cancer datasets facilitates an enrichment of the studied data by discovery of mutual internal relations and relations with external domain knowledge sources.

We developed machine learning based models for survival time prediction in two types of cancer – breast cancer and neuroblastoma. We proposed a novel universal and integrative feature for classification and analysis, investigated its performance in a cross-validation setup with four machine learning models, and showed that the best results are obtained with our integrative framework. Specifically, using Support Vector Regression with Linear kernel, and Decision Tree Regression.

Reviewers’ comments

Reviewer’s report 1: Eran Elhaik, Ph.D

The proposed framework is indeed novel but I found the manuscript long and difficult to read. It should be shortened and more figures should be employed to better explain it. This can be done by adding figures showing the pipeline and how the framework works and revise the legends to be more informative. Typos should be corrected. The manuscript is original and significant for works in this field.

Author’s response: We agree with the suggestions and improved the text. Figures illustrating the pipeline and how the framework works are now adjusted for better clarity. Typos are corrected and the text is optimized.

Reviewer’s report 2: Eran Elhaik, Ph.D

Is there a link to the system?

Author’s response: The system is developed internally and, at the moment, not for public use. However, we uploaded the latest version of the source code into a repository on GitHub, accessible after requesting permission. Our aim is to develop the system as a publicly accessible tool.

Reviewer’s report 3: Eran Elhaik, Ph.D

What are the results of the framework that has been applied to the 2 cancers? You wrote: “The potential of these models is in improving the accuracy of survival time prediction by improving iteratively the training dataset over the whole integrated dataset.” Where can we see the improving of the results accuracy?

Author’s response: The potential of these models is shown in Figs. 6 and 7 and in the now additionally introduced table. The more similar the data between two patients, the more accurate the prediction is. We have now added a table with aggregated numerical results with regard to accuracy, as indicated by R2.

Reviewer’s report 4: Eran Elhaik, Ph.D

Figure 4 is unclear.

Author’s response: In Fig. 4 we present the idea of TICF and how it is organized.

Reviewer’s report 5: Eran Elhaik, Ph.D

Figure 5 is most unhelpful. Consider adding more information from proteins, etc. to increase clarity.

Author’s response: Fig. 5 shows how the Linked Data concept is applied for patient data integration. The linked data is a method of publishing structured data so that it can be interlinked and explored via semantic queries. The queries in our case can be done by using the information of the relation between a mutated base (e.g., CNVs) and a patient.

Reviewer’s report 6: Eran Elhaik, Ph.D

The legends of Figs. 6 and 7 are unclear. If I understand correctly, Fig. 6a demonstrates the accuracy of the model -this should be emphasizes and more numerical results should be provided, but for which cancer are they? Also, these figures have subplots that are not mentioned in the legend.

Author’s response: We introduced Tab. 1 showing the numerical results. We give an answer to the rest of the question in our answer to question 3.