Prediction of Adverse Biological Effects of Chemicals Using Knowledge Graph Embeddings

Tracking #: 2658-3872

Authors: 
Erik B. Myklebust
Ernesto Jimenez-Ruiz
Jiaoyan Chen
Raoul Wolf
Knut Erik Tollefsen

Responsible editor: 
Guest Editors DeepL4KGs 2021

Submission type: 
Full Paper
Abstract: 
Semantic web technologies enable the interoperability of disparate data sources. We have created a knowledge graph based on major data sources used in ecotoxicological risk assessment. This facilities the use of the extensive library of semantic web tools. We have applied this knowledge graph to a important task in risk assessment, namely chemical effect prediction. Our extensive evaluation shows that by using knowledge graph embeddings we can increase the accuracy of effect prediction over a simple baseline. Furthermore, we have implemented a fine-tuning architecture which adapts the knowledge graph embeddings to the effect prediction task and leads to a better performance.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Adrien Coulet submitted on 16/Feb/2021
Suggestion:
Accept
Review Comment:

Summary:
This article presents two contributions :
an original knowledge graph (KG) named TERA that groups data and elements of knowledge about 3 types of entites: chemicals, living species, and effect of chemical on living species;
an original experiment that uses supervised machine learning models, trained on the TERA KG, to predict the effect of chemical on living species.

Major comments:
The article is well written, scientifically sound, well motivated by challenges in the field of ecotoxicology, and its contributions are original.
It is rather long (32 pages, including references and appendices). This length is in part due to the fact that it describes both the construction of the KG, and its use for the task of effect prediction.
Despite this length, I did not found useless or boring sections, rather many bricks, useful to understand the whole.

I recommend to accept the article that I found of both quality and interest.
I would recommend to enrich the paper with elements of answer to the following questions, to improve the benefit one would have reading the article.

+In the end, it is unclear what the authors demonstrate with the prediction experiment.
The interest of KGE is mitigated. Pros of KGE models is lightly discussed.
Results seems highly dependent on the data (in particular in the data selected for the experiment).
Is it consistent with other experiments? What is the balance between effort to build TERA vs. effort to find a subset suitable for predictive tasks ?
Is the prediction easier, or of better quality once the data is aggregated within TERA?

+ The rational behind the choice of, and the distinction between decomposition, geometric and convolutional models would be of interest.

+ “For each chemical in the effect data, we extract
all triples connected to them using a directed
crawl.”
The transformation of RDF graph to input models is not made explicit. There is many alternative and this deserves few words.
I understand that chosen strategy is rather simple, but for instance I am not sure if predicate type is considered.

+ Do you think that considering the complexity of the KG (considering a larger neighborhood of chemicals and species, considering transitivity) may impact prediction results?

+ why choosing sens. and spec.? Why not adding F-measure and precision?

+ Important choices in the design of the prediction task are well described, but not discussed : choice of Y^ > 0.5, choice of oversampling for class balancing, size of the entity neighborhood considered for the KGE.

+ Could you think of additional linked open data sources that could be easily connected and bring additional features to help discriminate between examples?

Minor comments:
General
+Section 2 : it is unclear to me if the risk assessment pipeline is something standard for the ecotoxicology community or an original proposition of the authors.
+Fig 2 and text core : NCBI is used as a short name for NCBI Taxonomy. I found this confusing, since NCBI hosts many data resources.
+Table 4: I found indexing with roman numbers heavy. Arabic number with a prefix “t” t1, t2, …?
+in 6.1.2 Sampling : why 78/11/11?

Phrasing/typo
+This facilities the use
+binary mortality effect prediction (long compound word to me)
+there is > 0 probability of lethality to test organisms (?)
+Sensitivity measures the number of true positive classification (this is reductive since other metrics also count TPs, such as precision)
+the Youden’s index is near zero (“near zero” is subjective, this happens in only one setting to my understanding)
+fail completely to capture the semantics chemicals and species

Review #2
Anonymous submitted on 21/Feb/2021
Suggestion:
Minor Revision
Review Comment:

Prediction of Adverse Biological Effects of
Chemicals Using Knowledge Graph Embeddings

(1) originality:
This manuscript provides details of the improvements built upon authors’ previous publications on a knowledge graph in Ecotoxicology domain.

The paper starts with explanation of ecotoxicology definition, and its importance. Challenges of datasets related to the field are listed as interoperability from various data sources.
Knowledge graphs(rdf) and semantic web technologies are suggested as a solution of orchestration of these datasets.
For the sake of completeness the manuscript provides details. On the other hand, this makes following the paper difficult, especially in the methods part. The general quality of the manuscript is appropriate.
Main contribution of the work is investigation of KG embedding methods and adding new datasets to previously published KG. Therefore novelty of previous work is limited. However, the overall quality is fairly acceptable.

(2) significance of the results,
The manuscript provides appropriate details on KG embeddings on proposed KG. The details of the results are enough.

(3) quality of writing:

Notes and minor improvement suggestions are suggested as follows:

Contributions of the work:
1. Consolidation of relevant information to ecotoxicology domain as knowledge graph. Integration includes tabular data, ref files, sparql queries over public linked datasets such as Wikidata and log map.
Biological :Ecotox, 1M experiments, 12K chemicals, and 13kK species.
Chemical : Ecotox,Wikidata pubchem, chembl mesh,
Taxonomy : Ecotox, NCBI
Species Traits Enc. of Life,

2. Implemented a prediction model using MLP (multi)and KG embedding models are presented.

3. Papers investigates prediction performance of various embeddings namely
Decomposition Models : dismay, complEx, Hole
Geometic Models: TransE, RotatE, pRotatE, HAKE
Convolutional Models: Cons KB, ConvE.

Problems found in the manuscript is:
-Abstract is too generic. The methods should be mentioned as much as possible.
-The reasons and benefits of adding new datasets could be detailed more in terms of usability. Novelty by adding these datasets is limited.
-The KG Embedding methods used in these methods are fairly standard and consist limited novelty.
-The comparisons between different embedding methods in the manuscript are useful in evaluation at once and valuable effort.
-The MLP is a critical component in the prediction, however, it is not detailed enough in the manuscript.
-The comparison between Pre-trained and Fine tuning embeddings are unnecessary. The reader is generally interested in performance of final FT models. However no action is necessary on this comment.

Review #3
Anonymous submitted on 10/Apr/2021
Suggestion:
Major Revision
Review Comment:

Summary
The paper proposes a knowledge-driven approach to address the problem of predicting adverse biological effects of chemicals on organisms. The proposed approach enables the integration of heterogeneous data sources into a knowledge graph, named TERA, and embeddings of the integrated entities enable for predicting links among these entities. The predicted links correspond to binary chemical effects. The paper is an extension of an in-use paper from ISWC where new data sources, prediction model, and evaluation are included.
The benefits of using TERA embeddings to solve the prediction task are empirically evaluated; state-of-the-art embedding methods are assessed.

The addressed problem is relevant and the proposed solution innovative. The proposed methods are exhaustively evaluated in different configurations. The observed experimental results demonstrate the potential of exploiting knowledge integrated from different data sources for accurately predicting adverse biological effects of chemicals on organisms.
The manuscript is understandable but requires to be reorganized to clearly present the results. Moreover, many decisions made during the creation of the knowledge graph and in the implementation of the prediction models are not justified. These issues reduce the value of this version of the work and impede the reproducibility of the reported results.

Detailed Comments

The paper requires to be restructured. The related work section more than presenting an analysis of the state of the art and positioning the proposed approach which respect to existing approaches, just presents preliminary concepts. Given the need to introduce all these concepts, a section on Preliminaries should be added. This section should define all the concepts required to understand the methods implemented in TERA. For example, what is an embedding and the different types of embeddings should be included as preliminaries; also, RDF and OWL should be part of this section.
In the related work, the existing approaches need to be analysed in terms of how well they can solve the prediction problem addressed in this paper. For example, which of the existing geometric models is more appropriate to generate the embeddings that better represent entities integrated in TERA.
Also, the problem of knowledge graph creation should be presented in one section and the problem of prediction in different section.

The process of knowledge graph creation is not explained. In particular, it is not described how the data sources are mapped to the integrated schema and if mapping languages like RML are utilized for knowledge graph creation.

The research questions that guided the experimental study need to be presented. The characteristics of the TERA that impact on the quality of the outcome should be discussed in more detailed.

Questions to the authors
Q1) What is the meaning of database orchestration?
Q2) How embeddings are represented in TERA? How additional information is integrated to the embeddings?
Q3) Why ConvKB and ConvE are suitable for the prediction problem addressed in this paper?
Q4) What is the process followed to perform entity alignment among the integrated data sources?
Q5) What is the formalism used to specify the mappings among the different data sources and the integrated schema?
Q6) What are the interoperability issues existing among the integrated data sources? What are the challenges of integrating new data sources? Which are the benefits?
Q7) What are the criteria followed in TERA to integrate taxonomy and trait data from NCBI and EOL, and chemical data from PubMed? How is data from PubMed represented?
Q8) Why the methods LogMap, AML, and Similarity measure where selected? What are the characteristics of the gathered mappings?
Q9) What is completeness of the alignments Wikidata, and NCBI and EOL?
Q10) What is the meaning of the metrics reported in Table 6 in the context of TERA?
Q11) Why the baseline model was selected?
Q12) Why the threshold for the similarity measure was set to 0.9? What would happen if a higher value is chosen?
Q13) What are the benefits that the integrated data sources bring in the quality of the predictive models?
Q14) Which of the predictive should be used to generate new hypotheses that can be investigated in a laboratory or studied in an environment?

Minor comments:
These sentences are not clear and should be rewritten:
S1) The formats of these data sources vary from tabular, to RDF files and SPARQL queries over public linked data.
S2) this is different for the species KGS For sampling strategies (i) and (ii), HAKE is extensively used in the top models to embed KGS.

Recommendations: The paper presents a relevant problem and the proposed approach has great potential. However, the current version suffers from several drawbacks that require a major revision of the work.