Review Comment:
The paper provides three tightly-related contributions in the domain of knowledge graph maintenance:
(1) a method, called PaTyBRED, for detection of erronous facts in knowledge graphs
(2) the CoCKG method to automatically fix those facts
(3) a method to induce relation constraints in SHACL.
The first two contributions has been already published, thus the paper positions them as subtasks for the greater goal of knowledge graph quality maintenance. The experimental evaluation shows that the presented techniques (except for CoCKG) outperform significantly the state of the art in their corresponding task. Besides, the presented techniques seem sound, and the paper's assumptions reasonable.
1) Legibility and presentation
The paper is well-written and fairly legible. Its structure is adequate in my opinion.
2) Scientific contribution
The paper tackles an important problem: knowledge base automatic correction. It first focuses on the detection of erronous facts with PaTyBRED. Once detected, those facts can be automatically corrected with CoCKG. Furthermore, the detected errors can be used to learn constraints that can help in the detection of problems a priori. While PaTyBRED and constraint induction show quite satisfactory results, the results of CoCKG's evaluation are rather not satisfactory. Nevertheless, I must acknowledge that automatic KG correction, when limited to internal dataset features, is a very hard task.
3) Detailed Review
Section 4.2.
- Q: Can two atoms in a path share both variables, i.e., residenceCountry(x1, y1), nationality(x1, y1)? If not, why? In any case, the paper should clarify via examples how a path looks like.
- I would appreciate more details about how the metrics inter(A, B), m1(A, B) and m2(A, B) are used to prune irrelevant paths. Is there any thresholding involved?
Section 4.3
- The examples "child(Trump, Ivanka), child(William, George), child(Kate, George), spouse(Trump, Melania)" overflow the column space.
Section 5.2
- The example after equations 4-7 is not really clear.
Section 5.4
- The authors claim to have done the evaluation on DBpedia and NELL, but Figure 3 shows results for YAGO too. It would be great to explain in a couple of sentences why PaTyBRED performs poorly on YAGO with more features (yago25).
Section 6.
- Given the low precision of CoCKG, I wonder why the authors have not considered to reduce the number of candidate corrections by using a link prediction approach, i.e, among the surviving correction subject/object candidates, take those that a link prediction approach would rate as likely to occur in the triple. Moreover, by adding link prediction to the formula, the authors could test the performance of CoCKG when dropping the assumption that confussions tend to use more general IRIs.
- While the authors have stated their preference for endogenous features, they could use external sources such as search engines or other KBs (via the co-occurrence of the entities) to test the viability of a correction candidate.
Section 8.2
- The example depicted in Figure 5 is not clear. I would recommend the authors to replace the labels c_i with actual examples.
- The a-posteriori pruning of the decision trees raises the question of why not applying parameter tuning when learning the tree.
- It would be great to publish all the learned SHACL constraints.
I would suggest a major revision for this paper. In particular the contributions of Section 6 are not convincing and need to be improved.
|