Correcting Assertions and Alignments of Large Scale Knowledge Bases

Tracking #: 2723-3937

Jiaoyan Chen
Ernesto Jimenez-Ruiz
Ian Horrocks
Xi Chen
Erik Bryhn Myklebust

Responsible editor: 
Guest Editors KG Validation and Quality

Submission type: 
Full Paper
Various knowledge bases (KBs) have been constructed via information extraction from encyclopedias, text and tables, as well as alignment of multiple sources. Their usefulness and usability is often limited by quality issues. One common issue is the presence of erroneous assertions and alignments, often caused by lexical or semantic confusion. We study the problem of correcting such assertions and alignments, and present a general correction framework which combines lexical matching, context-aware sub-KB extraction, semantic embedding, soft constraint mining and semantic consistency checking. The framework is evaluated using three representative large scale KBs: DBpedia, an enterprise medical KB and a music KB constructed by aligning Wikidata, Discogs and MusicBrainz.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Petr Křemen submitted on 28/Mar/2021
Minor Revision
Review Comment:

The paper deals with the problem of correcting incorrect assertions in large KBs. The problem is well-motivated and definitely important. Authors present a general framework for error correction of KB assertions, consisting of multiple techniques aiming at improving the quality of a KB. The techniques presented in the paper are well-chosen and evaluated.

The paper is well-readable, written in understandable language. The paper structure follows common standards.

- p.3. - Authors mention that section 2.1. is not directly related to the paper topic, and I agree with them. For me, the section 2.1. is rather distracting and could be significantly reduced or removed.
- p.4. - Authors claim their framework is general and does not assume any additional KB meta information or external data. Although the primary focus of the paper is the error correction technique technique itself, an insight to how the approach works comparing to the alternate techniques (in terms of accurracy and overall performance) on the major KBs would be very helpful also to justify, whether a general technique like this is beneficial enough.
- p.6. - Can a correction cause (hard) inconsistency of the KB? For example, 'a R b' corrected to 'a R a', might cause inconsistency for R irreflexive. Can similar situations happen in real KBs given their expressiveness?
- p.17 - When looking at fig.4. I was wondering how the performance gains actually boost the particular different KB use-cases mentioned in the intro (search, QA data integration, etc.) and whether some of these use-cases can benefit more from the introduced techniques.

Minor Comments:
- p.5. - Authors refer to "input KB", "aligned KB", "original KB" - these seem to denote the same thing - terminology unification would help.
- p.7. - Authors mention that for lexical matching they take into account "labels and name-like attributes". Were You considering combinations of these attributes as well (e.g. firstName + surName cf. fullName)?
- p.8. - Lookup Service Based => "Lexical matching (Lookup Service)" to keep it consistent with intro to sec. 4.2.
- p.9. - consider delimiting the concrete example in section 4.3.2 to make the paper more structured and readable. This applies to the concrete examples in the whole paper.
- p.9. - lines 13-14 of Algorithm 1 seem to be easily simplifiable to a one-liner
- p.10 - Multiple Layer Perception => Multiple Layer Perceptron
- p.10 - syntax of equation (1) was quite confusing to me. I don't see huge value in presenting this equation.
- p.13 - Sections 5 and 5.1 should not be empty.

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

Review #2
By Heiko Paulheim submitted on 30/Mar/2021
Major Revision
Review Comment:

The authors introduce a framework for correcting assertions in knowledge bases. Various components can be plugged in both for generating as well as for scoring candidate replacement statements. The approach is evaluated on three different datasets.

I have two major points of critique on the paper, which I would like to see addressed in a revised version.

The first point concerns the selection of components for candidate generation and for candidate scoring. While the overall approach works, the selection of components and their assignment to the two stages could also have been done in another way. For example, the authors generate candidates by finding entities with similar names, and then score them based on connecting paths. This could also have been done the other way around: considering all entities in a two-hop neighborhood as candidates, and then scoring them by syntactic similarity. The same holds for most of the other components assigned to either one of the two stages. Here, I would like to propose either to provide more evidence (e.g., by computing the recall graphs in Fig. 3 also for the other approaches, and showing that the ones chosen for the candidate generation stage actually have the highest recall@k), or to give a clearer argumentation why the components were assigned to the stages in the way given, e.g., based on a thorough analysis of classes of errors and their frequency.

The second point concerns the evaluation. In my opinion, the evaluation metric chosen is too optimistic. The authors compute performance (correction rate, empty rate, accuracy) only on the subset of errors where there is a suitable replacement candidate for a literal. However, when applied to a real world knowledge graph, the error rate on the total set of statements would be interesting. For example, on the DBP-Lit dataset, there are 725 target assertions, 499 out of which have a GT entity. This means that for the remaining 226 ones, it is best not to replace the literal (i.e., it is desired to have a low empty rate on the literals with a GT entity, but a high empty rate on the rest).

The way the authors report accuracy overestimates the results. For example, if an approach always replaces a literal at an accuracy of 80%, it would also replace all the 226 entites erroneously. In that case, the error rate would be (20%*499 + 226)/726 = 44.9%, which corresponds to an accuracy of only 55.1%, not 80%. Hence, I would like to propose a fairer evaluation, which also takes into account those statements without a GT entity.

Further comments and questions:

Section 4.2.1: lexical matching is named as a technique for creating candidates, using edit distance. I am not sure how this is implemented, but searching through an entire large-scale knowledge graph for finding entities which have a small edit distance to an entity at hand sounds pretty costly. Are there any heuristics and/or special index structures involved?

Section 4.2.2: the authors claim that "entity misuse is often not caused by semantic confusion, but by similarity of spelling and token composition". I would like to see some evidence for that statement. Semantic confusion is also not too rare, metonymy is a typical case here (e.g., using the city "Manchester" instead of the soccer team). So I would like to see some statistics here on the typical error sources. These could also help motivating the selection of candidate generation and scoring mechanisms, see above.

Section 4.2.3: to the best of my knowledge, the lookup service uses so-called surface forms, i.e., anchor text links extracted from Wikipedia, and scores based on the conditional probability that for a search string s, the surface form s links to an entity e. Later in section 5.2, the authors mention that DBpedia Lookup also uses the abstract of entity, which I think it does not (but I am not 100% sure either). The authors should double check the inner workings of DBpedia Lookup and clarify the description of the service.

Section 4.2.3: the description for merging to entity lists seems to assume that both have k results, but result lists in DBpedia Lookup may have different lenghts. How are lists of different length merged? Moreover, DBpedia Lookup also gives a score (called "RefCount" in the API), so I wonder why the score is discarded in favor of list position, instead of ordering the merged list by the score.

Section 4.3.1: Lines 9-11 in algorithm 1 could me more simply rephrased as E = E union {o| in script(E), o is an entity}.

Section 4.3.1: algorithm 1 seems to extract neighborhoods only with statements with the same predicate as the target assertions. For example, if my target assertion was , the neighboorhood graph would not contain, e.g., or . Is that really intended?

Section 4.3.2: At the point where sampling is done, there is already some relatedness/similarity notions in place. Did the authors also consider weighted sampling using those relatedness/similarity scores as weights?

Section 5.1.1: The authors describe the access to DBpedia, but I miss similar statements on how the other two datasets are accessed.

Section 5.1.1: "literals containing multiple entity mentions are removed" -> how exactly? what do you consider a multiple entity mention? for example, would "University of London" be considered a multiple entity mention, since it mentions both "London" and "University of London"?

Section 5.1.1.: likeweise, "properties with insufficient literal objects are complemented with more literals from DBpedia" -> how exactly is that done? For both this and the previous items, please provide a more detailed description of what is happening here, plus a discussion on how that eases/complicates the task at hand.

Section 5.2: could you also combine multiple related entity estimation mechanisms? what would be the results then?

Section 5.3.1: the comparison methods (like AttBiRNN) and their configurations should be explained more thoroughly. Likewise, how exactly is RDF2vec exploited as a baseline?

Overall, as can be seen from that list, there is a lot of open questions to this paper. I am confident that if those are addressed in a revised version, this paper will be a really interesting contribution to be published in SWJ.

Minor points:
p.1: canonicalizaiton -> canonicalization
p.15: MSU-Map -> MUS-Map

Review #3
By José María Álvarez Rodríguez submitted on 12/Apr/2021
Minor Revision
Review Comment:

Authors present an approach to recognize and reconcile entities in large knowledge graphs such as DBPedia or MusicBrainz with the main aim of improving the quality of such large datasets. To do so, they introduce a correction framework including a process and a set of techniques based on natural language processing (embeddings) and semantics (for consistency checking) to ensure that the proposed corrections are valid. They mainly focused on the literals of RDF statements. After presenting the process and the framework, they have conducted some experiments to validate the different techniques with real datasets coming from large datasets. This work is an extended version of a previous work presented in an international conference.

In the first section, Introduction, authors make a description of some of the existing problems in large datasets constructed through different methods (e.g. extraction, collaboration, etc.). They identify the problem of quality (and the cost to repair) in such datasets with special focus on the inconsistencies (e.g. labels) generated by the construction method. They also introduce the main contributions: 1) the correction framework (process) and 2) the techniques at different levels of abstraction: lexical and semantical. However, it should be specified that the proposed method is oriented to fix problems in knowledge bases based on a representation mechanism (RDF graphs) as it is also commented in section 3.1.

In the Related work section, authors categorize the types of problems that are being faced and make a review of the main works in these topics. However, some works in the field of entity recognition, matching and reconciliation could be added, specially in the case of applying deep learning techniques and embeddings as representation mechanism. E.g.

Works on knowledge graph embeddings could be also added to the review:

Next section, Background, describes the main theoretical concepts behind knowledge graphs represented as RDF statements. It also presents the problem of KB alignment in this context focusing on the A-Box making references to previous works.

In the methodology section, authors introduce the model and process to correct problems in statements coming from different names (e.g. properties) and ontological structure. However, there are some points to clarify/extend:

-Section 4.2.1 when explaining name-like attributes, Which are these attributes?

-Section 4.2.2 here it is not clear why is necessary to eliminate stop words. In some cases, stop words are representative for a name in some open domain. Is there any strategy to do it properly without losing meaning?

-Section 4.3 when training a classifier, what type of classifier? Are the features those in 4.3.3? Which is the configuration to ensure a balanced dataset? Furthermore, if possible it would be nice to see something about the implementation details (technology in use and hyper-parameter configuration).

-Section 4.4.3 explains the correction process based on the previous types of possible checking. However, to perform this type of checking it would be necessary to have the description of the properties (and assertions) as explained in the previous section through SPARQL queries. Would it be possible to extend the approach to use any other type of schema behind? (e.g. any metamodel semantics or a SHACL/SHEX schema)

-In the next section authors introduce the experimentation with the different datasets. They extensively explain the whole process step by step in terms of data, performance metrics, analysis and robustness of results. Results show promising values that are also properly commented by authors establishing some limitations and future extensions. However, it would be nice to have an enumerated set of possible extensions at both levels: conceptual and technological. On the other hand, it would be nice to have the possibility of making this research as FAIR as possible, so as recommendation try to make the datasets, experiment configuration and source code available under the principles of Open Science.

Other comments:

-Title is adequate regarding the paper contents. However, some more details in the type of knowledge base, techniques in use and the type of approach will be better. For instance: A framework for…, graph-based knowledge bases, etc.

-Abstract summarizes the problem and the approach. However, some numbers about the results of the experiments (improvements) will help readers to have a complete overview of the paper (approach, methodology and results).

-Figures are correct, clear and properly cited and explained in the text.

-Tables are correct.