On the relation between keys and link keys for data interlinking

Tracking #: 2153-3366

Authors: 
Manuel Atencia
Jérôme David
Jérôme Euzenat

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
Abstract: 
Both keys and their generalisation, link keys, may be used to perform data interlinking, i.e. finding identical resources in different RDF datasets. However, the precise relationship between keys and link keys has not been fully determined yet. A common formal framework encompassing both keys and link keys is necessary to ensure the correctness of data interlinking tools based on them, and for determining their scope and possible overlapping. In this paper, we provide a semantics for keys and link keys within description logics. We determine under which conditions they are legitimate to generate links. We provide conditions under which link keys are logically equivalent to keys. In particular, we show that data interlinking with keys and ontology alignments can be reduced to data interlinking with link keys, but not the other way around.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 29/Sep/2019
Suggestion:
Major Revision
Review Comment:

The paper explores the interactions between various key constructs and
alignment assertions in the data alignment task. The paper maps all
these constraints to a Description Logic (DL)-based formalism and,
based on this formalism studies the interplay between the various
constructs.

However, the authors have chosen a very weak DL for the formalization:
a DL whose TBoxes (called "schemata" here) only capture concept and
role hierarchies (as defined on top of page 4). The choices of
"key"-like constructs are not motivated sufficiently: for example, why
does one require the intersections of the ranges of the properties
involved in an "in-key" to be non-empty? (and later reevaluating this
choice on page 8 top?). Many, if not most of the properties of the
constraints studied in the paper seem to be a simple consequences of
the definitions, perhaps in combination with standard Armstrong Axioms
for functional dependencies (which probably should be cited).

The paper indeed lacks any general reasoning algorithm, does not
discuss completeness of inference, nor does it provide complexity
analysis of reasoning with the proposed constructs: While the authors
state that "the relation between keys and link keys is more subtle
than one may think", it looks like pretty much all the results could
be obtained by running DL-Lite(A,id) based reasoner (e.g., Calvanese
et al.: Path-based identification constraints in description
logics. KR 2008).

In a revision the authors need to address (at lest) the following
questions:

1) is the choice of key-like constructs orthogonal with the remainder
of their schema language? moreover, are the constructs themselves
independent? Is there any notion of "completeness"? (cf. Armstrong
Axioms for FDs)

2) how is reasoning performed: do we simply rely on the translation to
an underlying DL? if so, which one? can we pick any DL? what is the
effect on the complexity of reasoning?

Also, the "Related Work" (section 2) simply lists papers that also use
various notions of key constructs, but provides little insight into
the various design definitions (e.g.: statements of the form "The
key-based approaches to data interlinking proposed in [6,7] are
different from [5] and closer to link keys [8]" may be meaningful to
the authors, but the readers are likely to be confounded).

Minor comments:

In Def 2 you use "\approx" to stand for equality between constants
("links") while elsewhere you use plain "="; is there any technical
difference?

In Definitions 4 the precondition is "p_1^I(delta)=p_1^I(delta')!=0";
does that mean that *both sides* of the equality must be non-empty?
or is it only used when they are non-empty? (ought to be
disambiguated);

In Proposition 2 you use nominal(s) and value restrictions (that are
not part of your DL syntax);

In Proposition 5(9): is this true? what happens if the q_i's are empty
(depends on the answer to the above question about Def 4);

In Definition 8 you seem to be using "=" instead of "approx" (C and D
come from the "linked" ontologies, right?); and the same questions
about "..!=0" as for Def 4; and Theorem 5(13) seems to be problematic
too;

Theorem 9: as is, isn't this just the same as Thm 8? (shouldn't it
use "key_eq?)

Review #2
Anonymous submitted on 12/Oct/2019
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This technical paper proposes a comparison between the notion of keys and link keys. Keys are sets of properties that can be declared for a class c in a given ontology to express that these properties uniquely identify every of the class c. Instead of considering one ontology, link key consider two ontologies and express sets of pairs of properties from both ontologies and express that every pair of instances of two classes of the two ontologies that coincide for these properties should be considered the same.

The proposal of this paper, while being inspired from the work in [21], is to compare theoretically keys and link keys. To do so, six different kinds of link-keys are defined by distinguishing between two different semantics: eq-link keys which need equally of value sets of multi-valued properties (like F-keys in [21]) and in-link keys which need to have and intersection between the value sets of multi-valued properties (like S-keys in [21]). They also introduced the notion of weak, plain and strong link-keys based on wether the link keys involve properties that are truly keys or not. The authors have also used Description Logic interpretations to formally ground the usage of the keys and link keys for data interlining.

As fundamental results, the authors theoretically proved that the usage of keys and ontology alignment is equivalent to the use of strong link keys (ie. they get the same set of links), but the inverse is not true.

The paper is well written, even tough some times needs some high level explanations. The proposal seems to be technically sound, all the propositions and theorems are formally proved. However, my main concern is more about empirical evaluation, there is no experimental evaluation that show the significance and the evidence of the proposed theorems on real datasets. The example, on Insee and IGN is nit sufficient to convince on the exhibited relations between keys and link keys. Moreover, sometimes more intuitive explanations are necessary to follow the technical parts of the paper. For example, it is not clear how the notions of plain and weak link keys can be useful and usable in a data interlining context. One also may wonder if there any data interlinking tools that already implements the purposed new definitions of link keys.

Miner remarks :
- to follow the example 1, a graph schema representation with the considered ontologies may be more suitable.
- the paragraph, after definition 8, is not clear and misleading. May be clarifying this may help to understand the relevance of the notion of weak-link keys.
- in the example of Fig. 1, there is no reference to ‘Y’ … “depending on the value of y, it will …’ toi be checked.

Review #3
Anonymous submitted on 16/Oct/2019
Suggestion:
Major Revision
Review Comment:

In this paper, the authors redefine three kinds of existing semantics of keys that have been proposed in the setting of the web semantic using description logics. These keys vary depending how multivalued properties are taken into account, and depending on how semantic mappings between classes and properties can be handled (i.e. either the mappings are considered in the linkage rule or handled using some rewriting step). They have formally compared the properties of each semantics, introduce the definition of generalized hybrid keys and propose new types of keys that vary depending on their validity in subsets of instance pairs (i.e. strong keys, weak keys or plain keys). One of the main contribution is that the authors show that link keys are more general than keys. More precisely, data interlinking with link keys cannot be reduced to data interlinking with keys and ontology alignments.

Strong points :
- The main different definitions of keys have been published in international conferences but in this paper, the authors clarify these notions, unify their formalization using description logics, give and prove their different properties (w.r.t. equivalence, class and property subsumption, class intersection or union) and theoretically compare them.
- Significant extensions are proposed to adapt their usage considering different scenarios and validity constraints.
- The paper is well structured and clearly written.

Weak points :
- The main definitions are illustrated by examples taken from a real dataset (INSEE).
However, there is no experimental evaluation that show that the number of correct links that can be discovered in real datasets that involve link keys is higher than in a scenario where alignments and keys are exploited. It would be very convincing to show the impact of the theoretical differences that are described using at least two heterogeneous datasets.

-More precisely, the relevance of defining weak/strong and plain keys is not so convincing. Strong and detailed arguments are needed. Here again, some experimental results (even simple) that can evaluate the impact of using these three types of keys would be really welcomed to support some of the hypotheses and to justify the fact (at the end) that link keys may lead to better results.

Minor :
- It would be really informative to describe the key semantics that can be used in existing linkage platforms that are mentioned in the related work section such as SILK and LIMES.

- You say p3 that “F-key are keys in relational databases”. But existing approaches have exploited such keys in data graphs. Furthermore, you have just explained in the same section that the problem in relational databases is different since properties are functional. I think that this sentence has to reformulated.

- In the section 4 , p 6, it is said that “eq-keys require some sort of local closed world assumption”. Then p 11, it is said that “local completeness is necessary for interlinking with eq-keys”. Since emptiness is not considered for eq-keys, you have to define what kind of local or partial completness is needed and if it is needed at the discovery step (validity).

- In section 4.2, p 9, give a simple counterexample that shows that (9) is not fullfilled for eq-key. I agree but it is not so obvious.

- Generalized hybrid keys are introduced but not used in the following definitions and properties. However many properties are valid for both key semantics, so for generalized keys.

- p 12. The sentence used to introduce plain link keys i.e. "a set of property pairs is a plain key for a pair of classes if it is ... will be linked" is not clear.

- the fact that link key can involve classes that are not semantically mapped (no subsumption or equivalence relations but intersections) can be said sooner.

- You have to introduce the weak, strong and plain keys by illustrating first the reasons why they will be efficient and meaningful in practice. I agree that sometimes strong keys are not numerous or not usable to interlink two datasets. However, If a name is not a key for a person in one dataset, why would it be correct to define and exploit it, when the key is only valid on the subset of persons shared by two datasets ? Is it because of their coverage ? How will you know that a weak or plain key is valid for new datasets ? (that the key condition will hold for the instances that will be linked). Give intuitions, other examples or scenarios, and maybe how to take dataset characteristics into account to decide.