Separability and its Approximations in Ontology-based Data Management

Tracking #: 3054-4268

Authors: 
Gianluca Cima
Federico Croce
Maurizio Lenzerini1

Responsible editor: 
Guest Editors Ontologies in XAI

Submission type: 
Full Paper
Abstract: 
Given two datasets, i.e., two sets of tuples of constants, representing positive and negative examples, logical separability is the reasoning task of finding a formula in a certain target query language that separates them. As already pointed out in previous works, this task turns out to be relevant in several application scenarios such as concept learning and generating referring expressions. Besides, if we think of the input datasets of positive and negative examples as composed of tuples of constants classified, respectively, positively and negatively by a black-box model, then the separating formula can be used to provide global post-hoc explanations of such a model. In this paper, we study the separability task in the context of Ontology-based Data Management (OBDM), in which a domain ontology provides a high-level, logic-based specification of a domain of interest, semantically linked through suitable mapping assertions to the data source layer of an information system. Since a formula that properly separates (proper separation) two input datasets does not always exist, our first contribution is to propose (best) approximations of the proper separation, called (minimally) complete and (maximally) sound separations. We do this by presenting a general framework for separability in OBDM. Then, in a scenario that uses by far the most popular languages for the OBDM paradigm, our second contribution is a comprehensive study of three natural computational problems associated with the framework, namely Verification (check whether a given formula is a proper, complete, or sound separation of two given datasets), Existence (check whether a proper, or best approximated separation of two given datasets exists at all), and Computation (compute any proper, or any best approximated separation of two given datasets).
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 04/May/2022
Suggestion:
Minor Revision
Review Comment:

The authors investigate the separability problem considering an OBDM scenario based on the DL-LiteR ontology language. They provide complexity results for the verification, computation, and existence problems associated with query separability. They also study a notion of characterization.
Overall, the quality of writing is good. The results are original and relevant (query separability was either investigated in learning settings or the ontology language differed from the one in this paper), however, the authors missed important related works. Please check the references pointed out in this review.
------
Related works
The reference to Angluin’s paper (number 16) needs to be fixed. It should be the paper “Queries and Concept Learning” because this is the paper where she introduces the exact learning model.
Please cite the papers on learning ontologies within the exact learning model, in particular, those related to query inseparability:
- A Model for Learning Description Logic Ontologies Based on Exact Learning (AAAI 2016)
- Learning Query Inseparable ELH Ontologies (AAAI 2020)
Also, there is this on GAV active learning:
- Active Learning of GAV Schema Mappings (PODS 2018)
For query inseparability in a non-learning setting:
- Games for query inseparability of description logic knowledge bases. (AI 2016)
- Query inseparability for ALC ontologies. (AI 2019)
-----
Preliminaries
It is unusual to define the domain of an interpretation as a set of constants (in many DL papers the domain is an abstract set and then there is a mapping from individual names to the domain, without UNA, in many database papers we have the constants and the unnamed elements or unknowns). Please add a comment about your approach and those in the literature.
It would also be good to add a note about OBDA (Ontology-based Data Access) and OBDM.

Typos and minor
Naming of sections (make it a bit longer): instead of “Existence” -> “The Existence Problem”
Line 28-29 in input -> in the input?

Review #2
Anonymous submitted on 17/Jun/2022
Suggestion:
Minor Revision
Review Comment:

The paper introduced a query-by-example problem in the ontology-based data management (OBDM) setting. Possibly, this is the first work in the area, where data mappings are given full attention. The authors study two classical query-by-example problems, separation (of positive from negative examples) and characterization, by unions of conjunctive queries. It is observed that proper separation/characterization is not always possible. Then the authors consider its approximations, such as sound, complete, minimally-complete and maximally sound separation/characterization. Using DL-Lite_R as an ontology language, the authors consider three types of mappings: LAV, GAV, and GLAV. The authors obtain a variety of interesting complexity results for 1) separation/characterisation existence problem; 2) verification problem of checking if a given UCQ separates/characterises given data instances under ontology and mappings, or if a given UCQ is an approximation. Moreover, the authors propose an algorithm for computing minimally-complete and maximally sound separators.

(1) originality

This work is similar in nature to several recent papers studying the query-by-example problem (and its versions) in the ontology-mediated querying setting. However, I think this is the first work where the mappings play a central role and their form is shown to affect the complexity of the problems.

(2) significance of the results

The work presents a set of new results. In particular, the coNExpTime-hardness proof is very involved and seems very novel to me.

(3) quality of writing

The paper is well written.

Suggestions:

1. If possible I would recommend inserting a simple example illustrating the setting (OBDM system, including mappings, and certain answers) and separation/characterisation into the Introduction.

2. For readability purposes, perhaps you could simplify the naming convention for your computational problems. If I understand correctly, you only show significant results for L_O = DL-Lite_R and Q = UCQ, so you could drop them from the problem names.

Review #3
Anonymous submitted on 11/Aug/2022
Suggestion:
Minor Revision
Review Comment:

This paper presents a complexity analysis and algorithms for the logical separability problem in the context of OBDM: Given two sets of tuples interpreted as of positive and negative examples, one looks for a UCQ query q such that each positive example is contained in the set of certain answers for q under the given ontology and mappings, and none of the negative examples intersects with the set of certain answers.
The separation problem has previously been considered in the pure database context (when the goal is to find a database query from a particular class of queries that separates the positive and negative examples) and for OBDA. The key difference is that in OBDA, the vocabulary of an ontology directly contains the database schema, while in OBDM the mapping is given by GLAV schema mappings between ontologies and databases. These mappings make things significantly more complicated.

The key contributions of this paper are the tight complexity bounds for the separator existence and verification problems, and the introduction of two approximations of the notion of separability. A complete approximation requires that all positive examples are contained in the set of certain answers, but this set may also contain some negative examples; a sound approximation requires that the query returns none of the negative examples, but some positive examples may also be omitted.
Then the authors define the very natural notions of minimally complete and maximally correct approximations.

It turns out that the complexity of verifying if a given query is a separator is NP-complete for complete separation, coNP-complete for sound separation, and DP-complete for proper separation. The complexity of checking if a proper separator exists varies from coNP- to coNextTime-complete depending on the underlying mapping language. Finally, both minimally sound and maximally correct approximations always exist, and exponential-time algorithms are given.

The separation problem is of interest for the general KR audience, and is in scope of the Semantic Web Journal. The paper if reasonably well written, and the main concepts are illustrated with examples.

I recommend accepting the paper with minor revisions.

Some comments and suggestions.
P3, Lines 43-51. A paragraph with results being listed is difficult to process. I would suggest summarising the main results of this paper in a table or a better structured list.
P6, L 10. “an S-databases” -> “an S-database” or “S-databases”
P6, L 13. “n is the arity q_s” -> “n is the arity of q_s”
P6, L 17 and further in the text, should i=[1, p] be i\in [1,p]? If equality is intended, please introduce the notation.
P6, L26, I think the inclusion “q_S^D\subseteq \lambda^+” should be the other way, “\lambda^+\subseteq…”
P6, L31. h(C_1)\subseteq h(C_2) does not make sense. Should it be h(C_1)\subseteq C_2?
P7, Lines 22-23. Please explain what PerfectRef does.
Section 4 presents a case of an empty ontology and non-trivial mappings. Could you please remind the reader what happens if the mappings are trivial, that is, there’s a one-to-one correspondence between the ontology alphabet and the DB schema.
Def 3. Is a proper separation always minimally complete and maximally sounds? Please comment.
P10, L 6. “conference paper [46]”. Reference [46] is not a conference paper.
Def. 4 introduces a name clash between the case of positive/negative examples and positive only, which might be confusing.
The entire Section 4.1 is about the related work. There are no results and the definition is not used anywhere else. At least mention in the related work that a comparison with Abstraction is given in Section 4.1 as definitions are needed, and better move to the related work section.
Section 5. \sigma(x) for the size of x is unconventional. Why not use #x or |x|?
P14, lines 9 and 11-14. Again there should be “\in” in place of “=”
P16 presents exponential algorithms for computing min complete/max sound approximations. What is known about the lower bounds on their sizes? Do you know of any particular OBDMs where separations are necessarily exponential in size? Are there any tractable cases where separations are guaranteed to be small (apparently yes, as they are considered further when analysing GAV/LAV cases)
P18 Corollary 3 states that either q_O is a proper separation or no proper separation exists. This statement immediately give a naïve algorithm for existence: compute q_O and check (Thm 2 and Corollary 1) if it is a separator. This is probably already in NEXTIME. Is your algorithm better?
The proof of Thm 8 would benefit from either moving some parts in an appendix or splitting the proof into several lemmas. Claims within claims are hard to read.

Review #4
Anonymous submitted on 04/Sep/2022
Suggestion:
Major Revision
Review Comment:

The paper studies the separability problem in the framework of ontology-based data management. The problem is formalized as follows. The input consists of a database D, mappings M, a source signature S, and an ontology O, and sets of tuples lambda+ and lambda- and the task is to decide whether there is a query q such that all tuples in lambda+ are certain answers to q over (D,M,S,O) (defined in the standard way), but none of the tuples in lambda- are certain answers. This problem has been studied a lot in the past few years because it has numerous applications in query reverse engineering, generating referring expressions, and explainable AI. Compared to previous work, the main difference is that the mappings come into play. Of course, the problem is also parametrized by the ontology language, the mapping language, and the language in which q is formulated in. The present paper studies the ontology language DL-Lite(R), the mapping languages GAV, LAV, GLAV, and GAV n LAV, and the query language UCQ (union of conjunctive queries). The main results are tight complexity bounds for the separability problem (as above), the verification problem (given q, is it is separating) and the computation (compute q if it exists). The second main contribution is the investigation of two approximate notions of separability: sound and complete separability, and the complexity of the induced problems. I find it interesting that the mappings have such a bit influence on the complexity in some cases.
The results seem to be correct; related work is covered very nicely (except: there is a conference version in IJCAI 22 of [32]).

From a technical point of view, one must remark that admitting UCQs as separating queries simplifies a lot the setting (in fact, studying all this for CQs seems much harder). The reason is that a separating UCQ exists (essentially) iff the union of all queries describing single positive examples achieves separation. I say “essentially” because one has to consider the perfect rewriting over the source schema instead, a well-known rewriting technique for query answering over DL-Lite knowledge bases. This observation is used implicitly throughout the paper. In fact, I would have liked to see this very early on, because most proofs rely on it. Most importantly, Theorem 1 relies on it but does not mention it in its proof and the observation is made only in the subsequent section 6. In the current form, the proof of Theorem 1 is not understandable. Something similar applies to Theorem 2; the proof also relies on some characterization but does not spell it out – which property exactly is decided? In fact, the proof of Theorem 1 also relies on more specific properties of the perfect rewriting procedure which are also not mentioned: the length of rewriting steps is not bounded in the proof, so polynomiality is not ensured. I think they should be spelled out earlier. Finally, I believe that the discussion about inconsistency in the beginning of Section 6 should be even earlier; somehow it does not make sense to work with inconsistent situations, and it would simplify future proofs.

A minor comment is that the lower bounds for characterization (instead of separability) are often a bit more difficult because one cannot choose lambda-. This seems currently not reflected in the proof of Theorem 3.

Overall, I vote for accepting the paper but urge the authors make the central observation regarding UCQ separability more prominent as such model-theoretic characterizations often help the experienced reader and make the situation much more transparent. Also the other smaller comments mentioned above should be addressed.