Hybrid Reasoning on OWL RL

Tracking #: 443-1610

Authors: 
Jacopo Urbani
Robert Piro
Frank van Harmelen
Henri Bal

Responsible editor: 
Pascal Hitzler

Submission type: 
Full Paper
Abstract: 
Both materialization and backward-chaining as different modes of performing inference have complementary advantages and disadvantages. Materialization enables very efficient responses at query time, but at the cost of an expensive up front closure computation, which needs to be redone every time the knowledge base changes. Backward-chaining does not need such an expensive and change-sensitive precomputation, and is therefore suitable for more frequently changing knowledge bases, but has to perform more computation at query time. Materialization has been studied extensively in the recent semantic web literature, and is now available in industrial-strength systems. In this work, we focus instead on backward-chaining, and we present a general hybrid algorithm to perform efficient backward-chaining reasoning on very large RDF datasets. To this end, we analyze the correctness of our algorithm by proving its completeness using the theory developed in deductive databases and we introduce a number of techniques that exploit the characteristics of our method to execute efficiently (most of) the OWL RL rules. These techniques reduce the computation and hence improve the response time by reducing the size of the generated proof tree and the number of duplicates produced in the derivation. We have implemented these techniques in an experimental prototype called QueryPIE and present an evaluation on both realistic and artificial datasets of a size that is between five and ten billion of triples. The evaluation was performed using one machine with commodity hardware and it shows that (i) with our approach the initial precomputation takes few minutes against the hours (or even days) necessary for a full materialization and that (ii) the remaining overhead introduced by reasoning still allows single pattern queries to be processed with an interactive response time. To the best of our knowledge our method is the first that demonstrates complex rule-based reasoning at query time over an input of several billion triples and it takes a step forward towards truly large-scale reasoning by showing that complex and large-scale OWL inference can be performed without an expensive distributed hardware architecture.
Full PDF Version: 
Revised Version:
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Matthias Knorr submitted on 03/Apr/2013
Suggestion:
Minor Revision
Review Comment:

Overall, the paper has been considerably improved and is easier to follow (in the technical parts, e.g., I like Example 5). There are still a few issues but I believe that none of them will constitute a major problem.

1. One issue that caused some misunderstandings in previous reviews and that still persists in the paper but also, e.g., in the some replies of the cover letter it seems is the matter of backward-chaining with and without pre-materialization. It seems to be agreed on the fact that backward-chaining can be done without pre-materialization (on the original rule set of course), while pre-materialization is introduced as a significant improvement for efficiency (which is why it is integral part of the implementation). Why not split Sects. 3 and 4 accordingly and adjust all related descriptions to avoid some confusion? I would suggest to restrict Sect. 3 to backward-chaining alone and join all material on pre-computation (including 3.4) into Sect. 4. Then the following text parts can be simplified/improved:
- The last paragraph in Sect. 2;
- The introductory paragraph on Sect. 3;
- The paragraph before the last in 3.3: you could say the same about termination and soundness, I would suggest to simply drop it to avoid any confusion, since you are not saying anything new/surprising anyway (backward-chaining is correct; if you introduce pre-materialization, then you have to ensure that it is correct as well, but that is completely irrelevant to the discussion/correctness of backward-chaining alone - sure, you are not saying it is relevant, but the way the phrase is put starting with "However", it confuses things without introducing anything beyond the obvious);
- The beginning of Sect. 4: the purpose of pre-materialization is certainly not what is claimed there;
- The beginning of Sect. 5: "pre-materialization procedure which ensures the completeness of our approach" - again, this is not true: the procedure improves efficiency but you could have programmed QueryPie entirely without it, and it would still be complete (but much slower, I guess).

2. While the proof structure is now fine in my opinion, there are still some details that have to be corrected:
- Lemma 1, item 1 apparently also shows p \circ Var(r) \subseteq V' but that is not mentioned in item 1;
- same Lemma, item 2, is proven by induction, however, if the induction hypothesis is that infer(Q',infer) where Q' \in \mathcal{Q}, then nothing needs to be proven (note that the second argument of infer is wrong);
- Lemma 3 is proven by induction without specifying on what;
- The hypothesis for Lemma 3, is written in a way such that it actually suffices already, i.e., it is that strong, that nothing else needs to be proven, which is certainly not the intention in a proof by induction, the problem seems to be related to spelling out how the induction proof actually proceeds;
- "As remarked below the definition of joint…" - I cannot find that, is that really meant to be there?
- The restriction on Datalog ("requires all variables of the head to be covered by some atom in the body") seems crucial to me and should be mentioned right away in 3.1 and not only in the proof of a Lemma!

3. Overall, I think the evaluation section is fine apart from one issue that I do not understand which is related to 6.2.2. QueryPie is written in Java, you have the source code, so I naively wonder, how it cannot be possible to pickup the source code, comment all calls to pre-materialization (and other optimizations based on that) away and use backward-chaining directly thus creating a correspondent to Table 5? You loose all the pre-materialization time, but query responses are most likely worse than there, but this way the comparison is much more meaningful than Table 6. Then, you could also show whether the idea on how to pre-materialize is better than the naive solution as described in the beginning of 4.1. Again, if this has been done in the past, then it is sufficient to cite, otherwise some experimental data would be really nice and most likely be really convincing.

Some further minor suggestions/mistakes/language-related issues (with the usual indicator page, column):

- 2, l: "Our method relies… to calculate the inference." - Which inference?

- 2, l: it still says that "OWL RL fragment is the most recently standardized OWL profile designed to work on a large scale". Now, does this mean the other profiles were developed earlier or do not work on (such a) large scale, or is the phrase simply misleading and should better be rephrased?

- 2, l: "We thus tackle several problems. The first is to show that our reasoning algorithm…" - I guess, you mean the adapted backward chaining algorithm, which should be spelled out explicitly. Then, no second or further problem is mentioned, so one wonders what happened to the several problems. Instead it is said that the overall method is applied (the one from a paragraph before) without mentioning any related problems. Then, however, "we have implemented these techniques…" and the reader may guess that you mean the two steps of pre-computing and then backward-chaining which was mentioned two paragraphs before (and not as techniques). Again, that is not a problem of content but of writing down a sound outline of what you do/are going to present.

- 2, r: Please add a pointer to Sect. 5 in footnote 1 regarding the OWL RL rules you do not cover.

- 3, l: Before, you establish that your approach combines queries and (pre-)materialization. Then, "…we use backward-chaining in both instances…" and further "…a backward-chaining algorithm which exploits…that some triple patterns are pre-materialized…" I know what you want to say, but currently, it might leave the impression of a loop - before backward-chaining you do backward-chaining - please rephrase.

- 4, r: You might consider adding a small example to joints of substitutions, similar to Example 2 and 3.

- 4, r: Better use R instead of R0 and R1 when talking about unifiers. If R0 and R1 are indeed different, then you cannot unify them anyway.

- 4, r: In the Datalog query definition, Ri(ti) is an atom for each i in {1,…,n}, not {0,…,n}.

- 4, r: Is it intended that an answer is first defined as a tuple, and 4 lines below as atoms?

- 5: Var(r) is defined twice, once on the left, and once on the right.

- 5, l: "T_P(I) is I extended by all facts that could be inferred…" - This is related to my previous request of unifying the tense, where I failed to explain properly what I meant. Sure, all facts could be inferred, but they still can, i.e. Simple Past should be used for describing things that happened in the past and are finished, I believe for such technical descriptions Present is more suited. Similar, p.7, l: "We introduced", "chose to"; p.10, l: "since \delta_h was the unifier"; p.11: "R'(b) was an answer"

- 5, l: It is o.k. to define graphs and then trees, but since all you need is trees, would it not be enough to only introduce trees?

- 5, r: "otherwise it the maximal height of all trees, where … plus 1." Please switch "plus 1" before the comma.

- 5, r: What is meant by arbitrary trees, for which the height is not defined - infinite ones?

- 5, r: Please avoid pointing to Algorithm 1 and line 13 when defining lookup: that is the occurrence of usage, but the function should be general, so it is fine to define it in general without a concrete pointer to some usage 3 pages later.

- 6, l: You introduced the convention in the beginning of 3.1 for variables, here "?a" reoccurs. I guess you want to unify notation here as well? Also page 18 contains numerous occurrences of variables with "?".

- 6, r: The two rule applications cax-sco and pro-spo1 are both wrong in Fig. 1: the children in the third line should, e.g., be T(s,rdfs:subClassOf,Person), T(x,ref:type,s), T(y,rdfs:subPropertyOf,rdf:type), and T(x,y,Person) (which needs to be adjusted in their children as well); by the way, since the caption says "Fig. 1" you might consider referring to it by Fig. 1 instead of Figure 1.

- 7, r: I cannot see any reason, why Lemma 1 is presented before explaining Alg. 1, please move it.

- 7, r: You certainly mean A to be all atoms occurring in P union {Q} not unified with {Q}.

- 8, l: \lookup (twice) and \Mat is probably not intended this way in Alg. 1.

- 8, l: Lemma 1 sounds rather general. What I mean is, it sounds like something that was already there for previous instances of QSQ. If so, then this should be mentioned.

- 9, r: "discuss the correctness of our algorithm, which are termination, soundness and completeness." - please rephrase; also "we will furnish further lemmas and definitions" sounds a bit arbitrary: I think it would be better to say what you introduce (and briefly with which purpose).

- 9, r: Minimal model in Lemma 3 has never been used before as a term.

- 10, l: Termination is fine, but I guess Lemma 1 and the fact that \mathcal{Q} is finite would suffice.

- 10, r: The phrase on "New in line 5" should probably appear further down in the paragraph and not as first statement regarding Soundness.

- 11, l: In Prop. 1, you mean Q defined above Lemma 1, not 2.

- 12, r: "rule set… different from standard OWL" - do you mean OWL RL or are you saying there is a rule set for full OWL 2? I cannot make the logical step to "Our approach thus relies on a pre-materalization phase…" - the two issues seem completely unrelated.

- 13, l: End of proof Prop.2, it should be "R(a) \not\in \bigcup …", right, because that is the contradiction to the initial assumption?

- 13, l: In Algorithm 2, line 7, the condition i>0 is missing, otherwise the head is also overwritten, which apparently is not intended.

- 15, r: It is probably better in the induction step to say "either R(a)\in T^{k}_P(I)" than T^0_P(I).

- 16, r: In the caption of Table 2, better use pre-materialized instead of precalculated.

- 17, r: "Using Proposition twice more" - please correct.

- 17, r and 18, l: "First type of duplicates source" - how about First source of duplicates? Similar the second one.

- 18, l: the INV rules are wrong: if S_{INV}(p, INV, q) (or vice versa) and T(x,q,y), then
T(y,p,x), and not T(x,p,y).

- 22, Table 7: I do not understand what "queries changing the rule set" means in the caption.

- 23, l: "relatively modest computer architecture" is maybe not the best description for the DAS-4 cluster?

- 23, l: "all the features of the OWL 2 language". I guess, you mean OWL 2 RL?

- 25: Inconsistent capitalization of names in the references (OWL, RDFS,…) - please check them all and unify (to the capitalized versions, I suppose).

Some typos:

- 3, r: algoritms;
- 5, r: "leads form" - from;
- 6, r: contruct;
- 7, l: redudancy; dept-first; "drawback that THEY are very difficult to BE implemented; QSQ require - requires; "termination soundness" - comma missing;
- 8, r: procedure; "that were inferred in the process, AND saved in the global…";
- 10, r: assingment (twice);
- 11, l: Queries - queries
- 11, r: "If there is a Datalog proof tree…, then" - comma missing before then
- 12, r: query (a TYPE, w) - should be "u" instead of "w" according to Ex. 5; "to the variable w so that in A subsequent call…" - article "a" missing;
- 13, l: harmeless;
- 13, r: "where I0 IS the database I…" - "is" missing
- 14, r: "satisfies these condition…" - conditions
- 15, r: "…it is the rule where Ri(ti) has been substituted…" - better "a rule"

Review #2
By Aidan Hogan submitted on 10/Apr/2013
Suggestion:
Accept
Review Comment:

Thanks to the authors for their revision. I missed the previous round of reviews, so I will defer to the other reviewers to see that the last round of revisions satisfy their concerns. I'll focus mainly on the core concerns I raised in the first review.

In summary, I'm satisfied that the authors have addressed my original concerns and minor comments. Here were my major points-to-address in the original review:

"(1) Make explicit what parts of the OWL 2 RL/RDF standard you tackle or don't tackle."

This has been clarified, though I think when you first state that you do not tackle all of the OWL 2 RL/RDF rules (in the introduction), you should quickly characterise the rules you do/do not support.

"(2) Relatedly, avoid overselling. Keep the conclusions grounded in the results you present."

The addition of Section 6.3 addresses this issue (though I would prefer to see this discussion integrated directly with the sub-sections it refers to rather than serving as a post-script).

"(3) Add more examples for formal parts of the papers, esp. the algorithms. Add more intuitive discussion as to what the proofs show. Introduce all notation and concepts used, even if only briefly."

The paper does a much better job of defining all of its terms than the first version I read did. The examples are also quite useful. That said, I still struggled to digest the formal parts of the paper. In particular, I still found the details of sections 4.1 and 4.2 very difficult to follow. I've read carefully through the sections three times now, and I still have only a rough idea of why the propositions are required and how the proofs are constructed. Perhaps the authors should consider moving the proofs to an appendix and make clearer the logical progression between the propositions themselves.

"(4) The writing in parts could be improved."

The writing has improved, though I list minor comments below.

MINOR COMMENTS:

* algoritms (No excuse for not running an automatic spell-check before submitting!)

* "they can be trivially rendered into a Datalog program" This is not true for many rules (e.g., rules pertaining to lists and datatypes) and this claim contradicts later discussion on list rules.

* In the definition of substitutions, you state that the domain of the function is a set of variables. But you often apply it over terms that may be constants. Obviously the intention is for the substitution to serve as the identity function over constants, but this should be clarified in the notation. Also, dom\theta should be explained when first used.

* "is an atom for each i \in {0,\ldots,n}" -> "... i \in {1,\ldots,n}"

* "A Datalog rule has the form ..." A Datalog rule has a more general form than that (e.g., negation). It sounds like you're defining the general notion of a Datalog rule here when you're only defining a specific form useful for the purposes of this paper.

* "T_P(\mathfrak{J}) is \mathfrak{J} extended by all facts that could be inferred from facts in \mathfrak{J} under P." Should clarify here that inferences must be direct and not recursive.

* "G := (V,E) is called [a directed] graph."

* "to be the Graph G'" -> "to be the graph G'"

* Your construction of a Datalog proof-tree looks a little strange to me. I don't know if this is a standard construction or not, but when talking about a labelling function for directed labelled graphs, the function is usually applied for edges, not vertices. You could easily define the proof-tree more directly by just defining the vertices to be atoms in this case. Dropping the unnecessary indirection of this labelling function would, for example, already would save the reader considerable headaches in later proof sections. Perhaps there is a reason why the labelling function is required, but if not, please remove it from the notation of the paper.

* "In the pseudocode of Algorithm 1 ... and Q' \sqsubseteq Q." Why is this paragraph presented here? I have no way to put this text in context until I see the algorithm itself introduced/discussed. It's poor flow.

* "all its variants is" -> "all its variants are"

* Figure 1: Overloading the 'x' variable makes the Figure difficult to understand.

* "dept-first" -> "depth-first"

* "In this way, it is able [to] cache..."

* "QSQ require" -> "QSQ requires"

" "which queries where" -> "which queries were"

* "We report the algorithm ... " + Lemma 1. These are again out of context. The algorithm hasn't been introduced yet. How could I be reasonably expected to put this discussion/Lemma in context?

* Algorithm 1: Mat stores results of the previous materialization round(s). Also, why are '\' seen in the code?

* Example 5: Explain that x,y,z are variables and a,c,d are constants. Fonts are too subtle to make this important distinction.

* "the correctness of our algorithm, which are" What's 'which' referring to?

* "$Tmp$'" -> $Tmp'$ (multiple)

* "Assingmnet" x2

* harmeless

* Algorithm 4: the formatting of variables is not consistent (Derivation and Dataset). "NewRuleset" -> "NewRuleSet"

* Proposition 3: "it is the rule where" -> "it is a rule where"

* Section 5: After discussing rules that are not supported, mention early on how many of the 78 rules WebPIE supports. Instead of saying "25 rules out of 78", replace 78 with the given figure.

* "Singling out exactly those triple patterns from Table 2 is motivated by the empirical observation that:" What empirical observation is that? Please be specific. Otherwise I would rather call the latter two points "assumptions" not observations.

* "rdf:next" -> rdf:rest

* The duplicates discussion is, as you acknowledge much later in Section 6.3, preliminary. It reads as very hand-wavy and is very much an incomplete discussion. I pointed this out in my first review and I still feel the same way. Why pick those duplicative features you do? At the least, please mention right at the start of Section 5.1 that the discussion is practical/preliminary.

* "the pre-materialized ... is not symmetric." It's worth clarifying that it should be symmetric, but the relevant rule is not included in OWL 2 RL/RDF.

* "the execution time drops by about one order of magnitude on average" It doesn't look like it does. Instead say something more specific, like "drops by a factor between X--Y".

* Tables: right align numeric values. Captions above?

* "is retrieved ."

* "The measurements that we report have shown that large scale OWL RL reasoning" You only do a subset of OWL RL.

* "none of these approaches supports the OWL RL rules" ... but again, nor does WebPIE.