RDFRules: Making RDF Rule Mining Easier and Even More Efficient

Tracking #: 2398-3612

Authors: 
Vaclav Zeman
Tomas Kliegr
Vojtěch Svátek

Responsible editor: 
Agnieszka Lawrynowicz

Submission type: 
Full Paper
Abstract: 
AMIE+ (Galárraga et al., 2015) is a state-of-the-art algorithm for learning rules from RDF knowledge graphs (KGs). Based on association rule learning, AMIE+ constituted a breakthrough in terms of speed on large data compared to the previous generation of ILP-based systems. In this paper we present several algorithmic extensions to AMIE+ which make it faster and more practical to use. The main contributions are related to performance improvement: the top-k approach, which addresses the problem of combinatorial explosion often resulting from a hand-set minimum support threshold, a grammar that allows to define fine-grained patterns reducing the size of the search space, and the faster projection binding reducing the number of repetitive calculations. Other enhancements include the possibility to mine across multiple graphs, lift as a new rule interest measure adapted to RDF KGs, the support for discretization of continuous values, and the selection of the most representative rules using proven rule pruning and clustering algorithms. Benchmarks show considerable improvements compared to AMIE+ on some problems. An open-source reference implementation is available under the name RDFRules.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Emir Muñoz submitted on 25/Jan/2020
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

# Summary

This paper proposes several optimisations to improve the performance of AMIE+ algorithm for rule mining in knowledge graphs. The authors implemented these optimisations and open-sourced the code. The proposed optimisations are as follows:
- Reduction of the search space for rules
- Pre-processing to support numerical attributes (object literals)
- Post-processing with rule clustering, pruning, filtering, and sorting

Experimental results show that the optimisations provide an improved speedup when compared with AMIE+. Given the increasing size of knowledge graphs, the paper is relevant to the journal and the communities interested in knowledge graphs.

Although the work is well written and presents a good narrative with examples, I have major concerns because the paper does not fully deliver what is stated by the authors over the first two sections. This requires some significant clarification in the text and likely extra experiments.

I have put the paper with all my comments in the following link for the authors: https://drive.google.com/file/d/1O1z2XWC2S8Ba_KBL7UQLvtMmclci2ExU/view?u...

# Comments

I have three major comments about this paper. I leave out comments on the open-source tool.

1. There are several weak claims that require clarification in the text. For example:
- Abstract: “In this paper we present several algorithmic extensions to AMIE+ which make it faster and more practical to use” How much faster? How do you measure “practical to use”?
- Page 2: “Experiments evaluating the proposed algorithms, showing considerable improvements over AMIE+.” What is considerable?
- The discussion of AMIE+ is not really a review of RDF rule mining as stated in the contributions.
- Something that struggles me is that the authors claim that RDFRules provides the same results as AMIE+, but this is not evaluated. Are the number of rules reported in Table 3 the same for AMIE+ and RDFRules?

2. The notation, algorithms, and equations are not clear or sound in many cases. For example:
- Angle brackets are used for different purposes (e.g., notation of atoms), which are not clearly stated. In page 13, parentheses are used to denote an atom.
- The use of coverage with a different definition. Coverage already has a clear definition in rule mining --- fraction of records that satisfy the antecedent of a rule. It's a percentage.
- The ground substitution is not clearly explained and leaves unanswered questions. Could be possible to have a $\theta = {?a=, ?b=}$ How do you handle owl:sameAs equality?
- Page 6: In the bsize equation is not clear how $B\theta$ can be compared to a conjunction of triples. Similar situation in the $bsize_{pca}$ definition
- Algorithm 1, line 4: It is not clear if you process unique predicates or all instances of predicates for each triple.
- Algorithm 2, line 9: $map[r]$ is treated as a set and as a counter.
- Algorithm 2, line 13: another use of angle brackets not clearly described.
- Similar issues with Algorithm 3.
- Algorithm 5 has a few issues with set operators and it’s not clear when the loop should be broken.
- Similarity equations on page 16 have issues in their definitions. (See attached PDF for more details.)
- The definitions of head coverage (hc) and confidence (conf) differ from the ones introduced in Galárraga et al. 2015. Could you explain why the difference?

3. There are few of the optimisation proposals that were not tested in the experiments section or tested insufficiently.
- The graph-aware rule mining is not evaluated at all. Furthermore, the only dataset, YAGO core, is only a part of YAGO that does not contain the YAGO taxonomy with the entity types. Therefore, the rules shown in the examples are not feasible to obtain. Finally, the whole point of improving speedup and scalability should be tested with datasets larger than 948K triples, e.g., DBpedia and Wikidata. This should improve the comparison with Galárraga et al. 2015.
- The binning or discretisation approach for numerical attributes introduced in Section 5.2 is not evaluated.
- The proposed lift metric in Section 5.6 is not measured at all in the experiments.
- The proposed rule clustering in Section 5.7 is not used beyond mentioning that RDFRules supports it with the DBscan algorithm.
- The proposed rule filtering in Section 5.8 is not used.

Review #2
Anonymous submitted on 15/Feb/2020
Suggestion:
Major Revision
Review Comment:

Essentially the paper analyzes some weak points of the reference RDF rule induction system AMIE+ and proposes a number of improvements which are implemented in a new system dubbed RDFRules.

* Most of the improvements concern efficiency issues and their value is to supported by an empirical evaluation to some extent.

* Hence the contribution is interesting but still incremental with respect to the base AMIE+. There is some little advance in the state of the art but new research directions are not opened.

* Some solutions transposed to the case are borrowed from ILP/Relational Learning and can be considered as a clever adaptations of existing approaches (top-k outcomes, lift measure).

* In my opinion the most interesting improvement is the possibility of defining specific constraints to the desired rules to be mined which also allows dramatic speedups of the search process.

* Results seem to be technically sound.

The main weak points seem to be:
1. The lack of an in-depth critical discussion of the motivations for mining rules:
* what is the intended use? what are they meant for? (exploration only? logic reasoning?)
* what is the specificity w.r.t. clausal logic theories that can be induced using ILP systems?
* how are they supposed to be integrated with Semantic Web knowledge bases?
* what are the advantages / disadvantages w.r.t. rules definable through rule languages for the Semantic Web?

2. The empirical evaluation ought to be extended and strengthened to support the claims on the improvements more convincingly, also in the light of the motivations required by the previous point.

Overall, the paper is well-written and endowed by useful examples and figures.

# Specific Comments

1. The authors motivate the inadequacy of the current AR learning systems in the context of ILP for their making the CWA by default instead of the OWA which is required by the intended semantics of Semantic Web KBs.
* Then the semantics of the rules mined by the proposed system seems not to take into account this issue, following the solutions already proposed for AMIE+ based on further assumptions.
* Having weakened the specificity of the intended semantics would legitimate a comparison to the aforementioned ILP systems.

2. A discussion of what is missed w.r.t. the original semantics of the data making the mentioned simplifying assumptions may help better motivate the paper.
* what is lost in terms of effectiveness ?
* what is the use of the rules if logical strength is not enforced? non standard inference services?
* How are these rules to be integrated if they risk to conflict with the original semantics as formalized in the Web ontologies where classes and properties are defined?

3. The related work section should focus specifically on association rule representations that can be (semantically) integrated with existing RDF KBs hence touching possible issues related to reasoning (not just querying).

4. In \S 3.2.2: "the rule atoms must not be reflexive" doesn't that depend on the semantics of the underling relation/property?

5. In \S 3.3, when discussing measures for association rules in AMIE+, sometimes it is required some form of counting (e.g. bsize) that would require the UNA to hold.
This can be a critical point in a scenario where the KG/KB is distributed on various sources.
Could you please discuss this issue?

6. The issue of noisy triples or inconsistent assertions due e.g. to cardinality restrictions may be worth a discussion: how do the two systems react to such cases?

7. In Sect. 4 standard good practices from the data mining area are proposed as solutions to the limitations highlighted for AMIE+:
1. the goal of "extracting an exhaustive set of rules", as undertaken in the proposed system, may be worth a more detailed specification (in terms of the presented performance indices?)
2. p.8, col. left, l.10: "An integrated approach is even more necessary in the linked data context as general algorithms for data pre-processing are difficult to apply to linked datasets due to the different structure of inputs as well as outputs" could you be more specific on this point?
3. On numerical data (\S 4.1): "Since numeric attributes have typically many values" actually they often range on infinite domains even when reduced to intervals; the solution based on discretization techniques finding emerging sub-ranges is quite straightforward yet this issue ought to be discussed also with respect to the underlying representation.
4. also the issue of the "uniqueness" of the property values in the dataset is questionable in the underlying scenario, considering also the meta-properties (such as functionality)
5. The extension to top-k outcomes (\S 4.2) seems to be quite straightforward and incremental given the exploratory nature of the mining task. Of course taming the combinatorial explosion is the actual difficulty, as recognized by the authors. Dually, problems of incompleteness may arise.
6. Algo 4: please shortly discuss the termination of the recursive function [especially if the actual implementation is recursive].
7. Discretization technique: have other approaches been tried? if so what motivates the choice of the _equal-frequency_ approach
8. In Rule Clustering (\S 5.7) a similarity measure for rules is defined in terms of atom similarity. The latter seems very simplistic and does not seem to exploit the semantic features, e.g. property subsumption. This choice would deserve some motivation (efficiency?).
9. The section on experiments features strong and weak aspects: overall it does not seem to target in depth all the various extensions proposed for the original reference system.
1. a single large dataset (in terms of triples) is considered hinting that efficiency is the main objective of the empirical evaluation. However, also the number of classes and properties is important as, together with the number of constants, they determine combinatorially the total number of candidate triples that can be inserted in each new rule
2. including one/more dataset(s) with different stats on these numbers may strengthen the claims (see papers on AMIE)
3. the setup of parameters like minimum head size threshold and maximum rule length should be motivated: have these values been found via cross-validation? what objective-function was used to select the best values?
4. Also the number of replications of the experiment would deserve a comment.
5. elapsed time and number of found rules are considered; it would be interesting to measure also other aspects, e.g. related to the semantics or utility of the mined rules (in other passages "meaningful rules" are mentioned)
6. availability of code and reproducibility of the experiment are among the strong points
7. Tab.3: the possible causes for the case showing a very large difference (hours) would deserve some discussion
8. Results regarding the Top-k Approach and Rule Patterns seem to be quite expected.
9. As ILP systems share similar CW assumptions (and no particular from of reasoning involving the underlying ontology seems to be considered) it would be interesting to consider them for a comparison: please provide solid motivations for their exclusion

# Minor Issues
(numbers per pages and lines in the manuscript)

## General comments:
* Readability may be improved using a different font for the examples (e.g. via \tt, \mathtt, or \sf-\mathsf etc.) and use a single delimiter couple for triples (\langle \rangle or parentheses).
* In the Algorithms please consider using boldface keywords as sometimes colors are nor preserved

(page - column - line)

1.
Abstract.
16. References in the abstract should be avoided
left col.
37. "apriori" (also in other occurrences) -> "Apriori" ? Is it related to "arules" [9] in the following page?
right col.
30. "potentially incomplete" -> "inherently incomplete"

2.
left col.
28.-36. A bit repetitive w.r.t. the introduction of the contribution made in the right column
right col.
8. "We provided benchmarks..." consider using present or future tenses for parts to be presented in following sections.
49. The indication in the main text of the preliminary paper title is superfluous given the citation ([10]) and can be removed.

4.
left col.
28. Is it really necessary to have a citation in the section title?

5.
right col.
28. "from the KG covered" -> "in the KG that are covered"

6.
right col.
26. "If some predicate p is completely absent for some subject s" please rephrase this sentence

8.
left col.
35. "unusably" checked on the dict.?

11.
left col.
Algo 3, line 7: swap the conditions?
17.
left col.
12. "RDF-style data" -> "style?" why not just "RDF data"
20.
right col.
34. (and ff?) "logical rules only with variables", please rephrase

Review #3
By Jedrzej Potoniec submitted on 21/Feb/2020
Suggestion:
Minor Revision
Review Comment:

The paper presents a set of modifications to AMIE+ making mining rules from RDF knowledge graphs more efficient both in terms of runtime, and user experience. Summarizing along the reviewing dimension:
* Originality: the paper presents an original contribution by extending AMIE+ with faster projection counting during the mining; on-line construction of numerical ranges in rules; explicit support for named graphs; more complex rule patterns; top-k approach to dynamically adjust minimal support and minimal confidence thresholds; introducing the lift measure; introducing clustering of rules with DBscan to decrease cognitive overload of the user; rule pruning based on ranking and coverage of the rules. I find this contribution original.
* Significance: the presented contribution is incremental, in this sense that it adds small bits here and there to a preexisting algorithm. I am not sure whether the Semantic Web Journal is the right venue, but I acknowledge that the presented approach pushes state of the art.
* Quality of writing: the paper is well written, but somewhat lacking in the discussion of the related work. Some parts are not as precise and polished as they should be, they are listed in details below.

Detailed remarks:
* Numerous related papers were omitted from discussion, all related to frequent pattern mining and, in turn, to various forms of association rule mining, e.g.,
** Thomas Pellissier Tanon, Daria Stepanova, Simon Razniewski, Paramita Mirza, Gerhard Weikum: Completeness-Aware Rule Learning from Knowledge Graphs. International Semantic Web Conference(1) 2017: 507-525
** Claudia d'Amato, Steffen Staab, Andrea G. B. Tettamanzi, Duc Minh Tran, Fabien L. Gandon: Ontology enrichment by discovering multi-relational association rules from ontological knowledge bases. SAC 2016: 333-338
** Claudia d'Amato, Andrea G. B. Tettamanzi, Duc Minh Tran: Evolutionary Discovery of Multi-relational Association Rules from Ontological Knowledge Bases. EKAW 2016: 113-128
** Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski:
Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked Data. RuleML 2015: 422-430
** Agnieszka Lawrynowicz, Jedrzej Potoniec: Pattern Based Feature Construction in Semantic Data Mining. Int. J. Semantic Web Inf. Syst. 10(1): 27-65 (2014)
** Agnieszka Lawrynowicz, Jedrzej Potoniec:
Fr-ONT: An Algorithm for Frequent Concept Mining with Formal Ontologies. ISMIS 2011: 428-437
2010
** Joanna Józefowska, Agnieszka Lawrynowicz, Tomasz Lukaszewski: The role of semantics in mining frequent patterns from knowledge bases in description logics with rules. Theory Pract. Log. Program. 10(3): 251-289 (2010)
** Jedrzej Potoniec, Piotr Jakubowski, Agnieszka Lawrynowicz:
Swift Linked Data Miner: Mining OWL 2 EL class expressions directly from online RDF datasets. J. Web Semant. 46-47: 31-50 (2017)

* Sec. 3.1, line 41: "In the Semantic Web, any subject or predicate ..." -> This is a bit vague.
* Sec. 3.2.1, lines 8-9: "An AMIE+ atom is a statement which is further indivisible" -> What does it mean for a statement to be divisible?
* Sec. 3.2.1 -> Why there are commas in some of the triples (e.g. in lines 32-33) but not in the others (e.g. in lines 25-26)?
* Sec. 3.3.1 -> I am confused why the name "atom size" was introduced. To me it feels like support. A discussion why it is not support would be very welcome.
* Sec. 3.3.1, lines 27-28 -> Why angle brackets?
* Sec. 3.3.3 -> The definition of bsize_{pca} is very complex and I am not sure I got it right. Additional description would be welcome. Also, I don't think "o'" may be called a "dangling variable" given that you prefix variables with a question mark. Possibly a rewording or a discussion are necessary.
* Sec. 3.4 "the maximum rule length" -> I think at this point the term "rule length" is not defined.
* Sec. 3.4 "indexed KG" -> I think this term is not defined.
* Sec. 3.5: The description is clear, but the formula in lines 33-34 is confusing. Maybe replace it with a picture giving an example, e.g., like Figure 2 in https://arxiv.org/pdf/1710.07114.pdf
* Sec. 4.1 and Sec.5.2: I feel that here you are changing the hypothesis space of the algorithm. So far the possible values of the variables were directly related to the nodes of the considered KGs. Now, you demand introduction of some ranges/bins/etc. I believe this requires additional discussion and formalization of the considered hypothesis space.
* Sec. 4.7: I find the example unconvincing, because to infer that "nyt:N8209 a dbo:PopulatedPlace" you just put everything in a single graph and you're done.
W.r.t. support for owl:sameAs, this can be easily addressed by a preprocessing (select a single representative for every equivalence class of IRIs)
* Algorithm 4: What does operator ++= do?
* Sec. 5.4: The presented grammar looks very impressive, but is it really necessary?
* Sec. 5.7, page 16, "two atom items": What are atom items?
* Sec. 5.7, page 16, sim_t: What if s1=s2, but p1!=p2? From the definition it follows that sim_t=1, but it seems counterintuitive. Also, given two variables, say ?c and ?d, does ?c=?d hold? Both answers are justifiable and this should be explained in the paper.
* Sec. 5.7, sim_r: I find this definition a bit surprising. I imagine the following situation:
r1= A1 ^ A2
r2 = B1 ^ B2
and the following similarities: sim_a(A1, B1)=1 sim_a(A2, B1)=2/3 sim_a(A1, B2)=0 sim_a(A2, B2)=0
And now sim_r(r1, r2)=1/2*[max(1, 0)+max(2/3, 0)]=5/6 which is pretty strange given that half of r2 is completely different from anything in r1.
* Figure 3: Possibly a set of rules for ranking is not a figure?
* Sec. 6.5, paragraph Constraints: I don't get it. Patterns are described in details as a contribution to the algorithm, whereas constraints are only mentioned in the description of implementation. Why is that?