Review Comment:
Essentially the paper analyzes some weak points of the reference RDF rule induction system AMIE+ and proposes a number of improvements which are implemented in a new system dubbed RDFRules.
* Most of the improvements concern efficiency issues and their value is to supported by an empirical evaluation to some extent.
* Hence the contribution is interesting but still incremental with respect to the base AMIE+. There is some little advance in the state of the art but new research directions are not opened.
* Some solutions transposed to the case are borrowed from ILP/Relational Learning and can be considered as a clever adaptations of existing approaches (top-k outcomes, lift measure).
* In my opinion the most interesting improvement is the possibility of defining specific constraints to the desired rules to be mined which also allows dramatic speedups of the search process.
* Results seem to be technically sound.
The main weak points seem to be:
1. The lack of an in-depth critical discussion of the motivations for mining rules:
* what is the intended use? what are they meant for? (exploration only? logic reasoning?)
* what is the specificity w.r.t. clausal logic theories that can be induced using ILP systems?
* how are they supposed to be integrated with Semantic Web knowledge bases?
* what are the advantages / disadvantages w.r.t. rules definable through rule languages for the Semantic Web?
2. The empirical evaluation ought to be extended and strengthened to support the claims on the improvements more convincingly, also in the light of the motivations required by the previous point.
Overall, the paper is well-written and endowed by useful examples and figures.
# Specific Comments
1. The authors motivate the inadequacy of the current AR learning systems in the context of ILP for their making the CWA by default instead of the OWA which is required by the intended semantics of Semantic Web KBs.
* Then the semantics of the rules mined by the proposed system seems not to take into account this issue, following the solutions already proposed for AMIE+ based on further assumptions.
* Having weakened the specificity of the intended semantics would legitimate a comparison to the aforementioned ILP systems.
2. A discussion of what is missed w.r.t. the original semantics of the data making the mentioned simplifying assumptions may help better motivate the paper.
* what is lost in terms of effectiveness ?
* what is the use of the rules if logical strength is not enforced? non standard inference services?
* How are these rules to be integrated if they risk to conflict with the original semantics as formalized in the Web ontologies where classes and properties are defined?
3. The related work section should focus specifically on association rule representations that can be (semantically) integrated with existing RDF KBs hence touching possible issues related to reasoning (not just querying).
4. In \S 3.2.2: "the rule atoms must not be reflexive" doesn't that depend on the semantics of the underling relation/property?
5. In \S 3.3, when discussing measures for association rules in AMIE+, sometimes it is required some form of counting (e.g. bsize) that would require the UNA to hold.
This can be a critical point in a scenario where the KG/KB is distributed on various sources.
Could you please discuss this issue?
6. The issue of noisy triples or inconsistent assertions due e.g. to cardinality restrictions may be worth a discussion: how do the two systems react to such cases?
7. In Sect. 4 standard good practices from the data mining area are proposed as solutions to the limitations highlighted for AMIE+:
1. the goal of "extracting an exhaustive set of rules", as undertaken in the proposed system, may be worth a more detailed specification (in terms of the presented performance indices?)
2. p.8, col. left, l.10: "An integrated approach is even more necessary in the linked data context as general algorithms for data pre-processing are difficult to apply to linked datasets due to the different structure of inputs as well as outputs" could you be more specific on this point?
3. On numerical data (\S 4.1): "Since numeric attributes have typically many values" actually they often range on infinite domains even when reduced to intervals; the solution based on discretization techniques finding emerging sub-ranges is quite straightforward yet this issue ought to be discussed also with respect to the underlying representation.
4. also the issue of the "uniqueness" of the property values in the dataset is questionable in the underlying scenario, considering also the meta-properties (such as functionality)
5. The extension to top-k outcomes (\S 4.2) seems to be quite straightforward and incremental given the exploratory nature of the mining task. Of course taming the combinatorial explosion is the actual difficulty, as recognized by the authors. Dually, problems of incompleteness may arise.
6. Algo 4: please shortly discuss the termination of the recursive function [especially if the actual implementation is recursive].
7. Discretization technique: have other approaches been tried? if so what motivates the choice of the _equal-frequency_ approach
8. In Rule Clustering (\S 5.7) a similarity measure for rules is defined in terms of atom similarity. The latter seems very simplistic and does not seem to exploit the semantic features, e.g. property subsumption. This choice would deserve some motivation (efficiency?).
9. The section on experiments features strong and weak aspects: overall it does not seem to target in depth all the various extensions proposed for the original reference system.
1. a single large dataset (in terms of triples) is considered hinting that efficiency is the main objective of the empirical evaluation. However, also the number of classes and properties is important as, together with the number of constants, they determine combinatorially the total number of candidate triples that can be inserted in each new rule
2. including one/more dataset(s) with different stats on these numbers may strengthen the claims (see papers on AMIE)
3. the setup of parameters like minimum head size threshold and maximum rule length should be motivated: have these values been found via cross-validation? what objective-function was used to select the best values?
4. Also the number of replications of the experiment would deserve a comment.
5. elapsed time and number of found rules are considered; it would be interesting to measure also other aspects, e.g. related to the semantics or utility of the mined rules (in other passages "meaningful rules" are mentioned)
6. availability of code and reproducibility of the experiment are among the strong points
7. Tab.3: the possible causes for the case showing a very large difference (hours) would deserve some discussion
8. Results regarding the Top-k Approach and Rule Patterns seem to be quite expected.
9. As ILP systems share similar CW assumptions (and no particular from of reasoning involving the underlying ontology seems to be considered) it would be interesting to consider them for a comparison: please provide solid motivations for their exclusion
# Minor Issues
(numbers per pages and lines in the manuscript)
## General comments:
* Readability may be improved using a different font for the examples (e.g. via \tt, \mathtt, or \sf-\mathsf etc.) and use a single delimiter couple for triples (\langle \rangle or parentheses).
* In the Algorithms please consider using boldface keywords as sometimes colors are nor preserved
(page - column - line)
1.
Abstract.
16. References in the abstract should be avoided
left col.
37. "apriori" (also in other occurrences) -> "Apriori" ? Is it related to "arules" [9] in the following page?
right col.
30. "potentially incomplete" -> "inherently incomplete"
2.
left col.
28.-36. A bit repetitive w.r.t. the introduction of the contribution made in the right column
right col.
8. "We provided benchmarks..." consider using present or future tenses for parts to be presented in following sections.
49. The indication in the main text of the preliminary paper title is superfluous given the citation ([10]) and can be removed.
4.
left col.
28. Is it really necessary to have a citation in the section title?
5.
right col.
28. "from the KG covered" -> "in the KG that are covered"
6.
right col.
26. "If some predicate p is completely absent for some subject s" please rephrase this sentence
8.
left col.
35. "unusably" checked on the dict.?
11.
left col.
Algo 3, line 7: swap the conditions?
17.
left col.
12. "RDF-style data" -> "style?" why not just "RDF data"
20.
right col.
34. (and ff?) "logical rules only with variables", please rephrase
|