Review Comment:
The paper proposes a rule based approach to validate RDF graphs. As far as I know this is an original approach which combines the use of rules for inferencing as well as for validating.
The results are interesting as this approach can improve the performance of other techniques which separate inferencing from validation, it can also increase the expressiveness by being able to represent some constraints that are not represented in other approaches and it can also offer better explanations of the violations.
The paper is well written and the approach is sound. Nevertheless, I think the authors are exceedingly optimistic in their assessments of the benefits of their approach and I suggest them to rewrite several parts of the paper pointing not only to the pros, but also to the cons of their approach. Given that this is a research paper, it must try to offer a more objective comparison with alternative approaches and avoid a style which sometimes looks like a marketing paper.
As an example, the phrase that starts section 3. Comparative analysis says: "In this section, we show the shortcomings of existing approaches, and …". I think the authors need to show the existing approaches in an objective way, and not just their shortcomings.
In the same way, the conclusions show only the benefits of the rule-based approach, ignoring the trade-offs that this approach offers. For example, with this approach that combines inference and validation there is no separation of concerns between those two tasks which in some contexts are better. In some contexts, inference is tackled by ontology engineers with a focus on domain entities like people, while validation may be tackled by data engineers which may be more focused on integrity constraints and data representation. Having different technologies for both can be important in several contexts where domain ontologies can be reused. In fact, in ShEx, it is possible to validate the RDF graph before inference with some shapes, and the RDF graph after inference with other shapes. This technique can be used to debug the inference process and it seems that the rule based approach could not be applied to this use case.
There is no treatment in the paper to recursion and negation, a topic that was one of the main differences between ShEx and SHACL. Although the paper mentions the use of Scoped Negation as Failure, it is not clear if this approach could be extended to handle recursion and negation as in ShEx (see [1]) or in a recent proposal for SHACL (see [2]).
The State of the Art and the comparison with other technologies needs to be updated to take into account recent work proposed for SHACL as SHACL-rules [3] and to take into account which of the constraint types mentioned in Hartmann's paper can in fact be expressed in SHACL-SPARQL.
In the same way, Hartmann's paper didn't take into account that ShEx can also handle advanced constraints using Semantic Actions. In fact, a lot of those constraint types could be expressed using ShEx with semantic actions.
The related-work section differentiates between hard-coded systems and grammar-based approaches like Description Set Profiles or ShEx. However, it says that ShEx does not rely on SPARQL and concludes that it is a hard-coded system which I think is misleading. ShEx is based on a well-founded semantics (see [4]) which is a different approach from a hard-coded system. This mistake is replicated in section 3 (comparative analysis) where ShEx seems to have been embedded in the "hard-coded" system column, which I think is wrong as in the case of ShEx I would qualify the "Explanation" row to yes because Shape Maps in ShEx can explain which nodes conform or don't conform to some shape.
In the following, I enumerate some minor comments:
Page 2- Example line (1):
:birthdate "01-01-1970"^^xsd:date
should be:
:birthdate "1970-01-01"^^xsd:date
Page 2. Problem P1. "however, current approaches only report which resource violate which constraints, not why the violation occurs". It is not clear to me to which current approaches the authors are referring. Do they include also ShEx? If that's the case, if a system tracks which triples come from the original RDF graph, and which triples have been inferred, isn't it possible to show which are the triples that raise the error?
Page 2. 2nd column. "and find implicit violations". This is the first appearance of the "implicit violation" concept which I think it refers to violations caused by inconsistencies. I would ask the authors to define that concept at least…although I think it would also require some more justification about why an inconsistency is a violation…in ShEx/SHACL, an RDF node could conform to some shape while the RDF graph could be inconsistent.
Page 3. Problem P4. "it is not clear whether a piece of RDF data came from the original dataset or was inferred (P1)". There are systems that can differentiate between triples that are from the original RDF graph and triples that have been inferred.
Page 6. End of first column: The phrase: "ShEx does not rely on an underlying technology such as SPARQL to perform validation, a hard-coded system is used instead" is wrong…the fact that ShEx does not realy on an underlying technology does not imply that it must be implemented as a "hard-coded" system. ShEx is defined as a domain-specific-language with a denotational semantics which could have different implementation strategies…subsets of ShEx or even ShEx itself could be implemented with other strategies like a rule-based engine.
Section 2.3 talks about validation reports…in this context, it would be worth to mention that ShEx defines result shape maps as a result of the validation process.
Page 7, first column, I don't agree with the sentence "Supporting inferencing rules is thus an important requirement for validation approaches": although I understand the motivation for supporting inferencing rules, I don't think it should be a validation requirement. This is in fact a controversial statement that is based on a single study (Hartmann's PhD thesis). Further research could be done about which are the best validation requirements, because from a different perspective, adding inferencing rules to a validation system can be seen that it extends the validation language expressiveness too much in an uncontrolled way which may not be desirable…if the task is to describe and validate the structure of RDF graphs some people could consider that it is better to have a well-defined language with a clear semantics rather than a more expressive language whose rules are difficult to define and debug.
Page 7. Section 3 (first sentence). …the sentence "we show the shortcomings of existing approaches" sounds to me too strong. I would prefer the authors to include a more objective comparative analysis, rather than focusing of criticize the other approaches without also talking about the shortcomings of their approach.
Page 7, "…via translation of the SPARQL queries using property paths [21][23]" Why do the authors include reference [23] here?
Page 7, 2nd column. "…without inspection the code…"
Page 7, "…Customization of hard-coded systems is limited without requiring a development effort [50]"…why do you include the reference [50] here?
Page 8, table 2. I am not sure in which of the columns could ShEx or SHACL be included…maybe add a specific column for each of them?
Page 8, Figure 1. In both diagrams a box titled "Background knowledge" is presented…but it is not described in the paper…and in fact, I have doubts if it is really necessary. ShEx and SHACL don't have a different input for "background knowledge".
Page 8. "…has the following disadvantages: (1) multiple systems need to be combined and maintained, e.g. a reasoner and a querying endpoint". I understand the need of a reasoned, but why is a querying endpoint necessary?
"…(ii) different languages need to be learned and combined for the inferencing rules and constraints (e.g. OWL and SPARQL". Why is SPARQL necessary? And also…why is it a disadvantage to have 2 different languages for 2 different tasks? I consider it to be a good practice because it promotes a better separation of concerns…something like HTML and CSS which are different languages because they tackle different concerns.
Page 8, end of 2nd column. "Moreover, a rule-based reasoned natively supports custom inferencing rules, and thus, custom entailment regimes". Is this really an advantage? Because, it can also be seen as a challenge if the rule-based reasoned infers triples with a custom semantics which could be different from OWL…
In this respect, maybe the authors should also mention the problem that SHACL offers an entailment which is like a subset of RDFS but is different from RDFS, i.e. it supports rdfs:subClassOf, but does not support, for example rdfs:domain or rdfs:range…
Section 4.1. The authors talk about SNAF…should they talk about answer set programming also?
Page 11, figure 2. Represents the components view of a rules-based approach. The title says "rules-based reasoned", should it be "rules-based validator"?
That figure contains again an input titled "Background knowledge", is it necessary?
The components of the figure, as they are represented are very similar to the left part of figure 1 (pre-processing approach), substituting "validator" by "constraint translation"…I wonder if both approaches are really so different as the authors claim…I understand that one possibility is that the "entailment regime" and "constraint translation" phases can be run in parallel…but that possibility could also be tried in the other approaches where the validator could be running at the same time as the reasoned which would infer triples about the neighbourhood of a node2 on demand.
Page 11, "N3Logic supports at least OWL-RL inferencing…" so the system does not support OWL DL…maybe the authors should also mention this as another shortcoming of their approach…the fact that their approach is only viable when the reasoner can itself be implemented by rules.
Page 16, "regimes included"
Section 6.3. I think the sentence "Validatrr can support more constraint types than existing approaches RDFUnit, SHACL and ShEx" is wrong…in the case of ShEx, using semantic actions most of those constraint types could also be represented.
Page 17, "Without inferencing, our implementation is already faster for small RDF graphs. We perform about an order of magnitude faster until 10,000 triples, namely 1-2s per RDF graph compared to 30s per RDF graph…". And later the authors talk about set-up time required by RDFUnit…is it possible that if the comparison removed that setup-time, then both implementations would be equally fast?
Could the authors give some explanation about why after 10,000 triples, the times of their implementation increase considerably?
Page 18. "To make the results comparable, we used the EYE reasoned with the same RDFS rules to execute the reasoning preprocessing step…" What part of the time in RDFUnit is consumed by the reasoner compared to the validator? Could the authors use a different reasoned in RDFUnit?
Page 18. I could not understand the sentence "execution time drops from 120s to 80s for Validatrr whereas it rises from 25s to 185s for RDFUnit". Looking at the figure, I didn't see a point where the execution time drops…
Page 18. The 4 paragraphs that start by "RDF graph size" seem to be a justification that the size of most RDF graphs is not very big…although I understand the argument and it may be true that the size of current RDF graphs in LODLaundromat is not very big…I would not take that as a justification for not having good performant validators. On one hand, it may be that the current size of RDF graphs is not big because current technologies don't support very big RDF graphs well…on the other hand, maybe if we had better tooling, RDF adoption could be better and there would appear bigger RDF graphs…and finally, I think the size of RDF graphs will be increasing as long as there are better tools and computational resources to manage them…so I think the whole argument that current RDF graphs have fewer than 100,000 triples is not significant and I would suggest to remove those 4 paragraphs as they don't contribute to the validation approach at all.
Page 19. Conclusions. I think the authors should try to offer not only the benefits of their approach, but also to point to some drawbacks.
[1] Semantics and Validation of Shapes Schemas for RDF , Iovka Boneva, Jose E. Labra-Gayo, Eric Prud'hommeaux, In 16th International Semantic Web Conference, ISWC2017. – 2017
[2] Corman, J., Reutter, J.L., Savkovic, O.: Semantics and validation of recursive SHACL. 17th International Semantic Web Conference. ISWC-2018
[3] https://lists.w3.org/Archives/Public/public-shacl/2018Sep/0003.html
[4] http://shex.io/shex-semantics/
|