Review Comment:
In this paper, the authors presented the querying component of the Knora platform which is dedicated to the management of datasets in the humanities sector. The platform is open source and comes with a rich documentation. This system proposes several general features like versioning and user permissions, but also features of particular interest for the humanities, mainly searching by dates and by markups in texts.
Their technical solution for providing these features is based on (i) a modified version of SPARQL, (ii) an alternative Web API to the SPARQL HTTP protocol followed by SPARQL endpoints, (iii) a query reformulation mechanism for translating the input query into one or multiple standard SPARQL queries executed over a triplestore and (iv) a core data model suitable for managing versioning and permissions.
This technical solution is motivated by the following claims:
1. There is an assumption in the design of triplestores that "everything in the triplestore should be accessible to the client, and thus offer no way to restrict query results according to the client’s permissions" (page 2, second column, line 47)
2. Triplestores "provide no way to work with data that has a version history" (page 3, first column, line 1)
3. "A SPARQL endpoint accepts queries that are processed directly by an RDF triplestore" (page 1, second column, line 33)
4. A SPARQL endpoint does not enforce pagination but "allows a client to request all the data" (page 2, second column, line 21). This seriously limit its scalability.
5. Their system provide the "scalability and [the] efficiency of a web API" (page 7, first column, line 33).
6. "SPARQL does not provide type checking; if a SPARQL query uses a property with an object that is not compatible with the property definition, the query will simply return no results" (page 4, second column, line 43).
Claim #1 is my view excessive, as most triplestores (e.g. Virtuoso, GraphDB, Stardog) actually offer some solutions for access control. However, the key difference between these systems and what the authors have proposed is that for the mentioned triplestores, access control policies are not part of the RDF data model but are specified in a out-of-band manner. Instead the authors proposed an in-band approach, where access control rules are part of their RDF data model. This difference would deserve to be discussed as such, and this first claim to be moderated.
Regarding the second claim, several works have been done for modelling versioning in the RDF landscape, but none of them have been cited (see for instance https://rdfostrich.github.io/article-versioned-reasoning/). Basic mechanisms for bringing context about triples such RDF reification and named graphs have not been mentioned. I would suggest the authors to have a look at the recent RDF*/SPARQL* proposal so as to enrich the references of the paper.
Claim #3 introduces a confusion between the SPARQL endpoint which is fundamentally an interface and one of its possible implementations. Like any interface, a SPARQL endpoint can be implemented in many different ways. To take an example, consider a Ontology-Based Data Access (OBDA) system such as D2RQ or Ontop, where SPARQL queries received by the endpoint are translated into SQL queries executed over a legacy relational database. There is here no triplestore on the back-end and the RDF and DB data models might be completely different. Said differently, it perfectly possible to setup a SPARQL endpoint for performing query reformulation.
Claim #4 is valid. Enforcing a LIMIT which is kept implicit in the user query, as proposed by the authors, breaks the current version of the SPARQL HTTP protocol, as there is no standard mechanism to communicate to the client that a LIMIT has been applied. The client could therefore wrongly assume that all the results have been retrieved, which may not be the case. It could be also interesting to compare it with another solution where the user is required to include a LIMIT in the query and where the limit value would have to be inferior or equal to a threshold. Both solutions have pros and cons which might be interesting enough to be discussed.
Regarding the scalability issues of SPARQL endpoints, the authors referred to a famous blog post of Dave Rogers where it was explained that the possibility to retrieve all the content of a SPARQL endpoint with a single query causes a severe scalability issue. However, the authors seem not to have paid enough attention to a comment of this blog post (https://daverog.wordpress.com/2013/06/04/the-enduring-myth-of-the-sparql...) where Dave Rogers said that "pagination does not solve the problem of query complexity. With enough data, it is straightforward to write inefficient or complex SPARQL that returns only a few results". If the proposed Web API distinguishes itself from SPARQL endpoints by enforcing pagination, it still preserves the query complexity of the SPARQL query language. Such a Web API appears to be much closer to a regular SPARQL endpoint than to ordinary Web APIs Dave Rogers was referring to (as scalable solutions). In absence of any experimental results, it remains unclear what supports claim #5.
I agree with the fact that returning an empty result set for ill-asked queries instead of returning an error, as SPARQL does, is very counter-intuitive to many users. However, I found the description of the "type checking" mechanism unclear and not detailed enough. It also gives the impression that type inference is not always needed, as in the case of the "?linkProp" variable (page 5, first column, line 25) whose values might perhaps be retrieved by standard SPARQL mechanisms. To improve this description, I would suggest the authors to evaluate the possibly to align their proposal with existing well-defined Semantic Web standards such as SHACL.
In general, it would be highly appreciated to provide a clear presentation of the full query reformulation process by reusing standards and highlighting the similarities with existing mechanisms. Query reformulation rules introduced by GeoSPARQL and the notion of mapping between two data models proposed in the R2RML standard could be two sources of inspiration.
The current weakest point of this paper is the too preliminary comparison with the related work, which makes it hard to assess the relevancy of the technical choices made. The bibliography currently includes only 7 citations, one being a peer-reviewed paper, and needs therefore to be significantly enriched. I am confident that a good alignment with the existing works from the Semantic Web field could overcome the "ad-hoc" impression the paper currently gives.
As this paper has been submitted as 'Tools and Systems Report', one would expect to find more about the impact of the tool and its limitations.
Other remarks
-------------
It would be interesting to mention which security model has been adopted (e.g. RBAC, ACL).
The use of a novel datatype for historical dates instead of xsd:date is a very interesting point which in my view would deserve to be further discussed, in particular when it comes to different levels of precision.
Contrary to what was stated in page 4, second column, line 14, solution modifiers such as ORDER BY and OFFSET actually do play a role for CONSTRUCT queries: https://www.w3.org/TR/sparql11-query/#SolModandCONSTRUCT.
If we consider HyperGraphQL as an interface and not as an implementation, which limitations does it share with SPARQL endpoints?
To which extend the function knora-api:match(...) differs from the standard function CONTAINS(...)?
|