Gravsearch: transpiling SPARQL to query humanities data

Tracking #: 2290-3503

Benjamin Geer
Tobias Schweizer

Responsible editor: 
Special Issue Cultural Heritage 2019

Submission type: 
Tool/System Report
It has become common for humanities data to be stored as RDF, but available technologies for querying RDF data, such as SPARQL endpoints, have drawbacks that make them unsuitable for many applications. Gravsearch (Virtual Graph Search), a SPARQL transpiler developed as part of a web-based API, is designed to support complex searches that are desirable in humanities research, while avoiding these disadvantages. It does this by introducing server software that mediates between the client and the triplestore, transforming an input SPARQL query into one or more queries executed by the triplestore. This design suggests a practical way to go beyond some limitations of the ways that RDF data has generally been made available.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Pietro Liuzzo submitted on 26/Sep/2019
Minor Revision
Review Comment:

The Gravsearch tool is presented in this paper with focus on some of its feature which are relevant for humanities data. Although it is not really clear why such problems should be classified as humanities related, the solutions are interesting especially from the point of view of a developer of an application serving or using data in RDF. The longevity and interoperability of the system are questionable and more references would be welcome on several points.

## Quality, importance, and impact of the described tool or system (convincing evidence must be provided).
The tool is presented with its qualities and limits with examples which are rather convincing for the developer of a resource based on RDF data, to add a layer of control on the access to the data. The relevance for users of the Gravsearch API is instead limited, which I think should be stressed, because, from what I understand, such user would have benefits but also have to learn another layer of syntax, and know the Knora ontologies to an extent, thus loosing some of the freedoms granted from SPARQL, as well as control on the actual data structure. It is also not clear whether Gravsearch could be used and how it could be used with resources ad triple stores outside Knora. If this is stated in the documentation, I would suggest the authors to consider adding a sentence and a reference to clarify this. The specific relevance to humanities is in my opinion limited. The service would be useful also for other types of data concerned with similar issues, like versioning and permission mangement.

## Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.
The paper lacks references for many of the statements made, and some clear definition of terms like 'transpiling' or the name of the system itself. For example, focusing on the very important first paragraph of the introduction:
- by which measure is the RDF data publication common in the humanities? My feeling would be the opposite, and I would really like to know on what information is this assertion based.
- Also the following statement about the cumbersomeness of querying humanities data with SPARQL would need some supporting evidence or an example. In my opinion it is not more or less cumbersome that any other query language, given the data is not cumbersome in it self, which would not be a fault of the query language as such.
- I would also suggest to clarify why the versioning should be supported at this level, as it is not that obvious at least to me, why this preoccupation should not be left with the triplestore only for example, and not with the query language.
- It is also not so self evident that text markup and calendar independent historical dates are humanities focused data structure features.

The structure of the paper needs some minor, but important, revision in the organisation of the argument for the argument to be fully effective in my opinion. Although the readability is fine, the clarity of the presentation is diminished by two main issues which can however be easily overcome.

The first issue is the relevance to humanities of the highlighted issues of a SPARQL Endpoint, namely a) no support for versioning; b) no support for permissions; c) no limits to query results size. I would recommend to highlight why these are so specific to humanities data.

The second issue is the organisation of the information in the paper, which mixes in the argument problems and solutions. Especially the organisation of the information in the two subsections of the introduction does not let the reader have a clear view of the problems and the proposed solutions which are anticipated immediately. I think this crucial part of the paper would instead benefit from real life examples of the issues which would then be recalled later in the paper when discussing the features of the software which solve them, instead of simply stating: this solves that. I think that some example of problems, perhaps this time the problems of the development, may be usefully anticipated and referred to also when discussing the features of the hybrid SPARQL/WebAPI approach. At the moment the problem definition occurs actually under the section "related work", and in the section called "problem definition" a mixture of solutions and issues is offered.

Some other more punctual notes for review follow.

The interesting method of comparing to JDNs is not explained in full or refers to other sources for further clarification. How would a date in a calendar whose correspondence to JDN is not known or computable be figured out by Gravsearch? What about other projects dealing with such issues like GODOT, Chronontology and PeriodO?

I have also used Europeana as an example SPARQL endpoint for obvious reasons, but I have been told that being developed in a third party project it is not actively maintained. I may be wrong, but it is worth considering if instead the Wikidata SPARQL endpoint could be more exemplary.

The limitations of SPARQL are often repeated, but when the paper says that Gravsearch combines the advantages these are not listed, neither the relation of these advantages to the issues faced is clarified. I would suggest to add a short list of the gathered advantages to then discuss point by point. This would also help to frame the paragraphs about the use of CONSTRUCT which at the moment follow directly on this statement. For example, it looks like one of the main advantages is that the produced optimised SPARQL queries are much more complex than the one passed to Gravsearch. This is only said in 2.5. Perhaps an example of this and of the kinds of optimisation made would help, or a reference to one of the examples provided which shows this. The example provided in section 3.1 is not in my opinion very telling of this, because the complexity of the resulting SPARQL query is due to the actual Knora data structure in facts.

The sentence starting in line 41 of page 3, first column should in my opinion all be in a footnote.

The Ontology Schemas section 2.1 is probably the most crucial, yet very short. I would have liked to have a fuller example discussed here or a reference to relevant further relevant examples.

In the section 2.3. Versioning, it is said that Gravsearch avoids the possibility to accidentally querying a past version. However the previous sentence makes me wonder if this is not granted already by the absence of the property, thus in the data, rather than by the software.

The last sentence of section 2.5 would need in my opinion to also be unpacked.

In relation to the first main issue, the first example in section 3.1 makes me wonder if these issues are actually inherent with the data structure, because the example query would be a one liner, querying a simple XML database, or using standards like CorrespSearch.

Example 3 is also a bit unclear. A text in a p element is not very telling. What about a text marked up in TEI with something simple like Richtiegkeit, or alternative dates marked up in this way, or alternative interpretations by different authors or publications, as would typically be the case for humanities data?

In the conclusions I would suggest to clarify a bit more why the longevity is supported by this system, I really fail to see the connection. One more piece of software to access the data structure, let it be web API like, has little of self evident which will support longevity. Also putting a layer in front of the SPARQL endpoint would seem to me to limit interoperability and ease of access, while it certainly improves the usability by developers for application.

Review #2
Anonymous submitted on 13/Oct/2019
Minor Revision
Review Comment:

This paper presents Gravsearch, a query answering system that rewrites CONSTRUCT SPARQL queries over their own ontology into a number of SPARQL queries over the data provider triplestore/ontology. The result's format is JSON-LD. The actual query rewriting engine (apparently also in charge of reasoning) is called Knora.

This is an interesting tool to ease the access of scholars to semantic data. However, I do have some observations about the paper and some concerns about the approach.

Some of the examples are split between the appendix and the main text.  I suggest to bring them together.
There are several concepts in the paper that would benefit from examples. For instance,  the ontologies in Section 2.1, the mappings between those ontologies and the ontologies in the triplestores, the set of queries resulting from Knora translation, etc. A running example illustrating all of these concepts would be great. In particular, the Knora ontologies seem quite complex so the authors should spend more time describing them. The date example illustrates the capabilities to virtually homogenize data, but it is not explained in the paper how this homogenization is done by Knora, mappings?

Section 1.2, related work, should mention other systems that have tried to ease the access to historical semantic data, such as "Ontology-based data integration in EPNet: Production and distribution of food during the Roman Empire".

There are some claims that I do not think are correct.
- "...there is no support for permissions or for versioning..."
Versioning and users permissions are supported in Stardog (although versioning is being deprecated). In Ontop you can, in principle, inherit the permissions from the underlying RDBS.
- "...A SPARQL endpoint allows clients to request all the data"
In Virtuoso you can disallow users to query all the data  by just setting ResultSetMaxRows=N.
I suggest to go over those (and similar) claims and double check w.r.t. the triplestores documentation.

The authors also claim that the SPARQL query is translated into several queries for performance reasons (p4). It is not clear in the paper how this can improve performance.  It is not clear either why the system needs to infer the types to build the query, or what OWL fragment they support for reasoning.

Overall, I think that the paper is interesting and the tool can be useful to scholars. Once the authors tackle the issues mentioned above the paper will be ready for publication. 


Review #3
By Benjamin Cogrel submitted on 11/Nov/2019
Major Revision
Review Comment:

In this paper, the authors presented the querying component of the Knora platform which is dedicated to the management of datasets in the humanities sector. The platform is open source and comes with a rich documentation. This system proposes several general features like versioning and user permissions, but also features of particular interest for the humanities, mainly searching by dates and by markups in texts.

Their technical solution for providing these features is based on (i) a modified version of SPARQL, (ii) an alternative Web API to the SPARQL HTTP protocol followed by SPARQL endpoints, (iii) a query reformulation mechanism for translating the input query into one or multiple standard SPARQL queries executed over a triplestore and (iv) a core data model suitable for managing versioning and permissions.

This technical solution is motivated by the following claims:
1. There is an assumption in the design of triplestores that "everything in the triplestore should be accessible to the client, and thus offer no way to restrict query results according to the client’s permissions" (page 2, second column, line 47)
2. Triplestores "provide no way to work with data that has a version history" (page 3, first column, line 1)
3. "A SPARQL endpoint accepts queries that are processed directly by an RDF triplestore" (page 1, second column, line 33)
4. A SPARQL endpoint does not enforce pagination but "allows a client to request all the data" (page 2, second column, line 21). This seriously limit its scalability.
5. Their system provide the "scalability and [the] efficiency of a web API" (page 7, first column, line 33).
6. "SPARQL does not provide type checking; if a SPARQL query uses a property with an object that is not compatible with the property definition, the query will simply return no results" (page 4, second column, line 43).

Claim #1 is my view excessive, as most triplestores (e.g. Virtuoso, GraphDB, Stardog) actually offer some solutions for access control. However, the key difference between these systems and what the authors have proposed is that for the mentioned triplestores, access control policies are not part of the RDF data model but are specified in a out-of-band manner. Instead the authors proposed an in-band approach, where access control rules are part of their RDF data model. This difference would deserve to be discussed as such, and this first claim to be moderated.

Regarding the second claim, several works have been done for modelling versioning in the RDF landscape, but none of them have been cited (see for instance Basic mechanisms for bringing context about triples such RDF reification and named graphs have not been mentioned. I would suggest the authors to have a look at the recent RDF*/SPARQL* proposal so as to enrich the references of the paper.

Claim #3 introduces a confusion between the SPARQL endpoint which is fundamentally an interface and one of its possible implementations. Like any interface, a SPARQL endpoint can be implemented in many different ways. To take an example, consider a Ontology-Based Data Access (OBDA) system such as D2RQ or Ontop, where SPARQL queries received by the endpoint are translated into SQL queries executed over a legacy relational database. There is here no triplestore on the back-end and the RDF and DB data models might be completely different. Said differently, it perfectly possible to setup a SPARQL endpoint for performing query reformulation.

Claim #4 is valid. Enforcing a LIMIT which is kept implicit in the user query, as proposed by the authors, breaks the current version of the SPARQL HTTP protocol, as there is no standard mechanism to communicate to the client that a LIMIT has been applied. The client could therefore wrongly assume that all the results have been retrieved, which may not be the case. It could be also interesting to compare it with another solution where the user is required to include a LIMIT in the query and where the limit value would have to be inferior or equal to a threshold. Both solutions have pros and cons which might be interesting enough to be discussed.

Regarding the scalability issues of SPARQL endpoints, the authors referred to a famous blog post of Dave Rogers where it was explained that the possibility to retrieve all the content of a SPARQL endpoint with a single query causes a severe scalability issue. However, the authors seem not to have paid enough attention to a comment of this blog post ( where Dave Rogers said that "pagination does not solve the problem of query complexity. With enough data, it is straightforward to write inefficient or complex SPARQL that returns only a few results". If the proposed Web API distinguishes itself from SPARQL endpoints by enforcing pagination, it still preserves the query complexity of the SPARQL query language. Such a Web API appears to be much closer to a regular SPARQL endpoint than to ordinary Web APIs Dave Rogers was referring to (as scalable solutions). In absence of any experimental results, it remains unclear what supports claim #5.

I agree with the fact that returning an empty result set for ill-asked queries instead of returning an error, as SPARQL does, is very counter-intuitive to many users. However, I found the description of the "type checking" mechanism unclear and not detailed enough. It also gives the impression that type inference is not always needed, as in the case of the "?linkProp" variable (page 5, first column, line 25) whose values might perhaps be retrieved by standard SPARQL mechanisms. To improve this description, I would suggest the authors to evaluate the possibly to align their proposal with existing well-defined Semantic Web standards such as SHACL.

In general, it would be highly appreciated to provide a clear presentation of the full query reformulation process by reusing standards and highlighting the similarities with existing mechanisms. Query reformulation rules introduced by GeoSPARQL and the notion of mapping between two data models proposed in the R2RML standard could be two sources of inspiration.

The current weakest point of this paper is the too preliminary comparison with the related work, which makes it hard to assess the relevancy of the technical choices made. The bibliography currently includes only 7 citations, one being a peer-reviewed paper, and needs therefore to be significantly enriched. I am confident that a good alignment with the existing works from the Semantic Web field could overcome the "ad-hoc" impression the paper currently gives.

As this paper has been submitted as 'Tools and Systems Report', one would expect to find more about the impact of the tool and its limitations.

Other remarks

It would be interesting to mention which security model has been adopted (e.g. RBAC, ACL).

The use of a novel datatype for historical dates instead of xsd:date is a very interesting point which in my view would deserve to be further discussed, in particular when it comes to different levels of precision.

Contrary to what was stated in page 4, second column, line 14, solution modifiers such as ORDER BY and OFFSET actually do play a role for CONSTRUCT queries:

If we consider HyperGraphQL as an interface and not as an implementation, which limitations does it share with SPARQL endpoints?

To which extend the function knora-api:match(...) differs from the standard function CONTAINS(...)?