Gravsearch: transforming SPARQL to query humanities data

Tracking #: 2412-3626

Authors: 
Tobias Schweizer
Benjamin Geer

Responsible editor: 
Special Issue Cultural Heritage 2019

Submission type: 
Tool/System Report
Abstract: 
RDF triplestores have become an appealing option for storing and publishing humanities data, but available technologies for querying this data have drawbacks that make them unsuitable for many applications. Gravsearch (Virtual Graph Search), a SPARQL transformer developed as part of a web-based API, is designed to support complex searches that are desirable in humanities research, while avoiding these disadvantages. It does this by introducing server software that mediates between the client and the triplestore, transforming an input SPARQL query into one or more queries executed by the triplestore. This design suggests a practical way to go beyond some limitations of the ways that RDF data has generally been made available.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Pietro Liuzzo submitted on 29/Jan/2020
Suggestion:
Minor Revision
Review Comment:

This is to complement my previous review, whose main points have been all addressed by the authors.
- headers still have the old title
- references 1 and 2 about Virtuoso SPARQL Query Editor s do not supplement the statement made in the second sentence. Either the references or the statement must be adjusted I think. If this Virtuoso editors are what is meant, then why not add to the reference list also the others which appear in the reference list?
- in section 1.1. "It not feasible" perhaps something is missing
- in section 1.2. imprecise dates in humanities are often not only limited to a year instead of being precise to the day. There are dates which are only relative (23rd year of the Tribunicia Potestas of Emperor X), or in calendar systems which have nothing Julian to them, e.g. the year of the Pristess of Hera... or the third year of the 56th Olympiad, and the list of cases could be much longer including many instances of month and day, sometimes hour being know... but no year... The Julian Day Number proposed as a solution certainly helps but has itself as a requirement that a formal date is aligned to the date expressed in different calendars, thus eventually building up unwanted spurious exactitude in the data. That is in my opinion exactly where periods definitions are fundamental. The example provided is just an example, but perhaps the authors could consider to use a bit more caution in regard to the complexity of this issue in humanities data.
- in section 2.1 paragraph 2, SPAQRL instead of SPARQL is used twice
- in section 2.2. paragraph 2 dcterms is used in normal font: throughout the paper there are inconsistencies of this kind, see also CONSTRUCT for example sometimes in a font, sometimes in another.
- reference 22 in note 17 would be perhaps better complemented or replaced with a reference directly to ONTOP which is in my opinion more relevant to the discussion
- note 23 would benefit from a link to the exact page and a full reference.
- unfortunately the conclusion is a bit circular, I understand: Gravsearch stores data in a way suitable for long term preservation and it consequently serves ensuring longevity. Also permissions and interoperability get into the conclusion picture but in a disjoined way and could do with one conclusive sentence each. The presentation in the article is full and complex and the current formulation of the conclusion diminishes its value, therefore I would recommend that the authors invest some time in finding a nice conclusive formulation, where
- - the features of the approach which sustain long term preservation are briefly listed,
- - the features which make the approach more appropriate for web application are briefly spelled out.
- The last sentence can be simply and safely omitted as it does not add anything to the previous statements, but just attempts to link this discussion to a more generic and unreferenced discussion within "digital humanities".

Review #2
By Martin Rezk submitted on 03/Mar/2020
Suggestion:
Accept
Review Comment:

As I mentioned in my previous review, this is an interesting tool to ease the access of scholars to semantic data. I still think that the description of the query translation process is obscure, but the value of the paper in providing a tool that uses semantic technologies to share humanities data outweigh that issue, thus I will accept the paper. Plus, the documentation in https://docs.knora.org/paradox/index.html seems to be quite complete.
If that authors want to introduce some improvements in the final version I would suggest:
1) Simplify the examples: the reader does not need the real URLs in the queries, just an intuition of how the query looks like.
2) Improve Section 2.2: this section is supposed to describe the ontologies but it mixes them with the queries, the results, the standards. I would suggest to keep it simple and to the point: "this is Ontology A, and this is ontology B. This is how they are similar, this is how they differ, these are the possible mappings."
3) Add some diagrams to illustrate the workflow of the system and what is used where.

Review #3
By Benjamin Cogrel submitted on 08/Mar/2020
Suggestion:
Accept
Review Comment:

The paper has been significantly improved in this revision. In particular, the novel section 1.1 "Institutional and technical context" provides a clear motivation for the proposed system, which is long-term preservation of datasets in the humanities sector. From this long-term perspective, the use of vendor-specific solutions are avoided to prevent any vendor lock-in. Instead, they rely on standards or on self-descriptive/in-band solutions where user permissions and versioning are directly are included in the RDF graphs.

The positioning with existing works and standards has been improved and reached in my view an acceptable state. Most of my comments have been addressed, in particular in the novel section 2.1 "Scope of Gravsearch".

Some too-broad usages of the term "SPARQL endpoint" are still present in the text (they were already there in the first version). I think it would be better to align them with the new positioning of the paper.
* Page 4, column 2, line 43: "This extra layer of processing enables Gravsearch to avoid the disadvantages of SPARQL endpoints and to provide additional features."
* Page 6, second column, row 50: "With a SPARQL endpoint, there would be no way to prevent other users from querying the value".

Similarly, the comparison with HyperGraphQL has not evolved and remains not convincing for me. Offering a GraphQL/HyperGraphQL interface could actually be an interesting perspective for the presented work so as to further improve the system scalability. This would probably be compatible with most of the proposals made in this paper, except the current pagination mechanism, which could be anyway replaced by a GraphQL-specific one. Consequently, I would suggest the authors to remove HyperGraphQL from the related work section and mention it instead in the section 2.1 as a direction for a future investigation on the delicate balance between expressivity and scalability.

While some small improvements can still be done, I think they can be easily be addressed in a camera-ready version. I am therefore in favor of having this paper accepted.

Minor comments:

* Page 12, column 2, line 16: "but to an API server rather than to the triplestore" . Perhaps "virtual endpoint" would be more appropriated than "API server", as a triplestore can also be viewed as an API server.
* Instead of changing the semantics of the SPARQL OFFSET, which can be very confusing, one could have introduced a novel construct like PAGE_OFFSET.
* How does the client know that it has retrieved all the results? By asking for the following page and receiving an empty answer?