A Stitch in Time Saves Nine -- SPARQL Querying of Property Graphs using Gremlin Traversals

Tracking #: 1821-3034

Authors: 
Harsh Vrajesh Thakkar
Dharmen Punjani
Yashwant Keswani
Jens Lehmann
Soeren Auer

Responsible editor: 
Claudia d'Amato

Submission type: 
Full Paper
Abstract: 
Knowledge graphs have become popular over the past years and frequently rely on the Resource Description Framework (RDF) or Property Graphs (PG) as underlying data models. However, the query languages for these two data models -- SPARQL for RDF and Gremlin for property graph traversal -- are lacking interoperability. We present Gremlinator, a novel SPARQL to Gremlin translator. Gremlinator translates SPARQL queries to Gremlin traversals for executing graph pattern matching queries over graph databases. This allows to access and query a wide variety of Graph Data Management Systems (DMS) using the W3C standardized SPARQL query language and avoid the learning curve of a new Graph Query Language. Gremlin is a system-agnostic traversal language covering both OLTP graph database or OLAP graph processors, thus making it a desirable choice for supporting interoperability wrt. querying Graph DMSs. We present a comprehensive empirical evaluation of Gremlinator and demonstrate its validity and applicability by executing SPARQL queries on top of the leading graph stores Neo4J, Sparksee, and Apache TinkerGraph and compare the performance with the RDF stores Virtuoso, 4Store, and JenaTDB. Our evaluation demonstrates the substantial performance gain obtained by the Gremlin counterparts of the SPARQL queries, especially for star-shaped and complex queries.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 22/Feb/2018
Suggestion:
Reject
Review Comment:

This paper describes a translation from (a subset of SPARQL) to the Gremlin query language. The motivation behind this work is to to bridge the gap (wrt querying) between the RDF data model and Property graph data model. Authors give an informal description of the translation, by also providing examples, and report on an experimental evaluation involving different RDF stores and Graph Data Management Systems.

Pros:
(i) the paper touches an interesting problem; it is very relevant to offer an unified way to query graph data stored in different formats;

(ii) the experimental evaluation is comprehensive although not really interesting for the topic of the paper, that is, translating SPARQL into Gremlin.

Cons:
(i) the paper lacks any formal treatment of the translation;

(ii) authors do not properly position their work wrt previous work;

(iii) one can use an already available translation chain SPARQL—>SQL—>Gremlin without the need of another translation.

Detailed Review:
I have to admit that I was very excited when reading (the second part) of the title. In this respect, I don’t really understand what "A Stitch in Time Saves Nine” adds to the second part. However, as soon as I started to read the paper my enthusiasm vanished.

Authors make a series of statements that need to be either removed or properly backed up. Examples are: " corresponding data management techniques have distinct and complementary characteristics” or "still largely disjoint semantic and graph data technology ecosystems”. In this latter respect authors state that Property Graphs "support extremely scalable storage”; at this point one can wonder: does that mean that SPARQL cannot? If so, then put a reference to a comparative study!

Authors state that “this is the first work addressing the query interoperability (translation) problem.“. Even in this case I do not fully agree; there are existing translators that allow to go from SPARQL to Gremlin. One can simply use the translation chain SPARQL—>SQL—>Gremlin. The first part of the translation is available in almost any SPARQL processor (e.g., Virtuoso) while the second one is available e.g., at http://sql2gremlin.com. What one needs to add is another translation SPARQL-->SQL that takes in general less than 0.5 seconds (as also authors show for their case)! Hence, the question is: do we actually need another translator?

My answer to this question is yes! We would need a formally backed up translator showing: (i) how one can translate SPARQL to Gremlin and; (ii) the results of the original query and translated queries are the same.

What could have made a difference would have been a formal treatment of the SPARQL to Gremlin translation problem. However, this aspect has been completely overlooked in this paper.

I also strongly disagree with the approach followed by authors: comparing the result counts is not a valid indicator of the correctness of the translation. It many be the case that the count is the same but the actual results are different.

Authors should consider tackling the problem from a more formal perspective. It is not about "re-inventing the wheel by re-defining formal concepts and proofs (which already have been addressed in the works”, it is about soundness of the proposal.

In terms of related/previous work it is not clear at all what are the differences of the present submission wrt ref [12] where the translation is discussed (with a slightly different set of authors). Usually, in a journal submission one should explicitly mention the improvement wrt previous work. This has not been done in the present submission.

Ditto for another existing translator SPARQL2Gremin (https://github.com/dkuppitz/); the website just mentions that the translation presented here is a fork of this one. Finally, authors hould also consider comparing their proposal wrt the SQL2Gremlin translator and the (longer) chain SPARQL—>SQL—>Gremlin.

As for the evaluation, I don’t really see what authors are showing. It seems that the goal is to compare the performance of RDF stores wrt graph databases. However, the paper is not about this. Moreover, authors are probably aware that recently Amazon Neptune (https://aws.amazon.com/neptune/ ) already integrates RDF property graphs and the respective query languages in a "fast, reliable, fully-managed graph database service".

Finally, I don’t really get what the sentence “ but this would imply performing a linear search through the entire dataset (set of triples) for each query that is executed” means.


Some typos:

"in brief with respect to. context “ page 5

Fig.4 overlaps with the text

Review #2
Anonymous submitted on 19/Mar/2018
Suggestion:
Major Revision
Review Comment:

The paper approaches the problem of translating queries between graph database systems, specifically how to translate SPARQL SELECT queries to Gremlin traversals. The paper introduces an openly available implementation and give an extensive evaluation of the approach. The manuscript introduces an original contribution.

However, I see a several major issues with the manuscript:

1. Motivation/Evaluation: The contribution aims mainly for a performance gain when evaluating SPARQL queries on property graph stores. This can be seen from the introduction and is also consistent with the evaluation findings. Surprisingly querying (legacy) property graph stores with SPARQL is not part of the motivation.
Therefore it is confusing to find property graph data as basis for the example (Figure 2 and 3) as well as the Northwind evaluation (a relational benchmark being first converted to PGs and to RDF afterwards) when the motivation is (apparently) to query RDF data with SPARQL.

2. Mapping RDF <-> PG: A mapping from the PG data model to RDF is missing completely. There exist several standard approaches already to express PG data in RDF (e.g., [2], [61], or Hernandez, Hogan, Kroetzsch: Reifying RDF: What Works Well With Wikidata?). At least a description and a justification is necessary.
For data originally available as RDF (e.g., BSBM) also a mapping from RDF to PG is also necessary.
Of course a mapping working both ways is also sufficient.

3. Presentation: The manuscript shows several weaknesses regarding the presentation.
- order of paragraphs (e.g. in the introduction the approach is presented before listing the research problems)
- confusing definitions/preliminaries, examples: the term "elements" is never defined. Section 3.4.1 and especially Equation (1) (what do the arrows mean? what does the line mean?) are too compact and not understandable without consulting the references. Since a journal article must be self-contained, a more extensive preliminary section is necessary. At the same time introducing terms which are not necessary for the main part of the paper should be avoided.
Another one is Definition 3.5: what is L_e?
A reader not used to PG notions and terms can hardly understand the manuscript in this form.

Minor issues:

Evaluation: it is not clear why new queries had to be devised, for example for BSBM, which contains already sets of queries.

The definition in equation (2) is repeated from the preliminaries and is actually wrong: a triple pattern consists of exactly one pattern; it can *not* be empty (footnote 16). Refer to the SPARQL standards and literature cited in the preliminaries.

Table 2 does not contain the aggregate operators (count, sum,...).

The "three-dot-operator" (\because in LaTeX) used in Equation (6) is uncommon in the SWJ and should be avoided or explained.

The definition of "path" is unneccessarily complicated.

Ideally correctness would be shown formally

Review #3
By Olaf Hartig submitted on 30/Jun/2018
Suggestion:
Major Revision
Review Comment:

This manuscript introduces an approach to query Property Graphs (PGs) using SPARQL by translating SPARQL queries into Gremlin traversal expressions. Additionally, the manuscript presents an empirical evaluation in which the authors have used their approach to compare query execution times of a) SPARQL queries over PGs in different Gremlin-aware graph database systems versus b) SPARQL queries over RDF data in different RDF triple stores.

ORIGINALITY

Achieving interoperability of different types of graph-based data management systems is a timely topic--in particular for the two data models that are most prevalent in this context, namely: RDF and the PG model. The idea to use SPARQL as a common query language is reasonable. To the best of my knowledge, this manuscript would be the first journal article related to this idea.

SIGNIFICANCE OF THE RESULTS

The authors have implemented their approach in a well-documented and easy-to-use prototype. This prototype may become a useful tool for various projects. However, in terms of scientific contributions the presented work is quite limited. That is, the description of the translation approach lacks scientific preciseness, remains vague in terms of almost all aspects of the approach, and even contains various formal incorrectnesses. Furthermore, properties and characteristics of the approach have not been studied in the presented work. Regarding the experimental evaluation, while some of the observations may be interesting, the manuscript provides very little to explain or to understand the observations. For more details regarding these points, see below.

Moreover, the authors continuously confuse system-specific observations and characteristics as general properties of the respectively supported data models and query languages. This issue already occurs in the introduction where, for example, the authors claim that "PGs [...] support extremely scalable storage and querying." Scalability of data storage and of querying processing cannot be properties of an abstract data model per se; instead, these may be properties of specific data storage or querying processing techniques and, thus, may be offered by specific systems that implement such techniques. Other techniques that are similarly efficient may be developed for other data models (such as RDF). The same confusion reoccurs throughout the manuscript and culminates in the description of the experimental observations where the authors repeatedly speak about SPARQL versus Gremlin when referring to the behavior of the selected RDF triple stores and the selected Property Graphs systems--just three of many examples, all quoted from Sec.6.5: "SPARQL performs moderately faster" and "SPARQL queries reap the most benefits of warm caching" and "Gremlin traversals perform considerately better [...] by leveraging the advantage of underlying property graph data model (locality)." The latter example is a very typical example of the issue: Locality may be achieved by a physical storage schema but not as a result of using a logical data model (such as PGs); there also exist physical storage schemas that achieve locality when storing RDF data (for instance, take a look at the papers about GStore by Tamer Özsu at el.).

QUALITY OF WRITING

The organization of the manuscript is reasonable and on a high level it is possible to follow what is going on. However, when looking at the details, the text often is handwavy and many parts are not sufficiently clear. In particular, the authors use terminology without introducing it; similarly, most formal notions are not introduced properly. The perhaps most significant issue, however, are many small conceptual mistakes that can be found all over the manuscript. A few examples of each of these issues:

1/ terminology that has not been introduced
* what is meant by "graph databases" in the second paragraph of Sec.1?
* what is meant by "path queries" in the second paragraph of Sec.3.4.2 (and several other places)?
* what is a "simple path traversal" (second paragraph, Sec.4.1)?
* what is "currying" (third paragraph, Sec.4.1)?

2/ formal notions that are not introduced properly
* the notion of a "graph pattern" as used in Def.3.3 has not been introduced
* what are all the expressions that are typically denoted by \psi or \Psi in the manuscript? What is the grammar of language?
* What does the superscript + in formula (5) mean for "traversal" steps?

3/ conceptual mistakes
* In most cases in which Sec.4 mentions BGPs, the actual concept referred to are triple patterns. For instance, Table 1 is only about triple patterns.
* Last paragraph of Sec.4.3: Creating "[sub]graphs based on specific graph patterns" has nothing to do with the concept of named graphs (instead, it is more similar to views I would say).
* Sec.5.1: rdfs:label is *not* a prefix
* Sec.6.5: "...for star-shaped queries (queries with bushy plans..." -- star-shaped queries do not necessarily need to be compiled into bushy plans
* Sec.6.6: "...perform expensive joins [...] on top of executing aggregation operations..." -- the given test queries do not result in execution plans that have joins after aggregation operations

DETAILS REGARDING ISSUES WITH THE PRESENTATION OF THE APPROACH

General issues:

* The approach seems to assume an RDF view of the PG to be queried and this view seems to be based on a specific RDF vocabulary (containing URIs such as v:label) and a specific way to create URIs (e.g., v:name, e:knows). However, none of this is mentioned in the manuscript.

* The whole introduction of Gremlin is unclear and confusing (in particular, the notions of Gremlin machine and Gremlin traversals, which are key to the presented work).

* The introduction of the translation of "BGPs" (Sec.4.2) actually focuses only on triple patterns (Table 1 is only about triple patterns). There is no discussion anywhere in the manuscript of how an actual BGP (i.e., a set of triple patterns) is translated.

* The translation of triple patterns (Table 1) is presented only for specific example triple patterns. There is no precise (formal) definition for the general case of triple patterns of the various possible forms. In other words, given an arbitrary triple pattern, how do I know which of the examples in Table 1 does this triple pattern correspond to?

* The description of the translation of the full fragment of SPARQL 1.0 is not complete and inconsistent. More precisely, the grammar in formula (3) does not correspond to SPARQL 1.0 and, furthermore, is incorrect (according to the given grammar, every query has to contain at least *two* BGPs). Moreover, the translation of some of the operators mentioned in this grammar are not captured in Table 2 (OPT, DIFF) and not mentioned at all in the text (DIFF). Similarly, Algorithm 1 does not cover all cases (OPT, DIFF, SELECT) and, generally, has a lot of issues (e.g., the meaning of lines 7, 12, 20, etc, is unclear; lines 21, 24, 28, 30 should use GT instead of T; it is not clear why it is sufficient for the output of the algorithm to be a set of expressions--shouldn't there be an order?)

In addition to these general issues, there are a great number of small formal incorrectnesses and missing details. In the following, I list examples from Sec.3. Similar examples can be found in Sec.4.

* Def.3.1 - How exactly is this related to the RDF data model?
* Def.3.2 - How is the set Lab related to the function \lambda?
* Def.3.2 - What is R and S?
* Def.3.3 - Formally, it is not correct that the mappings mentioned in point 1 are already given as part of the "graph pattern" P.
* Def.3.4 - What are the symbols s, ?s, p, ?p, o, ?o ???
* Def.3.4 - It is not clear how the two sentences in this definition are related. What is the relationship between pattern matching and the notion of a BGP?
* Sec.3.3.1 - It must be made clear that the notion of a "graph pattern P" here is different from the notion with the same name as mentioned in Def.3.3. Ideally, different terminology and symbols should not be overloaded.
* Sec.3.4.1 - What is T? What is \Psi? What are "the instructions in \Psi"?
* Sec.3.4.1, bullet point 1 - *where* are "no more existing traversers"?
* Sec.3.4.1, below formula (1) - the symbol \mu should not be used with different meanings (it has been used before in Def.3.2 with a different meaning)
* Sec.3.4.1, formula (1) - even with the explanation of the symbols, I do not understand the formula since it is not a typical algebraic formula. What do the arrows mean? What is the purpose of the fraction line? etc?
* Sec.3.4.2 - What are f, g, and h?
* Sec.3.4.2 - Here, the symbol \Psi seems to denote some type of a set. Later, in Sec.4.3 it seems to denote some kind of expression instead (of some unspecified grammar). This is highly confusing.
* Sec.3.4.2 - What are A and B?
* Sec.3.5, bullet point 2 - What is meant by "variables" here?

DETAILS REGARDING ISSUES WITH THE EXPERIMENTAL EVALUATION

* It is not clear what is the purpose of the evaluation.

* How has the RDF version of the Northwind dataset been created? Similarly, how has the PG version of the BSBM dataset been generated?

* According to Sec.6.3, the authors "compared the results returned by each of these queries for correctness." I would like to know how exactly this was done (in particular, given that the datasets contained 33K and 1M triples).

* The conclusion at the end of Sec.6.3 should not claim generality. Instead, "the proposed SPARQL -> Gremlin translation approach is correct" for the 30 queries considered by the authors, which does not mean it is correct in general!

* Regarding the notion of warm cache versus cold cache, it should be made clear that the approach used to clear the cache focuses on the operating system cache rather a cache that is built into the respective systems. Consequently, the observations regarding warm versus cold cache may be of limited relevance.

* Given that the authors report averages of 10 runs, the charts must also contain error bars. Any point in a chart that represents an average should have an error bar that indicates the error of the average (e.g., in terms of one standard deviation). If the standard deviation for different points are high and overlapping, it is not possible to draw any conclusions from the averages!

* The first two sentences at the top left of page 18 do not provide sufficient explanation for which union should be implemented less efficiently in a Gremlin-based system than in an RDF system.

* As mentioned before the given test queries do not result in execution plans that have joins after aggregation operations and, thus, it cannot be the case that RDF triplestores "perform expensive joins [...] on top of executing aggregation operations."

* I don't see a reason for why a "greater number of projection variables" should be disadvantageous for RDF triplestores. Why is that?

* BSBM does *not* offer "the flexibility of generating graphs of custom [...] density." (Sec.6.1)


Comments

Dear Editor, Reviewers, and Readers,

Due to a very large IT infrastructure changes at the University of Bonn, the link of our system demonstration (http://gremlinator.iai.uni-bonn.de:8080/Demo) is currently not accessible. This issue will be resolved soon. Meanwhile, may we request you to visit a mirror hosted at http://195.201.31.31:8080/Demo/ for the demo.