Star Pattern Fragments: Accessing Knowledge Graphs through Star Patterns

Tracking #: 2472-3686

Authors: 
Christian Aebeloe
Ilkcan Keles
Gabriela Montoya
Katja Hose

Responsible editor: 
Guest Editors Web of Data 2020

Submission type: 
Full Paper
Abstract: 
SPARQL endpoints offer access to a vast amount of interlinked information. While they offer a well-defined interface for efficiently retrieving results for complex SPARQL queries, complex query loads can easily overload or crash endpoints as all the computational load of answering the queries resides entirely with the server hosting the endpoint. Recently proposed interfaces, such as Triple Pattern Fragments, have therefore shifted some of the query processing load from the server to the client at the expense of increased network traffic in the case of non-selective triple patterns. This paper therefore proposes Star Pattern Fragments (SPF), an RDF interface enabling a better load balancing between server and client by decomposing SPARQL queries into star-shaped subqueries, evaluating them on the server side. Experiments using synthetic data (WatDiv), as well as real data (DBpedia), show that SPF does not only significantly reduce network traffic, it is also up to two orders of magnitude faster than the state-of-the-art interfaces under high query load.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Olaf Hartig submitted on 05/Jul/2020
Suggestion:
Minor Revision
Review Comment:

This manuscript studies a new type of Web interface to access and to query RDF data. By doing so, it provides a timely and interesting contribution to a line of research in the community with the goal to identify alternatives to the SPARQL endpoint interface. The particular new type of interface that this manuscript introduces is called the Star Pattern Fragments (SPF) interface.

Conceptually, this type of interface is an extension of the Bindings-Restricted Triple Pattern Fragments (brTPF) interface. Instead of enabling clients to combine only a single triple pattern with a set of solution mappings, the innovative extension of the SPF interface is to allow for multiple triple patterns that form a star pattern. By a comprehensive set of experiments, the authors show that client-server query execution processes with their new interface outperform corresponding brTPF-based processes in terms of reduced network traffic and increased query throughput. Additionally, the experiments nicely show the various properties of the authors' approach (including a good explanation of the observations). Hence, in summary, the idea of the SPF interface is original and novel, and the manuscript shows that this interface presents a significant advancement of the state of the art.

In addition to the originality and the significance of the results in the manuscript, another strength of the manuscript is that it is very well written. More specifically, the work is well motivated in Section 1; Section 2 gives a good overview of the state of the art and of the relationship to the work presented in the manuscript. Sections 4 and 5 provide an excellent description of the approach; that is, the high-level ideas, formal definitions, and all necessary details are presented in an easy-to-understand manner. Similarly, the description of the experiments is excellent and should serve as an example for many other papers in the community.

The only issue that should be addressed is the following problem related to the notion of a "subject-based star decomposition" (Def.12): There are BGPs for which multiple such decompositions exist, which has several consequences. First of all, Def.12 cannot be formulated as if there is always only one such decomposition (i.e., do not write "the ...decomposition" but "a ...decomposition"). Second, the manuscript has to detail how exactly the authors decompose a given BGP into a subject-based star decomposition (I assume some form of a greedy algorithm that aims to maximize the size of the star patterns).

Another issue is that the experiments do not consider SaGe, which is another state-of-the-art approach that has also been shown to outperform brTPF-based client-server systems. While I think that the presented work is already sufficiently comprehensive and a comparison to SaGe may be deferred to future work, I would like the manuscript to be more explicit about this limitation.

A few additional minor issues that are easy to fix:
* Algorithm 1, line 10: instead of $tp_j \in S_B$, it should be $S_j \in S_B$
* Algorithm 1, line 12: the subscript of the big union operator needs to include: \in \phi^\epsilon
* Algorithm 1, lines 15+16 should be removed because the execution of lines 2+3 in the recursion makes them redundant
* line 28 in the right column of page 10: It should be StarPatternIterator (rather than TriplePatternIterator)
* Please extend the "Hardware Setup" paragraph in Sec.6.1 with a sentence about the network connection between the VM and the server.
* Footnote 12: be explicit about what exactly the "large number" was

Review #2
Anonymous submitted on 20/Jul/2020
Suggestion:
Reject
Review Comment:

This paper proposes a technique, called Star Pattern Fragments (SPF), to offload query processing from a SPARQL endpoint to a client. The authors claim that complex queries can overload or crash endpoints. Thus, SPF decomposes a SPARQL query into star-shaped subqueries and evaluates them on the server-side. The intermediate results of these subqueries have to be joined by the client. The SPF technique is similar to star-shaped subqueries techniques developed by federated RDF engines cited by the authors [24–29]. These techniques rely solely on schema information. In a recent study, Lusail [R1] demonstrated several limitations of these techniques. Overall, the novelty of SPF is very limited and rely on outdated techniques. Moreover, the motivation is very week as clients usually have less computing power, i.e., cannot handle substantial intermediate results. This motivation also contradicts the findings of distributed RDF engines [R2]. In addition to these limitations, SPF is an extension of brTPF [10].

Strong points are:
S1- The paper is easy to follow.
S2- The evaluation uses real and synthetic data sets.

Weak points are:
W1- The motivation and applicability of the proposed solution are not justified.
W2- The novelty is very limited.
W2- The conducted experiments need improvements.

Detailed feedback:

D1- motivation and applicability

The authors mentioned that "complex query loads can easily overload or crash endpoints as all the computational load of answering the queries resides entirely with the server hosting the endpoint." The authors did not support this claim, which contradicts the findings of distributed RDF engines, as reported by [R2]. Moreover, there are simple solutions to avoid complex queries, such as using an adjustable timeout based on the current workload on a server.

The applicability of the proposed solution is not clear. The authors said, "While the server evaluates individual triple patterns, the client handles the remaining query processing tasks. This increases the availability of the server and ensures more efficient query processing during periods with high load." These individual triple patterns may lead to huge intermediate results, as reported by [R1]. Moreover, the computing power of a client is usually less than the computing power of a server. In SPF, handling a complex query is almost like a "hot potato" problem.

Can SPF detect the complexity of a given query while taking into consideration the computing power of the sever?

Will SPF handle all queries equally even if the server can easily answer the query?

D2- The novelty is very limited

The authors said, "This section contains a formal definition of Star Pattern Fragments (SPF), as an extension of brTPF [10]". So, the idea is not original.

Moreover, the authors said the federated "techniques are similar to the approach presented in this paper since they also utilize star-shaped decomposition. However, such approaches execute star-shaped subqueries over SPARQL end-points." The similarity is in the decomposition technique, which is the core of the claimed contribution. Moreover, the federated systems deal with multiple SPARQL end-points, i.e., a more challenging situation.

Overall, these decomposition techniques are based on schema information. Thus, these techniques suffer several scalability limitations, as demonstrated by Lusail [R1].

D3- Experimental evaluation

The conducted experiments need significant improvements in terms of:
A) Dataset size should be large, i.e, in order of billions of triples,
B) the existing methods, such as FedX [25] and Lusail [R1] should be used as baselines for comparison, and
C) Evaluate using more realistic queries in benchmarks, such as all queries in LUBM and LargeRDFBench.

D3.A- Nowadays, the available RDF datasets are in order of billions of triples. Consider the following examples:
H2RDF+ reported up to 13.8 billion triples
DREAM reported up to 7 billion triples
AdPart reported up to ~5 billion triples
Trinity.RDF reported up to 9 billion triples, the one used in Table 4.
TriAD reported also 1.3 billion triples using LUBM.

The authors used only two datasets of one billion triples, namely watdiv1000M and DBpedia. The evaluation would be stronger if the authors used large datasets, i.e., more than a billion triples, of synthetic and real data.

D3.B- (Baseline)

The authors confirmed that the federated "techniques are similar to the approach presented in this paper since they also utilize star-shaped decomposition. However, such approaches execute star-shaped subqueries over SPARQL end-points." The evaluation would be stronger if the authors compared their technique with FedX [25] and Lusail [R1]. Dealing with one SPARQL endpoint is a less challenging setting for both FedX [25] and Lusail [R1].

D3.B- (Queries)

The queries should be of different complexities, such as the queries in LUBM and LargeRDFBench. For example, LargeRDFBench has ten queries of a "complex category", i.e., a high number of triple patterns and advanced SPARQL clauses, and eight queries of a "large category", i.e., with massive intermediate results. It is interesting for the reader to see how a client can perform with substantial intermediate results. Keep in mind the client in the setting of SPF is the server in the settings of a federated system. Thus, the applicability of SPF is not clear.

[R1] Ibrahim Abdelaziz, Essam Mansour, Mourad Ouzzani, Ashraf Aboulnaga, Panos Kalnis:
Lusail: A System for Querying Linked Data at Scale. Proc. VLDB Endow. 11(4): 485-498 (2017)

[R2] Ibrahim Abdelaziz, Razen Harbi, Zuhair Khayyat, Panos Kalnis:
A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data. Proc. VLDB Endow. 10(13): 2049-2060 (2017)

Review #3
By Pascal Molli submitted on 27/Sep/2020
Suggestion:
Major Revision
Review Comment:

Star Pattern Fragments: Accessing Knowledge Graphs through Star Patterns
Christian Aebeloe , Ilkcan Keles, Gabriela Montoya and Katja Hose

Public SPARQL endpoints can be easily overloaded with a high load of complex queries. To protect themselves, endpoints enforce fair usage policies that stop queries that take too many resources to terminate. Reducing the interface of the server as proposed by TPF [6] increases the availability of the server under high load, but has severe counterparts in terms of data transfer, number of calls to the server, and finally deliver very poor performances under light load compared to traditional SPARQL endpoints.

The paper proposes to reduce the interface of the server to star-shaped queries to provide a better trade-off than TPF. As star-shape queries handle a significant number of joins on the server-side, data transfer is significantly reduced and the number of calls to the server to terminate query is also reduced. The general approach relies on a smart client that decomposes SPARQL queries into star-pattern queries, process them on the server, and recombines results to deliver full SPARQL on the client-side. The paper has an extensive experimental section to validate the proposal.

Strong points:
* The paper is well written and sound
* The motivations are clear.
Weak points:
* We dont know why the proposal may outperform [12] and [13],
* Experiments do not include [12] and [13] that address the same issue and already outperform TPF.

Related work.

The related work section is quite complete. I found the P2P part and federated system subsection a little bit “off-topic”. Section 2.3 is directly related to the issue/approach/contribution and highlights quite well problems of data transfer and the number of calls of TPF-like approaches.

I regret that section 2.3 does not mention SPF with a clear positioning vs TPF. SPF is mentioned in section 2.1,2.2, 2.4, but not in section 2.3 that is the more related section for the contribution.

Concerning the last subsection 2.4, the subtitle “other systems” looks like a bag where two different approaches are put together. This is problematic as the proposal really competes with Sage and Smart-kg that already outperform TPF approaches. So, In this part, I expect to see why the proposal could outperform Sage and Smart KG. The sentence “This, however, often leads to a large amount of requests to the server, since long running queries will have to be resumed several times.” is nearly meaningless as the quantum for a Sage Server is set by the administrator. You can set a quantum of 60s with a page of 100 results.

3 Preliminaries

Section 3 is well written and clear to me. Section 3.1 is less formal, but i understand the idea.

4 Star pattern Fragment

I understand the main idea of the approach ie. extending to the capability of the server from triple pattern to star pattern. I agree that it could deliver better performances than TPF because some joins are performed on the server. But, I don’t understand how this approach may outperform [12] and [13].

Fundamentally, Virtuoso poorly handles heavy loads because all Virtuoso processes may be busy with long-running queries. For example, on a Virtuoso server configured with one process, the select * where {?s ?p ?o} takes time to terminate and a concurrent query have to wait for this query to terminate. TPF avoids this problem by paginating results ie. after 200 results produced, the TPF server returns a page of results and a concurrent query can start execution. It works because producing a page of results for a triple pattern query is likely to terminate in short time thanks to SPO, POS, OSP indexes, and getting the next page of results is in O(log(data size)) thanks to the index. To ensure data availability, you have to ensure that you can produce any page of results for any query supported by the server in fixed time. There is no evidence that SPF can do that, and even if it can do that, why it can be better than [12] that provides such property for almost Core SPARQL?

Pagination is handled as a detail in the paper, but it is critical for dealing with data availability.

Section 4.1 is ok for me.

Section 5 Query processing.

This section detail how to decompose a SPARQL query into SPF fragment and recombines results. Well, this section gives many details on iterators but does not detail the decomposition algorithm (the one that implements definition 12). I prefer to see this algorithm than iterators. Is there only one decomposition possible? If there are several star decompositions how to choose? What is the complexity of the decomposition?

Section 6 experimental evaluation

The experiment compares TPF, brTPF, a SPARQL endpoint, and SPF the contribution. Why [12] and [13] are not part of experiments? [12] and [13] address the same issue, run nearly the same setup, and already outperforms TPF. This is a critical issue for me, we don’t know if SPF is better than [12] and [13].

The description of the setup is quite clear, the workload of queries is well crafted.

It is not clear to me how the SPF/LDF/TPF server has been configured with workers/Threads ?? Is it fair compared to the virtuoso setup?

The discussion around the CPU load is unclear to me. What is the objective and how to read it? Reduced energy consumption? Running with cheaper machines? Reduced hosting-cost on a cloud? Faster execution because the load on the machine is decreased? If the CPU load is 2 times higher, but the workload is executed in half-time, is it good or not?

Experiments with timeout/throughput are always difficult to interpret. How throughput is computed in presence of timeout? Maybe it can be better to wait for the workload to terminate to deliver a better final comparison.

Overall, the paper is well written and address an important topic. However, I don’t think that the proposed approach can outperform the state of art and especially [12,13]. As the experimentations do not include [12,13], there is no evidence of an improvement vs the state of art.