An Empirical Evaluation of Cost-based Federated SPARQL Query Processing Engines

Tracking #: 2514-3728

Authors: 
Umair Qudus
Muhammad Saleem
Axel-Cyrille Ngonga Ngomo
Young-Koo Lee1

Responsible editor: 
Ruben Verborgh

Submission type: 
Full Paper
Abstract: 
Finding a good query plan is key to the optimization of query runtime. This holds in particular for cost-based federation engines, which make use of cardinality estimations to achieve this goal. A number of studies compare SPARQL federation engines across different performance metrics, including query runtime, result set completeness and correctness, number of sources selected and number of requests sent. Albeit informative, these metrics are generic and unable to quantify and evaluate the accuracy of the cardinality estimators of cost-based federation engines. To thoroughly evaluate cost-based federation engines, the effect of estimated cardinality errors on the overall query runtime performance must be measured. In this paper, we address this challenge by presenting novel evaluation metrics targeted at a fine-grained benchmarking of cost-based federated SPARQL query engines. We evaluate five cost-based federated SPARQL query engines using existing as well as novel evaluation metrics by using LargeRDFBench queries. Our results provide a detailed analysis of the experimental outcomes that reveal novel insights, useful for the development of future cost-based federated SPARQL query processing engines.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 13/Jul/2020
Suggestion:
Major Revision
Review Comment:

I thank the authors for their reply letter and updated paper.
Most of my earlier comments have been addressed,
except for my main comment regarding the significance of the metrics.

I will re-iterate this earlier remark on the weakness of the similarity and q-error metrics:
According to the current experiments, correlation to query execution time has been shown to be quite small (R values).
Furthermore, the results for q-error are insignificant (p values) in most cases,
and for similarity in 50% of the cases under linear regression, and 25% using robust regression.
The authors correctly explain that this shows that query execution time is explained by a multitude of factors, where cardinality estimates are one.
However, given the outcomes of these experiments,
there is no proof that these metrics are significant enough to provide information about cardinality estimation,
which makes the value of these metrics questionable.
I find it unfortunate that the authors did not find a alternative method of proving the effectiveness of these metrics in a statistically significant manner,
as I do think that these metrics have their value in this domain.

Even after repositioning the paper, the introduction, evaluation, and discussion of these metrics up the majority of this submission.
Therefore, I find this work too weak in its current state to accept for a journal.

Typo:
* Page 2: "sate-of-the-art cost-based"

Review #2
By Stasinos Konstantopoulos submitted on 29/Jul/2020
Suggestion:
Minor Revision
Review Comment:

The submission proposes a similarity metric that evaluates the accuracy of the predictions that a query optimizer makes about the cardinalities of triples and joins. As these predictions are the primary input for estimating cost, erroneous cardinality predictions can lead the optimizer to sub-optimal query plans.

The proposed similarity metric is compared against q-error, a metric for evaluating cardinality predictions from the relational database literature. The basis of comparison is how strongly the evaluation metric correlates with actual execution times on a benchmark suite. That is to say, the proposed similarity metric is found superior to q-error in the sense that the new metric is a better predictor of overall query execution performance than q-error. Such a metric can be used (in combination with other metrics evaluating other steps of the query processing pipeline) in order to pin-point the step that causes ill-performance.

The article is well-written and the work described can potentially be a significant contribution to the federated query processing literature, and the submitted manuscript is a substantial contribution towards materializing this potential.

My comments regarding the analysis of the experiments' outcomes have been addressed to the degree that I only have minor editorial complaints. Specifically, I recommend that Section 6.3 (or thereabout) the authors provide a concrete example demonstrating how the new metric can be used in a way that q-metric alone would be insufficient or inadequate. This would evaluate the first contribution claimed by the authors: the new metric can give a fine-grained evaluation of query engines. I note here that in the previous round, the authors reacted to this recommendation with a rough sketch of such an example in their response, but nothing relevant was added to the manuscript.