Systematic Performance Analysis of Distributed SPARQL Query Executing Using Spark-SQL

Tracking #: 2455-3669

Authors: 
Mohamed Ragab
Sadiq Eyvazov
Riccardo Tommasini

Responsible editor: 
Guest Editors Web of Data 2020

Submission type: 
Full Paper
Abstract: 
Recently, a wide range of Web applications is built on top of vast RDF knowledge bases (e.g.DBPedia, Uniprot, and Probase) and using the SPARQL query language. The continuous growth of these knowledge bases led to the investigation of new paradigms and technologies for storing, accessing, and querying RDF data. In practice, modern big data systems like Apache Spark can handle large data repositories. However, their application in the Semantic Web context is still limited. One possible reason is that such frameworks are not tailored for dealing with graph data models like RDF. In this paper, we present a systematic evaluation of the performance of SparkSQL engine for processing SPARQL queries. We configured the experiments using three relevant RDF relational schemas, and two different storage backends, namely, Hive, and HDFS. In addition, we show the impact of using three different RDF-based partitioning techniques with our relational scenario. Moreover, we discuss the results of our experiments showing interesting insights into the impact of different configuration combinations.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 24/Apr/2020
Suggestion:
Reject
Review Comment:

This article presents a performance analysis of Apache Spark for the evaluation of SPARQL queries. It compares different configurations built by combining three different relational storage schemas for RDF (Single Table, Property Table, Vertically Partitioned Table), three different RDF data partitioning strategies (Horizontal-based, Subject-based and Predicate-based) and five SPARK storage layers (Avro, CSV, ORC, Parquet, HIVE). The query workload is composed of 11 queries taken from the SP2 benchmark and evaluated over two datasets (100M and 500M triples). The performance is measured by the total execution time of each query for each query and the 3*3*5*11*2=990 results compared using different analysis methods.

Originality: The main contribution of this article is an experimental evaluation of Spark for evaluating SPARQL queries. The experimentation protocol is very complete and includes a systematic analysis (descriptive, diagnostic, prescriptive, combined).

Significance of results: The experimental settings and workloads (query/data) are insufficient (see detailed comments) to produce interesting results for deeper analysis (on scalability) to understand what happens in each configuration and obtain interesting scientific ore technical conclusions.

Quality of writing: The article is well written in general.

Detailed comments:

You should better explain why you consider in the introduction that big data frameworks (and in particular map-reduce) are not tailored to perform native RDF processing. There exists an amount of work that studies the problem of SPARQL query optimization and distributed join processing with map-reduce (see below). You should explain your argument with respect to this work.

Configurations and data locality: I do not understand how you combine certain storage schemas with certain partitioning schemas. For example, you should explain in more detail how you apply Predicate-based partitioning on Property tables? It is quite obvious that horizontal partitioning of randomly ordered triple sets is not efficient. More generally, you should analyze each configuration with respect to data locality. What can you tell about the data exchange/shuffling cost, which seems to be the most important performance factor (which is mainly independent of the data storage layer)? Couldn't you measure this cost more precisely?

Experimental setup: 500M triples on 4 machines with 128 GB main memory? How does the performance compare when using a single machine (which might be been sufficient for this workload)? More generally, the experiments don't illustrate the scalability of each configuration. How does each configuration behave with an increase in resources (nodes) and workload (data size, queries)? You could generate much larger workloads with billions of triples which might illustrate new issues for certain configurations.

Experimental result analysis: I got lost in the explanations using your ranking function (prescriptive analysis). How did you choose this function/ ranking score? Whereas it seems to visually/numerically separate the performance result more than the simple execution time (Fig 10), I could not get convinced by its practical use. Why isn't a simple sum of execution time for a practical workload sufficient? More generally, I would expect a more detailed performance measure, which also takes account of space requirements (disk, memory).

Queries: You should discuss the relationship between Table 1 and the obtained performance results. Certain queries include very selective filters that can be executed locally and reduce the intermediate result size before applying the joins. For example, query Q2+ST-SQL has more joins and fewer filters than query Q4 but has better performance for all partitioning schemes. What is the reason for this? What is the influence/role of the Catalyst optimizer? You should compare the execution plans?

Bibliography:

Graux, Damien, et al. "Sparqlgx: Efficient distributed evaluation of SPARQL with apache spark." International Semantic Web Conference. Springer, Cham, 2016.

Schätzle, Alexander, et al. "Sempala: Interactive SPARQL query processing on Hadoop." International Semantic Web Conference. Springer, Cham, 2014.

Kim, H., Ravindra, P., & Anyanwu, K. (2011). From SPARQL to MapReduce: The journey using a nested triple group algebra. Proc. VLDB Endow, 4(12), 1426-1429.

Naacke, Hubert, Bernd Amann, and Olivier Curé. "SPARQL graph pattern processing with apache spark." Proceedings of the Fifth International Workshop on Graph Data-management Experiences & Systems. 2017.

Review #2
Anonymous submitted on 24/May/2020
Suggestion:
Major Revision
Review Comment:

This paper is about systematic evaluation for the performance of Spark-SQL engine for processing SPARQL query over different relational encodings for RDF datasets in distributed setup. The paper uses SP2Bench to generate RDF datasets. It translates the benchmark queries into SQL, storing the RDF data using Spark’s dataframe abstraction. It also evaluates the three different approaches for RDF relational storage.

Originality:
The paper does not contain the sufficient novelty to be published as a journal article in its current form. However, it gives insights into another dimension of analysis methodology. Therefore, I see the value of this work. It provides analysis of SPARK-SQL queries with different configuration settings.
While I see the value of this work, I am unsure whether it is sufficient for a journal publication or not.
I would consider this work a first step towards a broader analysis of Spark-SQL queries performance. As such, my feeling is that this work deserves to be shared to the research community.
A suggestion to improve this paper would be to include representative systems in experiments and perform an extensive evaluation based on different configuration settings. Also adding the latest benchmark queries with translation and comparison between systems would be a nice insight.
One major concern is that the whole methodology is based on a big data decision making framework, however no provenance information of using such methodology is provided.
Missing some recent benchmarks such as LargeRDFBecnh [a] or explanation why they are not used for experiments. Inclusion of LargeRDFBench queries with translation would increase the scope of the paper. Also SP2Bench composed of only Star or snowflake shaped queries as mentioned, so the subject based partitioning will be more biased and outperform others.

Related work portion is relatively weak, the difference with important work such as [7] is not explained.
A good explanation about how it is different/extended from previous work is required[9].

I find the current conclusion section rather weak.
Currently, it is mainly a summary of the paper, while the expected perspectives are missing.
Some aspects I would like to see discussed:
Which storage backend is good for SPARK-SQL and to what extent?
Which relational schema and partitioning technique is best in what type of settings in SPARK-SQL?
What is the impact of this work?
What are the optimal configurations for using SPARK-SQL?

Significance of the results:
Positive point is that authors uploaded the source code and experiments data on Github.
But it needs a better explanation of results.
Results are more focused on what are the results?, instead of why such results are produced?
E.g. in page 15 authors claim,
“For the 500M with the VT schema, and for PBP technique, ORC and Parquet respectively are still in the best ranks, but Avro, interestingly,outperforms Hive and CSV.”
Authors should investigate why Avro outperforms Hive and CVS and many more such examples in paper.

Quality of writing: overall paper is well written with less mistakes in writing.

Minor comments:
Page 8 left column line 28, storage backend => do you mean relational schemas here?
Page 8 right column line 1, RSD => RS_D
It's better to define the abbreviations used, such as PT, ST or VT. or atleast put in brackets in front of headings.
Section 3.1: projected variables => projection variables
I don’t understand the juxtaposition of section 5. it should be in the background section of paper.
Keywords ST,VT and PT are used without definition. Did you mean single statement table and vertical partitioning table? Need to define acronyms well.
Some references are not complete, such as in [9] and [20] venue not provided.

[a] M. Saleem, A. Hasnain, and A.-C. N. Ngomo. Largerdfbench: a billion triples benchmark for SPARQL endpoint federation.Journal of Web Semantics, 2018.

Review #3
By Oscar Corcho submitted on 01/Aug/2020
Suggestion:
Reject
Review Comment:

First, I would like to acknowledge the large amount of work that has been done by the authors in preparing the experimental setup that has allowed running the experiments that are reported in this paper. I really think that this type of work is needed in order to understand better the different choices for supporting semantic technologies using different types of underlying systems.

However, I have many concerns about the level of maturity of the work that is presented and the extent to which the experiments that have been done are sufficient in order to obtain conclusions. And this is the main reason why I propose a reject as a decision for this paper, since I think that the amount of work needed in order to make the paper ready for publication is still too much. It is true that the experimental setup is already prepared and the authors may be able to run on the missing configurations and benchmarks (as I discuss below), but I have the impression that most of the paper should be changed since the conclusions may be completely different as a result of those experiments.

I sincerely hope that my comments can help authors to prepare a stronger submission in the near future, which provides more clear insights on the best configurations of partitioning, schemas and storage backends to be used with big data systems like Apache Spark.

First of all, and in terms of originality, I am not aware of other similar works focused on evaluating the performance of these different alternatives/dimensions in the context of Big Data frameworks such as SPARK. I am aware of works that are focused on trying to use SPARK, Flink and other similar frameworks to implement SPARQL query evaluation engines on top of them, and they have their own evaluations, but I am not aware of such a systematic study.

In terms of the title of the paper, I am not sure that the title is sufficiently accurate. When I first read it, before going into the content of the paper, I thought that the paper was about SPARQL query evaluation over federated SPARQL endpoints, instead of talking about partitioning RDF datasets and evaluating different storage options.

My main concerns are the following:
- Why you only work with the SP2Bench benchmark. It is true that in section 3 you try to provide a convincing argument of why it follows some of the recommendations from Jim Gray wrt benchmarks, but you are not discussing why other existing benchmarks do not have these properties, or why this is the best one among those available. Indeed, in the future work section you claim that you will extend this to other benchmarks. The main problem that I see here is that you may strongly depend on the characteristics of this benchmark and hence some of the conclusions that you are obtaining may be completely useless when applied in other benchmarks/types of data. Indeed, in the end of section 6.1 you provide a comment on this, based on the number of projections in the benchmark.
- Why you have decided 100M, 250M and 500M, and not other scales. What has been used in the state of the art? Can you convince the reader that most of the triple stores are analysed with these scales and then this is what you do? Are you able to scale even more? Indeed, something that I somehow miss in your work is a comparison on the times that you need to run queries when compared to how centralised triple stores or other options need. Is there a competitive advantage of using something like SPARK? For instance, are you able to handle scales that others cannot do? Are you more efficient when the scale gets over some specific threshold scale? This is quite relevant to understand the usefulness of using these architectures and your analysis.
- Why you have discarded some schemas and partitioning techniques. For instance, in the related work you mention that one work has demonstrated that WPT is really useful, but you have discarded it, and you do not justify sufficiently why. This is very relevant, since you are only looking at a more reduced set of combinations than what you could actually check and evaluate.
- Most of the discussions that you make refer to the different configurations that you have set up, but I am missing discussions on situations that are not strictly related to Spark. For instance, partitioning can still be done without this framework. Would the restuls be similar? Is there anything that has to do with the processing style of Spark or with the actual implementation of Spark SQL? I think that this analysis is missing in the paper, and it is extremely relevant. The same for the schemas used. Indeed, and in relation to those who are trying to create SPARQL query engines on top of Spark, Flink, etc. (I have myself tried to do it on Flink, for instance), your paper falls short of providing relevant recommendations. You just analyse the behaviour of Spark SQL, but there are additional discussions that are missing.

The above mentioned points are the most relevant in my recommendation, and I hope that you can address them in future work. Now I move into more detailed, and less relevant comments:

I really like the methodology that you propose for the analysis of the results. Especially since in many cases I see in papers the usual: X perfroms better than Y, but there is no diagnosis nor prescription. However, I think that you may want to simplify/reduce the text of section 5, and especially describe better the content of section 5.3. It has been really hard to understand by me how you generate this ranking and why the formulas that you are applying are used and whether they are correct/adequate.

I cannot understand well what you mean by accuracy, and why this is relevant. You should try to explain it better.

As a minor comment on the evaluation setup, I think that it may be nice to have studies of cold runs, and not only of warm runs as you do, just in case. Not too important, but easy for you to do with the setup you have, and it may provide some interesting insights as well.

I have some concerns about the results:
- Why can't you use Q9 for the PT relational schema?
- Why Fig 9 does not have Q7 and Q9? In fact, you describe why in the related website, but not in the paper, I think.

Other final comments:
- You refer to the work on [9] as a first phase of your work. However, the reference provided in the bibliography section does not say where this was published. By going into the GitHub-based website that you refer to, I have been able to identify that there are two previous works in two workshops. Fix the reference, and pay attention to identifying also the second reference/workshop if needed.

- You have repeated the Github-based website as a footnote twice.

- Quality of writing. While going through the paper I have noted that there were many typos (no major problem, but a second reading by the authors would have been useful to spot those). And I don't really like some of the initial sentences in the abstract and in the introduction to justify why doing this work. It would be as easy as commenting that there are large RDF data sources available that need to be queried, that many techniques have been used in centralised and distributed settings, and that there is an opportunity to check whether existing Big Data systems can be used. Furthermore, when you indicate in the introduction that some native triplestores have problems in scalability, you mention only a few of them, but there are plenty of triplestores (open source and commercial) that are not mentioned until we move into the related work, or not mentioned at all, and it is not clear why you do not mention them.

- Some unclear statements. For instance, in the introduction you claim that an additional contribution of your work is a deeper and prescriptive analysis of Spark SQL performance. It is very unclear what you mean by this, since you are not really evaluating Spark SQL, but how it reacts under different dimensions for queries that are commonly used in a SPARQL benchmark.