Review Comment:
First, I would like to acknowledge the large amount of work that has been done by the authors in preparing the experimental setup that has allowed running the experiments that are reported in this paper. I really think that this type of work is needed in order to understand better the different choices for supporting semantic technologies using different types of underlying systems.
However, I have many concerns about the level of maturity of the work that is presented and the extent to which the experiments that have been done are sufficient in order to obtain conclusions. And this is the main reason why I propose a reject as a decision for this paper, since I think that the amount of work needed in order to make the paper ready for publication is still too much. It is true that the experimental setup is already prepared and the authors may be able to run on the missing configurations and benchmarks (as I discuss below), but I have the impression that most of the paper should be changed since the conclusions may be completely different as a result of those experiments.
I sincerely hope that my comments can help authors to prepare a stronger submission in the near future, which provides more clear insights on the best configurations of partitioning, schemas and storage backends to be used with big data systems like Apache Spark.
First of all, and in terms of originality, I am not aware of other similar works focused on evaluating the performance of these different alternatives/dimensions in the context of Big Data frameworks such as SPARK. I am aware of works that are focused on trying to use SPARK, Flink and other similar frameworks to implement SPARQL query evaluation engines on top of them, and they have their own evaluations, but I am not aware of such a systematic study.
In terms of the title of the paper, I am not sure that the title is sufficiently accurate. When I first read it, before going into the content of the paper, I thought that the paper was about SPARQL query evaluation over federated SPARQL endpoints, instead of talking about partitioning RDF datasets and evaluating different storage options.
My main concerns are the following:
- Why you only work with the SP2Bench benchmark. It is true that in section 3 you try to provide a convincing argument of why it follows some of the recommendations from Jim Gray wrt benchmarks, but you are not discussing why other existing benchmarks do not have these properties, or why this is the best one among those available. Indeed, in the future work section you claim that you will extend this to other benchmarks. The main problem that I see here is that you may strongly depend on the characteristics of this benchmark and hence some of the conclusions that you are obtaining may be completely useless when applied in other benchmarks/types of data. Indeed, in the end of section 6.1 you provide a comment on this, based on the number of projections in the benchmark.
- Why you have decided 100M, 250M and 500M, and not other scales. What has been used in the state of the art? Can you convince the reader that most of the triple stores are analysed with these scales and then this is what you do? Are you able to scale even more? Indeed, something that I somehow miss in your work is a comparison on the times that you need to run queries when compared to how centralised triple stores or other options need. Is there a competitive advantage of using something like SPARK? For instance, are you able to handle scales that others cannot do? Are you more efficient when the scale gets over some specific threshold scale? This is quite relevant to understand the usefulness of using these architectures and your analysis.
- Why you have discarded some schemas and partitioning techniques. For instance, in the related work you mention that one work has demonstrated that WPT is really useful, but you have discarded it, and you do not justify sufficiently why. This is very relevant, since you are only looking at a more reduced set of combinations than what you could actually check and evaluate.
- Most of the discussions that you make refer to the different configurations that you have set up, but I am missing discussions on situations that are not strictly related to Spark. For instance, partitioning can still be done without this framework. Would the restuls be similar? Is there anything that has to do with the processing style of Spark or with the actual implementation of Spark SQL? I think that this analysis is missing in the paper, and it is extremely relevant. The same for the schemas used. Indeed, and in relation to those who are trying to create SPARQL query engines on top of Spark, Flink, etc. (I have myself tried to do it on Flink, for instance), your paper falls short of providing relevant recommendations. You just analyse the behaviour of Spark SQL, but there are additional discussions that are missing.
The above mentioned points are the most relevant in my recommendation, and I hope that you can address them in future work. Now I move into more detailed, and less relevant comments:
I really like the methodology that you propose for the analysis of the results. Especially since in many cases I see in papers the usual: X perfroms better than Y, but there is no diagnosis nor prescription. However, I think that you may want to simplify/reduce the text of section 5, and especially describe better the content of section 5.3. It has been really hard to understand by me how you generate this ranking and why the formulas that you are applying are used and whether they are correct/adequate.
I cannot understand well what you mean by accuracy, and why this is relevant. You should try to explain it better.
As a minor comment on the evaluation setup, I think that it may be nice to have studies of cold runs, and not only of warm runs as you do, just in case. Not too important, but easy for you to do with the setup you have, and it may provide some interesting insights as well.
I have some concerns about the results:
- Why can't you use Q9 for the PT relational schema?
- Why Fig 9 does not have Q7 and Q9? In fact, you describe why in the related website, but not in the paper, I think.
Other final comments:
- You refer to the work on [9] as a first phase of your work. However, the reference provided in the bibliography section does not say where this was published. By going into the GitHub-based website that you refer to, I have been able to identify that there are two previous works in two workshops. Fix the reference, and pay attention to identifying also the second reference/workshop if needed.
- You have repeated the Github-based website as a footnote twice.
- Quality of writing. While going through the paper I have noted that there were many typos (no major problem, but a second reading by the authors would have been useful to spot those). And I don't really like some of the initial sentences in the abstract and in the introduction to justify why doing this work. It would be as easy as commenting that there are large RDF data sources available that need to be queried, that many techniques have been used in centralised and distributed settings, and that there is an opportunity to check whether existing Big Data systems can be used. Furthermore, when you indicate in the introduction that some native triplestores have problems in scalability, you mention only a few of them, but there are plenty of triplestores (open source and commercial) that are not mentioned until we move into the related work, or not mentioned at all, and it is not clear why you do not mention them.
- Some unclear statements. For instance, in the introduction you claim that an additional contribution of your work is a deeper and prescriptive analysis of Spark SQL performance. It is very unclear what you mean by this, since you are not really evaluating Spark SQL, but how it reacts under different dimensions for queries that are commonly used in a SPARQL benchmark.
|