A Scalable RDF Data Processing Framework based on Pig and Hadoop

Tracking #: 392-1487

Authors: 
Yusuke Tanimura
Steven Lynden
Akiyoshi Matono
Isao Kojima

Responsible editor: 
Philippe Cudre-Mauroux

Submission type: 
Full Paper
Abstract: 
In order to effectively handle the growing amount of available RDF data, scalable and flexible RDF data processing frameworks are needed. While emerging technologies for Big Data, such as Hadoop-based systems that take advantages of scalable and fault-tolerant distributed processing, based on Google's distributed file system and MapReduce parallel model, have become available, there are still many issues when applying the technologies to RDF data processing. In this paper, we propose our RDF data processing framework using Pig and Hadoop with several extensions to solve the issues. We integrate an efficient RDF storage schema into our framework and then show the performance improvement from Pig's standard bulk load and store operations, including the schema conversion cost from conventional RDF file formats. We also compare the performance of our framework to the existing single-node RDF databases. Furthermore, as reasoning is an important requirement for most RDF data processing systems, we introduce the user operation for inferring new triples using entailment rules and show the performance evaluation of the transitive closure operation as an example of the inference, on our framework.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jiewen Huang submitted on 25/Dec/2012
Suggestion:
Major Revision
Review Comment:

This paper proposes a salable RDF data management system which is based on Hadoop and Pig. Major technical contributions of the paper include implementation of vertical partitioning with MapFile and transitive closure in Pig. The motivation of the paper is very relevant and the authors seem to have done a lot of work to integrate their techniques into Hadoop and Pig.

However, I have several concerns regarding the paper. They should be addressed before the paper is accepted.

1. Presentation. I find the paper difficult to read and sometimes extremely confusing.
(a) Section 4. It is not a common practice to put implementation and evaluation into one big section. It is understandable why authors did this, but it is more desirable to split them into two sections.
(b) It is not clear to the readers what a MapFile is. Both pages 4 and 5 explain MapFile but I find them insufficient.
(c) Algorithms 1 and 2 of transitive closure are hard to follow. I think I understand them, but the authors should explain them better. For example, what is "SELF-JOIN BaseTriples" in Algo 1? There are at least 3 ways to do a self join on RDF data (i.e. subject-subject join, subject-object join, object-object join). Please specify which one. Regarding "Extract NewTriples ....." in Algo 2, what exactly is "Extract" here? Figure 6 does not help readers understand. An example with concrete triples will be of great help
(d) Authors put brackets around units, such as MB and byte. I'm not aware of this practice.

2. Query execution is undoubtedly one of the most important aspects of any RDF system. However, only a small section, i.e. Section 4.3, is dedicated to this. This is my major complaint. Please elaborate on the query execution engine of the system.

3. It is common for RDF systems to map Uniform Resource Identifier (URI) and strings to integers in data loading. This generally speeds up the performance. Did the authors do this? If not, please explain why not.

4. Experiments.
(a) I don't see why Section 4.4.4 is needed. The fact that two independent jobs can be performed at the same time comes naturally from Pig/Hadoop. Authors should not claim credit for that.
(b) Figure 5 is not self-explanatory. Please add explanations for acronyms, such as vp-txt and vp-mf.
(c) Not sure why Table 4 is needed. Does it give us new insights in addition to Figure 5?
(d) Why did the authors perform experiments on two clusters (Tables 1 and 2)? Isn't one cluster enough?
(e) In Table 3 and Figure 5, for a fair comparison, you should add an entry "VP - Text with LZO compression".

5. Comparison with RDF-3x. RDF-3x is one of the best single-machine RDF management systems. The experimental results will be more convincing if the proposed system fares better than RDF-3x.

Review #2
By Marcin Wylot submitted on 05/Jan/2013
Suggestion:
Reject
Review Comment:

In the paper the authors want to tackle with RDF data processing using the MapReduce paradigm. It extends two previous works referred to as [35] and [36]. They define a storage schema based on vertical partitioning. Consequently they propose algorithms to load data and executes queries on the proposed schema. The proposed schema is an adaptation of a well known approach, where triples are grouped by predicate and additionally sorted by subject. The proposed framework is built on top of Pig and Hadoop.

To give a first summary of the journal article, we have to say that it lacks of the required novelty of the proposed solution and the description of the solution is rather vague. We feel that one shortcoming of this paper is that it lacks a clear description of the implementation and a separation of what is already provided by the platform and thus cannot be taken into account as a personal contribution. This is especially true for the evaluation where implementation features are referenced that are not described. Same holds true for the inference part. In addition to that, the authors mention their previous works which are extended but even though the previous works were already published, there is no space constraint for the journal article and thus they could have summarized their previous works concisely so that the reader could understand what are they about without having to read the two other papers.

Chapter 3.3.2 is supposed to describe query execution and a series of techniques used to optimize it. Unfortunately, there is not much about query execution, some optimization techniques are mentioned without explaining how they were used in the framework. The authors also mention the need of multiple joins in RDF data processing and the fact that they have proposed multi-join method in one of their previous works [24], but they state that it was not integrated with the solution. It would have a huge impact on performance if benchmarked it against complex queries, like query 7 and 9 of Lehigh University Benchmark.
In the same way the authors deal with reasoning support in the chapter 3.3.3. RDF-INFER command and RDFRuleBase() functions are introduced and it is not clearly explained what is the relation between those two. Moreover they are just mentioned, and after reading the paragraph it is hard to find out how they really work.
Section 4 entitled as “Implementation and evaluation” which sounds odd, why would one combine implementation and evaluation in one section, they should be separated. The implementation part is limited to ¼ page, and what is puzzling in it, is why the authors employed general compression algorithm (LZO) instead of something more specific for the problem. The evaluation is performed by using only one benchmark (SP2Bench) against other existing database systems (that’s what is stated at the beginning of the section 4). It was done on two environments, one containing 9 physical nodes, the other one 13 nodes. Nonetheless the second one is virtualized, on each physical machine a virtual one was deployed. Unfortunately the authors do not explain the reason why they employed that kind of virtualization, whereas they state that it is only necessary to consider the virtualization overhead. Furthermore the performance of the framework was compared with single node systems, whereas at least few multi-node systems are available (e.g. 4store, SHARD).
In the related work few more essential works from which most important are:
- "High-Performance, Massively Scalable Distributed Systems using the MapReduce Software Framework: The SHARD Triple-Store" Kurt Rohloff, Richard E. Schantz
- "Scalable SPARQL Querying of Large RDF Graphs" Jiewen Huang, Daniel J. Abadi, Kun Ren
- "Scalable Semantic Web Data Management Using Vertical Partitioning" Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach

For all those reasons we recommend to reject the paper.

The review was written by Martin Grund and Marcin Wylot.

Review #3
By Haofen Wang submitted on 31/Jan/2013
Suggestion:
Reject
Review Comment:

In this paper, authors address an interesting problem that handles scalable RDF graphs using Pig and Hadoop. Standard Pig and DFS for Hadoop are extended to support storing RDF data, answering RDF queries and performing rule-base inference. This work is promising, but yet not ready for publication, due to the reasons as follows:

The most critical problem is its novelty. There are several work [1-3] considering how to handle large scale RDF graphs in a distributed environment. The proposed storing scheme is vertical partitioning, which was originally proposed in [4], and was criticised and improved by [5,6]. [1] employs [6] as the underlying storage system, and I doubt whether the system presented in this paper can outperform [1]. I suggest the author to conduct a comparison experiment.

The optimization should be another important issue in this paper, but the presentation is very vague. For example, in the second paragraph of Section 3.3.2, it says "any other extended optimization for RDF data processing can be applied to the original Pig query engine". However, there is no citation about these optimization, and this statement is suspicious. Join ordering optimization is also discussed in a vague way. I also suggest to compare [7] with the join ordering optimization employed in this work.

This paper claims to support rule-based reasoning, but only transitive closure operation is presented in this work. Existing work [8-11] have shown that MapReduce is a powerful tool for reasoning tasks. A discussion about the relation between this work and the existing work is recommended.

Finally, the experiment is unconvincing. For example, in query performance evaluation, the authors compare their prototype with several single-node RDF databases. However, the prototype is run on a cluster of 13 nodes, each of which has an 8-GB memory, but the single-node RDF databases are run on a machine with only 3GB memory. This is not fair. Furthermore, the proposed method should be compared to other distributed RDF stores such as AllegroGraph, [1], and [2].

Minor:

The distributed file system for Hadoop is HDFS. Even though similar, HDFS and GFS are different.

References:

[1] Spyros Kotoulas, Jacopo Urbani, Peter Boncz and Peter Mika. Robust Runtime Optimization and Skew-Resistant Execution of Analytical SPARQL Queries on Pig. In Proc. of ISWC, 2012

[2] Jiewen Huang, Daniel J. Abadi, and Kun Ren. Scalable SPARQL Querying of Large RDF Graphs. In PVLDB, 2011

[3] Alexander Sch‰tzle, Martin Przyjaciel-Zablocki, and Georg Lausen. PigSPARQL: mapping SPARQL to Pig Latin. In Proc. of SWIM, 2011

[4] Daniel J. Abadi, Adam Marcus, Samuel R. Madden, and Kate Hollenbach. Scalable semantic web data management using vertical partitioning. In Proc. of VLDB, 2007

[5] Cathrin Weiss, Panagiotis Karras, and Abraham Bernstein. Hexastore: sextuple indexing for semantic web data management. In PVLDB, 2008

[6] Thomas Neumann, Gerhard Weikum. RDF-3X: a RISC-style Engine for RDF. In PVLDB, 2008

[7] Foto N. Afrati, Jeffrey D. Ullman. Optimizing Multiway Joins in a Map-Reduce Environment. IEEE Transaction on Knowledge and Data Engineering. 23(9): 1282-1298 (2011)

[8] Jacopo Urbani, Spyros Kotoulas, Jason Maassen, Frank van Harmelen, Henri E. Bal. WebPIE: A Web-scale Parallel Inference Engine using MapReduce. Journal of Web Semantics. 10: 59-75 (2012)

[9] Jacopo Urbani, Frank van Harmelen, Stefan Schlobach, Henri E. Bal. QueryPIE: Backward Reasoning for OWL Horst over Very Large Knowledge Bases. In Proc. of ISWC, 2011

[10] Jacopo Urbani, Spyros Kotoulas, Eyal Oren, Frank van Harmelen. Scalable Distributed Reasoning Using MapReduce. In Proc. of ISWC, 2009

[11] Chang Liu, Guilin Qi, Haofen Wang, Yong Yu. Reasoning with Large Scale Ontologies in Fuzzy pD* Using MapReduce. IEEE Computational Intelligence Magazine. 7(2):54-66 (2012)