VIG: Data Scaling for OBDA Benchmarks

Tracking #: 1796-3009

Davide Lanti
Guohui Xiao
Diego Calvanese

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
In this paper we describe VIG, a data scaler for Ontology-Based Data Access (OBDA) benchmarks. Data scaling is a relatively recent approach, proposed in the database community, that allows for quickly scaling an input data instance to s times its size, while preserving certain application-specific characteristics. The advantages of the scaling approach are that the same generator is general, in the sense that it can be re-used on different database schemas, and that users are not required to manually input the data characteristics. In the VIG system, we lift the scaling approach from the pure database level to the OBDA level, where the domain information of ontologies and mappings has to be taken into account as well. VIG is efficient and notably each tuple is generated in constant time. To evaluate VIG, we have carried out an extensive set of experiments with three datasets (BSBM, DBLP, and NPD), using two OBDA systems (Ontop and D2RQ), backed by two relational database engines (MySQL and PostgreSQL), and compared with real-world data, ad-hoc data generators, and random data generators. The encouraging results show that the data scaling performed by VIG is efficient and that the scaled data are suitable for benchmarking OBDA systems.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Panagiotis Papadakos submitted on 13/Feb/2018
Major Revision
Review Comment:

This paper describes VIG a data scaler for ontology Based Data Access (OBDA) that takes into account the domain information of ontologies and mappings. It is supposed to be very efficient for generating huge amount of data in constant time, since it does not have to retrieve previously generated tuples. Additionally, according to the authors the scaling process can be delegated to different machines, scaling up to the number of columns in the schema without communication overhead.

Generally, the paper is well written and the significance of scaling datasets for Big Data is rather important. Some observations follow:

a) This work is an extension of the original work presented in the BLINK 2016 Workshop for Benchmarking. Altough I generally agree with the authors about the new content added in this revised work, I believe that some of the currently reported future work should be included in this paper. For example for completeness the authors should consider including support for other distributions except the uniform one (e.g. normal, power-law, etc.) when they generate values in columns, since this could help supporting more real-life datasets (as shown in the DBLP one). Regarding the multi-attribute foreign keys, I think that the discussion provided at the end of the article suffices to understand the complexity of the problem.

b) I would like to see some kind of complexity discussion about the different stages of the VIG algorithm, especially for the last steps of the analysis phase, i.e. the columns cluster analysis and the satisfaction of foreign keys, a problem that is encoded into a constraint satisfaction problem (CSP).

c) In the same manner I am not convinced about the linear complexity of the VIG algorithm. I would like to see more fine-grained experiments regarding generation times (which currently are only provided for the BSBM experiment) and with more and bigger scale factors (e.g. 1, 10, 100, 1000, 10000, 100.000). I guess aiming for a dataset of 10,000,000 products in 5.2.1 or the memory limits of the bigger HP server would provide useful insights (from Fig. 4 it seems that the space complexity of the VIG algorithm is rather low).

d) I would also like to see some kind of experiments supporting the authors argument that parallelization can scale up to the number of columns

e) I think the paper would greatly benefit from a table illustrating the different stages of the analysis phase of the algorithm for the running example. The current text is difficult to follow, due to the dense notation and the fact that the reader has to constantly check things in previous pages.

f) In 5.2.2 the authors make a comment about dependencies between binary tuples stored in different tables. This is a limitation of the VIG approach which I would like to see somewhere in the presentation of the algorithm (maybe with other limitations), and not in the evaluation section.

In the same section in the discussion of the results, the authors claim that VIG performs substantially better than RAND, which runs the tests twice as fast as VIG or NTV. I would like to see some kind of discussion on this argument though (why a slower query performance is better?).

Further I am not sure about the importance of reporting results for two different databases (PostgreSQL/MySQL). I guess I am missing some kind of discussion.

Although I am not an expert in benchmarking I guess that this work can be a nice addition to the relevant bibliography and that my comments can help the authors improve the quality of this work.

Review #2
Anonymous submitted on 19/Feb/2018
Review Comment:

The authors have incorporated my feedback in the revision. Especially, I appreciate the restructuring of the evaluation section.

Review #3
Anonymous submitted on 23/Mar/2018
Review Comment:

The paper previously has received three reviews and has been reworked accordingly. The comments from the reviews have been addressed. Most notably, the following modifications which address comments by the previous reviews have been noted by the reviewer:
- The paper extends a previous workshop paper. In comparison with the workshop paper [15] particularly the evaluation has been significantly extended and a discussion on multi-attribute foreign keys has been included.
- Additional related work has been included.
- The evaluation has been better described (and extended).
- The section 4 introduction is adequate.
- ... and some further works.
Overall the paper is in an acceptable state.