An efficient SPARQL engine for unindexed binary-formatted IFC building models

Tracking #: 1705-2917

Thomas Krijnen
Jakob Beetz

Responsible editor: 
Guest Editors ST Built Environment 2017

Submission type: 
Full Paper
To date, widely implemented and full-featured query languages for building models in their native formats do not exist. While interesting proposals have been formulated to query Industry Foundation Classes (IFC) models, their functionality is often not complete and their semantics not defined precisely. With the introduction of the ifcOWL ontology as an equivalent to the IFC schema defined in the EXPRESS modelling language, a representation of native architectural and engineering building models in RDF is provided and such models can be queried using SPARQL. The requirements stemming from the size of data sets handled in complex building projects however, make the use of clear-text encoded RDF infeasible in many use cases. The IFC serialization in RDF is not as succinct as the STEP Physical File Format (IFC-SPF), in which IFC building models are predominantly encoded. This introduces a relative overhead in query processing and file size. In this paper we propose a SPARQL implementation, compatible with ifcOWL, directly on top of a standardized binary serialization format for IFC building models, which is a direct binary equivalent of IFC-SPF, with less overhead than the graph serialization in RDF. A prototypical implementation of the query engine is provided in the Python programming language. This binary serialization format, presented in earlier work of the authors, is based on an open file format for organizing large amounts of data, Hierarchical Data Format (HDF5). It has several properties suitable for querying. Due to the hierarchical partitioning and fixed-length records, known entity instances can be retrieved in constant time, rather than logarithmic time in a sorted or indexed dataset. Statistics, such as the prevalence of instances of a certain type, can be derived in constant time from the dataset metadata. With instances partitioned by their type, querying typically only operates on a small subset of the data. We compare the processing times for eight queries on five building models. The Apache Jena ARQ query engine (using N-triples, Turtle, TDB and HDT), RDF-3X and the system proposed in this paper are compared. We show that in several realistic use cases the interpreted Python code performs equivalent or better to the state of the art implementations, including optimized C++ executables. In other cases, due to the linear nature of the unindexed storage format, query times fall behind, but do not exceed several seconds, and as such, are still orders of magnitude better than the time to parse N-triples and Turtle files. Due to the absence of indexes, the proposed binary IFC format can be updated without overhead. For large models the proposed storage format results in files that are 2-3 times smaller than the currently most concise alternative.
Full PDF Version: 

Reject (Two Strikes)

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 27/Aug/2017
Major Revision
Review Comment:

This paper presents a practical SPARQL engine tailored for a specific binary HDF5-based format to represent building models in RDF. Authors assume that the instances strictly conform to a given ontology, such that triples can be organized by property tables and class types. They are then encoded on fixed-length records (sorted but “unindexed”), over which a chunk-based gzip compression is applied. This scheme allows authors to resolve subject-base queries in constant time by just jumping to the provided entity, but it hinders efficient object-base queries (as well as queries where the predicate is unbounded). Authors provide a prototype implementation that serves a partial subset of SPARQL 1.1. features. The evaluation on 5 datasets and some selected queries shows i) a similar space reduction than HDT, a general-purpose RDF binary format, ii) a similar performance than well-known query engines as RDF-3X for subject-based queries, and iii) a reasonable performance for the worst-case object-based queries.

The work is interesting and it tackles the underexploited area of RDF compression and efficient RDF storage for specific purposes such as RDF-based building models. Nonetheless, I have some doubts on the approach, the evaluation and the paper presentation itself, that would recommend to review the manuscript.

First of all, although the paper is, in general, well written, I honestly think that the paper is not self-contained and it is hard to follow. Acknowledging that the binary IFC serialization is already presented by authors and cited in the text, I missed a preliminary section with further details on HDF5 (which is not a well-know technique) as well as the proposed authors' storage for RDF graphs. While authors provide some preconditions in the beginning of Section 3, the description of the storage model in 3.1 is too shallow, and the example provided in Listing 3 is not explained in detail (i.e. what is the meaning of each keyword, how is it organized?). Given that authors base on a strict ifcOWL ontology, a minimum introduction and descriptive example would also be needed (the underneath meaning of Listing 1 and 2 are also not detailed, and readers are not familiar with ifc).

Then, I have some doubts/questions, some of them maybe related to this lack of preliminaries:

- In 3.1.1, authors state that the records are lexicographically sorted and one can assume that the record of inst:IfcWindow_37 can be obtained by jumping a number of fixed width records. Does this mean that all URIs follow the same scheme and no gaps are allowed? (e.g. if IfcWindow_37 is present, IfcWindow_1..36 are present). This can be clarified in the text. Otherwise, is there a dictionary of terms?
- Authors don't provide a formal algorithm for SPARQL resolution (or at least the subset implemented by authors), hence the resolution is unclear and difficult to reproduce. I would recommend to formalize the data structures and the resolution.
- The evaluation is focused on 5 datasets and some custom queries. Given that the system is quite restrictive on projections, predicate-bounded queries, object-based access, etc., I would maybe suggest to first perform random triple patterns on the supported functionality SPO, SP?, ?PO, etc. to evaluate the performance for these atomic queries, which are the basis of more complex SPARQL queries. Then, I would maybe use a synthetic dataset generator to build models of incremental size (in order to test the scalability of the approach).
- Contributions b) and c) in page 3 are marginally elaborated in the text. I would recommend to clarify them or add them to the discussion.
- Unless it is better justified, I think the term “unindexed” is a bit misleading given that, if I understand correctly, authors need to sort the information lexicographically and organize the triples by property tables and class types.
- The state of the art review of query execution optimization and, in particular, RDF storage models is too narrow. For instance there is a huge corpus of RDF storage models in databases, native storages and NoSQL which is hardly considered by authors.
- The term normalization in 2.1.2 is not standard. I believe it rather refers to “RDF dictionaries” where some reference could be added too.
- In 2.2., a reference to ontology based data access could be provided. Also in 2.2, when explaining 3store authors re-explain the use of RDF dictionaries, which can be omitted as it was detailed in 2.1.2.
- The title of section 3 (Implementation) can be substituted with a more descriptive text that reflect the purpose of the method.
- The title of 3.1.1 and 3.2.2 could be replaced with “subject-based access” and “object-based access” (or anything more formal).
- Figure 2 can be obviated and substituted with a more formal example.
- Authors should clearly state which SPARQL subset is implemented. Although Table 3 provides the supported keywords, they also state that “Some of these keywords are not implemented in every context, for example (...)”. It is unclear if the provided list of examples is complete or other restrictions are applied. A formal description of the supported SPARQL algebra could help in this regard.
- Authors provide the code of the implementation on, but they don't provide a minimum README or manual on how to use the approach.
- “The times presented in this section are the best of five” → Average time are more standard.
- An initial explanation on the corpus (size, statistics, etc...) could be provided. This is partially done in table 6, but it should be located and described before, in the beginning of the section.
- In Figure 3, does the hdf5 size consider the shuffle+compression as in table5? Please detail.
- In Figure 3, the explanation of the meaning of hdt.index is not provided.
- Please unify the name of the techniques across sections and figures. For example, in Figure 3, TDB should be Jena-TDB, in Figure 4, I think HDT should be Jena ARQ HDT
- Also, please clarify if the techniques work on disk or in memory. For example, HDT has two modes, in memory and on disk. I had a look at your published code and I believe you are using it on disk, which should be slower.
In order to test on memory, you can modifyL42 of the following file ( with
HDT hdt = HDTManager.loadIndexedHDT(model_name, null);
- In 4.4. some comments on the comparison between the proposed approach and Jena-TDB and Jena HDT could be provided, in particular when they seem to perform the same with all queries, which is a bit strange if a varied corpus of queries is selected (as it should be).
- The meaning of “parsing” is a bit unclear. Is it the time to load the system, being it ready to query, or it is also considering the time to parse the provided query? Note that some systems, such as Jena, can resolve full SPARQL on general-purpose graphs, then the comparison over a tailored approach that does not parse/resolve full SPARQL can be unfair.

Review #2
By Freddy Priyatna (OEG) submitted on 04/Sep/2017
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

I am happy that the authors have addressed most the concerns that I raised in the previous round and they have improved the expressivity of their system.
I see that the authors have changed the organization of the paper as suggested by the other reviewer and I think the current state is easier to read.

However, I still keep my decision for Minor Revisión. The reason is that I insist the authors drop the sentences "To date, widely implemented and full-featured query languages for building models in their native formats do not exist." from Abstract and "Their functionality is not as complete and their semantics not as precisely defined as query languages that went through extensive development and standardisation processes." from Introduction. Those sentences mislead/give the impression that the authors are going to solve this issue and provide a complete support for the query language they use (SPARQL) and that is not true as the systems have several limitations that are described in Implementation Section.

Except for that minor revision, I am fine with the rest of the paper.

Review #3
By Carlos Buil Aranda submitted on 05/Oct/2017
Major Revision
Review Comment:

I acknowledge the work done by the authors. The article now is easier to read and everything is in place. However I still think the contribution is too low. While I agree it may have some advantages to convert to HDF5 I still doubt that it is the best option available. The two examples (less and more favorable) are the typical examples creating indexes within any database system. As one of the reviewers said, "Can the authors convince me that I should use their proposed engine instead of using RDF3X?" RDF3X is not the most new/optimized database nor any of the Jena implementations. To accept the paper I need proof that this is really an improvement over existing storage systems. Showing proof that SPARQL over HDF5 is a better option for these models than Virtuoso 7, Blazegraph or some other state of the art system is a must for accept.

Regarding data integrity, this is something that every database system (including Virtuoso) offers through the ACID properties.