An efficient SPARQL engine for unindexed binary-formatted IFC building models

Tracking #: 1625-2837

Authors: 
Thomas Krijnen
Jakob Beetz

Responsible editor: 
Guest Editors ST Built Environment 2017

Submission type: 
Full Paper
Abstract: 
To date, widely implemented and full-featured query languages for building models in their native formats do not exist. While interesting proposals have been formulated, their functionality is often not complete and their semantics not defined precisely. With the introduction of the ifcOWL Linked Data ontology as an internationally recognized modelling standard for building models, a representation of native architectural and engineering building models in RDF is provided and such models can be queried using SPARQL. The requirements stemming from the size of data sets handled in complex building projects however, make the use of clear-text encoded Linked Data infeasible in many used cases. The IFC serialization in RDF is not as succinct as the STEP Physical File Format (IFC-SPF), in which IFC building models are predominantly encoded. This introduces a relative overhead in query processing and file size. This concern is aggravated when coupled with heterogeneous large volume datasets, such as point clouds and sensor data. In this paper we propose a SPARQL implementation, compatible with ifcOWL, directly on top of a standardized binary serialization format for IFC building models, which is a direct binary equivalent of IFC-SPF, with less overhead than the graph serialization in RDF. A prototypical implementation of the query engine is provided in the Python programming language. This novel binary serialization format, which is based on HDF5, has several properties suitable for querying. Due to the hierarchical partitioning and fixed-length records, known entity instances can be retrieved in constant time, rather than logarithmic time in a sorted or indexed dataset, or linear time in a traditional IFC-SPF model. Statistics, such as the prevalence of instances of a certain type, can be derived in constant time from the dataset metadata. With instances partitioned by their type, querying typically only operates on a small subset of the data. To validate our approach and its performance, we compare the processing times for six queries on five building models. The Apache Jena ARQ query engine (using N-triples, Turtle, TDB and HDT), RDF-3X and the system proposed in this paper are compared. We show that in many realistic use cases the interpreted Python code performs equivalent or better to the state of the art implementations, including optimized C++ executables. In other cases, due to the linear nature of the unindexed storage format, query times fall behind, but do not exceed several seconds, and as such, are still orders of magnitude better than the time to parse N-triples and Turtle files. Due to the absence of indexes, the proposed binary IFC format can be updated without overhead. For large models the proposed storage format results in files that are 2-3 times smaller than the currently most concise alternative.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Freddy Priyatna (OEG) submitted on 12/Jun/2017
Suggestion:
Minor Revision
Review Comment:

This paper presents a SPARQL engine implementation on top of HDF5-based binary serialization format. The authors argue that such format provides several suitable properties, such as smaller in size and constant access to known instances. They implement a SPARQL engine that works with the serialization format and evaluate their approach with six queries on five building models.

Overall, I found that most of the paper is well written with the exception of the section 3, which is hard to read to me.

I have found several issues that I’d like to be addressed:
- The first claimed contribution is to contribute to efficient binary storage format for building linked data model. This is an earlier work of the same authors and has been presented in ICCCBE2016.
- Given that the argument that the proposed serialization format has smaller size, I’d like to get more explanation of why it is important to have such characteristic, considering that storage is really cheap nowadays.
- Can the authors convince me, if storage-size is not a deciding factor, that I should use their proposed engine instead of using RDF3X? I see from Fig 4 that RDF3X performs better in most cases than HDF5.
- In the beginning of the paper the authors argue that full-featured query languages for building models in their native formats do not exists. However, there are many limitations that are explained in Section 3 that prevent the proposed engine to support a full-featured SPARQL. Either the authors have to relax their arguments in the abstract and introduction sections, or they have to increase the expressiveness of the engine.
- A major part of the evaluation section presents the results that are related to the serialization format, such as conversion times, file sizes, compression ratio, etc. I’d appreciate if the focus of this section can be directed to the query related results. For example, some of the content of 4.3 can be moved to either the introduction or state of the art section.
- Sections 4.6 - 4.7 are not part of the results. Perhaps they can be moved to the discussion section?
- The queries used for the evaluation purpose have relatively simple patterns. Not one of those queries features OPTIONAL patterns, or even nested OPTIONAL patterns.
- As mentioned in Section 5.5, one advantage to use index-less storage is ease of update. Thus, it would be nice if the authors could also put update mechanism in their engine.

Review #2
Anonymous submitted on 13/Jun/2017
Suggestion:
Accept
Review Comment:

The article at hand addresses the well-known issue of querying building data from IFC files. Several approaches have been presented in litterature addressing similar issues, but none addressed extracting data from such HDF5-based binary serialization of IFC files. Thus the originality rate for this paper is very good.
Validated through several benchmarks, this approach has two main advantages compared to those based on RDF serializations of ifcOWL: a) first the file size is reduced (with RDF-based approaches the file size usually is 5 to 7 times bigger), b) query execution times are considerably reduced. Authors should perhaps insist on the fact the the expressivity of ifcOWL hinders such building data processing with semantic Web techniques. Moreover, authors are well aware of existing works in the domain, they have well presented the advanatges of their approach, along with its drawbacks and also proposed ways of improvement. Significance of achieved results is also quite high.
The quality of wirting is also very good. Still, in order to best improve the overall quality of the article, the following remarks could be addressed.

Remarks:
- Abstract : "ifcOWL Linked Data ontology" - the actual standard ifcOWL ontology is not Linked data
- Table 3 - define units for measured times
- 4.2 conversion times - please specify if you use open source conversion algorithms such as https://github.com/IDLabResearch/IFC-to-RDF-converter
- Figure 4 - parse and query times should be clearly defined / why is Query 4 missing ?
- it could be interesting to check if execution times presented here could be compared to those presented in "Querying and reasoning over large scale building data sets: an outline of a performance benchmark" (https://dl.acm.org/citation.cfm?id=2928303)
- how do the authors make sure that all relevant results have been retrieved ?
- mention R2RML or DM [1,2] which are W3C standards, instead of D2RQ
- mention articles such as those of Kostis [3], those of Ontop [4] or SSN [5]
- perhaps think of a more solid formalization of the proposed approach - use [6,7] as reference
- add some additional references:
[1] https://www.w3.org/TR/r2rml/
[2] https://www.w3.org/TR/rdb-direct-mapping/
[3] Koubarakis, M., & Kyzirakos, K. (2010). Modeling and querying metadata in the semantic sensor web: The model stRDF and the query language stSPARQL. The semantic web: research and applications, 425-439.
[4] Bereta, K., & Koubarakis, M. (2016, October). Ontop of Geospatial Databases. In International Semantic Web Conference (pp. 37-52). Springer International Publishing.
[5] Compton, M., Barnaghi, P., Bermudez, L., GarcíA-Castro, R., Corcho, O., Cox, S., ... & Huang, V. (2012). The SSN ontology of the W3C semantic sensor network incubator group. Web semantics: science, services and agents on the World Wide Web, 17, 25-32.
[6] Pérez, J., Arenas, M., & Gutierrez, C. (2006, November). Semantics and Complexity of SPARQL. In International semantic web conference (pp. 30-43). Springer Berlin Heidelberg.
[7] Chebotko, A., Lu, S., & Fotouhi, F. (2009). Semantics preserving SPARQL-to-SQL translation. Data & Knowledge Engineering, 68(10), 973-1000.
[8] h5py (http://www.h5py.org)
[?] reference to the original building models used (if published online)

Review #3
By Carlos Buil Aranda submitted on 16/Jun/2017
Suggestion:
Reject
Review Comment:

In this paper the authors propose a mechanism to serialize RDF triples in the HDF5 file format. The authors also propose a way to execute SPARQL queries over these RDF serialization, which is compatible with the modeling standard ifcOWL. The authors provide an evaluation of the system comparing to other serialization formats.

Section comments:
Abstract and Introduction
In the Introduction section the authors describe what the BIM paradigm is (a data model for describing/storing 3D models), what ifcOWL is (an ontology for building models), and present the need of accessing directly IFC models through SPARQL.
Comments: In the abstract the IFC, ifcOWL, HDF5 terms are not introduced previously, so I was a bit lost at the beginning. The authors describe the problem in the abstract, but do not give a hint of why use that serialization instead of the existing ones or use some translation a la RDB2RDF. Thus, the sentence “To validate our approach” is misleading since the authors did not describe their approach previously. In the Introduction section these details are solved, however as a new reader looking at the abstract I found it a bit confusing.

State of the art
In the State of the Art section the authors introduce several approaches to store data and describe these approaches flawless, specially when storing certain RDF properties such as “rdf:list”, properties that are important to store geometry data. The authors describe some approaches to query execution optimization (Section 2.1), SPARQL translation (i.e. RDB2RDF), and Binary IFC and HDF5.
The section is interesting and easy to read, just a minor flaw: in Section 2.1 the authors write "Most relevant to the work presented in this paper are (a) static query optimization by a selectivity estimate for triple patterns, and (b) efficient storage and indexing methods to quickly retrieve data” however I see 3 subsections afterwards. I was expecting only two related to the introduction of that subsection.

Implementation
In the implementation section the authors describe their actual work, which is a "reasonably subset of SPARQL 1.1” implementation on top of HDF5. That reasonably subset of SPARQL 1.1 limits the query language to bound predicates in the ifcOWL ontology, “intermediate nodes”, SELECT only, and path expressions only for rdf:list predicates. In this section the authors also present the storage model, i.e. how data is stored in disk.
Comments:
Regarding the SPARQL 1.1 subset implemented, I’d like to know if other operators like OPTIONAL, UNION, SERVICE, subqueries, aggregates, etc. are implemented? From my point of view (if these operators are actually implemented) the subset of the language provided is quite small and I find it not reasonable at all. Have you studied the complexity of such subset of the language?
In that section there is a reference in page 5 to Listing 2 however Listing 2 is placed at page 10. Personally I got lost trying to find such listing. I found Section 3.1 difficult to understand. If this section is explaining how a set of RDF triples can be stored in HDF5 the best is to add a diagram showing instead of a screenshot like in Figure 2. Regarding the two query examples (less and more favorable), are these two query types the only that can be solved? In the query execution section, the authors explain how rdf:list is translated, however I do not understand what the steps 1 to 4 mean, i.e. in 1) move back, to where?

Overall I found this section quite difficult to understand, I think a running example would help.

Experiments
In this section the authors present their experimentation. First they describe their setup, next conversion times (I think time units are useful so the reader can get an idea of how complex is the process), file sizes and query execution times. First comment is about Figure 3, it is very hard to differentiate each column when printing the article in black and white. I would recommend to use lines, cross lines, dots, etc. instead of using similar colors. In this section it is possible to see that the authors run 5 queries against their system and that they provide the most succinct RDF representation off the evaluated systems. Again, pointing to a figure at page 6 from page 10 is misleading since the reader expects the related content to what s/he is reading is next to the text. Besides, it is not really descriptive to use a screenshot from an application to show data in disk, from my point of view a diagram would be more useful. The authors also describe the query parser/interpreter and some details about the file storage. I think these descriptions should go one section before, since this section is about evaluation.

Overall I find the experiments done very short. Regarding the amount of queries evaluated, I think that not so many conclusions can be drawn from using HDF5 to store the RDF data with only 5 queries accessing these data. There is a variety of queries shapes with different combinations of bounded subjects or objects that can help in understanding when is useful to use HDF5. Regarding the succinctness of the approach, HDF5 is the most succinct a bit better than RDF-HDT. However RDF-HDT can execute SPARQL 1.1 queries with unbounded predicates. Thus I consider that comparing both RDF serializations quite unfair since SPARQL expressivity is different.

Conclusions
In this section the authors conclude the paper and provide some future directions. The authors some hints on how to use HDF5 with larger models, optimizations to further reduce execution times on “realistic queries”, etc.

Overall comments:
I think the paper provides an interesting solution to store geometrical models, however the implementation of the solution is quite short. The authors only implement a minimal subset of the language and they did not explain how HDF5 is used to store the RDF data in detail. Or at least I did not quite understand it. Another problem I see in the paper is that I find it very badly organized. Section Experiments contain information about the implementation, figures and listings are very bad placed misleading the reader, figures could be better, and there are missing diagrams or running examples. I think another problem is in the evaluation section. I do not think it is possible to get many useful conclusions from executing just 5 queries to the data. There is a variety of SPARQL queries that the authors did not take into account.


Comments