An efficient SPARQL engine for unindexed binary-formatted IFC building models

Tracking #: 1625-2837

This paper is currently under review
Thomas Krijnen
Jakob Beetz

Responsible editor: 
Guest Editors ST Built Environment 2017

Submission type: 
Full Paper
To date, widely implemented and full-featured query languages for building models in their native formats do not exist. While interesting proposals have been formulated, their functionality is often not complete and their semantics not defined precisely. With the introduction of the ifcOWL Linked Data ontology as an internationally recognized modelling standard for building models, a representation of native architectural and engineering building models in RDF is provided and such models can be queried using SPARQL. The requirements stemming from the size of data sets handled in complex building projects however, make the use of clear-text encoded Linked Data infeasible in many used cases. The IFC serialization in RDF is not as succinct as the STEP Physical File Format (IFC-SPF), in which IFC building models are predominantly encoded. This introduces a relative overhead in query processing and file size. This concern is aggravated when coupled with heterogeneous large volume datasets, such as point clouds and sensor data. In this paper we propose a SPARQL implementation, compatible with ifcOWL, directly on top of a standardized binary serialization format for IFC building models, which is a direct binary equivalent of IFC-SPF, with less overhead than the graph serialization in RDF. A prototypical implementation of the query engine is provided in the Python programming language. This novel binary serialization format, which is based on HDF5, has several properties suitable for querying. Due to the hierarchical partitioning and fixed-length records, known entity instances can be retrieved in constant time, rather than logarithmic time in a sorted or indexed dataset, or linear time in a traditional IFC-SPF model. Statistics, such as the prevalence of instances of a certain type, can be derived in constant time from the dataset metadata. With instances partitioned by their type, querying typically only operates on a small subset of the data. To validate our approach and its performance, we compare the processing times for six queries on five building models. The Apache Jena ARQ query engine (using N-triples, Turtle, TDB and HDT), RDF-3X and the system proposed in this paper are compared. We show that in many realistic use cases the interpreted Python code performs equivalent or better to the state of the art implementations, including optimized C++ executables. In other cases, due to the linear nature of the unindexed storage format, query times fall behind, but do not exceed several seconds, and as such, are still orders of magnitude better than the time to parse N-triples and Turtle files. Due to the absence of indexes, the proposed binary IFC format can be updated without overhead. For large models the proposed storage format results in files that are 2-3 times smaller than the currently most concise alternative.
Full PDF Version: 
Under Review