Generating Public Transport Data for the Web based on Population Distributions

Tracking #: 1797-3010

This paper is currently under review
Ruben Taelman
Pieter Colpaert
Ruben Verborgh
Erik Mannens

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
Applying Linked Data technologies to geospatial and temporal data introduces many new challenges, such as Web-scale storage, management, and the transmission of potentially large amounts of data. Several benchmarks have been introduced to evaluate the efficiency of systems that aim to solve such problems. Unfortunately, the synthetic data many of these benchmarks work with have only limited realism, raising questions about the generalizability of benchmark results to real-world scenarios. On the other hand, real-world datasets cannot be configured as freely, and often cover only certain aspects. In order to benchmark geospatial and temporal rdf data management systems with sufficient external validity and depth, we designed PoDiGG, a highly configurable generation algorithm for synthetic datasets with realistic geospatial and temporal characteristics comparable to those of their real-world variants. The algorithm is inspired by real-world public transit network design and scheduling methodologies. This article discusses the design and implementation of PoDiGG and validates the properties of its generated datasets. Our findings show that the generator achieves a sufficient level of realism, based on the existing coherence metric and new metrics we introduce specifically for the public transport domain. Thereby, PoDiGG provides a flexible foundation for benchmarking RDF data management systems with geospatial and temporal data.
Full PDF Version: 
Under Review