Publishing planned, live and historical public transport data on the Web with the Linked Connections framework

Tracking #: 2854-4068

Authors: 
Julian Rojas
Harm Delva
Pieter Colpaert
Ruben Verborgh

Responsible editor: 
Axel Polleres

Submission type: 
Full Paper
Abstract: 
Publishing transport data on the Web for consumption by others poses several challenges for data publishers. In addition to planned schedules, access to live schedule updates (e.g. delays or cancellations) and historical data is fundamental to enable reliable applications and to support machine learning use cases. However publishing such dynamic data further increases the computational burden for data publishers, resulting in often unavailable historical data and live schedule updates for most public transport networks. In this paper we apply and extend the current Linked Connections approach for static data to also support cost-efficient live and historical public transport data publishing on the Web. Our contributions include (i) a reference specification and system architecture to support cost-efficient publishing of dynamic public transport schedules and historical data; (ii) empirical evaluations on route planning query performance based on data fragmentation size, publishing costs and a comparison with a traditional route planning engine such as OpenTripPlanner; (iii) an analysis of potential correlations of query performance with particular public transport network characteristics such as size, average degree, density, clustering coefficient and average connection duration. Results confirm that fragmentation size influences route planning query performance and converges on an optimal fragment size per network, in function of its size, density and connection duration. Our approach proves to be more cost-efficient and in some cases outperforms OpenTripPlanner when supporting the earliest arrival time route planning use case. Moreover, the cost of publishing live and historical schedules remains in the same order of magnitude for server-side resources compared to publishing planned schedules only. Yet, further optimizations are needed for larger networks (> 1000 stops) to be useful in practice. Additional dataset fragmentation strategies (e.g. geospatial) may be studied for designing more scalable and performant Web API s that adapt to particular use cases, not only limited to the public transport domain.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 17/Nov/2021
Suggestion:
Accept
Review Comment:

In my initial review I have recommended the paper to be accepted under minor revision. All the comments I had raised have been addressed in the resubmitted version. I suggest to accept the paper in its present form.

Review #2
Anonymous submitted on 25/Jan/2022
Suggestion:
Minor Revision
Review Comment:

Comments regarding the reply to the previous review:

> "The main issue of the paper ..."

sufficiently addressed

> "For the evaluation the sentence ..."

The authors acknowledged the influence of network latencies to the contribution results. They are right in abstaining from varying network latencies for reproducibility. Still, the (low) latencies of a local network are not representative.

Realistically higher latencies would increase the query response times, even if adding an additional static delay. This increase could affect queries with a lower fragmentation size more than those with a greater fragmentation size. Because of that the optimal fragmentation size could increase.

> "Hypothesis H2 is not ..."

Hypothesis H2 is (still) hard to reject. The "specific set of topological characteristics" could be anything and is (still) not concrete enough. Which characteristics?

> "The evaluation does not attempt ..."

sufficiently addressed

> "In Section 3 it could be ..."

sufficiently addressed

> "Section 5.3: instead of ..."

sufficiently addressed

Additional (new) comments:

The definition of "network" (and its topology) is unintuitive. To my understanding a PT network (and its topology) is changing very slowly; such changes could be new tracks installed or a new railway station built. Yet, the paper seems to assume the PT network (topology) is something like "the connections where a train is currently driving".
See for example the caption (and the purpose) of Figure 5: it shows the "topological structure of the network [which] varies throughout the day". It does not look like a "topological structure" (which would change very slowly) but visualises the *load of the network*. "Graph snapshot", which is also used in the paper, makes more sense to me.

Table 2 is inconsistent with the list of definitions on page 12:
- stops vs. Size vs. |V|
- k vs. K (in the definition)

The y-axis of figure 6 is easy to misinterpret. I always have to remember that the response time means the response time of a route planning query and not the response time of the individual HTTP requests. Additionally the 'connections' of the x-axis, which could be mistaken for HTTP connections instead of train connections.

Minor Remarks

- The table which is supposedly Table 4 the "Table 4" caption heading is missing.
- page 12 "we selected a represenantative set OF heterogeneous"
- page 24 "In Figure 9 we present the server side CPU use" is repeated.

Review #3
By Luis-Daniel Ibáñez submitted on 27/Feb/2022
Suggestion:
Minor Revision
Review Comment:

This paper is about publishing Public Transport data in a cost-efficient and flexible manner using Semantic Web technologies (Linked Connections approach). The paper studies what are the performance tradeoffs for route query planning when publishing data and compares it with "traditional" implementation with a "fat" server instead of pushing some computation to intelligent clients.

DISCLAIMER: Is the first time I review this paper, that I see has been previously submitted. In line with what editors of this journal have asked me in similar situations, I read the paper with fresh eyes, without evaluating or assessing changes from previous versions.

(1) originality: The approach is original, there is enough difference with previous work from same authors.

(2) significance of the results: Not groundbreaking, but in my opinion just sufficient. Indeed, the paper concedes that in its current form the approach is still far from practical for larger networks. Plus, Linked Connection's inherent bandwith overhead means that mobile apps, which I think are the most common client applications, would be slower and more expensive. That is relatively bad news for the SemWeb community, but still a valid scientific result vis-a-vis the methodology followed.

(3) quality of writing: Can be improved, further comments below.

(4) Resources: GitHub repository, the repo also includes an external link to an institutional repository that I assume complies with the requirements of research data deposit.

Detailed comments:

Abstract:

You say "fragmentation size influences route planning query performance and converges on an optimal fragment size per network, in function of its size, density and connection". From the results shown in section 6.1 I can't see where is that function. I was expecting you derived something F(S,D,C) -> Fragment Size.

Introduction:

Motivates well with respect to "Open Data", but jumps to "Public Transport Data" without explaining why the client-servers cost tradeoffs are important for that domain. What is wrong with current PT open data publishing?

The contribution "Shows how Semantic Web technologies can be applied not only to describe domain specific data, but also interfaces that enable applications to consume it, whose principles could be re-used towards more generic, domain-independent and autonomous data applications" is quite fuzzy. I'm not certain what is mean with "generic application", or "autonomous application" and how what is presented here helps to that. What is presented here is for the PT domain (as stated in the immediately previous sentence), therefore don't see how your contribution creates "domain-independent" applications.

Related Work:

In section 2.2 it is mentioned that the approach ultimately lowers the cost for data publishers. Please provide references to this, is there something where this has been quantified? I believe previous work from some of the authors have shown the load balancing part, but not the cost for publishers. I also wonder that in the context of PT, the publisher is usually the Transport agency or provider that has a mandate (or a business interest) in developing a client application too, how does the cost balance works there?

The same remark appears in section 3, where you mention the tradeoff of "increased implementation complexity on the client". You mention a mitigation strategy at that point, but this should be expanded in the discussion section.

In terms of contributions, I can see what a "general architecture is", but the adjective "integrated" does not add anything.

"An study of the factors that influence route planning query performance" it should be specified that is query planning under the data publication conditions imposed by your approach.

Section 3:

The AVL tree is nice, but it is part of the implementation, I don't understand why is considered part of the "LC architecture". If your architecture is "general", then the Live Data Manager is a component that does something, and the AVL is just your implementation.

Section 4:

Minor: "set heterogeneous" -> set of heterogeneous

The word "observed" does not compile to me, it seems "measured" is more appropriate.

To me, this section should be about the datases and metrics used for the experimentation. You include here the choice of modeling as TVG, which I believe is part of your approach, and specifically about your reference implementation of the general architecture.

There is no explanation of why the 22 PT networks were chosen, were they the ones available? You mention "representative in terms of modes of transport and geographical coverage" Did you consider a larger set and then discarded some? Did you choose them to have a variety of sizes/degrees/densities? (does not seem the case) What timeframes were considered and why? In the caption of Table 2 you mention "number of active stops during their busiest day", what means "busiest" and how it was established?

Section 5:

The formulation of hypothesis is not consistent with the questions. RQ1 is formulated as "What is..." but the H1 is "There is...". For RQ2, the question is "What is", but the hypothesis is not concrete enough, what is the "specific set of topological characteristics"? That you hypothesise? It seems both questions need to be rewritten as "is there" questions. Another thing that makes noise to me about writing these as hypothesis is that if you formulate them as statistically testable, then you need an actual statistic test, which you only have for RQ2.

Unclear what the assumption "PT route planning queries will be normally evaluated within the span of one day". After reading other parts of the paper, this seems to mean that queries are done for travel on the same day.

I'm confused about the relationship between the paragraph on "Smallest fragment possible" where you talk about "number of connections allowed per document" and that "with this lower bound we were able to fragment the rest of the collection in fragments containing similar number of connections and hence a similar size", and the "fragmentation sets" where you talk about "connections per fragment". If you use that lower bound as guide for the size of the fragments, then how does it make sense to then vary in fixed sizes? It seems to me the last sentence of "Smallest fragment possible" may be poorly written.

In table 3, the term "query length" has not been used before, what does it mean?

Overall, this section has a lower writing quality than the others, and would benefit from additional proof-reading.

Discussion and conclusion:

MINOR: neglible -> negligible

Can you elaborate why Spain-Renfe has the optimal point?

With respect to historical data, on p6 you mention that the main issue is that this data is not currently being published. Let's assume that a publisher willing to publish that (instead of hiding it for business reasons), what is the advantage of using your approach over a data dump? You mention Machine Learning algorithms as beneficiaries, wouldn't those require a full dump? A statistical analysis would need the same. If I got it right, you use in your experimentation for historical data the same queries than with the Live setting, but are queries for historical use cases the same as for Live use cases? It is not clear for me, and I would say they aren't.

You mention that optimal fragmentation size is "related to the average scanned connections of the query set". There is no mention in section 6.1 or Figure 6 of E(SCQ), just some references to table 2 for values of K and D (that are quite heavy for a reader to go check) what is the support of this statement? I think you need some visualisation of this in section 6.1.

You stress a lot the "cost-efficiency" of your approach for publishers (presumably small agencies in a budget. You even mention on p27 "...more expensive servers will be needed with OpenTripPlanner than with LC Server", but I'm missing at least an estimation of how much more (in money), based perhaps on current average cloud or web server costs.

Following on my remark on statistical tests in section 5, I don't think you can write "accept hypothesis" on RQ1 and RQ3 in the conclusion.