Review Comment:
GENERAL COMMENTS
The paper is very well written and straightforward to understand. The main contributions lie in the design of the dataset generation algorithm following the public transit planning methodology (as well as the population distribution) and a set of evaluation metrics for comparison with other (real) datasets. The generated datasets are of potential value to the research communities to benchmark different storage, query and search techniques.
Having said that, I would point out that it is not easy for me to find very notable novelties from the paper, although there are some interesting ideas in it. This is my major (and only) concern, and the reason that I would not recommend an accept immediately. The work in my opinion represents a significant amount of engineering effort and has achieved reasonably good results with respect to the evaluation metrics. Some parts of the paper are written in a style very much like a project deliverable (details can be found below). The authors are suggested to revise them. Below are some of the other comments that might be useful to the authors to further improve the presentation of the paper.
OTHER COMMENTS
- in introduction, the authors claim that the main contribution of the paper is a mimicking algorithm for generating realistic public transport data. However, in the 1st paragraph, they also say that “different information architectures with varying trade-offs exist” and then quickly discuss the architectural choice. How is the architectural choice influencing/related to the focus of the paper?
- in introduction, “Our second main contribution is an implementation of this algorithm…”. This could be combined with the 1st contribution as they are both about the algorithm.
- in introduction, “semantic data models. [9,4]”, -> “semantic data models [9,4].”
- in section 2.1 Spatiotemporal Dataset Generation, “…over which objects can move at a certain speed based.”, remove “based”.
- in section 2.1, “…are represented as two-dimensional areas that exist for a period in time over the network, which apply a decreasing factor on the maximum speed of the objects in that area.”. This sentence does not read well and needs to be rephrased. How can one represent weather conditions and event as a two diminutional areas?
- in section 2.1, for the three methods to select the starting node, it would be better to present them in bullet points or a table. It would also be better to explicitly mention what the two/one dimensional distribution functions are.
- in section 2.1, “…generate new instances based on a set of dependency rules (called “rules” [20]).”, what is an instance here? It would be better to briefly mention what the instance represents in [20]. “(called “rules” [20])” can be removed. It would be better to present these dependencies in bullet points or a table.
- section 2.2 also discusses methods for synthetic datasets generation but with RDF, it can be combined with section 2.1.
- in section 2.2, “…since this level of structuredness will have an impact on how certain data is stored in RDF data management systems”. I am not able to understand what is meant by this sentence here.
- I feel that section 2.3 is a specification or requirement for public transit design. It does’t seem appropriate to be under the “related work”. Also the section can be shorten in my opinion. It would be better to present those items for objectives, metrics and so on, in either bullet points or tables.
- section 2.4 on transit feed format talks about the specification and knowledge representation in the domain. It seems not appropriate under related work. Section 2.4 and section 2.3 are both about the background of the public transit domain. Separating them from the literature review might be a better idea.
- section 3, “The main objective of a mimicking algorithm is its ability to create realistic data.”, remove “its ability”.
- section 3, “by first comparing the level of structuredness of real-world datasets compared to their synthetic variants”. This sentence does not read well and needs to be rephrased.
- section 3, what are meant by “macroscopic coherence metric and domain-specific microscopic metrics”?
- section 4, Connection is an important entity in the design. It is worth explaining a bit more, e.g., how it is instantiated in the dataset? How to define the two ends of a connection?
- section 4.2, page 7, Eucilidian distance: is it calculated using the geographical coordinates or the indices of the two dimensional area for the region (matrix)?
- section 4.3, under “Short-distance”, “…maintains its center point that represents the average location of all stops…”, I would suggest to use centroid in this case, especially you have explicitly named the clustering algorithm. “…make sure that nearby clusters are merged before more distant clusters…”. This seems unnecessary as the standard agglomerative clustering will do that. What you need to is to choose an appropriate cutoff value on the dendrogram.
- section 4.3, under “Cleanup”, “…that the amount of stops…”, “…significant amount of loose stops…”, the number of stops seems better here.
- Figures 3, 4, and 5, add some extra space between all the sub-figures.
- I can somehow understand Algorithm 1, but there are things to be clarified. Line 8, why do you use multiplication of the difference between coordinates and Radius? Line 10, what does the index i mean? What is “I”? Line 12, it does not show how the random station s’ is calculated.
- Equations 2, 3, and 4, the symbols should be explained in text.
- section 4.5, under “Delay”, it is worth explaining how the delay factors as in the transport disruption ontology affects the trip, for example, how the trip time is updated in the presence of different delay types.
- section 5.1, first, these requirements are best to be enumerated in a project report, while not in a research paper; second, if they are to be kept in this paper, it would be better to make them short and to list them in a table.
- section 5.2, compossable -> compossible. “Node module on NPM and as a Docker image on Docker Hub”, it would be better to use one sentence or two to explain what they are. I think many people, including me, don’t really know.
- Figure 6, “Each route has a different color, and darker route colors indicating more frequent trips over them…”, I am afraid that one cannot see the “darker” one from different colors.
- Table 2, would it be better to provide the table as a footnote with a link? Similar to the requirement specification in section 5.1, it is of less interest to the readers of the paper, who are primarily looking for novel ideas on design and implementation.
- section 6.1, under “Metric”, “…the coherence metric [14] measures the structuredness of a dataset.” -> “the coherence metric [14] measuring the structuredness of a dataset is used.”
- section 6.1, under “Results”, “That is because of … and the fact that they originate from GTFS datasets that have the characteristics of relational databases.”. Is there a correlation between high structuredness and the fact that the data is from a relational database?
- section 6.1, under “Results”, “the same amount of stops, routes and connections”, -> same number of
- section 6.2, “…and the other way around given a distance function…”, what is meant by “the other way around” here? It is obvious that the function in (6) is symmetric.
- section 6.2, under “Edges Distance”, “…weighed by the length of the edges”, -> “weighed by the difference of length of the edges”. Why there is a “+1” in equation (9)?
- section 6.2, “the realism of stops and routes is lower, but still sufficiently high to consider them as realistic”. But from Table 4 the data for routes is not sufficiently high. Also, it doesn’t seem reasonable to average the values across stops, edges, routes and connections. They are better to be discussed separately.
- section 6.3, under “Results”, “…but now the increase for routes and connections is higher than for the connections parameter.”, do you mean higher than the stops parameter?
|