RML Mapper: a tool for uniform Linked Data generation from heterogeneous data

Tracking #: 1730-2942

Anastasia Dimou
Ben De Meester
Pieter Heyvaert
Ruben Verborgh
Steven Latré
Erik Mannens

Responsible editor: 
Aidan Hogan

Submission type: 
Tool/System Report
Linked Data is often considered a means of integration among data residing in different data sources. In real-world situations though, data sources of various formats are accessed using different protocols and contain data in various structures, formats and serializations. Generating Linked Data from these data sources remains complicated, despite the significant number of existing tools, because, the latter provide their own –thus not interoperable– format- and source-specific approaches. Linked Data generation is facilitated by mapping languages which detach the rules from the implementation that executes them, rendering them interoperable among different solutions, whilst systems that process those rules are use-case independent. Nevertheless, different factors influence the Linked Data generation process. Thus, diverse systems may be implemented to efficiently execute those mapping rules. In this paper, we present the RMLMapper, a tool for Linked Data generation from data with heterogeneous structure, format and serialization, which is retrieved from different data sources with various access interfaces. We (i) introduce the RMLM apper’s design choices and architecture, and (ii) demonstrate evaluation results and use cases where the RMLMapper was adopted, showing that the RMLM apper is well-adopted and capable of generating Linked Data in competitive time.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Peter Haase submitted on 12/Nov/2017
Major Revision
Review Comment:

The paper presents RML Mapper, a software tool for mapping data from a variety of data formats to RDF using a uniform approach.
The mappings are based on the RML mapping language, which - unlike e.g. R2RML - supports heterogenous data formats as input.
The RML Mapper software is available in open source under a permissive license on github. It is a fairly widely used and mature software.

The paper describes the design choices, architecture, performance characteristics, real-life use cases as well as related work very comprehensively. The paper is well structured and and written. The use cases demonstrate the impact of the tool.

At the same time, the paper is very lengthy and long-winded in parts, providing details that are not really relevant or can be presented in a much more compact and concise manner. Tool papers are expected to be short papers.
Examples of parts that can be significantly shortened or removed are:
- The previous phases of development of RML Mapper (2.1)
- Algorithm 1 is so straight-forward, it can be explained in one sentence and does not benefit from an artificial formalisation in an algorithm environment.
- Similarly, the formalisation of the number of triples generated as well as the subsequent experimental results seem very long-winded considering the relatively simple facts and connections.

Consequently, I suggest a revision to significantly shorten the paper to the expected length of a tool/system paper as per the submission guidelines of the journal.
(“Reports on tools and systems – short papers describing mature Semantic Web related tools and systems. These reports should be brief and pointed, indicating clearly the capabilities of the described tool or system”)

Minor comments:
- pg 13: rectriced -> restricted
- pg 13: DBpedia stem originally -> DBpedia stems originally

Review #2
Anonymous submitted on 12/Dec/2017
Major Revision
Review Comment:

This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

(1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided).

The authors present RML Mapper, a tool for mapping heterogeneous data to Linked Data. The title and the paper use the term "Linke Data", but the paper addressed RDF generation rather than link generation, so RDF generation would be a more proper term. The word heterogeneous should be defined precisely, as the paper sometimes uses the term as "heterogeneous data" and sometimes as "heterogeneous formats". The paper addresses heterogeneous formats, and like most rule based mappng tools, assumes mostly homegeneous data within each source.

Overall, the paper is interesting but needs significant work. The topic is important, but the system is not analyzed thoroughly so it is hard to undrestand the boundaries of the situations it addresses.

Section 1: Introduction

The intro is fine, makes the point that data exists in multiple formats and people want to map it to RDF. The argument agains "ad-hoc" solutions is unclear, as it is unclear what the authors mean by reuse. I would think the benefit of a multi-format tools is that users only need to learn one tool to map a variety of data. Currently they may need to learn one tool for SQL databases (R2RML), a different tool for XML, and yet another for RDF (eg SPQARL construct queries). It is unclear how rules defined for one source in one format could be reused for the same data in a different format. Does this use case arise in practice?

Section 2: RML Mapper

Section 2.1: unclear why this section is here, why should we care about the development phases. The paper reports on the state of the system now.

Section 2.2: a recap of RML, recommend combining the third level subsections as there is no need to the subsubsections. This section is fine.

Section 2.3: Architecture: should be combined with 2.4, which is the detailed description of the architecture.

Section 2.4: Modules

I got very lost in this section, buried in detailed having never grasped the high level idea or flow. The basic idea as I understand it, is to have an R2RML processor with a more general API to access data, without assuming that each item being processed is a row from a table. This never comes across. Algo 1 is the R2RML algo, as far as I can tell, the authors should explained where it is being generalized to handle multiple formats.

A key issue is the handling of joins, as a general tool must do joins on data from multiple formats. This is a key challenge as it would require doing the joins in memory or in a database built-in the tool, which still means ingesting all external data into an internal database for the benefit of joining.

I recommend significantly revising the structure of this section. Please look at the detailed feedback in the manuscript.

Section 3: Evaluation

Section 3.1 Formal: this section needs motivation. I was expecting a discussion of metrics to evaluate RML Mapper. The section focuses on number of triples, a useful metric, but not the only relevant metric. I can think of many metrics to evaluate a mutli-format mapping tool:

1) Scalability (seems to be the primary focus of the authors), how many items are being consumed, or how many triples are being produced. I would recommend measuring the size of inputs in terms of some generalization of "cell", which makes a nice tie with R2RML. Perhaps only count the number of cells that influence the output.

2) Source complexity. There seems to be a progression from tables to trees to graphs, can RML handle all three, any general graph?

3) Rule complexity. Here is where joins would be addressed, as one notion of complexity is the kinds of joins that need to be addressed. Are there other kinds of rule complexity? E.g., Xpaths can be complex to evaluate as they may produce a large number of matches, is there a way to measure this as some kind of source complexity?

4) User understandability and population: how well can users use RML Mapper?, how much training do they need? What level of complexity were they able to solve? Who are the users besides the RML developers?

5) Use cases, how many sources in each use case, what was the format of each source, how are the sources related? ie, what joins are required, where there data preparation tasks? Were this done using the functional language?

The mathematical formalism in the evaluation section seems to me out of place. A mathematical formalism would have been more relevant in the approach section. The focus here is in measuring number of triples, which is not that important in my view. At the end of the day, all datasets used are rather small. Extrapolating to 1M and 10M records would be useful as these are the sizes of DBpedia and many databases. For the readers it is important to know how far the system will go.

3.2.2 DBpedia: this is one of the most interesting parts of the paper as this is a really nice use case. However, the authors do not explain why RML Mapper does not create all the triples. Please see the notes on the paper about this specific issue. Many more details need to be provided to make the DBPedia use case more compelling.

4. Use Cases

This section is very important and could be easily improved to significantly strengthen the paper.

4.1 iLastic: intersting use case, seems very similar to VIVO, which is addressed using semantic technologies, so an interesting comparison would be possible.

The section provides many details about the application, including a linking problem, which is out of scope for RML Mapper. The section should focus on the role of RML Mapper, which is not clear. See manuscript for additional comments.

4.2 Combust: this use case is interesting because it looks like it was done by people other than the tool developers. Is this the case? How did they learn the tool? What where they able to do? How complex are the mappings they did? How did they request help when they got stuck (I am sure they did as RML Mapper is not yet a production system). Even anecdotal material would be intersting here.

4.3 DBPedia: see comments in the manuscript

5. Related Work

The related work section reads like an annotated bibliography. It would be better to organize the field according to dimensions. One dimension is source format (whith multiple values, SQL, csv, json, xml, rdf, ...), other dimension coud be size (in memory, larger than memory), and another would be expressiveness of the mapping language (R2RML, joins, data cleaning, ...)

(2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

As mentioned in the previous section and in the detailed comments in the manuscript, the paper needs significant revisions to enhance readability. Many sections delve into low level details without first presenting a big picture or motivation. It would be important to distinguish algorithmic details from programming details. It is important in the paper to highlight the algorithmic details, but not the programming details eg github repos, etc can all be packed into a succinct implementation section.

The current paper does not cover the limiations of the tool. To address this, the experimental section needs to be expanded.

Review #3
Anonymous submitted on 13/Dec/2017
Review Comment:

This paper presents a systems description of the RML Mapper tool. I understand that the authors have developed a system and want to share details about their system. However, this paper is not ready for publication due to two main reasons: readability and evaluation. Therefore my recommendation is to reject. I believe the authors need to do a complete rewrite of the paper.

Issue 1: Readability

Unfortunately this paper is very hard to follow because it seems to have been written for an audience who is already aware of the details of the RML Mapper system. It is not accessible to more general audience. The flow of the article is lacking. It immediately jumps into the architecture of the system without starting with some high level context. There is no description about the language RML Terminology has not been defined.

These are examples of what I struggle with initially

pg 2:

“By these means, we aim to ensure that agents are able togenerate Linked Data ..” —> what are agents in this case? software? people? Be specific
Section 2.1. Specifying the journey of how the system was designed is superfluous, unless there are interesting, key insights.
“bellow” —> below

pg 3:
- From what I understand, RML is a slight extension to R2RML. There is no mention on how these two languages are related.

pg 4:
- "The Logical Source Handler transforms any inputdata into an iterator of a set of data items” —> What are “data items”?
- "It actually returns a certain iteration or a value for a certain reference of a certain iteration.” —> What does this actually do? One or the other? Both? When is what done?
- The Mapping Document Handler (Figure 1, 2) —> it should be (Figure 1, 1)
- “mapping driven paradigm”
- “data driving paradigm”
- The description of the RML Processor in Section 2.4.5, which is a core part of the system is just one sentence. Absolutely no details. What does this actually do?
- Section 2.4.6. I don’t see 9 in Figure 1.

pg 5:
- “The function handle (figure 1, 8)” —> 8 is the Metadata Handler
- “… connects the RML Processor with the independent Function Processor” —> I don’t see Function Processor in Figure 1.
- Algorithm 1 appears. No where in the text this is being referenced. Additionally, there are functions in algorithm 1 that are not defined.

As you can see, by now, I’m confused and struggling to understand. I could keep going but this is enough. The paper should be rewritten completely so there is a natural flow. For example, the paper should start with a high level intro to the system. There should be a running example. Sample data in csv/json/xml should be given. Sample RML mapping should be presented and explained. Then each part of the system can be explained.

Issue 2: Evaluation

The evaluation section continues to be puzzling.

First of all, why is there Section 3.1 Formal in an evaluation section? This should come up front as a Preliminaries or Definitions section. Furthermore, the definitions presented in this section are never used anywhere else. What is the goal of this section? Finally, this is not a true formalization. For example, there is no initial definition of what is R, I, t, V. R is the set of references to the input source. But what is a reference? And R should be the universe of all references. Same for I. What is an iteration? V is a function apparently. What is the domain and co-domain?

What is the purpose of the evaluation? What is the hypothesis? What are the claims? What are you trying to prove? What evidence are you looking for to support which hypothesis/claims?

I don’t believe that you are trying to provide evidence that your system supports multiple data structures (btw, it’s not data structures but data formats). That is a given.

Table 2 has size. But that should be number of records. If I understand that correctly, a CSV file with 100K rows took 1493 seconds (almost 25 minutes)? There has to be something fatally flawed. This can not be true.

"For instance, the RMLMapper generates a dataset by integrating two datasources of CSV format with 100 and 1,000 records re-spectively, in 9 seconds which is only 2 seconds more,mainly due to the join execution, …” —> 100 and 1000 records can’t even be considered toy examples. Additionally, you are mentioning joins. This is the first I read about this. Where did this come from? The only mention of joins is in Section 2.5, Iteration Processing. What join algorithm is being used? And why? Joins are a big deal and this should be thoroughly fleshed out.

Section 6 Conclusion is not a conclusion section because nothing is being concluded. This is a simple summary.