A Linked Data extraction architecture for engineering software tools

Tracking #: 1665-2877

Jad El-khoury
Andrii Berezovskyi
Mattias Nyberg

Responsible editor: 
Jens Lehmann

Submission type: 
Full Paper
New industrial initiatives such as Industrie 4.0 rely on digital end-to-end engineering across the entire product lifecycle, which in turn depends on the ability of the supporting software tools to interoperate. A tool interoperability approach based on Linked Data has the potential to provide such integration, where each software tool can expose its information and services to other tools using the web as a common technology base. In this paper, we report on our negative findings when attempting to use existing Linked Data extraction techniques to expose the structured content managed in engineering tools. Such techniques typically target the storage facility to extract and transform its content, with the assumption that sufficient information is available within the facility to automate the process. Based on a case study with the truck manufacturer Scania CV AB, our study finds that an engineering tool manages its artefacts using business logic that is not necessarily reflected in the storage facility. This renders existing extraction techniques inadequate. Instead we propose an alternative Linked Data extraction architecture that can mitigate the identified shortcomings. While less automated compared to the existing solutions, the proposed architecture is realised as a standalone library that can still facilitate the extraction process.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Benjamin Cogrel submitted on 08/Sep/2017
Review Comment:

In this work, the authors are interested in designing a read-write Linked Data interface for a legacy information system that relies on a relational database. More precisely, their goal is to set up a read-write OSLC web service together with a read-only SPARQL endpoint, where the latter brings rich querying capabilities missing in the former interface.

In the first part of this paper, they performed a two-step experimentation of existing technologies for achieving this objective. First, they proposed to use a triplestore as a SPARQL endpoint and to set up an ETL process, based on D2RQ, for populating it with the database content. This process relies on a RDB-to-RDF mapping which is first bootstrapped from the database schema and then manually edited. Next, they reused the mapping to bootstrap an OSLC model with ORM (Hibernate) annotations, and they manually enriched this model before using it to generate a ready-to-run OSLC service. As experienced by the authors, this solution has an important limitation: by interacting directly with the database, it bypasses the part of the business logic that resides not inside the database but inside the application controller of the legacy system.

To address this issue, they made the radical choice of ignoring the database layer by focusing instead on the application controller. This choice led the authors to propose an architecture where the OSLC web service is designed manually (that is, using the standard procedure), and where a novel component, called the Lyo store, is introduced between the application controller and the triplestore. This component is in charge of keeping the triplestore in sync with the changes communicated by the application controller. In this architecture, developers are required to implement several application-specific classes both at the OSLC web service and the Lyo store levels. This architecture has been tested on three legacy systems.

Main comments:

* In my view, the central question of this work is how to deal with the business logic that resides in the application controller but not in the database layer. I found the choice of the authors to discard the database layer too radical and disappointing, since it brings us to an architecture less interesting than the one proposed in the experimentation section. Indeed, having a RDB-to-RDF mapping can be very valuable, since, as suggested by the authors, it could be used for partially generating a web service model and for other reasons I will later elaborate on. Therefore, it would have been interesting to study how to model the controller-specific business logic and to understand how it complements the business logic that can be extracted by analyzing the database schema. Said differently, I would have expected more modelling in a declarative fashion (what the Semantic Web is, in my view, actually all about) than less (what is achieved by discarding the mapping). Instead, the description of this controller-specific business logic remains shallow.

* During the experimentation, the authors mentioned that, equipped with a RDB-to-RDF mapping, D2RQ supports virtual RDF graphs (i.e. rewriting SPARQL queries into SQL queries) but this approach was discarded in favor of the ETL approach (where the RDF graph is materialized and stored in a triplestore) without being justified while it is an important design choice. The virtual graph option would allow both the web service and the SPARQL endpoint to query the same database without having to deal with the maintenance burden induced by the ETL approach. Also, since the performance characteristics may differ significantly between a triplestore and a SPARQL-to-SQL system (see e.g. [1]), it could have been interesting to evaluate these two alternatives so as to understand which one performs better in their setting.

* The literature related to RDB-to-RDF management mentioned in this paper is outdated: it predates R2RML (published in 2012) whereas the domain is still active and papers are regularly accepted in top Semantic Web conferences and journals (see, e.g., [1,2,3]). Important trends in this domain, such as the Ontology-Based Data Access (OBDA) approach, seem to be largely ignored. It would have been better to consider the R2RML and RDB direct mapping W3C standards (which both have existed for five years) rather than prior proposals, the native D2RQ mapping language and the D2RQ bootstrapping feature. Note that the last version of D2RQ is already five years-old and does not support R2RML while all the recent systems do.

* Designing a RDB-to-RDF mapping that produces a meaningful and valuable RDF graph remains indeed a challenging task which requires a significant amount of human curation, even after using semi-automatic tools. It would be interesting to provide more details about this step such as showing how far from each other the bootstrapped and the final mappings are.

* Evidence about the performance of the incremental ETL component would have been appreciated. Too little information is provided about this component. In particular, it would be important to understand what precise properties it offers, and how it relates to industrial frameworks such as Kafka.

* Has the proposed architecture been applied to the first legacy system, SesammTool? How does this architecture compare, in terms of integration effort, with the previous architecture?

To conclude, I do think that combining a read-write OSLC web service and a read-only SPARQL endpoint above a legacy information system is a good idea that deserves being studied, but this should be achieved with significantly more precision than this paper did. For instance, by looking carefully at what has been reported by the authors as a negative result, it could probably be possible to find out more promising outcomes. A detailed study on the interaction between the RDB-to-RDF mapping and the OSLC service model would also be very interesting. I also encourage to evaluate the OBDA approach as a possible SPARQL endpoint solution. In terms of contribution, I agree with the authors to consider the experimentation as the main contribution of this paper since it touched several interesting questions. However, in its current form, I do not consider this paper to be suitable for publication.

Other minor comments:

- The update service box is present in Figure 7 but not in Figure 6 while it appears outside of the Lyo store.
- The two last paragraphs of Section 4.1 are unclear.

[1] Calvanese, Diego, et al. "Ontop: Answering SPARQL queries over relational databases"
Semantic Web Journal (2017): 471-487.
[2] Jiménez-Ruiz, Ernesto, et al. "BootOX: Practical mapping of RDBs to OWL 2"
International Semantic Web Conference, 2015.
[3] Sequeda, Juan F., Marcelo Arenas, and Daniel P. Miranker. "OBDA: query rewriting or materialization?
In practice, both!" International Semantic Web Conference, 2014.

Review #2
By Andrea Giovanni Nuzzolese submitted on 12/Dec/2017
Review Comment:

The paper presents an architectural solution for generating and maintaining Linked Data from non-RDF legacy sources. The proposed architecture is designed for enabling (i) user-based access; (ii) REST API; (iii) SPARQL-based access; (iv) smooth integration with legacy databases and associated business logics; and (v) life-cycle management of Linked Data.
The architectural proposal emerges from the analysis of set of requirements that are identified by experimenting with the integration of D2RQ with SesammTool, which is a tool used by the truck manufacturer Scania for managing engineering artefacts. The authors distinguish between two different scenarios: (i) one aimed at evaluating D2RQ and its limitations and (ii) the other aimed at complementing D2RQ with the OASIS OSLC standard for enabling READ/WRITE access to RDF resource via REST services.

==== Overall comments ====
The paper is well written and structured in all its parts.
The problem of extracting Linked Data from legacy data sources is relevant to SWJ. It is also challenging as it combines knowledge engineering, business logics, data access along with a variety of problems such as the life-cycle management of Linked Data and their synchronisation with the legacy source when updates (e.g. READ/WRITE operations) are performed.
As a matter of fact, in recent years the number of solutions and initiatives have been proposed, e.g. D2R server, R2ML, Linked Data Platform, etc.

=== Strengths ===
The related work section provides an exhaustive summary on most of the related work by also providing pros and cons of each solution.
The design of the architecture is driven by an industrial use case that, in principle, can provide valuable hints about the effectiveness of the solution proposed in real world scenarios.

=== Weaknesses ===
Nevertheless, the paper shows significant weaknesses that, in my opinion, prevent it from publication as it is in its current form.
Those weaknesses are:

+++ Lack of proper graphical notation +++
The authors present figures that shows the architectural solution at different granularity levels. However, those figures are not presented by using a rigorous graphical notation.
The authors never introduce the semantics associated with the notation used.
In some cases (e.g. Figure 6) the authors combines different diagrams with different semantics (i.e. use case diagram and a sort of deployment diagram) and this is misleading.
The architecture needs to be described with more proper and formal graphical notation that can capture the different characteristics of the solution proposed by the authors.

+++ Requirements elicitation +++
The requirements are elicited by analysing the lessons learnt by experiment with D2R for extracting Linked Data based from SesammTool. However, D2R is only one of the tools at the state of the art for enabling Linked Data extraction from relational databases. Moreover, D2R is not a standard and it sounds strange that the requirements come from the analysis of an existing tool instead of being gathered from scenarios. It is fair to present to what extent D2R covers the requirements identified, but this analysis should take into account a broader spectrum of state of the art tools.

+++ Evaluation +++
The authors assess the effectiveness of the architecture in terms of requirements addressed.
This kind of evaluation is weak and naïve. On the contrary, the evaluation of a software architecture is methodologically complex. Some methods have been proposed so far for supporting the design of software architectures. Please refer to [1] for an exhaustive comparison of evaluation methods. Here are listed some of those methods: (i) Architecture Trade-Off Analysis Method (ATAM); (ii) Software Architecture Evaluation Model (SAEM); and (iii) Scenario-Based Architecture Reengineering (SBAR).
Additionally, the authors claim that their architecture has been successfully applied in the development of three tool adaptors (i.e. Bugzilla, JIRA and ProR), but only few details are provided. For those adaptors the authors should carry on a more rigorous evaluation based on clearly defined metrics and/or behaviours.

[1] Dobrica, Liliana, and Eila Niemela. "A survey on software architecture analysis methods." IEEE Transactions on software Engineering 28.7 (2002): 638-653.

Review #3
Anonymous submitted on 10/Jan/2018
Minor Revision
Review Comment:

The paper reports on the challenges faced when attempting to expose the relational data of an engineering tool as RDF while preserving its business logic. Based on these findings, the authors propose their own architecture including a reference implementation to demonstrate that such an implementation is feasible.

The findings are very interesting and important. The paper is easy to read and the results are well presented.

Considering that these are the results of the single tool being evaluated, one can imagine that there is a long path to go towards the "tool interoperability" utopia.
However, as interesting as the deduced requirements in Section 5 are, it would be even more interesting to see which of the established business solutions would actually support them. It seems that in your experiment with the Sesammtool, you could access for example all the business logic. But is that the norm with other tools?

Section 4:
Shouldn't Ontop [1] be included in the extraction technology mentioned? It is probably also worth mentioning that existing SQL-to-RDF solutions usually lack SPARQL 1.1 support.

Minor issues:
- Section 4 title: Lesson Learnt -> Lesson Learned
- Some text of Figure 4 & Figure 7 is overlapping on the printed document
- page 13: "to graphical define" -> "to graphically define"
- page 13: broken citations: [softwareX]
- page 14: typo "technilogy"

[1] https://github.com/ontop/ontop