High-Level ETL for Semantic Data Warehouses

Tracking #: 2512-3726

Authors: 
Rudra Pratrap Deb Nath
Oscar Romero
Torben Bach Pedersen
Katja Hose

Responsible editor: 
Philippe Cudre-Mauroux

Submission type: 
Full Paper
Abstract: 
The popularity of the Semantic Web (SW) encourages organizations to organize and publish semantic data using the RDF model. This growth poses new requirements to Business Intelligence (BI) technologies to enable On-Line Analytical Processing (OLAP)-like analysis over semantic data. The incorporation of semantic data into a Data Warehouse (DW) is not supported by the traditional Extract-Transform-Load (ETL) tools because they do not consider semantic issues in the integration process. In this paper, we propose a layer-based integration process and a set of high-level RDF-based ETL constructs required to define, map, extract, process, transform, integrate, update, and load (multidimensional) semantic data. Different to other ETL tools, we automate the ETL data flows by creating metadata at the schema level. Therefore, it relieves ETL developers from the burden of manual mapping at the ETL operation level. We create a prototype, named Semantic ETL Construct (SETLCONSTRUCT), based on the innovative ETL constructs proposed here. To evaluate SETLCONSTRUCT, we create a multidimensional semantic DW by integrating a Danish Business dataset and an EU Subsidy dataset using it and compare it with the previous programmable framework SETLPROG in terms of productivity, development time and performance. The evaluation shows that 1) SETLCONSTRUCT uses 92% fewer Number of Typed Characters (NOTC) than SETLPROG, and SETLAUTO (the extension of SETLCONSTRUCT for generating ETL execution flow automatically) further reduces the Number of Used Concepts (NOUC) by another 25%; 2) using SETLCONSTRUCT, the development time is almost cut in half compared to SETLPROG, and is cut by another 27% using SETLAUTO; 3) SETLCONSTRUCT is scalable and has similar performance compared to SETLPROG.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 07/Sep/2020
Suggestion:
Minor Revision
Review Comment:

This paper proposes a methodology for integration of semantic repositories (Data Warehouses) into OLAP-compliant format. This is a case of the well-known ETL process. The paper is well-written, original, mature and interesting, and I welcome acceptance. Some comments (mostly minor) that would hopefully improve the paper's quality follow.

I welcome the fact that the paper provides all the formal definitions and semantics of their operation (which is usually left implicit in many works), and also the fact that the paper provides a complete account of all the details used in the authors' approach.

Definition 1 and Example 2: I believe there are some technical glitches there. First, I don't see the difference between F_R and \rightarrow. Why are both needed? Second, bullet 3 in Example 2 does not follow the definitions. It probably defines \rightarrow, not (L,\rightarrow). Third, bullet 5 in Example 2 is also incompatible with the formal definition. Is RUP a function? So, F_R is what? According to Definition 1, F_R is a set of pairs, i.e., a relation.

Definition 3: I don't believe that the Tbox extraction operation is sound, in the sense that it mixes intensional and extensional representations. What exists in the instances of the semantic repository is the extensional part, and may be incomplete; deriving the intensional part from it may lead to mistakes. For example, one may infer that all birds are flying animals (subclass relationship), just because penguins don't exist in the database. Furthermore, the process requires that redundant information exists in the database (e.g., eagles are classified both as birds and as flying animals), which is often not the case in practice. Same comments apply for the other detected relationships as well.

Definition 4: left-outer and right-outer join do not exist in the set "relation"; they should be. Also, the definition of p_com has some problems with the scope of parentheses. Finally, op refers to "ETL operations"; are these the ones mentioned in Table 1? Please say so.

With regards to the ST2MAP vocabulary (and Figure 6), I'm concerned by the naming of certain elements. For one, some relationships (properties) have the same name as their range. This can create confusion. Similarly, confusing is the fact that some overused terms (such as "rdfs:subClassOf") are used as instances. This will create triples like , or which, although technically correct, may have undesired effects on most RDF repositories, query engines or reasoners. I suggest another name is used, e.g., "map:subClassOf". Same for the others.

Section 8: the evaluation is very interesting and provides a very detailed comparison of the authors' approaches. However, there is no "competing" system to compare with, which is strange. Also, it is claimed that the performance of all 3 tools developed by the authors is identical, as they use the same process, so it is not evaluated. While this is true, it would still be good to have an idea of the time needed for the ETL process to complete, ideally with a comparison to alternative (competing) ETL systems.

With regards to the auxiliary function "ExecuteQuery" in Appendix A, I don't understand why the authors provide a detailed semantics, rather than saying that the semantics is equivalent to the semantics of executing the corresponding SPARQL query. I don't see the need for reinventing the SPARQL semantics here.

Finally, I would like to propose paper [1] to the authors, which is a language for information integration specification, and quite relevant to the reviewed paper.

Typos and minor comments:

- The paper is quite verbose, going to great lengths to ensure that everything is well-defined and appropriately explained. Although this is a good practice in general, the paper is wordy at places, and its size could be reduced.

- There are lots of acronyms, which makes reading difficult at times. Could some of them be skipped?

- There are some pairs of terms which are used "interchangeably" (e.g., target and MD SDW, concept and class etc). It is usually better to stick to one term.

- Section 2.2: "owl:equivalentProeprty"

- Section 4: "may also prevents"

- Section 4: "create by Tbox-Extraction" -> "created by Tbox-Extraction"

- Section 5: "required to implements"

- Section 8: "We shows"

- Section 8: "they creates"

- Section 9: "do not support to generate"

- The Appendix is usually after the bibliography section

- Appendix A: "in any positions" -> "in any position"

- Appendix A: "QB4OLAP-complaint" (several positions)

- Appendix A: "will produced"

- Appendix A: "the operation reflect"

- Appendix A: "updateddTriples"

- Appendix A: "are corresponds"

[1] Y. Marketakis, N. Minadakis, H. Kondylakis, K. Konsolaki, G. Samaritakis, M. Theodoridou, G. Flouris, M. Doerr. X3ML Mapping Framework for Information Integration in Cultural Heritage and Beyond. IJDL, 2016.

Review #2
Anonymous submitted on 24/Sep/2020
Suggestion:
Major Revision
Review Comment:

In general, the paper is very well written. It is a good mixture of summarizing the main concepts of knowledge graphs and data warehousing and additional new functionality. Moreover, the paper uses a good running example of integrating two datasets to demonstrate the functionality of the system. The experimental evaluation shows the new solution performs better than the authors previous approach. In summary, I very much enjoyed reading the paper.

• However, there is no comparison with related work - only with own previous work. A good demonstration would be to perform the same data integration task with Pentaho Data Integrator (PDI) and measure the performance in the same way. Even though PDI does not create RDF mappings, the concepts are very similar. Hence, the authors would have a direct comparison with an existing open source data integration tool. This would also give the authors the possibility to demonstrate if the RDF-representation is in fact better than the relational representation – which is not always the case, given that in the presented evaluation there do not seem to be any schema changes.

• If it not clear how many source tables the two datasets contain that need to be integrated. Integrating the company table of source 1 with recipient table of source 2 seems very simple and could be done by an experienced PDI or PowerCenter developer in a few minutes. Please extend the evaluation to a more realistic data integration task where more tables need to be integrated and the mapping is not a simple one-to-one relationship. Using the presented approach on a realistic data integration task and demonstrating that the authors’ approach has a clear advantage over e.g. PDI would be a great result.

• The authors state: Mappings should be done at the TBox level. However, what if the property names of the SourceTBox and TargetTBox differ? This is what typically happens in real-world data integration tasks. How do you define the mappings (semi-) automatically?

• The authors state: "The ETL data flows can be generated automatically based on the source-to-target mapping file. " However, once the source to target mapping is defined, the rest is quite simple. Automatically generating ETL data flows has been done in industry at scale for quite some time. See for instance:

Jossen, C., Blunschi, L., Mori, M., Kossmann, D., & Stockinger, K. (2012, April). The Credit Suisse Meta-Data Warehouse. In 2012 IEEE 28th International Conference on Data Engineering (pp. 1382-1393). IEEE.

The hard part is to generate source to target mappings automatically. See also:

Stonebraker, M., Bruckner, D., Ilyas, I. F., Beskales, G., Cherniack, M., Zdonik, S. B., ... & Xu, S. (2013, January). Data Curation at Scale: The Data Tamer System. In Cidr (Vol. 2013).

How can this automatic source-to-target mapping be achieved with the presented approach? This would solve the real pain points of data integration.

Some smaller issues:
• qb4o:Type1, qb4o:Type1  qb4o:Type1, qb4o:Type2
• It may also prevents the loss  It may also prevent the loss

Review #3
Anonymous submitted on 27/Oct/2020
Suggestion:
Minor Revision
Review Comment:

SUMMARY:

This paper is part of a larger research effort, where the authors work on combining Semantic Web (SW) and Business Intelligence (BI) technologies. BI technologies consolidate related technologies such as the Extract-Transfer-Load (ETL) process, Data Warehouses (DW), and Online Analytical Processing (OLAP) technologies.
The motivation of this work is given by the shortcoming that, besides the OLAP and DW technologies, also ETL process itself needs to be "semantified". This should overcome the shortcoming of the traditional ETL process approaches such as schema-dependency, as well as, the ETL process that is based on an inherent table-based "canonical" model taken from RDBMS (which makes sense since the extract/load is often from/into a RDBMS), hence the process fall short in handling graph-based data (such as in RDF).

First, the authors outline some general concepts regarding the notion of a semantic DW (SDW) based on the RDF Data Cube (QB) vocabulary (the W3C standard) and the QB4OLAP vocabulary. They extend the QB4OLAP vocabulary by introducing an "UpdateType" to the "LevelAttribute". Then, they present an interesting use case, which has the aim of integrating the "Danish Business dataset" with the "EU subsidy dataset". However, these datasets are rather simple with limited amount of dimensions, mainly there seems a location- and time-based dimension with a deeper hierarchy.
More importantly, they describe the integration process based on the ETL process and introduce a two-layered approach with a "Definition Layer" (with source-to-target mappings and target TBox definitions tasks) and an "Execution Layer" (with ETL operations and data staging), where the process has four steps: defining the target schema (TBox) using the QB/QB4OLAP vocabulary; extracting source TBoxes; create mappings between schema and sources with S2TMAP vocabulary; execute the process.

Second is one of the main contribution, the Definition Layer (DL); it is based on the target TBox definition (TargetTBoxDefinition), which is made of a "dimension schema" and a "cube" schema, where the former introduces the schema for each dimension including levels (including their attributes), the relation between levels (defining granularity), the roll-up relation. The later introduces the schema for cubes that include a bottom level dimensions, measures, and aggregate functions applied on the measures. Another element of the DL is the extraction of the source TBox (TBoxExtraction), where the authors recognize that the TBox and ABox is often provided, but they also suggest a rather simple extraction method. The central task of the DL is the source-to-target mapping (S2TMapping), which is at the heart of an ETL process. The basic idea of S2TMapping is the generation of a workflow of operation, where mappings between a source and target schema are defined, which then allows a transformations of source instances to target instances. The authors give a formalization of a S2TMapping, which includes besides the obvious elements (source, target) a mapping relation (super-/subsumption, equivalence, join), filter/join conditions, and an operator. Different mappings can be connected to form sequences of transformations, which seem restricted to acyclic sequences. They further introduce a formalization of the mapping language in OWL, hence introducing a meta-level over the standard TBox's OWL axioms.

Third is the second main contribution, the Execution Layer (EL), which consists of ETL operations taking the mappings (or user input) into account for parameterization of each operation. The extraction operations includes a graph and a TBox extractor, where the graph extractor covers the functions of a SPARQL construct query, which leaves the question open why not allow (restricted) SPARQL construct queries directly. The transformation operations includes (a) a literal-based and join-based transformation, (b) level/observation generators, a (tuple-based) data change detector, and (c) a level updater, inference-based materialization, and URI linker. The literal transformator evaluates property mappings that include arithmetic/aggregate functions. The join transformator evaluates joins that are defined in the join mappings by creating SPARQL queries. Level and observation generators do as expected and are designed to create level members and observations with additionally coining IRI for new instances. The data change detector acts as an observer of the input ABox and updates level members and observations according to given mappings, where either updating or the creation of new instances can be parametrized. The level updater can be triggered by the change detector and updates levels according to an "Update Type" (T1: overwriting old values; T2: new version of values with validity; T3: creating new value and annotating old value). The operation "inference-based materialization" allows to infer new triples according to the underlying TBox. However, this operator might create very large ABoxes.

Fourth, they provide two algorithms that automatically generate ETL data flows. Algorithm "CreateETL" collects all the top mappings and acts as an outer-loop that calls "CreateFlow" algorithm for each mapping. CreateFlow recursively finds operations for a mapping, parameterizes them, and checks sub-mappings that are defined in the source mapping. The algorithms result in a set (or rather a sequence?) of ETL operations.

Fifth, the authors describe their implementation called SETL_Construct, which uses a Jena triple store. SETL_Construct is extended with SETL_Auto for the automatic ETL flow generation. Nicely, the make the implementation available to the community on Github. The experiments focus on the comparison of their approach regarding productivity, development time, and (runtime) performance with their earlier work on Python-based ETL approach (called SETL_Prog. The evaluation of productivity is based on the comparison of user interactions such as clicks or selections with SETL_Prog. The information gain of this evaluation is rather limited, since comparing Python-based programming approach with an automated UI-based approach depends on the level of the user's skills and the used tools (for instance the use of Eclipse to reduced the amount of interactions). The evaluation of development time seems more appropriate for comparing both approaches, since SETL_Construct should give an advantage in creating and improving ETL flows due to the simple reuse of operations, which the authors could indicate in their experiments. The last evaluation showed the performance (regarding runtime) and the (disk-)space used in the different steps. SETL_Prog is more efficient regarding runtime and space usage, which comes to no surprise, since ad-hoc implementations give more freedom to optimize the code. We believe though, that a weakness of the approach is the usage of the Jena triple store, which is not known to be most efficient tool on the market. For the storage of intermediate results, maybe, a compressed storage of RDF such as HDT could be used.

Finally, the authors conclude their work with related work (which missed some discussions, see below) and future work. The authors also provide the semantics of their operators, which is based on the work on SPARQL (by Perez et al.). Future work aims at providing correctness/completeness of the ETL constructs, as well as the important topic of virtual data integration.

EVALUATION:

The main contributions of this paper are (summarizing the above):

- A general two-layered approach to a semantified ETL process
- A definition layer that includes a target TBox definitions and source-to-target mappings based on formalization and a OWL-based language to express the mappings
- An execution layer that allows to combine different operations such as GraphExtractor, LiteralTransformator, JoinTransfomrator, and Load, which are combined to build a full ETL flow. Furthermore a mathematical definition and a RDF- and SPARQL-based semantics for each operation is provided.
- An algorithm that is used to automatically generate the ETL operations based on the mappings and (user-given) parameter
- An implementation of their approach, which is compared to a Python-based approach in an extensive evaluation.

The topic of the paper and the main ideas are appealing and the paper should be published in the SWJ. The paper is in our opinion very well written and the motivating example is a recurrent theme throughout the paper and helps to understand the sometimes large definitions. We believe that in particular the separation between mappings and operations that are automatically generated and parameterized based on the mapping is a novel idea and an appealing result. The authors also provide a more formal definition of the operators, which is important to define a well-defined syntax/semantics. Further, a focus on DW-specific operators such as LevelMemberGenerator or UpdateLevel is made, hence taking requirements such as the "Update Type" into account. The work is in our opinion well-rounded and covers all aspects of their approach.

However, we also believe that this paper has certain unclear points, which should be addressed in the answer and (if reasonable) also added the document (order by significance):

1.) A central element of the approach is the source-to-target mappings, which allow different relations (e.g., equivalence, subsumption, join), fitler/join conditions, and property mappings. What is the reason for not allowing direct SPARQL queries to the source ABox?

In our opinion this would make the rather cumbersome definition of join relationships obsolete, and it seems that their mappings are anyway compiled into SPARQL (where the semantics is also based on). We agree that the full SPARQL standard might be to extensive, however, are restricted syntax could be used. This would bring another advantage that SPARQL operators (such as join, projections, etc.) could be rearranged for optimizing the evaluation, which is not possible in the current approach.

Another question regards the mapping relations subsumption and supersumption. Could the authors give an example of the usage for using this relation? Since this relation should be defined in the TBox and should not affect a transformation from a source to a target ABox

2.) This work is based on the assumption that all the sources are already ABoxes and encoded in RDF. However, a main point of an ETL process is to extract and transform unstructured (or differently structured) data. Hence the authors should at least add one sub-section that discussed on how the authors would address different data models: So if the source data, for instance, is stored in a RDBMS, it seems naturally to allow also an R2RML-based operator in the ETL load step. Would be possible also to parametrize the R2RML based on the mappings?

3.) In the evaluation the authors show the applicability of the their approach. However, the measurement of productivity based on interaction seems subsumed by the second aspect of development time (which also measures productivity). We rather suggest that the authors provide instead (or additional) to the first aspect a qualitative evaluation of their work. One possible choice would be a qualitative evaluation are guideline-based expert interview (e.g., as described in "Das Experteninterview" by Bogner et al.) with a small selection of experts or user, which should give insight in the usability of the presented approach.

4.) In the conclusion or related work section, there should be a more in-depth discussion on the advantages of semantification of "conventional" OLAP, versus Relational OLALP (ROLAP), and versus Hybrid OLAP (HOLAP). ROLAP seems orthogonal to their approach, since the results are (always) materialized in the triple store. Each type of OLAP has its advantages/disadvantages, which might (or might not) affect a SW-based approach.

5.) The mapping language itself and it's vocabulary is defined in OWL, hence adding some kind of mapping meta-level. Besides convenience, is there another motivation of having this in standard OWL? Since rule-based mappings (as used in R2RML, SHACL or dl-programs) might be easier to understand.

6.) The assignment of operators to the stages of the ETL flow seem a bit blurry. For instance, LevelMemberGenerator, ObsGenerator, and LevelUpdater could be part of the Load phase. Is there an argument why these operators are part of the Transformation phase?

MINOR ISSUES:

P11, left: (D_{name}, L, \arrow, H, F_R) should be (D_{name}, (L, \arrow), H, F_R)
P15: Listing 3 is quite long and hard to read, maybe a simplified version could be presented
P21, right: Is "MaterializeInference" needed? Since the triple store should provide the inference out-of-the box.
P22: The algorithm CreateFlow does not handle cyclic relationships between operators, are there restriction checked a-priori?
P33, left: Syntax/semantics is based on the work of Perez et al., since this is the foundation of the presented semantics, it could be moved into the preliminaries and introduced more in detail.