Survey on complex ontology matching

Tracking #: 1879-3092

Authors: 
Elodie Thieblin
Ollivier Haemmerlé
Nathalie Hernandez
Cassia Trojahn dos Santos

Responsible editor: 
Marta Sabou

Submission type: 
Survey Article
Abstract: 
Simple ontology alignments, largely studied in the literature, link a single entity of a source ontology to a single entity of a target ontology. One of the limitations of these alignments is, however, their lack of expressiveness which can be overcome by complex alignments. While diverse state-of-the-art surveys mainly review the matching approaches in general, to the best of our knowledge, there is no study taking the specificities of the complex matching problem. In this paper, an overview of the different complex matching approaches is provided. This survey proposes a classification of the complex matching approaches based on their specificities (i.e. type of correspondences, guiding structure). The evaluation aspects and the limitations of these approaches are also discussed. Insights for future work in the field are provided.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 24/Jun/2018
Suggestion:
Major Revision
Review Comment:

This paper presents an extensive survey on automatic complex ontology matching. The authors define the scope of their survey (ontology and schema matching), then presents a classification of complex ontology matchers. They then present an extensive analysis of complex ontology matchers along the dimensions they presented. A brief discussion on evaluation matters precedes a more general discussion on the state of the art.

This paper presents a very good coverage of the state-of-the-art, and I believe it should eventually be published in a journal. I found myself capable of understanding a lot of what is presented, even if I am not familiar to most of the complex ontology matchers presented.
They are however a couple of recommendations I would like the authors to examine and hopefully to act upon, so that their paper is at the level of (very high) quality expected from SWJ papers.

One first suggestion is to give more motivation for the problem: who uses complex ontology matching now? The references given to justify the importance of complex OM (e.g. [2] in section 6) are from academic papers, from researchers that present their own complex OM system. For example, on p1, presenting the Alignment API as an application that consumes complex matchings is clearly not fit. A ‘real-world’ application using the Alignment API complex matching features would be more convincing. On p2 and following, presenting a complex matchings in the context of a ’toy ontology’ is not enough to show “why they are necessary”, as claimed. The question is whether automatic complex matching can be easy enough to apply and efficient enough to be used in the context of data conversion, query translation, instead of relying only on custom-made scripts that apply manually defined (but reliable) conversion rules. It is probably not a good sign that no approaches present a framework for visualization and edition (p23). These are present but in the data transformation tools that real users use...
This aspect alone is not a reason to reject an otherwise very convincing survey. But it seems that some of the claims should be really better substantiated - or downplayed.

Accordingly, the importance of complex OM in the field of OM in general is probably over-stated. This reviewer will not argue that complex OM is not important - it is. But presenting it as a ‘paradigm’ that organises the field seems a small exaggeration. The techniques used for generating matches (lexical, machine learning, etc) play a more structuring role, in my opinion.

In the same vein, the survey would probably benefit from presenting more information about how complex alignments are expressed for tools that could use them. It is quite a sign that there is not even a bibliographic reference for R2RML.

Another puzzling point is the position of the survey with respect to ontology evolution. It is listed as a possible case of application of complex alignments on p1, but then is presented as being out-of-scope for the survey, without much explanation.

The categories of analysis are very complete. Some are not very crisp (cf the authors’ comment on using tree structures on p9) but it is a hard exercise and the authors’ attempt deserves praise in my opinion. One remark on the classification though: the first categories of “guiding structures” are presented as these “on which the process relies”. Aren’t they also (and rather) the structures of the output, rather than the process?
Actually I think that some of the alternative classifications (for example on table 3) are more intuitive and could give a better rationale for structuring section 4.
There is a remark on p23 that left me puzzled: are the layers really not “fully independent of each other”. Of course some of the tools fall in several categories of different layers at once, but I am wondering, whether it is because there is an intrinsic dependence between layers, or if it is because the technique use by the tool just make it happen this way. Examples of intrinsic dependences would help here, if there are any.

Regarding the completeness of the state of the art, as said I am not an expert in the complex ontology matching field. But as I see database or XML Schema mapping tools being mentioned, I wonder whether the work on the Karma data transformation tool (http://usc-isi-i2.github.io/karma/) could be in scope.
Also I am intrigued by the authors’ claim that “there is no benchmark for evaluating complex correspondences” (p23) while they are themselves working on one (“Towards a complex alignment evaluation dataset”, Ontology Matching workshop at ISWC17). This is certainly work in progress, but it does not seem a reason to completely hide it if it has been published in the Ontology Matching workshop.

Finally, as already hinted the paper is in generally quite well written, especially considering the diversity of approaches to explain. There are however some (expected) glitches.

First a couple of general comments:
- the footnotes seem to have disappeared!
- the order of the tools analyzed should be more consistent across table. The currently (semi) random ordering does not help the reader.

Minor comments on specific parts of the paper:
- section 3.2 (or others) could give more examples and explanations on what block correspondences are, and why they cannot be expressed in terms of correspondences in other categories (or sets of simple correspondences)
- in 4.1 I am quite struggling to understand what the expressions “Class by Attribute Type” and “Class by Inverse Attribute type” cover. Examples may help, or a re-wording by the authors. And if other (clearer) categories like “Class Attribute Value Restrictions” are important, shouldn’t they presented as a categorization of its own?
- is figure 4 really useful? It seems that for most respects it could be replaced by table 3, which is more efficient with space.
- p8-9: are iMap’s searchers really ‘combined’ or should the text refer to ’a set of’ rather than ‘a combination of’? And is there really a ‘mismatch’ searcher? What is its role then?
- p9: “As for some of iMAP’s searchers, a Kullback-Leibler divergence measure on the data values is used to define the coefficient a of the linear transformation”. But for iMap the divergence measure is said to be used to compute a confidence measure.
- p9: “The juxtaposition of the two alignments”. It seems that there is only one alignment mentioned (the LogMap one)
- p9: holistic could be defined (as it is, later)
- p10. I am not sure why [26] is not brought closer to [27]. This would make the flow better. And how do they know that some properties cannot contribute to the alignment?
- p11: same comment as above regarding “a combination of searchers”
- p11 is Xu and Embley about DB or XML schemas? And why not trying to hav this one just after iMap to make the flow of the section better?
- p12 the first paragraph of the second column is not clear to me. It’s unexpected that a clustering technique would compute simple correspondences, holistic is still not yet defined.
- p12: could [50] be brought closer to [53]?
- p13: “A holistic approach” -> same remark as for [53]
- p13: please try to clarify, how much automatic the Clio process is.
- p14: “so that the groups have the same semantics”. What does this mean?
- p14: (for [32]) is mapping conceptual models really different from mapping ontologies? And it could be made clearer in the text that the method seem to depend on pre-computed (or pre-existing) cross-model relationships.
- p15: what is FOIL?
- p18: the ‘online’ column is quite different from the rest. It’s not bad per se, but one can wonder why it is presented in this specific table.
- p19: could fig 5 be merged with table 3 or table 5? This would be more efficient.
- p20: the categories from [13] could be re-explained to make the paper a bit self-contained - some of these categories are not very clear.
- p21 and 22: what are CAT and CAV?
- p22: “cf same problem as Example 5” is not very elegant nor clear.
- p23: the transition at “Few approaches” seems quite random. The paragraph “One can also observe”
- p24: what does ‘absolute’ mean precisely?
- p23-24: I am not sure the paper needs an independent conclusion as it is currently written. It could perhaps be merged with section 6.

Typos:
- p1: “Despite this fact” is not the right logical connector here.
- p3: “transitivity” is not a constructor. Perhaps the authors meant “transitive closure”?
- p3: the plural of ‘axis’ is ‘axes’ (twice in the page). “so far focus” -> “so far focused”
- p11: remove comma after “some of the searchers”
- p16: Lagramge -> Lagrange?
- p17: “deCarvalho” -> “de Carvalho”

Review #2
By Catia Pesquita submitted on 31/Jul/2018
Suggestion:
Minor Revision
Review Comment:

The paper presents a survey of the state of the art in complex ontology matching, that encompasses different kinds of knowledge representation models such as as ontologies, XML schemata, database schemata, etc. The paper also provides definitions for complex matching, and a classification scheme for complex matchers along two different axis: type of correspondence and guiding structures. The state of the art is then presented and analysed according to these categories. The paper also examines the issue of evaluation complex ontology matching, both on reference building and the establishment of performance metrics.

The paper is a suitable introductory text to complex matching, if the reader has previous knowledge of the overarching field of ontology matching.

Although several important works have been dedicated in the last years to reviewing the field of ontology matching, the topic of complex matching has never been to the best of my knowledge thoroughly reviewed or analysed. This is a topic that has been recently gaining relevance, with recent publications in top venues and the addition of a new track to the Ontology Alignment Evaluation Initiative that addresses it, so this paper is very timely.

The paper provides a thorough and encompassing review of works in the complex matching domain. Although few works are dedicated to complex ontology matching per se, there are several works that intersect the area, but use different knowledge representation formats. The authors included these works in their review, cleverly joining what are clearly related strategies and techniques under a common framework of categorization.

The paper provides a valuable formalization of complex ontology matching, however in section 1 complex matching approaches are textually defined in an imprecise or even misleading way: “complex matching approaches are able to generate correspondences which express the relationships between entities from different ontologies better.” This definition presumes that equivalence simple mappings between entities from different ontologies are not suitable. This may not be the case at all. In many cases, “simple equivalences” actually make up the bulk of the relationships between two ontologies, and are exactly correct in capturing the relationship between entities. I would rephrase this to: “complex matching approaches are able to generate correspondences which express more complex relationships between entities from different ontologies”

I find that the two proposed axis of classification are adequate to provide a systematization of the field. However, I believe the paper would strongly benefit for a more clear explanation of why the types of information explored by the matchers (node, structure, instance, semantic…) was not chosen to categorize these matchers. This should be discussed in section 3, since it even becomes more confusing to see it applied in Table 5.

Moreover, it is unclear if the “members expressions pre-definition” is a category on its own or how it relates to the other two since it is not given the same relevance in section 3, but it is show in Figure 3 as part of the Process axis, and does come up at what looks like the same level of relevance in section 4.

In section 3.2, I find that the “Type of correspondence” section should have specific examples of the types of correspondences described. Particularly the “blocks” type would benefit from this. Likewise for “guiding structures”.

Also in 3.2, The category “Trees” is confusing, it mixes together two very different uses of tree structures. For instance, I would have no problem in categorizing the GP approaches under No Structure, since they do not take “trees” as input, rather generate them as part of the GP method.

I don’t exactly understand this sentence: “Despite the fact that many approaches have been automatically evaluated, supposing the existence of a reference alignment, with respect to ontology matching, few reference alignment sets are publicly available”.

I think the discussion is lacking in terms of relating the expressiveness of the representation models and the types of correspondences and guiding structures. Intuition would lead us to believe more expressive representations support more complex techniques, but the results of the survey do not appear to point in this direction. I find it curious that this seems to stem from the fact that many systems that match schemata also need a set of rules or a domain ontology as input, whereas systems that match ontologies do not.

Many portions of the discussion read more as a summary of the paper. Particularly the paragraphs devoted to evaluation, which sound very repetitive from section 5.

The paper is very well written and organized. I think that some extra concrete examples (in the liens of the ones given in section 2) would make the paper a more enjoyable read. I understand figure and table placement in latex are not always friendly, but some figures and tables are mentioned a couple of pages before they appear, forcing the reader to jump around a lot.

Figures and tables

Table 6: where are the footnotes?

Review #3
By Antoine Zimmermann submitted on 03/Sep/2018
Suggestion:
Major Revision
Review Comment:

My apologies to the authors for the long delay for reviewing.

The paper is a survey on automated techniques for discovering complex correspondences between ontologies. A complex correspondence is one where at least one of the entities that are related is a compound entity, such as a concept union or intersection.
The paper tries to classify and compare the systems based on characteristics that are specific of complex alignments, rather than reusing the typical features of all ontology matching systems.
These characteristics are: type of correspondence, guiding structures, members expressions pre-definition.

The main contributions could be summarised as follows:
- it provides a single entry point to existing work in complex ontology matching
- the systems of complex matching are classified along the dimensions of comparison listed above
- it provides a number of challenges that could serve as the basis for the research of someone who would like to investigate this area (such as a doctoral student).

The survey is pretty good in terms of coverage, investigating the notion of ontology matching in a broad sense (it includes references to data base schema matching, as well as other forms of "ontologies"). There are references to algorithms, implementations, matchers evaluation metrics and datasets.

One missing part could be the representation aspect: how are complex alignments represented, stored or saved, exchanged, etc.? On this aspect, there exists at least EDOAL, an Expressive and Declarative Ontology Alignment Language (http://alignapi.gforge.inria.fr/edoal.html), which can be used with Inria's alignment API. In case a more formal bibliographic reference is needed, the language is directly inspired by Knowledge web deliverable 2.2.10:

Jérôme Euzenat, François Scharffe and Antoine Zimmermann. D2.2.10: Expressive alignment language and implementation. FP6 Knowledge Web deliverable, 2007.

It could be interesting to investigate how complex alignment are represented, especially in systems that do not deal with Web ontologies.

With that said, I find two things to argue against the paper, not in a rejecting manner but in a "needs improvement" way:
1) There are a number of technical issues that should be fixed before the paper is accepted. Most importantly, there are important problems in the formalisation. There are also several inaccuracies throughout the paper. I give thorough details about this below.
2) The choices of dimensions for classifying the complex matchers should be better justified. The paper apparently just assert that those will be the one used for the survey, without a clear motivation for rejecting other possibilities. The discussion even says that there could be other dimensions to study the approach, yet it does not explain why they have not been chosen after all.

For these reasons, I request that a stronly revised version of the paper be resubmitted.

1. Introduction:
- in the motivating example, after "complex correspondences are needed". The formulas are strange. They look like meta statements of first order logic. The symbol \equiv is used in FOL literature to mean "the FOL formula on the left of the symbol is logically equivalent to the FOL formula on the right".
If we take a look at Item 1., the left-hand side of the \equiv symbol is \forall x,y o1:priceInDollars(x,y). Every arragement of the universe that makes this true must have every pair of things be related predicate priceInDollars. On the right-hand side, we have o2:priceInEuro(x,coversionFunction(y)). The truth of this statement depends on what assignment we make for x and y, which are free variables in this formula. Clearly, if the left-hand side of \equiv is true, there is no reason that the right is true as well. So the equivalence is clearly wrong.
It is quite probable that what the authors mean in fact is \leftrightarrow instead of \equiv. In this case, Item 1 becomes a single FOL formula which indeed expresses the fact that the price in dollars of something can be converted into the price in euros of the same thing.
Assuming this is the case, then Item 3 is problematic: whether the formula is true or false depends on the assignment we make for y (which is a free variable). In this case, the formula should start with \forall x \exists y.
- The meaning of Fig.2 is unclear. Why "Data Models" is here? Are all data models equally expressive? Are data models even knowledge representation models? What is "General Logic"? Is XML a knowledge representation model? etc.

2. Background:
- "these appraoches are out of the scope of this study" -> why can't the work here be applied to them?
- Sec.2.2 "those are out of the scope of this survey" -> why can't the results be applied to them as well?
- Sec.2.3:
* the definition of correspondence is never used anywhere. There are only pseudo FOL formulas as the ones discussed above
* if a correspondence include a value $n$, then it's not a triple (e1,e2,r) but a quadruple (e1,e2,r,n). If it is a triple, then don't mention this n. You do not use, or need, this n, anyway.
* in the item list just following the definition of correspondence, there are so-called correspondences that are not following the definition. The first one could be expressed like this "(o1:Person,o2:Person,\equiv)", but the next one is less clear. With a language like EDOAL, the 5 examples can be expressed as triples following Def.2

3. Classification:
- In Sec.3.2, in addition to the formulas using \equiv, there is one that has \sqsubseteq. It seems that this is used to express \rightarrow instead (implication)

4. Complex alignment approaches:
- The "type of knowledge representation model" is sometimes strange. First, since this is a survey on complex ontology matching, all approaches are matching ontologies (in a broad sense). So to say that an approach [for complex ontology matching] is for "ontology to ontology" is a bit strange. It seems that, by "ontology" here, you mean something more specific, like OWL ontology or DL-based ontologies?
- Sometimes, there is "relational database schema", sometimes "database schema", sometimes just "schema". What each of these means?
- What is "conceptual model"?
- Svab-Zamazal and Svatek is not easy to follow. There should be an example, like in most other descriptions
- on p10, there are strange notations:
* Table 2 starts with pattern forms that reuse the pseudo FOL notation used so far. Then it uses a different notation with "contact", "union", "substr", then it uses curly brackets. There are strange equalities (maybe they are supposed to be equivalences in correspondences?)
* Is "union" different from disjunction?
* what are "v", "v1", "v2", etc.? constants? free variables? existentially quantified variables?
* the pattern forms do not have the provenance of the terms as in other examples (they use "p(x)" instead of "o1:p(x)")
* in Ex.3, there is a mixture of FOL-like notation, DL symbol \sqsubseteq, and RDF term "rdf:type"! Please use a single representation for all correspondences
- in general, the examples used in the whole section are not very illustrative. They look more like the general case (with generic names like A, B, p1, p2) rather than actual examples
- in "Wu et al.", the notation "{passengers}={adults,children,seniors}" is disturbing. We the common interpretation of curly brackets, equal sign, and commas, we have that a singleton is equal to a 3-element set!
- In Sec.4.3 "An et al." there is one more new notation "u \approxequiv s" which is not explained
- In Sec.4.6: "Table 3 ... the needed input" and later "with respect to the kind of input they exploit". It seems that the input mentioned in Table 3 is of a different nature as the one mentioned later. In fact, it seems that Table 3 does not really mention the input of the matching process (which should at least take 2 ontologies) but some other extra input. This is not really explained

5. Evaluation:
- on p22, end of Sec.5: yet another notation (DL-like this time) is used for expressing correspondences

6. Discussion:
- there is a clear distinction between the approaches based ... -> the distinction may be clear at this point for the authors, but it would be good to make explicit what distinguishes them clearly
- in p23, first column, other characteristics not used in the survey are mentioned as possible ways of classifying the approaches. But we would like to know why they have not been retained. As a result, the classification dimensions chosen in the paper seem a bit arbitrary (or at least, not too well justified).

Here are smaller issues (typos, grammar, etc.)
1. Introduction:
- "Largely speaking" -> "Broadly speaking"
- "e.g. ... etc." -> use either "e.g." or "etc.", not both
- "Two 'paradigms' organise the field" -> what's described is hardly a paradigm. Moreover, why use single quotes?
- "to fully overcome ontology conceptual heterogeneity" -> "to fully overcome conceptual heterogeneity"?
- "a survey on ontology matching resaerchers" -> "research"?
- "for different tasks [4], data translation [5]" -> ref [5] is clearly not about data translation. It seems ref 4 and 5 should inverted
- the outline of the paper should be the last thing to present in the introduction. If a motivating example occur after the outline, it should be in a separate section. If the motivating example is part of the introduction, then present it before explain how the paper is organised
- "Consider three toy ontologies" -> why qualify them as "toys"? The figures could equally depict portions of large, complex ontologies.
- "can help automatising the task" -> "can help automatise the task"
- "will lead to a loss in information" -> "loss of information"
- The motivating example ends abruptly, with no transition to the following.

2. Background:
- The quotation at the beginning of the section is not useful at all.
- Sec.2.2: "for the o2:accepted property ." -> deleted extra space before dot
- Sec.2.3: it would be good to define an ontology alignment after correspondence.

3. Classification:
- Sec.3.1 "the first one includes the matching process is guided"
- Sec.3.2
* "traducing" -> translating
* "different matching strategy" -> strategies

4. Complex alignment approaches:
- In Sec.4.1 "the labels of the ontologies entities" -> of the ontology entities
- In 4.2:
* "CGLUE, also presented in [30] is" -> missing comma before "is"
* "Some of the searchers, use" -> delete comma
* "from the target, schema." -> delete comma
* "are ciloared with help of" -> with the help of
* "It alignes" -> aligns
- In Sec.4.4
* "a XML schema" -> an XML schema
* "between the schema's attributes" -> the schema attributes
* "the ontology's data-properties" -> "ontology data properties" or "the data properties of the ontology"
* in "Nunes et al.": """Each "individual" of""" -> why quotes? Moreover, they should be opening quotes and closing quote, not straight quotes
* in "De Carvalho et al." """its "individual". Each "individual"""" -> idem
- In Sec.4.5:
* "the highest FOIL gain" -> what's this?
* In BMO """into a "document"""" -> why quotes? use opening and closing quotes
* "an Apriori algorithm" -> a priori
- In Sec.4.6:
* "only to Semantic Web" -> to the
* "Very few approaches are available online" -> Very few implementations (the approaches themselves are all accessible online)
* "on a guiding structures" -> structure
* "with respect the kind" -> with respect to
* in Table 3, the 3rd approach has lower case "onto"

5. Evaluation
- First sentence: it is not very interesting to know that some surveys did not address the complex matching perspective. It would be good to know if there is a survey that addressed it. If it's not the case, then the sentence should be that no survey address the problem
- In Table 4, there are footnote marks, but the corresponding footnotes are not there. In LaTeX, you can't directly put footnotes in tables, you need a little trick with \footnotemark and \footnotetext
- Fig.5 "Clio" is i the ontology-based systems rather than instance-based systems, but in Table 3, it has "matched instances" as its input
- Table 6 has missing footnotes

6. Discussion
- "into two 'classes'" -> why quotes? why are they single and not double?
- "e.g., ..., etc." -> choose between e.g. and etc.
- Ex. 5 could easily be inlined in the text rather than in an Example environment
- "an input resources" -> resource
- "Another aspect refers to the kind of relations of a correspondence generated" -> of a generated correspondence(?)
- "hybrid" in quotes, why?
- "function type For example" -> missing full stop
- Ref.70 is the same as ref.13 with a missing author.
- regarding the discussion on tickets, children, etc. Yes, tickets and children are not comparable, but numbers are numbers. I can say that I have as many tickets as I have children.
- There is a part of a sentence repeated "comple domains where several etc..."
- "the decidability of the merged ontology" -> it is not the ontology that is decidable or not. It is the ontology language or formalism.

References:
- ref 13: this is the second edition
- ref 15: "1: n" -> "1{:}n" to avoid extra space
- ref 31: "owl" should be in capital letters
- remove ref 70 and use 13 instead