Background Knowledge in Schema Matching: A Survey

Tracking #: 2683-3897

Authors: 
Jan Portisch
Michael Hladik
Heiko Paulheim

Responsible editor: 
Jérôme Euzenat

Submission type: 
Survey Article
Abstract: 
Schema matching is an integral part within the data integration process. One of the main challenges within the schema matching operation is semantic heterogeneity, i.e. modeling differences between the two schemas that are to be integrated. The semantics within most schemas are, however, typically incomplete because schemas are designed within a certain context which is not explicitly modeled. Therefore, external background knowledge plays a major role in the task of (semi-) automated schema matching. In this survey, we introduce the reader to the schema matching problem and its abstraction, the ontology matching task. We review the background knowledge sources as well as the approaches applied to make use of external knowledge. Our survey covers all schema matching systems that have been presented within the years 2004 -- 2020 at a well-known ontology matching competition together with significant publications in the research field. We present a classification system for external background knowledge, concept linking strategies, as well as for background knowledge exploitation approaches. We provide extensive examples and classify all schema matching systems under review in a resource/strategy matrix obtained by coalescing the two classification systems. Lastly, we outline interesting and yet underexplored research directions of applying external knowledge within the schema matching process.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 01/Mar/2021
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.

I would like to thank first the authors for the great effort in making this good survey. The paper addresses a key topic in the Semantic Web domain, and such a survey will help researchers in the field to be up-to-date. Overall, it is well written, and an effort is made to reference as most as possible key works in the ontology matching field.

I highlight some comments below in order to help improving the quality of the paper and to provide best possible version of the survey.
1/ I would prefer to change the title from “Background Knowledge in Schema Matching: A Survey” to “Background Knowledge in Ontology Matching: A Survey”. It seems important to be precise as the survey rather focuses on Ontology matching.
2/ Overall, it is not clear why the authors want to involve the notion of schema matching in this survey, which is a more wider notion, although it is about purely ontologies. The title and the content (including all captions of figures and tables) contain the “schema matching” term.
It could be read:
• “We introduce the reader to the schema matching problem and its abstraction, the ontology matching task”. (Abstract, page 1, line 22)
• “an ontology can be seen as a universal representation of a schema”. (page 2, lines 20-21)
• “schema matching problem can be formalized as ontology matching problem [42-45]”. (page 3, lines 45-46)
I do not agree with these statements, and I find the cited papers (in the last statement) [42-45] less convincing. These papers are not appropriate for supporting these statements.
I appreciate the way you have introduced the ontology matching by first describing its generalization, the schema matching, since ontology matching is a special case of schema matching, but not the inverse.
Most importantly, in the survey, all the reviewed papers are from the ontology matching area, except several dated papers from the schema matching area that you have included.
If it was a schema matching survey, you would have provided a complete list of XML matching works and relational database matching works, etc, and you would have categorized them and even made statistics for every type of schema. However, here, you have just included some dated works, which are:
• XClust (2002) [33] and xMatcher (2020) [34] for XML matching;
• DIKE (2003) [36] and Xu and Embley (2003) [162] for relational database matching;
• WISE-Integrator (2004) [164], Stroulia and Wang (2005) [35], and BeMatch (2008) [177] for WSDL matching, and
• MOMIS (2011) [185] which is very generic work of data integration.
What criteria lead to this selection? I would rather recommend first to elaborate some objective criteria then make the effort to be more exhaustive. For instance, among many others, it is worth mentioning COMA++ (2005) (https://dbs.uni-leipzig.de/Research/coma.html) which is a tool for XML and ontology matching.
I would advise the authors to see :
Rahm, E. (2011). Towards large-scale schema and ontology matching. In Schema matching and mapping (pp. 3-27). Springer, Berlin, Heidelberg. https://dbs.uni-leipzig.de/file/ch1-lssm.pdf
3/ In subsection 3.5 (Background Knowledge in OM) in page 5 and 6, WordNet was the only knowledge source that you have described in detail. Since this survey is about background knowledge sources, it would be very appreciated to describe in detail at least the 5 most used knowledge sources, because the description of knowledge sources in tables 2 and 3 is very brief. Links (as footnotes) for the most used sources are also appreciated. You have only provided links for WordNet, KGvec2go, and Google Universal Sentence Encoder.
4/- There are some repeated auto-references, please remove one of each pair:
• Reference [64] is the same as reference [209] (KG OAEI Track)
• Reference [97] is the same as reference [153] (WebIsALOD)
• Reference [99] is the same as reference [133] but with different authors (ALOD2Vec)
• Reference [125] is the same as reference [156] (WeSeE-Match)
- Some references are not cited in the paper, such as your previous work [211]. Indeed, references from [210] to [214] appear in the References section but are not referenced in the text. Please do not use the \nocite instruction of LaTeX.
- There is a considerable amount of auto-citations (There is a total of 28 auto-citations). Please try to reduce this amount. For example, choose a single paper for an approach that has multiple versions. For example, for DbkWik ==> [63], [64], [86] or [209]. For WebIsALOD, ==> [97], [152], [153] or [154]. For MELT, ==> [60], [61] or [62]. For WeSeE-Match, ==> [111], [125] or [156]. For ALOD2Vec, ==> [98], [99] or [133]. For Wiktionary Matcher, ==> [87] or [122]. At least, you can keep more than one citation, but make sure to mention all citations together at the same time, not in separate places in the text. For example, WeSeE-Match [111, 125], Wiktionary Matcher [87, 122], etc.
5/ When you first mention Tables 2 and 3, the meaning of the codes in the “Source Classification” column is not clear. And even when the reader reaches the Categorization section (Section 4), it still not clear the meaning of these codes, unless one looks in Figure 8 in parallel. It would be better to remove this column and create instead a new table, similar to tables 5 and 6, where you replace the Strategy column (of Tables 5 and 6) by the Knowledge Source Names column. In other words, this proposed new tables will contain two columns: the first column is the Background Knowledge Type (as in Tables 5 and 6), and the second column is the Background Knowledge (Name). By doing so, the categorization could be clear, and the reader can directly notice the knowledge sources of a certain type. The “Source Classification” column from Tables 2 and 3 can be replaced by “Availability” column, where you can tick knowledge sources that are still available, so that the reader can directly see knowledge sources that are no more available. Otherwise, you can keep the “Source Classification” column as it is and add notes to Tables 2 and 3, e.g., G : General, D : Domain-specific, S: Structured, SW: Semantic Web, LT: Lexical and Taxonomical… It would also be good to mention the languages of the mentioned APIs (JAVA…)
6/ In Table 4, page 16, it should be noted that HCONE (2004) [163], MoA (2005) [165], ILIADS (2007) [174] are approaches that make the ontology matching task as a preliminary step for merging/ integrating the input ontologies. So here, the final goal of these works is ontology merging/ integration. However, there are many other ontology merging/integration works that make an automatic matching step, such as FCA-Merge (2001), SAMBO (2006), ContentMap (2009), DKP-AOM (2012), Babylon Health (2018), etc. Therefore, the reviewed research works are not exhaustive if the aim is at taking into account ontology merging/integration works! I would suggest to include a remark which indicates that matching in the context of these works is part of the ontology integration process. This recent review seems relevant to look into in that context (Osman et al. https://doi.org/10.1016/j.inffus.2021.01.007)

Minor Comments:
• In Equation (2), it is rather “Recall”, not “Precision” (page 4, line9)
• The ontology matching tool is called LDOA (2011) [96], not LODA (page 11, line 39, and Table 2 page 13, and Table 6 page 22). Please correct it.
• The ontology matching tool is called ILIADS (2007) [174], not ILIDAS (Table 4, page 16)
• The ontology matching tool is called SPHeRe (2013) [190], not SHeRe (Table4, page 16)
• In table 2 in page 13, DBnary [121] is an RDF knowledge graph, so it should normally have the code SW-S, not LT-MU, no?
• In Table 3 in page 14, the source classification of “SAP Term” should be D-U-NT. Add -NT.
• In Table 3 in page 14, BLOOMS (2010) [158] should not be written in italic because it participated in OAEI, as shown in Figure 3.
• In Table 3 in page 14, please indicate that Swoogle [77] is no more available, like you did for Lanes API and OpenCyc (in the Source Description column).
• Are lexicons, controlled vocabularies and thesaurus considered as LT or SW? For example, in Tables 2 and 3, you assigned to MESH the type LT-MO, and to RadLex the type LT-MO, whereas you assigned to UMLS the type SW-L. MESH is not of type SW ?
• In Table 3, in the WebIsALOD row, please write “ALOD2Vec Matcher (2018) [98, 154]”, instead of “Portisch (2018) [154]” and “ALOD2Vec Matcher (2018) [98]” separately, because it looks like two different approaches using the knowledge source, while they cover the same work, and this can influence statistics afterwards.
• Please make sure that figures and tables are put in the nearest page where they are firstly mentioned. To see a cited figure or table, the reader has to refer to another page later. It is a bit bothering.
• In page 15 line 44, you indicate “similarity approaches are typically used that can handle multiple senses”. Please correct it: “similarity approaches, that are typically used, can handle multiple senses”.
• In page 17 line 50, please correct the following sentence: “we differentiate reasoning from the factual queries in that a reasoning operation is appli^ed that goes beyond querying a graph with an ASK query for equivalence or any other relation between two concepts”. (maybe “is applied that” should be removed).
• In page 19 line 38, you say “when we compare the matrix table 5 and 6, we quickly see that there are…”. Please correct it: “when we compare the matrix in tables 5 and 6”.
• Try to turn Table 6 in the other way (to have tables 5 and 6 in similar direction).

Review #2
Anonymous submitted on 10/Mar/2021
Suggestion:
Minor Revision
Review Comment:

The survey is well structured, well-written and very comprehensive. It provides a well thought out of introduction to the use of background knowledge (BK) in OM, and organizes the different approaches in well-defined categories that make it easier to understand the state of the art in this area.

I find that there are three aspects that merit improvement.

1. Background Knowledge Selection
I find the subsection dedicated to “Background Knowledge Selection in Ontology Matching” to be extremely short, given the relevance of the topic. The relevance is also acknowledged by the authors as an area for future direction ( in section 7.5).

Consequently, I am unsure of its placement inside section 3.3. It is not exactly a step in the overall OM process described in the rest of the section, and would merit further detail.
There are a few missing references to relevant work in this area:

Annane et al works (10.1007/978-3-319-49004-5_2, 10.1016/j.websem.2018.04.001) which focus on finding the appropriate concepts from BK ontologies to support mapping.

Jimenez-Ruiz et al works in using an large-scale portal of biomedical ontologies to select sources from (http://ceur-ws.org/Vol-1272/paper_67.pdf) This work actually tackles the challenge presented in 7.5, since LogMapBio uses BioPortal which has over 800 ontologies to choose from.

2. Linking strategies and exploitations approaches
In section 6, a clear explanation of how linking strategies intersect with exploitation approaches is missing. It appears that not every linking strategy would work with every approach.

3. BK impact evaluation
The survey also misses an overview of how BK sources contribute to improving the matching performance. I understand that not many works address this topic, but there are a few:
LogMap can be compared to LogMapBio performance in OAEI to see the impact of using the external BioPortal ontologies
AML was evaluated in the alignment of biomedical ontologies when using and not using BK sources (https://doi.org/10.1186/s13326-017-0170-9). There was also an older version of AML for OAEI 2013 that ran with and without BK (http://dit.unitn.it/~p2p/OM-2013/oaei13_paper1.pdf)
YAM++ was also used as the base matcher in (10.1016/j.websem.2018.04.001), so it affords a comparison between using and not using BK sources.

A section on this would provide a strong motivation for future research on the topic.

Additionally, the authors can also take a look at
https://doi.org/10.1186/s13326-017-0166-5 and https://doi.org/10.1186/s13326-017-0162-9 where the BK sources used in the anatomy and large biomed tracks of the OAEI are discussed.

Review #3
Anonymous submitted on 18/Apr/2021
Suggestion:
Major Revision
Review Comment:

This paper surveys the use of background knowledge in schema matching, in terms of types of background knowledge sources that have been used; the strategies for linking schema entities to background knowledge; and strategies for exploiting background knowledge within the schema matching process. For each of these dimensions, a classification is proposed: (i) type of background knowledge sources (domain-specific, general-purpose, structured -- lexical and taxonomical, factual database, semantic web dataset, pre-trained neural models, etc. and unstructured -- textual, non-textual); (ii) strategies for schema and background knowledge linking (given links, direct linking, fuzzy linking and WSD); and (iii) exploitation strategies (factual query, structure-based, statistical/neural, logic-based).

While the paper presents a comprehensive overview in the topic with an extensive literature, I have some main concerns.

First, with respect to the scope and positioning. The authors tried to define the scope in several passages: "We introduce the reader to the schema matching problem and its abstraction, the ontology matching task"; "This includes papers that focus on schema matching in a different technological area such as DTD matching (e.g. [33]), XML Schema matching (e.g. [34]), WSDL matching (e.g. [35]), or relational database matching (e.g. [36])." "Nonetheless, most of the papers of this survey are from the ontology matching domain as an ontology can be seen as a universal representation of a schema (see Subsection 3.3)." Despite these efforts, however, the scope is still not clear: the paper mostly describes works on schema matching (TBox) of ontologies and not schema matching in the large sense (at least in the paper, there is no explicit description of works addressing relational schema matching -- works [33-36] appear only in Table 4 and not in Tables 2 and 3 and no description in the text). In fact, an ontology can be seen as a universal representation of a schema, but these different schemes have very different levels of expressiveness, what has not been taken into account at all in the paper. In that sense, I do not agree on the that statement "even though the term ontology is used in this paper -- the presented methods can be equally applied to other matching problems such as database schema matching or XML schema matching [46]." There is no impact in the use of the external resource with respect to the expressiveness of the schemes to be matched? These points have to be clarified in the paper.

Second, particular attention is given to OAEI (track descriptions, evaluation strategies, participating systems -- Figure 2, Figure 3, Figure 7, Table 1). It is true that OAEI is a reference in the field but the survey, in the same sense of the comment above, should go beyond OAEI in terms of schema matching (matching XML, relation schemes, etc., with the specificities of these different "schemes"). Again, there is no information about the specific kind of schema the non-OAEI systems are able to deal with and on which datasets they have evaluated (in particular for the 14 systems in Table 4 and that use WordNet). It sounds more a review on background knowledge in OAEI.

Third, the critical aspect of background knowledge selection has been mostly neglected in the paper. This however is an interesting point. And some guidelines on choosing the "good" background knowledge should be provided in the paper, in particular in the discussion. Furthermore, some words on the quality of background knowledge resources and how they have been constructed -- manually (WordNet), semi-automatically or automatically (BabelNet) are missing (this could also be included as a category in the classification). The quality can have an impact in the matching results. This is also the case of multilingual resources with different language coverage (for instance, the French lexicon in BabelNet has a lot of noise that does not appear in the English lexicon).

Fourth, the discussion has a clear OAEI bias mostly discussing the drawbacks and open challenges related to the OAEI tracks. The discussion should be also directed to the challenges of re-using the solutions in real cases and industrial scenarios and how the different levels of expressiveness of the different types of schemes impact in the matching process and selection of the background knowledge.

For those reasons, the recommendation is major revision.

---------------------------------------------
Minor comments:

1. Introduction

- It is missing a more explicit link between the 4th and 5th paragraphs of the introduction (surveys on matching and background knowledge and context-based).

- "The matching techniques further studied in this survey can be broadly categorized as context-based approaches according to Euzenat and Shvaiko" vs. "Logic-based approaches apply reasoning on or together with the external resources. This class of approach is also referred to as context-based matching [11]"

2. About this survey

- "In this survey, we cover all matching systems that participated in the schema matching tracks of the OAEI from its inception in 2004 until 2020 [13–28]. ==> missing reference to OM 2020

3. Schema Matching and Ontology Matching

- Evaluation of Automated Schema Matching Systems => this subsection seems not to be required

- Background Knowledge in Ontology Matching => in Schema Matching

- Background Knowledge in OM ==> Background Knowledge in OAEI ?

4. Categorization of Background Knowledge in Schema Matching

- Put the tables closer to their citation in the text (Table 2 indicates the kind of schema is used)

5. Categorization of Linking Approaches

- "Our analysis on how concepts are linked into the background knowledge source revealed that most matching systems do not perform elaborated linking approaches but use a direct string lookup".
This statement is quite surprising. Given the number of matching systems exploiting WordNet for which a disambiguation is required. "We did not find matching systems that try to actually disambiguate the sense of a label through Word Sense Disambiguation – despite the heavy usage of WordNet (which is built around senses)" => which similarities ? What is "real" WSD ?

- Lastly, (iv) logic based approaches => Lastly, logic based approaches

- Table 4: should be interesting to indicate the kind of matched schema for the systems not participating at OAEI

- It is important to note that reasoning can also be applied across multiple ontologies: Locoro et al. [11] ==> . Locoro

- Logic-based according to the Figure 11 can be also considered as an indirect matching

- "However, we did not find broad usage of logic-based exploitation approaches in past and current (OAEI and non-OAEI) schema matching systems that go beyond singled out experiments". => LogMap does not apply any reasoning involving UMLS?

- Pre-trained embedding-models and architectures, for instance, are so far rarely used but may be very promising given breakthroughs in other scientific communities. ==> These resources have been made fully available quite recently.

- Structural approaches are almost completely limited to WordNet and their exploration on multilingual datasets and in Semantic Web datasets may yield interesting results given good results on WordNet and given
that this class of approaches is typically intuitive to understand and can be comprehended by humans (unlike neural models). ==> multilingual aspect here should be clarified (do the authors refer to the different versions of WordNet in different languages?)

- If we take a closer look at the domain-specific knowledge sources used, it is striking that almost all datasets are from the biomedical domain. => OAEI bias ?

- Enterprise schema matching and integration challenges in the business world, for example, are not reflected at all in OAEI tracks. => what about the Process Model Matching at OAEI?

- While multiple automatic background knowledge selection approaches have been proposed (see Section 3.3) => very short section

References

Missing ones in background knowledge selection and other surveys.

@inproceedings{tigrine:lirmm-01407888,
TITLE = {{Selecting Optimal Background Knowledge Sources for the Ontology Matching Task}},
AUTHOR = {Tigrine, Abdel Nasser and Bellahsene, Zohra and Todorov, Konstantin},
URL = {https://hal-lirmm.ccsd.cnrs.fr/lirmm-01407888},
BOOKTITLE = {{EKAW: Knowledge Engineering and Knowledge Management}},
ADDRESS = {Bologna, Italy},
SERIES = {Knowledge Engineering and Knowledge Management},
VOLUME = {LNCS},
NUMBER = {10024},
PAGES = {651-665},
YEAR = {2016},
MONTH = Nov,
DOI = {10.1007/978-3-319-49004-5\_42},
PDF = {https://hal-lirmm.ccsd.cnrs.fr/lirmm-01407888/file/Main.pdf},
HAL_ID = {lirmm-01407888},
HAL_VERSION = {v1},
}

@article{DBLP:journals/semweb/ThieblinHHT20,
author = {{\'{E}}lodie Thi{\'{e}}blin and
Ollivier Haemmerl{\'{e}} and
Nathalie Hernandez and
C{\'{a}}ssia Trojahn},
title = {Survey on complex ontology matching},
journal = {Semantic Web},
volume = {11},
number = {4},
pages = {689--727},
year = {2020},
url = {https://doi.org/10.3233/SW-190366},
doi = {10.3233/SW-190366},
timestamp = {Fri, 28 Aug 2020 15:32:46 +0200},
biburl = {https://dblp.org/rec/journals/semweb/ThieblinHHT20.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

Some references are incomplete or outdated:

F.J.Q. Real, G. Bella, F. McNeill and A. Bundy, Using Domain Lexicon and Grammar for Ontology Matching, 2020, to appear ==> where ?

E. Thiéblin, O. Haemmerlé and C. Trojahn, Automatic evaluation of complex alignments: an instance-based approach (2020). ==> where ?

S. Hertling, J. Portisch and H. Paulheim, Supervised ONtology and Instance matching with MELT, in: OM@ISWC 2020, 2020, to appear.

D. Faria, C. Pesquita, T. Tervo, F.M. Couto and I.F. Cruz, AML and AMLC Results for OAEI 2020., OM@ISWC 2020 (2019), to appear.

(and all other OAEI 2020 papers)


Comments

Also check "Experiences from the anatomy track in the ontology alignment evaluation initiative" https://doi.org/10.1186/s13326-017-0166-5 that has a section on the use of background in 10 years of OAEI Anatomy.