Review Comment:
The paper tests different supervised and unsupervised rank aggregation methods on a cross-lingual ontology matching (OM) task. Results show that ranking aggregation method are promising in improving the quality of the results returned by the system.
The paper provides some contributions that are potentially interesting for supporting OM, and, maybe, cross-lingual OM in particular. I have appreciated in particular the following things:
• A systematic evaluation of several stat-of-the-art supervised learning to rank methods to support OM; many of these approaches have not been tested enough in the field of ontology matching and it is good to bring in some techniques that may have been overlooked.
• The idea of using learning to rank is interesting because it may be used also with rankings provided by similarity measures that do not fall in a common [0,1] interval (e.g., Lucene Conceptual Scoring, which is very handy in several practical problems).
• In the experiments, only a small part of the alignment has been used for training the supervised approaches (differently than what I have seen in some other approaches that have tested machine learning methods); this makes learning to rank – applied to rank aggregation – a good candidate to support interactive matching, which is a very important task and much under the attention of the community today []
• Experimental results provide hints that the proposed techniques may bring benefit in terms of performance. In particular, it should be noticed that performance discussed in the paper are compared to much more complex OM systems.
All the above observations may me think that there are some nice ideas in this paper. However, I think that these ideas must be much more developed. In the paper in its current status, there are too many weaknesses, which make the contributions of the paper quite far from the kind of clear-cut and robust contributions that are expected in a Semantic Web Journal paper. The paper as such fails to convince that its contributions deliver significant results. Also the presentation should be much improved. Finally, lack of comparison with related work makes it difficult to assess its actual novelty.
For these reasons, I have to suggest that the submission is rejected; at the same time I encourage the authors to further develop their ideas and re-submit the paper in the future.
1) Claims of the paper, overall approach, and significance of results
The paper discusses a problem framed as “ranking aggregation” and evaluates it on cross-lingual ontology matching. However, this ranking aggregation problem (see more comments on the terminology here below) is general enough to be perfectly applied to any OM task, not only to cross-lingual OM tasks. So it is not clear at all why the authors have focused on cross-lingual OM. What is so peculiar in cross-lingual OM that requires this special attention to rank aggregation? The question is relevant because:
• the same problem has been addressed in OM as the problem of combining different matchers, which opens up the problem of a stronger comparison with related work (see Related Work section);
• a much smaller number of resources are available to evaluate cross-lingual OM: one track of the OAEI with few ontologies in a specific domain; blind evaluation; few systems that have participated to the track; other OM tasks have been used to evaluate cross-lingual OM approaches but in quite different settings (e.g., linguistic ontologies).
The authors should take one of these options: further motivate and substantiate the peculiarity of cross-lingual OM that makes it the best application field for the proposed ranking aggregation techniques; or evaluate their work in a broader experimental settings, which also consider mono-lingual OM tasks and compare their work with approaches proposed to similar (but I would rather say “equivalent”) problems.
Here I would like to proactively share some insights with the authors, hoping they could be useful for a future submission.
• There are some peculiar issues emerging in the cross-lingual domain, which IMHO are relevant for your work and adequately considered:
--- machine translations introduce a branching factor when more translations are available for one word; more translations are particular useful to handle polysemous words’; ambiguous words are translated differently depending on the context in which they appear but for concept labels you frequently do not have the context; it is not clear how this issue is considered in the proposed approach. If you query a machine translation service, you can trust the top result (but this is very naïf) or get more possible translations; when concepts from distinct ontologies are translated, different translations can be returned also for similar words; so usually it is required to deal with multiple translations. This problem may not emerge with NASARI-based similarity, but it definitely emerges with the syntactic and lexical similarity methods that are run after the translations are collected. In the paper it is not explained at all how you use the results of machine translation but it does not seem you collect and use multiple translations (and, if you do, you do not explain how).
--- If on top of that you use a reference lexical knowledge base like WordNet, a second branching factor may occur (each translation hitting more than one concept). The paper does not explain how WordNet is used to compute the similarity.
• There are some known aspects of OM that are relevant for your work and I think have not been adequately considered.
--- Some OM systems, for a similarity function, compute the similarity between any pair of elements; this is equivalent to store a matrix for each similarity function; some other systems, e.g., AML, and, I think, LogMap, do not do it to scale with very large ontologies. If the system uses similarity matrices, rank aggregation is IMHO equivalent to the combination of these matrices, that is, combination of different similarity scores (supervised and unsupervised approaches exist).
--- The methods proposed so far combining similarity scores may have problems when full similarity matrices are not stored (because of data sparsity); in this case, using an approach specifically addressed to merge rankings instead of similarity scores may be useful, e.g., to be applied to very large ontologies.
--- The latter may be a nice motivation scenario for your method.
--- Approaches that combine similarity scores, by known functions (avg, max, min, etc.) or more sophisticated methods (e.g., weighted linear combinations) have troubles when scores different from similarity measures are used (e.g., Lucene Conceptual Scoring, page rank, etc.); your method could fill this gap, but in the definition of the problem you use the assumption that all the scores are normalized in a [0,1] range. I suggest considering other measure and discuss this as an advantage of ranking-based aggregation vs. score-based aggregation.
--- Matching each source concept with its most similar concept is a very naïf mapping selection strategy. Most of systems use at least a threshold, some others may learn when to match using machine learning or optimization methods (also applied to cross-lingual matching). Otherwise I would always match each source concept, which is unrealistic because some concepts of the source ontology may be not covered by the target ontology. This problem is not addressed in your method because in this specific task the same ontology has been translated in different languages; but OM systems are not defined to work with benchmarks but with real problems. Observe that this adds a bias to the significance of your results, because the hidden assumption is that a “best match always exist among the candidate concepts”, which is only true in the particular evaluation settings.
As a summary, I think that there is the space for finding a scenario, a problem definition and an evaluation setting that better supports the valuable idea of your paper (using ranking-based aggregation instead of score-based aggregation). Suggestions are: combining matchers that use scores different from [0,1]-constrained similarity; dealing with large ontologies where partial rankings are generated for each concept (e.g., top-15 most similar concepts) without computing entire similarity matrices; consider the scenario where user inputs are used to customize the ranking aggregation function.
2) State-of-the-art
The authors cover the state-of-the-art in Information Retrieval-based methods. However, there are important papers in two subfields relevant for your work that are missing:
A - Cross-lingual ontology matching
Chen, J., Xue, X., Huang, Y., & Zhang, X. (2019). Interactive Cross-Lingual Ontology Matching. IEEE Access.
Bella, G., Giunchiglia, F., & McNeill, F. (2017). Language and domain aware lightweight ontology matching. Journal of Web Semantics, 43, 1-17.
Helou, M. A., Palmonari, M., & Jarrar, M. (2016). Effectiveness of automatic translations for cross-lingual ontology mapping. Journal of Artificial Intelligence Research, 55, 165-208.
Helou, M. A., & Palmonari, M. (2015, September). Cross-lingual lexical matching with word translation and local similarity optimization. In Proceedings of the 11th International Conference on Semantic Systems (pp. 97-104). ACM.
(I suggest checking also publications on problems related to cross-lingual OM, like cross-lingual ontology enrichment:)
Ercan, G., & Haziyev, F. (2019). Synset expansion on translation graph for automatic wordnet construction. Information Processing & Management, 56(1), 130-150.
Ali, M., Fathalla, S., Ibrahim, S., Kholief, M., & Hassan, Y. (2018). Cross-Lingual Ontology Enrichment Based on Multi-Agent Architecture. Procedia Computer Science, 137, 127-138.
(Matchers’ combination)
Isabel F. Cruz, Flavio Palandri Antonelli, Cosmin Stroe: Efficient Selection of Mappings and Automatic Quality-driven Combination of Matching Methods. OM 2009
Eckert, K., Meilicke, C., & Stuckenschmidt, H. (2009, May). Improving ontology matching using meta-level learning. In European Semantic Web Conference (pp. 158-172). Springer, Berlin, Heidelberg.
Xue, X., Wang, Y., & Hao, W. (2015). Optimizing Ontology Alignments by using NSGA-II. International Arab Journal of Information Technology (IAJIT), 12(2).
(Related to combination and worth a look:)
Duan, S., Fokoue, A., & Srinivas, K. (2010, November). One size does not fit all: Customizing ontology alignment using user feedback. In International Semantic Web Conference (pp. 177-192). Springer, Berlin, Heidelberg.
3) Experimental evaluation
See point 1) for the main arguments against the experimental evaluation. I am aware of the few numbers of resources for cross-lingual ontology matching. However, the proposed evaluation is insufficient because 1) it does not consider relevant related work and 2) it exploits assumptions true in the evaluation dataset that are unlikely to occur in real-world scenario (for all the source concepts there exist exactly one correct match).
In addition, the proposed evaluation comparing with average results by other system is not very convincing, in particular, if we see the effect of the language pair on the performance. I suggest downloading the tools (I think that the set of three selected OM systems is ok) and conduct experiments with these systems. It is also possible to get in touch with the developers to make sure to use the correct settings. If this process is not successful, the authors may report about this attempt and go for a second-better option.
See detailed comments for more remarks.
4) Presentation
Overall the presentation needs to be improved significantly. The paper is written in a good English but has a list of problems:
• Some intuitive and well-understood concepts are stressed too much (with some repetitions) while not enough details are given about the most important sections; see detailed comments for examples; here I mention the lack of details about how translations are managed and about cryptic similarity measures (e.g., the one based on WordNet); also, as a major remark, CombANZ is not defined; I think that it is important to define at least the best combination methods (much more relevant than defining Levenstein or Jaro-Winkler, which are well known and not object of the present work).
• The formalization of the problem is not helpful; there is no need to define the problem in the most generic way and then use a very restricted version of the problem (only equivalence mappings are considered, a naïf mapping selection method is applied, mentioning thesauri as possible ontologies to map, while no comparison with approaches to match lexical ontologies is given). It is ok a general introduction, but, in the problem formalization, I suggest focusing on the specific setting that is supported by your work. See detailed comments.
5) Detailed comments
Page 1.
“We use ontology in a general sense, including taxonomies and thesauri [1]. “ → Then I expect you compare with work in this field, including cross-lingual ontology enrichment, which requires matching against a thesaurus / semantic net.
Page 2
“While single ranking techniques are used in ontology matching [7], rank aggregation is yet under explored. “ → This is really not true; see suggested related work.
“with a set of attributes “ → I do not understand what an attribute is; metadata? Synonyms? Again, I think there is a mismatch between the definitions here and the particular case considered in the evaluation.
“Each relation r(c_1,c_2)\in R […]” → I find this definition extremely confusing. First r(c_1,c_2) is defined as member of the set R, then it is defined as a function. Wouldn’t be simpler to just say that there is a set of relations r_1,..,r_n that represents relations and then a mapping is a triple with r \in R?
Page 3
The definition becomes overcomplicated from “Ontology Matching” on. I suggest using a definition from previous work and / or modify it only to the extent that the change is significant for your work (for example, you consider only one kind of relation, equivalence; different relations often requires very different methods.
When considering only equivalence relations, do you also assume that the cardinality of the mapping is 1:1? Or do you consider the case where you return an alignment of cardinality m:n? I think that by using a greedy approach to mapping selection you can possibly output an alignment with cardinality different from 1:1. This may somehow be ok, but it is a relevant aspect of the problem formalization to clarify.
Page 4
I was quite surprised that RankSVM was not considered among the possible approaches. It seems quite natural to use it over vectors that represent the different similarity scores. There are also quite efficient implementation of the algorithm, in particular for short vectors.
Page 5
“For similarity […]” → Do you only consider the top result returned by the service? What about ambiguous words? What if the service return different words in English for two equivalent concepts labeled in different languages?
“First, the similarity values […]” → I was really confused to find pessoa here among the concepts. Figure 8 is confusing as Author and Author of contributions seems to be properties. So what is this graph representing?
Considering the figures referred to here: I suggest using real examples from the ontologies in Figure 4, otherwise it is not much informative (the idea described in this figure is very trivial for those familiar with OM); Figures 5, 6 and 7 are not very useful as they are now. I suggest making one example with data from the ontologies used in the experiments and show the whole data processing flow in one figure.
“Levenshtein and Jaro, chosen by their performance reported by Christen study [39] “ → This study was about instance matching. Jaro-winkler for example should be ok for concepts too, but is particular good for names. For concepts a measure that better handle labels with multi-tokens could be useful (e.g., Jaccard, n-grams, etc.). I do not object much to the choice of similarity measures because your goal is to improve the results by ranking aggregation rather than finding the best measures to combine, but be careful with the reference.
Page 9
Table 3 is not informative, I suggest deleting it
“An evaluation protocol […]” → I appreciate the small set used for training, but more details are needed. It is not clear if you build a different model for each language pair (selecting 15% from each language pair), or you build one model for all.
“Although the unsupervised method “ → It is ok to report on results with this setting, but you can also add a new table for unsupervised methods, which reports on the whole dataset.
Table 4. Explain CombANZ in the paper. Also, I think that the evaluation should make a more detailed analysis of best performing learning methods and compare their performance on a different task. The main question is: what is conclusion a user should draw from these results? Which method should be used? Remember that in a real-world setting the gold standard is not available.
|