Computing Entity Semantic Similarity using Ranked Lists of Features

Tracking #: 1737-2949

Livia Ruback
Marco Antonio Casanova
Chiara Renso
Claudio Lucchese
Alexander Mera
Grettel Monteagudo García

Responsible editor: 
Claudia d'Amato

Submission type: 
Full Paper
This article presents a novel approach to estimate semantic entity similarity using entity features available as Linked Data. The key idea is to exploit ranked lists of features, extracted from Linked Data sources, as a representation of the entities to be compared. The similarity between two entities is then estimated by comparing their ranked lists of features. The article describes experiments with museum data from DBpedia, with datasets from a LOD catalogue, and with computer science conferences from the DBLP repository. The experiments demonstrate that entity similarity, computed using ranked lists of features, achieves better accuracy than state-of-the-art measures.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 29/Nov/2017
Major Revision
Review Comment:

The paper uses a framework "SELECTOR" to measure similarity between two entities based on their ranked list of features. The method is evaluated on subsets of linked open datasets: DBpedia, LoD catalog, and DBLP. The paper is well written and easy to follow. Here are some comments for improving the paper:

* Related work should be improved by adding more details and other relevant literature.

* Between sections 3 (preliminaries) and 4 (evaluation), there should be a section "Proposed Method" which should explain the proposed method in detail.

* It should be made clear that how the proposed method is different than the previous work of the authors [1] (other than the new experiments of DBLP and LoD catalog)?

* The results of the proposed method should be compared with similar approaches like Mirizzi et al [2] and more recent (if any).

* Check the column names in tables. For example, AO in Table 2 should be RBO.

* Try to avoid ambiguous term like "some" and "may", be specific.

[1] Ruback, Lívia, et al. "SELEcTor: discovering similar entities on LinkEd DaTa by ranking their features." Semantic Computing (ICSC), 2017 IEEE 11th International Conference on. IEEE, 2017.

[2] Roberto Mirizzi, Azzurra Ragone, Tommaso Di Noia, and Eugenio Di Sciascio. Ranking the Linked Data: The Case of DBpedia, pages 337–354. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.

Review #2
Anonymous submitted on 18/Dec/2017
Review Comment:

The paper proposes an implemented method that applies a rank comparison measure to he problem of assessing the similarity between individual resources in the context of a Linked Dataset.
Overall this work appears to be more adequate to a conference paper rather than to a journal paper which would require a more in depth insight and a solid comparison with the state of the art methods related to the problem tackled. Even more so I’m afraid it does not convince about the originality of the method. An application of an existing method to a related context is a good idea but does not make a substantial contribution to advancing the state of the art.

Originality: The paper builds upon a well-known measure between ranking; the very novelty of the proposal may reside in the application to an emerging field (although it does not make a big effort to exploit the knowledge that is formally encoded in a linked dataset).
The resulting framework (SELEcTor) was already presented in [30]. The paper seems to contribute only an experiment on specific datasets.

Technical Quality: The results appear to be technically sound. However, conceptually the choice made in the empirical evaluation are questionable. Hence the claims made do not seem to be well-supported by the experimental results (there is little theoretical analysis).
The evaluation seems to be appropriate. Yet, the experiments can be made more convincing by adopting a baseline of valid competitors and also an alternative method that would exploit the proposed measure, e.g. one aimed ad supervised learning (instead of clustering, as the validation index may be very dependent on the measure that employed). I guess results may be replicated by other researchers to replicate (provided that further information is made available (e.g. n. of repetitions, variability).
The authors should make an effort to clearly assess strengths and weaknesses of their approach.

Significance: the paper addresses a central problem. I’m afraid the paper does not make a significant advance in the state of the art. I doubt that the paper would likely cite.
It does not open new research directions, instead it somehow overlooks much of the research in the field of the Semantic Web (at least the latest decade).

Relevance: as mentioned above, the paper is not well put in the perspective of the Semantic Web literature (i.e. the scope of the journal). As such it would require a substantial improvement to result of interest to researchers in the field.
The related work section in the paper discusses various works of other fields (retrieval, recommender systems, clustering) but it shows a surprising unawareness of highly relevant prior works conducted in the reference area for this journal.
No explicit discussion or at least a simple reference seems to be made to the many approaches to the problem of assessing the similarity of resources in KBs expressed (in DL and/or) OWL/RDF which would make the best gold standard for a comparative evaluation. I suggest having a look to the papers appeared in this Journal as well as in the proceedings of the International and the Extended Conference on the Semantic Web (ISWC/ESWC) and also those of EKAW, in the last decade.

The paper organization needs a substantial revision. The main problem lies in the fact that the proposed method is formally presented in a proper section (between sects. 3 and 4). Part of it is in the section on the Preliminaries, the rest is in the following section concerning the empirical evaluation, hinting that the real novel contribution is really limited the fine tuning (elicitation of the features) required to apply existing techniques for assessing the similarity to specific Linked data contexts.
The quality of writing is generally fine (a few minor issues will be indicated below).

Detailed comments

• Overall the section is quite short for a journal paper: I‘d suggest to add at least a convincing motivation and to discuss the weaknesses of the existing approaches.
• it should be clarified from the very beginning if the features extracted via the method to be presented are able to elicit (part of) the formal semantics published through the linked datasets (at least being more specific on the relevance criterion adopted for ranking the features)
• the impression is that the application to the context of linked data is rather casual: e.g. the underlying open-world semantics of the targeted datasets is not mentioned nor taken into account (at least for a discussion): the similarity estimate is likely subject to changes as long as more facts are collected in the KB. This problem does not seem to be specifically addressed.
• The idea of assessing the similarity through the alignment of ranked features is quite similar also to the proposed approaches to weighting features for metrics (e.g. via simple entropic weights) also used with various kernel functions that essentially are similarity functions. Also, feature construction methods have also been proposed to elicit features (through evolutionary algos) successively used for clustering.

• Overall: a categorized list of (some) related methods, with only a few coming from the research area of interest for the Journal. An effort should be made to discuss weaknesses and strengths of the method from a technical perspective.
• The discussion on the related ranking approaches for Linked Data should be extended from a technical viewpoint.
• Some mention is made to methods based on Wikipedia: one would expect a comparison with methods that are able to exploit formal semantics encoded via DBPedia, the central source of linked data in the Web of data cloud.
• The mentioned scores based on the traversal of a graph may be compared to the various kernel functions individuals in RDF (or DL) KBs. More recent representation learning approaches based on embeddings in low-rank spaces have have been effectively used. No mention is made to such related works.

Sect. 3
• from the beginning, it would be advisable to explain how the features are selected/constructed, as this has a decisive influence on the quality of the derived metric.
• AO’s presentation seems a bit too sketchy. Are weights mandatorily uniform?
• The measures based on incomplete ranking lists may offer a cue to discuss the problems related to the OWA
• The choice of WLM as a baseline is debatable for the many of alternative measures available in the literature which may exploit directly the semantics encoded in DBPedia. Even more debatable are general purpose metrics: the very features should depend on the available formal semantics of the Linked Datasets involved.
• The clustering method for the empirical evaluation of the measure is to be better presented. Very basic and well-known details are provided, while one would expect a clear indication of the method and then the index used for assessing the performance of the clustering algo coupled with the proposed measure.
• The choice of the method and of index should be justified. Probabilistic vs deterministic? Medoid-based? Internal vs external (supervised / unsupervised) performance indices?

• One big section mixing theoretical and empirical issues. This one of the main issues with the paper organization: the theoretical part, e.g. the definitions and the details on the approach should be anticipated in a new dedicated section. Inserting it in 4.1 (Basic definitions) goes to show that methodologically the contribution is quite limited if it can be merged with the setup of the empirical evaluation. My suggestion is to clearly keep the presentation of the framework separate section including the current subsects. 4.1 and 4.2 (as well as an extension of some of the preliminaries)
• The presentation of the framework in 4.2 is quite coarse-grained: especially the "ranked features extractor" should be discussed (in more detail) in a comparison with other feature selection (construction) approaches that have been proposed in the area of maching learning.
• Underlying Web ontologies and reasoning services that can exploit them do not seem to be involved (only SPARQL query answering): is that because the method is really general purpose and not especially targeted at the semantics of the KBs in the SemWeb?
• Regarding the applicability of the method, it seems that a fine tuning is required to apply it to specific domains/linked datasets: how much of human intervention is needed?
• Most of the empirical assessment of the method seems to be aimed at a qualitative rather than a quantitative evaluation. Technical details would be required to judge the significance of the few quantitative results: number of repetitions, changes of input conditions on the various trials, averages and std. deviations, etc.
• Even if they are results of single shot applications of the algos the results do not seem to be significantly better than the selected gold standard. The fact that is varies from domain to domain hints that the performance is not stable.

Minor comments/typos:
• "In each of these domains, the experiments show that high quality features are, respectively: the art movements of the artworks in a museum; the Wikipedia top-level categories that describe a dataset; and (iii) keywords extracted from computer science conference publications" → Not much for a lesson learned
• "relevancies" ? plural of an uncountable noun?
• "(i.e. they do not cover all elements in the domain)"
the notion of domain seems a bit vague here: somewhat different from the previous usage of the term.
• WLM: a citation to a reference may be added here
• "emphasis retrieving" ?? did you mean "emphasizes retrieving"
• lg → log (better)
• "in a relatively few items" → "in relatively (a) few items"
• "Error! Reference source not found"

Review #3
Anonymous submitted on 15/Apr/2018
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

(1) originality
The proposed framework has been introduced in authors’ previous work, however, this manuscript presents novel empirical results.

(2) significance of the results
In the empirical study, the proposed framework and experimental settings (baselines, metrics) have been carefully configured to address the particularities of each studied domain. Therefore, the scope of the results is limited to the studied settings, i.e., the results cannot be generalized.

(3) quality of writing
Overall, the paper is easy to follow. The structure of the paper can be improved by avoiding redundant information presented in both the “Related Work” and “Preliminaries” sections.

This work presents an empirical evaluation of SELEcTor, a framework to estimate the similarity of entities in a dataset. SELEcTor relies on ranked lists of entity features and similarity metrics. The authors study the performance of SELEcTor by constructing ground truths with entities from three different domains: i) 12 museums represented in DBpedia, ii) 125 datasets from the Linking Open Data (LOD) cloud, and iii) 248 Computer Science conferences represented in DBLP. The empirical results obtained in the three domains indicate that the proposed solution overcomes the baseline approaches.

This manuscript tackles a relevant research problem in the Semantic Web area. However, my major concern is that the proposed solution is not generalizable, i.e., the instantiations of SELEcTor are crafted for the specific characteristics of the ground truth. In addition, the performance of the proposed solution is not compared to the state of the art and the results are not reproducible. More details about these major issues in the following:

1) In the described approach, the entity features and metrics that better predict the similarities between entities (according to the constructed gold standards) are manually selected. This diminishes the research contributions of this work and prevents from deriving significant conclusions about the performance of the approach.

2) The proposed solution is not properly positioned with respect to the state of the art. This is reflected in the empirical evaluation, where the authors compared the performance of three instances of SELEcTor against generic baselines, i.e., general purpose similarity measures. This work does not compare the proposed approach with state-of-the-art solutions for computing the semantic similarity of entities in RDF graphs, in particular for the “LOD Dataset” and the “Conference” domains. For instance, the works by Morales et al. [Mo17], Traverso et al. [Tr16], and Zhu & Iglesias [Zh17] are tailored for RDF graphs and should be included in the related work section as well as in the experimental study.

3) The proposed approach is not specific to Linked Data or RDF data. Besides using LD datasets as input data, the manuscript does not explain the role of the semantics encoded in RDF datasets in the proposed approach. This significantly reduces the relevance of this work to the Semantic Web Journal.

4) The reported results are unfortunately not reproducible. The versions of the datasets used in the empirical study are not specified and the constructed gold standards are not available. As a good practice, the authors should measure the performance of the proposed approach using well-known benchmarks. For instance, in the Bioinformatics domain, the CESSM online tool [Pe09] allows for comparing semantic similarity approaches using entities from the Gene Ontology.


Q1. In the Introduction, the authors mention that the “key idea is to represent each entity by a list of features (...) and ranked according to some relevance criterion”. How is the relevance criterion modeled/represented in the proposed approach?

Q2. In Section 3, what is the difference between “complete linkage” and “average linkage” clustering? The difference is not clear from the description provided in the paper.

Q3. In Definition 1, what is the domain of sj?

Q4. In Section 4.3.1, what is the impact of the traversal strategy and the path length on the performance of the feature extraction process of DBpedia entities?

Q5. Regarding the feature extraction strategies in the "Museums" domain, how was the comparison between the two strategies conducted? How many entities were analyzed in this comparison?

Q6. Regarding the feature extraction in the "LOD Datasets" domain, how is the linking between dataset entities and Wikipedia articles performed? What is the quality of this linking process?

Q7. In the feature extraction in the "LOD Datasets" domain, how is a top-category defined in this context? The category hierarchy in Wikipedia presents cycles and may even lead to a single top category (Philosophy).

Q8. How was the minimum number of categories per dataset determined for selecting entities in the "LOD Datasets" domain?

Q9. In all the experiments, what are the criteria to choose the different values for the p parameter of the RBO metric?


[Mo17] Morales, C., Collarana, D., Vidal, M. E., & Auer, S. (2017, June). MateTee: a semantic similarity metric based on translation embeddings for knowledge graphs. In International Conference on Web Engineering (pp. 246-263). Springer, Cham.

[Pe09] Pesquita, C., Pessoa, D., Faria, D., & Couto, F. (2009). CESSM: Collaborative evaluation of semantic similarity measures. JB2009: Challenges in Bioinformatics, 157, 190.

[Tr16] Traverso, I., Vidal, M. E., Kämpgen, B., & Sure-Vetter, Y. (2016, September). GADES: A graph-based semantic similarity measure. In Proceedings of the 12th International Conference on Semantic Systems (pp. 101-104). ACM.

[Zh17] Zhu, G., & Iglesias, C. A. (2017). Computing semantic similarity of concepts in knowledge graphs. IEEE Transactions on Knowledge and Data Engineering, 29(1), 72-85.

Further comments:

- Page 5: Table 2 has no RBO values. What is the definition of A_{S, T, d}?
- Page 11, 14: The authors refer to Section 3.2 but there is no such section.
- Page 20: Formatting issue = “Error! Reference source not found.”
- Several typos throughout the paper.