Network Metrics for Assessing the Quality of Entity Resolution Between Multiple Datasets

Tracking #: 2175-3388

Authors: 
Al Idrissou
Frank van Harmelen
Peter van den Besselaar

Responsible editor: 
Guest Editors EKAW 2018

Submission type: 
Full Paper
Abstract: 
Matching entities between datasets is a crucial step for combining multiple datasets on the semantic web. A rich literature exists on different approaches to this entity resolution problem. However, much less work has been done on how to \emph{assess} the quality of such entity links once they have been generated. Evaluation methods for link quality are typically limited to either comparison with a \textit{ground truth dataset} (which is often not available), \textit{manual work} (which is cumbersome and prone to error), or \textit{crowd sourcing} (which is not always feasible, especially if expert knowledge is required). Furthermore, the problem of link evaluation is greatly exacerbated for links between more than two datasets, because the number of possible links grows rapidly with the number of datasets. In this paper, we propose a method to estimate the quality of entity links between multiple datasets. We exploit the fact that the links between entities from multiple datasets form a network, and we show how simple metrics on this network can reliably predict their quality. We verify our results in a large experimental study using six datasets from the domain of science, technology and innovation studies, for which we created a gold standard. This gold standard, available online, is an additional contribution of this paper. In addition, we evaluate our metric on a recently published gold standard to confirm our findings.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Francesco Corcoglioniti submitted on 24/May/2019
Suggestion:
Major Revision
Review Comment:

The paper proposes an Estimated Quality metric "eq" for the automated prediction of the quality of an Identity Link Network (ILN) connecting 3 or more entities with identity links generated by some Entity Resolution (ER) system. Four variants of the metric are proposed: "eq" for non-weighted ILNs and "eq_min", "eq_avg", "eq_w" for weighted ILNs. All variants take values on a [0, 1] scale, discretizable to good (all links correct) / bad (some link wrong) / undecided ILN labels. The variants are all defined in terms of a weighted average of three network-based metrics (bridge, diameter, closure metrics), which assess, in different & complementary ways, how much an ILN is similar to a fully connected network. Three empirical evaluations of the metric in its four variants are conducted on the ILNs obtained by: (i) linking research institutions in 6 datasets via simple name matching methods; (ii) combining proximity and name matching methods in 3 of the 6 datasets; and (iii) reusing data from [16]. Evaluations (i) and (ii) compare the good/bad output of "eq" variants against a majority class baseline, based on ground truth human assessment of ILNs' correctness. Evaluation (iii) compares the F1 scores for 6 entity resolutions systems computed manually (in [16]) and automatically via the good/bad outputs of "eq".

The paper extends an EKAW 2018 publication [1] by the same authors. By dealing with ER, a relevant Semantic Web topic, and investigating novel techniques for its evaluation, the paper falls in the scope of the Journal and meets the criteria required for a full paper submission. As a full paper, this review will focus on the dimensions of originality, significance of results, and quality of writing.

== Originality ==

Up to section 8, the paper is basically the same as [1], with only few very minor additions: clustering Algorithm 1 (not much useful, actually), confusion matrices of tables 5 and 6 (useful), plot of F1 deviations in figure 7 (useful). The novel contributions mainly reside in sections 9 and 10, where the metric extensions "eq_min", "eq_avg", "eq_w" for weighted ILNs are proposed and then evaluated in the same settings of metric "eq" considered in [1]. While "eq_min" and "eq_avg" are trivial extensions, "eq_w" is more interesting, although their definitions appear in some cases rather arbitrary, handle weights in a debatable way (see comments later), and their evaluation results are inconclusive, not showing appreciable benefits in using weighted metrics in place of the simpler "eq".

As a result, I believe that the submission in its current state does not appreciably advance the state of the art w.r.t. what previously done in [1], and I'm not sure the novel contributions here qualify as sufficient extension for acceptance as full paper in this Journal. However, I believe these shortcomings can be addressed in a revision of the paper, at least through further analysis of proposed metrics and/or by providing further details and discussion of aspects previously not covered (see review comments), which would shed further light on the behavior of the metric originally proposed in [1].

== Significance of results ==

In the reported experiments, the proposed metrics (at least, eq and eq_w) correlate well with judgments by humans, demonstrating the potential for applying them to quickly assess the quality of ER identity links. However, these are largely results already shown in [1], and overall I have the following major concerns regarding the design and evaluation of the metrics that negatively affect the significance of the presented work (C1, C2, C5 applying also to [1] and not addressed here; C3, C4 specific to the novel contributions of this paper):

C1. Unclear hyper-parameter estimation and generalization. The defined metrics depend on few hyper-parameters (bridge metric parameter 1.6, thresholds 0.75 and 0.90, all introduced in section 4) that the authors claim to have empirically determined without providing further details (both here and in [1]). Now these parameters appear to be crucial for the accuracy of the metrics (in particular the thresholds), so I would like the authors to detail their estimation in this paper, also to make clear that they were not estimated on the same datasets used in the evaluation (which would amount to overfitting). Also, testing the impact of hyper-parameters on metric performance and their generalization to multiple datasets (only two datasets used in the paper) is something that I would like to have covered in this paper and not as part of future work (see section 11.2), as this aspect is strictly tied to the robustness and practical usability of the proposed metrics.

C2. Low metrics performance on "imbalanced" data. The evaluation results show that the metrics perform poorly when applied to ILNs produced by ER systems tuned for precision (e.g., via higher similarity thresholds), whose links are fewer but more accurate, up to the point to be outperformed by a majority class approach. See, specifically: (i) the expert evaluation for sizes 3, 4, 5 in table 2; (ii) the negative predicted value of 0.238 (precision of negative class) in table 6, which suggests that any "bad" label coming from the metric in this setting is most likely wrong; (iii) the geo+names evaluation in table 7; (iv) the increasing ranking errors for larger thresholds in figure 7. I understand these are "unbalanced" settings, but in my opinion this kind of setting is also desired and frequent, as it will occur any time the metric is applied to high quality identity links, for which the poor accuracy of the metric will limit its practical utility (i.e., the better the links, the less useful the metrics). This "imbalanced" setting is precisely the setting where using weight information may likely help, as it provides the metrics with the indication that the links forming an ILN are more accurate, and this increased accuracy may balance the negative evidence coming from "bad" network metrics due to missing links (which are likely to increase in number in a setting tuned for precision).

C3. Inconclusive evaluation of weighted metrics. Weighted metrics are the novel contribution of this paper, but their evaluation (section 10) fails at showing some concrete benefit in using them w.r.t. the eq metric. The authors suggest that the analysis in table 8 "shyly helps breaking the tie between the two metrics" (eq vs weighted eq variants). This analysis is based on computing the average of the 4 differences between F1 scores coming from eq metrics and corresponding F1 scores coming from human judgments, over 4 threshold settings. Here there are two problems that make the analysis inaccurate: (i) the authors compute the average between **signed** differences, so that a large positive error may cancel out a correspondingly large negative error, whereas it would have been appropriate to consider unsigned differences; and (ii) few significant figures are considered, so that a difference between 0.00325 and 0.0035 is expanded to the difference between 0.003 and 0.004 after rounding (table 8, eq_w vs eq for MCENTER).

C4. Reliance on non-normalized, non-comparable weights. The weights used to derive eq_min, eq_avg, eq_w (see section 9) are not normalized, so both these weights and the obtained weighted metrics are not comparable among different usage scenarios. Concrete example: let's assume to have an edge e1 in the "geo" setting and an edge e2 in the "geo+names" setting that both get the same weight w1 = w2 = 1. In the "geo" setting, that weight w1 = 1 reflects evidence coming only from the comparison of geographical information, whereas in the "geo+names" setting that weight w2 = 1 would be backed by stronger evidence also including perfect name similarity, evidence that is not reflected in the weight (w.r.t. the "geo" case). Unless there is some hyper-parameter tuning (not the case), the metric "eq" has no way to treat "geo+names" weights differently from "geo" weights, and the good/bad labels resulting from its application would likely tend to treat the "geo+names" ILNs as having lower quality than corresponding ILNs in the "geo" setting. This is a major issue that affects what is the novel contribution of the paper, and that might have led to the inconclusive evaluation results of section 10. I strongly suggest authors to address this issue. For instance, they may try to normalize weights so that they assume a precise meaning, e.g., the one of "calibrated probabilities" of link correctness. This can be achieved based on some ground truth good/bad link annotations, using, e.g., the Platt method or a similar one (see, e.g., https://scikit-learn.org/stable/modules/calibration.html for concrete solutions).

C5. Use of disagreeing expert vs non-expert ground truth data. I don't understand the utility of evaluating the approach with both "low" and "high" quality ground truths, respectively from non-expert and expert annotators (see sections 6.3, 8). I see two explanations for the expert vs non-expert differences here: (i) the annotation task is inherently difficult (e.g., one entity does not corresponds exactly to another - e.g., it represents a branch of a bigger organization - so annotators may disagree), a case that deserves further investigation with assessment of inter-annotator agreement based on precise annotation guidelines; or (ii) reliable human annotation is feasible and differences are to be ascribed only to errors by the non-expert annotator, which means that evaluation numbers reported for the non-expert case are of little value, and a merged ground truth dataset (expert annotation + checked/revised non-expert annotations) should be better used.

Of the above major comments, I think all of them except C2 can be (at least partially) addressed in a relatively short time by authors, and that's the main reason backing my major revision recommendation. Besides, I'm also reporting later some minor comments related to specific passages of the paper, which the authors may find useful and/or possibly consider to improve the paper.

== Quality of writing ==

The paper is overall well written, with the intuitions behind the proposed metrics nicely presented. The authors decided to first introduce and evaluate the base "eq" metric for unweighted graphs (basically, the contribution of the EKAW paper) and later introduce and evaluate its weighted extensions (the main new contribution), and I find the resulting paper structure acceptable. There are some typos, the definitions of the weighted metrics can be improved as well as few figures, but the required changes are limited: I list all of these issues later, for authors' convenience.

== Minor comments ==

M1. [section 1] I suggest clearly listing in the text the additional contributions w.r.t. prior work [1] by the same authors.

M2. [section 2] I agree with the authors that the aim is not clustering (nor proposing another method for entity resolution) and honestly I don't feel the need for paragraph "Simple Clustering Algorithm" that reports the very obvious clustering algorithm used by authors to detect ILNs in a weighted graph. If the authors decide to keep it, then: (i) please check the addition of multiple strengths to the same edge (~ lines 37-38), as I don't see it as necessary and, if it is, then it means metrics "eq" can work with multigraphs and that should be emphasized and motivated in the paper; and (ii) please note that worst case complexity is not O(m) but likely O(m log(m)) as merging clusters in line "C_b.add(C_s.items())" is not O(1) - that said, I don't think complexity analysis of Algorithm 1 is of much interest, and, if complexity is considered, then also the complexity of the evaluation of the "eq" metrics themselves should be discussed.

M3. [section 3] References [14, 15] refer to Coreference Resolution and Entity Linking (EL), two NLP tasks dealing with entity mentions in text. It's OK to mention them but in that case I would explicitly name the tasks they are addressing. Also, these tasks can be seen as building identity graphs whose nodes are entity mentions and (for EL only) KB entities, so I don't see them as incompatible scenarios where to use the proposed metrics, although I can understand if those scenarios are out-of-scope of this work.

M4. [section 4] Based on how it is defined, closure metric n_c is strictly < 1, whereas bridge metric and diameter metric can reach 1. As a consequence, eq metric may never reach 0. This is not a problem, of course, but the authors might want to consider slightly revising the definition so to guarantee that eq can entirely cover the whole [0, 1] range.

M5. [figure 3, caption] I would move the statement "to evaluate eq, all possible links are evaluated" in section 4 to make it more apparent to readers, as this is a requirement for the proper application of the metric (due to how bridge, diameter and closure metrics are normalized).

M6. [section 6.1] While authors assume here that datasets may contain duplicates, please note that in multi-dataset ER it is often assumed (and leveraged) the opposite (see, e.g., [A]). I don't see problems in applying "eq" in those settings, however, as the knowledge a dataset is duplicate-free, or that generalizing two entities are distinct, can be used before applying eq to immediately mark as "bad" an ILN containing a link between those necessarily distinct entities.

M7. [section 6.2, Table 2] How are defined "Positive" and "Negative" ground truth samples? I infer "positive = all ILN links are correct", while "negative = some ILN link is wrong", but I suggest the paper to explicitly specify that.

M8. [section 6.2] Why ILNs of size < 5 were not considered in the non-expert evaluation? I understand there are a lot of them and considering all these ILNs is infeasible, but if I have to sample ILNs, I would try to get a representative sample for each size to better investigate dependencies on size and avoid possible biases. Besides, in the expert evaluation (section 3), all sizes >= 3 where used.

M9. [section 7.3] I would find interesting to see the distribution of ILNs by size also in this case, similarly to what reported in Figure 4. In particular, I'm curious on whether the non-considered ILNs are mainly of size 2 (and thus out-of-scope of the proposed metric) or there is a relevant number of ILNs of size > 3, a sample of which could have been assessed to avoid possible evaluation biases (see comment C7).

M10. [section 8, figures 6, 7] I like the outcomes of this evaluation, as it shows that metric "eq" can be used to get an approximate indication of the performance (F1) of an ER system - at least when the system is not tuned for maximum precision (i.e., high threshold). What I find confusing is talking of "ranking test", "ranking algorithms" and "ranking error". To me, "ranking" here would mean to establish an order of algorithms, from best to worst performing ones (for a certain threshold). Based on that, a "ranking error" occurs if the ranking induced by applying metric "eq" is different from the ranking computed using human annotations, and to quantify that error I would use for instance some rank correlation measure, e.g., Kendall's Tau. Instead, what seems to be evaluated as "ranking deviation" in Figure 7 is the difference in F1 scores computed via "eq" and via human annotations. Small F1 differences are good, but they don't imply a similar algorithm ranking would be obtained by using "eq" (as claimed in section 11.1). Also, the text talks of the "potential to rank clustering algorithms whenever they show "\emph{significant performance differences}". Should I take "significant" as "statistically significant"? In that case, I don't see how the claim is supported by Figures 6 and 7. If that is the intended meaning, perhaps the authors may check for statistically significant differences in F1 (e.g., using approximate randomization test [B]) when using both human assessment and "eq", and check if the same differences "algorithm X significantly different than algorithm Y" are detected.

M11. [section 10] I suggest providing some quantitative measures of the differences in performance of different weighted metrics.

M12. [section 11.2] I don't see how to apply the intuitions behind the eq metrics to networks of size 2. These networks consist of exactly one edge, for which there is little to compute in terms of network metrics.

== List of typos and other presentation issues ==

T1. [section 1] "the proposed metrics indeed reliably estimates" -> either "metric" or "estimate"
T2. [section 1] "our contributions is a method" -> "contribution"
T3. [section 2] "Fig. ??" -> "Fig. 1".
T4. [section 2] "they belongs to different clusters" -> "belong"
T5. [section 4] "For example, n_c and n'_c treat a Tree, Star..." -> "and n_b"
T6. [section 5] "OpenAire: 2018.08.16" -> I suspect the date is wrong, as it is in the future w.r.t. Jan 2018
T7. [section 6] check special characters (tm, (c)) in footnote 13.
T8. [section 6.2] "Negative Predicted Value (NPC)" -> "NPV"
T9. [section 6.3] "to our results. \footnote{...}" -> drop space between "." and "\footnote{...}"
T9. [section 7.2] "c1 = {{a_1}, {b_3}}..." -> why nested sets? This suggests that something more than a subset of nodes within a graph is needed to identify an ILN, and I think the distinction between datasets is apparent also using a regular "flat" set.
T10. [figure 5] is there a meaning in the line dashing used for different edges?
T11. [table 5] "IDLINEs" -> "ILNs"
T12. [section 8] "between the baseline and the four eq metrics" -> up to this point in the paper, there is only one eq metric
T13. [section 8] "and display it in Figure 7" -> "displayed"
T14. [section 8] "Figure 7 shows a deviation of +-0.97" -> looking at table 8, I think the correct number here is 0.096
T15. [section 9] "e_i = (v_{i-1}, v_i} \in L where v_i \in V for in \in [1,k]" -> what is k? why not just saying that the two vertices are \in V? what is v_{i-1}? (I later understand it is an arbitrary vertex in a path, but here it is unclear)
T16. [section 9] definition 2 of "dist(a,b)" uses a strange notation; I would write "dist(a,b) = min_{\pi \in \Pi(a,b)} |\pi|", similarly to how "dist_w(a,b)" has been defined
T17. [section 9] I would revise also the notation in definitions 3, 4 of "diam(G)" and "diam_w(G)", e.g., "diam(G) = max_{a,b, \in V} dist(a,b)"
T18. [section 9.2] in definition 6 of "eq_avg", replace "we" with "w"
T19. [section 9.3] "\frac{2.2}{2} = 1" -> I get the point that the branch for "eDiam(G) > n - 2" is applied, but written like this looks weird
T20. [figure 8] the use of overlapping boxes makes the figure very difficult to read
T21. [section 11.1] "it estimates the quality of links" -> "of ILNs"
T22. [table 8] whats the meaning of bold vs. underlined average values in the table?

== References ==

[A] M. Nentwig, E. Rahm. Incremental Clustering on Linked Data. ICDM Workshops. 2018.
[B] E. W. Noreen. Computer Intensive Methods for Testing Hypothesis. John Wiley & Sons. 1989.

Review #2
Anonymous submitted on 30/May/2019
Suggestion:
Minor Revision
Review Comment:

In this article, the authors propose a method for qualitative evaluation of entity matching quality between multiple datasets. The method requires neither manual annotation nor ground truth datasets. In particular, the authors describe and evaluate several metrics, to some extend, provide a useful indication of the dataset quality. These metrics are derived from the consistent patterns usually observed in high-quality datasets.

This article is a revised and extended version of the paper published at EKAW 2018 (doi:10.1007/978-3-030-03667-6_10). Sections 9 and 10 present new content, namely a substantial refinement and evaluation of the proposed metric for the case of weighted links.

# Strengths

1. The proposed metrics allow evaluation of datasets for which ground truth is not available.

2. Experiments show that the high values of the aggregated metric, e_Q, correlate with the identity link network (ILN) quality.

# Weaknesses

1. As far as I understand, the metrics are computed per each ILN. Thus, the use of these metrics requires additional aggregation to obtain a single-number-metric for the whole dataset. It is not clear, however, how to perform such aggregation.

2. There is no running time analysis of the proposed method, which is important in processing large-scale datasets. Although on page 6, the authors mention the computation time ('1:40 minutes' is 100 seconds or 100 minutes?), I recommend the authors to discuss this aspect more thoroughly.

3. Essential details of the experimental setup description are missing.

# Comments

Page 2, line 38. Please indicate whether Algorithm 1 is used to produce ILNs.

Page 4, line 4. How is the value of 1.6 estimated? Did the authors use the held-out development dataset or they tuned this hyper-parameter on the evaluation dataset?

Page 6, line 15. The authors do not describe how the evaluation scores are calculated and how the confusion matrices are built in the reported experiments. Without this, it is not possible to understand the meaning of the presented numbers. Please carefully explain the quantitative evaluation details.

Page 6, line 44. How is the similarity threshold of 0.8 chosen?

Page 7, line 22. How is the Majority Class Classifier applied?

Page 7, line 31. The authors add simulated noise to the input data. How often is the chosen kind of noise observed in the literature?

Page 8, lines 13, 18, and 21. How are the parameters chosen?

Page 10, line 23. Please provide an example of the ranking task in Section 8.

Page 10, lines 16 and 31. The claim of significance requires performing a statistical test. Did the authors conduct a statistical test to assess the significance? (This comment also applies to other results reported in the article.)

Page 11, line 28. Please move graph (and other) definitions to a dedicated section after Introduction to provide the consistent notation for the whole article.

Page 11, lines 10 and 23. What do the labels in the legend mean? Please provide the references to CLIP/.../STAR2 or define them before they are used.

Page 12, lines 6 and 26. Please introduce the terms before using them, e.g., n_b_w(G) should be defined before n'_b_w(G).

Page 12. Please provide illustrations similar to Fig. 2 to provide the reader with an intuition of how the new metrics are designed. Such verbal descriptions as the one in lines 32-44 are not enough.

Page 13, line 34. Please provide proper evidence whether the difference is significant or not.

Page 14, line 25. Please provide a reference confirming the information gain increase.

Page 14, line 44. I think it is crucial to provide the reader with advice on the metric choice. Otherwise, the experiments performed in Section 10 seem unfinished and insufficient.

Although the authors have published the evaluation datasets at https://github.com/alkoudouss/Identity-Link-Network-Metric, I believe the availability of the implementation will be useful for the community.

# Typography Issues

Page 1, author block. There is a possible typo in the author name: 'Frankvan Harmelen' should be 'Frank van Harmelen'.

Page 2, line 46. Fig. ??

Page 6, footnote 12. The link is already mentioned in footnote 2.

Page 6, footnote 13. Encoding issue.

Page 7, line 17. Please expand the MCC abbreviation as 'Majority Class Classifier'.

Page 8, line 10. Please put Fig. 5 at the top of the page.

Page 10, line 21. 'deviation of ±0.96' should be 'standard deviation of ±0.96'.

Page 12, line 31. Should not 'we(e_i)' be 'w(e)'?

Tables. I think that the notation \frac{MajorityClassClassifier}{NetworkMetrics} is highly confusing because it looks like a fraction, but this is not a fraction. Since there is enough horizontal space, the authors should try separating these two numbers by '/' or '|' to simplify reading.

Generally, I recommend the authors to perform a proof-reading of the manuscript, especially focusing on the new content: 'seams' -> 'seems', 'each-other' -> 'each other', etc.

Review #3
Anonymous submitted on 10/Jun/2019
Suggestion:
Minor Revision
Review Comment:

The paper presents a method for estimating the quality of generated entity links between multiple datasets.
The proposed strategy relies on the exploitation that the entity links generated from multi datasets create a network can be analyzed through metrics reliable for predicting their quality.
Results have been verified on several datasets.
This work is an extension of a previous work published at EKAW 2018.
Overall the paper is very interesting and of high quality.
Actually the Reviewer does not have any concern about it.

The presentation of the problem is well done and it is easy to follow and understand also by people that are not familiar with the topic.
Also the challenges targeted by the authors are clear and well placed with respect to the literature.
The contribution is definitely novel since, to best of the Reviewer's knowledge, this is the first work combining theories from ontology matching and network analysis.
The Semantic Web community will definitely benefit from this work since it can foster further research in the field in order to apply it to bigger knowledge bases.
Indeed, the availability of all the data used for validating the proposed strategy is a very valuable resource for the community and also allows the reproducibility of the methods described by the authors.
Finally, the extension with respect to the EKAW paper is appropriate and also the comments provided by the EKAW Reviewers have been properly addressed.

In order to complete the work, the Reviewer invites the authors to include a further subsection discussing the scalability of the algorithm in generating the network based on the dimension of the datasets.
This information would be of interest for understanding the suitability of the system for a real-time usage.
Then, a discussion about this should be included.

Minor issues:
- pag. 2, col. 1, row 46: the reference to the figure is missing