Person Record Linking for Digital Libraries using Authority Data

Tracking #: 753-1963

Authors: 
Cornelia Hedeler
Bijan Parsia
Brigitte Mathiak

Responsible editor: 
Guest Editors EKAW 2014 Schlobach Janowicz

Submission type: 
Conference Style
Abstract: 
The explicit purpose of Linked Open Data is to link diverse data, or using the web to lower the barriers to linking data currently linked using other methods. Yet, there exist many objects in the Linked Data cloud that refer to the same real world entity, but are not yet ex- plicitly linked. One special case of this are persons, and in particular authors, which may appear in a variety of contexts, but while they of- ten carry many identifiers, the most prominent attempts to link them use auxiliary information, such as co-authors, affiliations, research inter- ests and so on. In this paper, we investigate the possibility to identify the same person in different, previously unconnected digital library and person-centred authority data sets. We use digital library data sets from different domains and authority data sets, test the suitability of auxiliary information for person record linkage and evaluate how difficult it is to re-find the same person.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
[EKAW] reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 25/Aug/2014
Suggestion:
[EKAW] conference only accept
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== 3 strong accept
== 2 accept
== 1 weak accept
0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject

Reviewer's confidence
Select your choice from the options below and write its number below.

== 5 (expert)
== 4 (high)
3 (medium)
== 2 (low)
== 1 (none)

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 5 excellent
4 good
== 3 fair
== 2 poor
== 1 very poor

Novelty
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
3 fair
== 2 poor
== 1 very poor

Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
3 fair
== 2 poor
== 1 very poor

Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
3 fair
== 2 poor
== 1 not present

Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
4 good
== 3 fair
== 2 poor
== 1 very poor

Review

Person Record linking for digital libraries using authority data

The authors propose a method to identify the same person in different libraries using one person-centred authority library.
They illustrate the approach on GND, Sowiport, DBpedia (en/de).

I consider this as a real problem, and also relevant for EKAW.

This is very interesting work, but at some points rather premature.

From the paper it is not clear to me whether the method is still useful if you do not have a person-centred authority library but just a number of different libraries.
The assumption of the method is that one data source is more important (authority) then others , what if there is none?
Another question is how much of your method can be used in another domain instead of linking persons. How generalizable are the results?

Good intro, small detail is if the information is partially over-lapping then you should also take into account inconsistency.
Add a last paragraph with the structure of the paper i the first section.

There is a large body of literature on record linkage, and several overview papers, which is useful to refer to, but the paper lacks a description of what the different is wrt. those existing methods. It would be very useful to know how your method differs from other linking record linking methods.

The approach (section 4):
Make more explicit what is original in your approach. I think the fact that you use a person-centred authority? The indexing step seems not very surprising to me.
Concerning the record pair comparison, I was wondering how dependent this step is on the slots (name, keywords, affiliation etc.).

The value of the paper is in applying their person linking method on GND, Sowiport, and DBpedia.
For the contribution of the method, it should be more clear what the difference is with existing methods, the generalizability (or the main assumptions for this method). Furthermore a comparison against a baseline for evaluating the results would be interesting. How behaves your method with respect other person linking methods.

Review #2
Anonymous submitted on 25/Aug/2014
Suggestion:
[EKAW] reject
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.
-2
== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject

Reviewer's confidence
Select your choice from the options below and write its number below.
4
== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.
3
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

Novelty
Select your choice from the options below and write its number below.
2
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

Technical quality
Select your choice from the options below and write its number below.
2
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

Evaluation
Select your choice from the options below and write its number below.
3
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present

Clarity and presentation
Select your choice from the options below and write its number below.
2
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

Review
This paper describes an effort to link Person records for digital libraries. The authors link person records from two different datasets (DBLP and the social science publication dataset Sowiport) to two different 'authority data sets': the GND, which is a (German) subset of VIAF and DBpedia.

The authors' hypothesis seems to be that this linking is improved when more structured information is available for the linking process. To test this, they describe a person matching approach which 1) finds string matches 2) compares records, including related keywords, co-authors etc and 3) performs some domain-specific filtering.

Moreover, the authors investigate the amount of (overlapping) structured information (used in step 2) for the various sources. This results in a number of interesting tables reporting on the type of information available in those sources.

My main concern with this paper is that it is unclear what the contribution is. As a description of an approach or algorithm to link persons it is lacking much needed detail into the specifics of the algorithm. From Section 3, I gather that there is not much more to it than standard record linkage techniques, which include string-matching and record-comparison. It is unclear what the extension beyond the state-of-the art here is.

On the other hand, it seems that the contribution could be a description of the amount of structured metadata in the various data sources, which could help matching algorithms. Here the authors find that there is 'currently very limited information beyond the author name'. But at the same time, they conclude that this is actually not that crucial. As the authors state: [This seems to] "suggest that the lack of information does not have a too negative effect on the performance of the person record linkage". I dont understand then what the contribution of the paper is.

In Section 5, the authors want to investigate "how much the name of a person and how much of the additional information (if available) on GND and DBpedia contributes to the correct matching of authors to their corresponding person records". The methodologically correct way of doing this would be to test two versions of the algorithm, one with and one without using structured information and test the effect on the evaluation. The way the authors do it now does not give clear evaluation of the effects.

Also, how generalizable is this whole algorithm and the findings. Do the found effects hold for scientific authors, for authors, or for all types of persons?

Some other issues
- In many cases, overly long sentences are used. These can make it hard to understand the intended meaning of these sentences. For example, in the 2nd paragraph in section 6, the first two sentences cover 10 lines.

- p2:"Not all links are of equal value..." -> This paragraph is confusing. I would suggest a rewriting that clarifies a) how the authors came to this conclusion (references or original research) and b) what they actually do with this conclusion. Did it influence the algorithm? the evaluation?

- In table 2, what is the difference between a "0" value and "NA"?

- p6: The algorithm description is not very detailed. For the preprocessing: what is the success rate of this conversion of name-ordering. What about names in other languages (Chinese,..)
- Sec5.3: why is one testset manually created and the other random. Why are they different size and how do these variations influence the evaluation?

- Table 2 comes before Table 1 (very minor issue)

Review #3
Anonymous submitted on 26/Aug/2014
Suggestion:
[EKAW] reject
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.
-2
== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject

Reviewer's confidence
Select your choice from the options below and write its number below.
4
== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.
3
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

Novelty
Select your choice from the options below and write its number below.
2
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

Technical quality
Select your choice from the options below and write its number below.
2
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

Evaluation
Select your choice from the options below and write its number below.
2
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present

Clarity and presentation
Select your choice from the options below and write its number below.
3
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

Review
Please provide your textual review here.

This paper presents a study of link discovery problems for author data. The authors use predefined specifications to link persons across different knowledge bases and report on the quality of the linking. Thereafter, they identify the problem that occur when trying to link author data. While the aims of the study are clear, the implementation is rather poor.

1- Definition of the approach for linkage
The authors seem to pick a predefined approach to determine matching authors and apply no fitting of any kind (at least, I was not able to detect any in the paper). For example, by using Lucene, they rely on the Levenshtein distance to compare author names. It is yet well known that Levenshtein is actually a poor measure for record linkage (see (Cheatham and Hitzler, 2013) and even (Cohen et al., 2003)). Moreover, they do not report exactly which measures they use to compare the other attributes of authors. The linkage rules used should have been made explicit to enable the reader to understand exactly how the scores come about. The F-measures reported by the authors seem to be merely the score achieve by a particular linkage rule and are thus not representative of the possible scores that could have been achieved if some machine learning (even unsupervised, see (Nikolov et al., 2012; Ngonga Ngomo and Lyko, 2013)) had been used.

2- Evaluation
The results of the evaluation cannot really be generalized due to the reasons mentioned above. Thus, I am rather not inclined to assume that the conclusions of the authors are pertinent.

3- Scientific contribution
I really do miss a scientific contribution here. The authors take exisiting data and apply a linkage rule to it. To be honest, this would contribute a nice workshop contribution but I do not think it is sufficient for a main conference or a journal.

Some minor comments:

same real world => same real-world
One special case ... and so on => Split to 2 sentences
Person names are often unsuitable as identifiers => quantification?
real price => do you mean prize?
person centred => person-centred
Levenshtein similarity is poor metric when used alone => http://secondstring.sourceforge.net/doc/iiweb03.pdf
How were the thresholds defined?
two data set => two datasets
run the => ran the
97% on 27 resources mean a huge possible deviation
a person sufficiently unambiguous => a person sufficiently unambiguously