Editorial Board

Editors-in-Chief
Krzysztof Janowicz

Managing Editors
Cogan Shimizu
Eva Blomqvist

Editorial Board
Mehwish Alam
Claudia d’Amato
Stefano Borgo
Boyan Brodaric
Philipp Cimiano
Oscar Corcho
Bernardo Cuenca-Grau
Elena Demidova
Jerome Euzenat
Mark Gahegan
Aldo Gangemi
Anna Lisa Gentile
Rafael Goncalves
Dagmar Gromann
Armin Haller
Aidan Hogan
Katja Hose
Eero Hyvönen
Sabrina Kirrane
Agnieszka Lawrynowicz
Freddy Lecue
Maria Maleshkova
Raghava Mutharaju
Axel Polleres
Guilin Qi
Marta Sabou
Harald Sack
Christoph Schlieder
Stefan Schlobach
Oshani Seneviratne
Cogan Shimizu
Ruben Verborgh
GQ Zhang

Former Editors-in-Chief
Pascal Hitzler

Editorial Assistants
Sanaz Saki Norouzi

Syndicate

Learning Expressive Linkage Rules from Sparse Data

Submitted by Christian Bizer on 01/02/2019 - 11:53

Tracking #: 2107-3320

Authors:

Petar Petrovski

Christian Bizer

Responsible editor:

Jérôme Euzenat

Submission type:

Full Paper

Abstract:

A central problem in the context of the Web of Data, as well as in data integration in general is to identify entities in different data sources that describe the same real-world object. There exists a large body of research on entity resolution. Interestingly, most of the existing research focuses on entity resolution on dense data, meaning data that does not contain too many missing values. This paper sets a different focus and explores learning expressive linkage rules from as well as applying these rules to sparse web data, i.e. data exhibiting a large amount of missing values. Such data is a common challenge in various application domains including e-commerce, online hotel booking, or online recruiting. We propose and compare three entity resolution methods that employ genetic programming to learn expressive linkage rules from sparse data. First, we introduce the GenLinkGL algorithm which learns groups of matching rules and applies specific rules out of these groups depending on which values are missing from a pair of records. Next, we propose GenLinkSA, which employs selective aggregation operators within rules. These operators exclude misleading similarity scores (which result from missing values) from the aggregations, but on the other hand also penalize the uncertainty that results from missing values. Finally, we introduce GenLinkComb, a method which combines the central ideas of the previous two into one integrated method. We evaluate all methods using six benchmark datasets: three of them are e-commerce product datasets, the other datasets describe restaurants, movies, and drugs. We show improvements of up to 16% F-measure compared to handwritten rules, on average 12% F-measure improvement compared to the original GenLink algorithm, 15% compared to EAGLE, 8% compared to FEBRL, and 5% compared to CoSum-P.

Full PDF Version:

swj2107.pdf

Previous Version:

Learning Expressive Linkage Rules from Sparse Data

Tags:

Reviewed

Decision/Status:

Solicited Reviews:

Click to Expand/Collapse

Review #1

Anonymous submitted on 21/Feb/2019

Suggestion:
Accept

Review Comment:

The paper has come a long way and I thank the authors for addressing my reviews. I hence suggest accepting the paper under the condition that the authors address the following major comments:

Punctuation: The punctuation of the paper is not very sharp. Please check it again thoroughly. Examples include
p6, col2:
- 'In such a case,' or 'In such cases,'
- 'Where,' -> 'Where'
- 'constant,' -> 'constant'
- 'n -- th' -> $n^{th}$
- 'At the first iteration' -> 'At the first iteration,'
p7, col2:
- [0,.., 1] -> [0,\ 1]
Also consider the punctuation after your equations and do keep it consistent.

Wrong reference: I do have to insist w.r.t. [34]. I just checked the paper again and the version of LIMES (you say yourself that you refer to the algorithm in the last version of your answers) described in that paper is what you compare against. The 2012 reference given in my last comment is definitely the more appropriate. You can also check the code of the framework to convince yourselves thereof. I'm afraid citing a paper because it is well known does not hold as a valid reason for choosing references.

Minor comments
Equations
(4) does not fit well in the column
(6) and (7) would be better aligned on the =
(6) and (7) feature a ',' but not a '.' (but these equations are not part of a sentence)
(8) and (9) should be aligned on the arrows and the 'with' part put somewhere else (use \noindent after the equation or \intertext{} within it).
(9) may be using ; instead of the middle | like in Equation (1) would improve readability.

Log in or register to post comments
7324 reads

Main menu

Editorial Board

Syndicate

Learning Expressive Linkage Rules from Sparse Data

Tracking #: 2107-3320

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles

Search form

Main menu

Login

Editorial Board

Syndicate

Learning Expressive Linkage Rules from Sparse Data

Tracking #: 2107-3320

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles