# KnowMore - Knowledge Base Augmentation with Structured Web Markup

### Tracking #: 1552-2764

Authors:
Ran Yu
Besnik Fetahu
Oliver Lehmberg
Dominique Ritze
Stefan Dietze

Responsible editor:
Guest Editors ML4KBG 2016

Submission type:
Full Paper
Abstract:
Knowledge bases are in wide-spread use for aiding tasks such as information extraction and information retrieval, where Web search is a prominent example. However, knowledge bases are inherently incomplete, particularly with respect to tail entities and properties. On the other hand, embedded entity markup based on Microdata, RDFa, and Microformats have become prevalent on the Web and constitute an unprecedented source of data with significant potential to aid the task of knowledge base augmentation (KBA). However, RDF statements extracted from markup are fundamentally different from traditional knowledge graphs: entity descriptions are flat, facts are highly redundant and of varied quality, and, explicit links are missing despite a vast amount of coreferences. Therefore, data fusion is required in order to facilitate the use of markup data for KBA. We present a novel data fusion approach which addresses these issues through a combination of entity matching and fusion techniques geared towards the specific challenges associated with Web markup. To ensure precise and diverse results, we follow a supervised learning approach based on a novel set of features considering aspects such as quality and relevance of entities, facts and their sources. We perform a thorough evaluation on a subset of the Web Data Commons dataset and show significant potential for augmenting existing KBs. A comparison with existing data fusion baselines demonstrates superior performance of our approach when applied to Web markup data.
Tags:
Reviewed

Decision/Status:
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 21/Mar/2017
 Suggestion: Minor Revision Review Comment: This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. This paper addresses an important and timely problem in Linked Data: how to enrich the incomplete description of entities within a dataset using external sources. It is a significant extension of a preliminary work of the authors. The paper is globally well written and easy to read. The part on the evaluation is clearly explained and discussed. The main contribution is to provide a pipeline to enrich a RDF dataset based on RDf statements extracted from Web markup, that integrates clustering and supervised classfication techniques based on crowdsourcing. My main concern is the lack of formal definitions of several important notions that are just introduced by sentences and illustrated by examples. In particular, "(long) tail entities (types, properties)" is mentioned several times without being defined precisely, and without explanation of the impact of this feature on the problem adressed in this paper. Also, the notion of co-references is central but not formally defined, as well as what an entity is in the setting of this paper. The notions of "real-world entity", "entity identifier", and "entity descriptions" should be more clearly defined and distinguished. In Section 3.2, there is a mismatch of notations between the notation $e_q$, denoting in Definition 1 "a description of a (real-world or identified ?) entity", and the notation $e_s$ mentioned in the above paragraph, denoting "the entity description of e corresponding to subject s". Concerning the last point (*the* entity description of e corresponding to subject s): how multivalued properties are handled in the proposed approach ? In definition 1, the set F' is not defined: it would be better to say "we aim at selecting a subset F' of M .... Each fact $f_i$ in F' represents a valid ...". BTW, it is not clear whata fact is: an RDF triple ? in this case, it is not consistent with the statement "property-value pair describing the entity q" in the definition (why the subject is not mentioned ? is it q, or more rightly the identifier of the real-world entity q ?). To be called a definition, Definition1 should formally define the notion of "correct" fact: how can it be checked that a fact "is consistent with the real world" ? In the entity matching paragraph: the set E is not defined either ; what is the definition of "co-referring entity descriptions" ? Pseudo-keys are mentioned but not explained or defined.
Review #2
By Aleksander Smywinski-Pohl submitted on 28/Mar/2017
 Suggestion: Major Revision Review Comment: The paper presents a novel method designed for augmenting KBs with data extracted from Web markup. It concentrates on schema.org markup and the evaluation is performed on two entity types: books and movies, which are broadly available on the Web. The performance of the presented method (KnowMore) is compared with two other methods: PrecRecCorr and CBFS. The results presented in the paper indicated that the implemented method yields better results than these state-of-the art data augmentation methods. Thus the system invented by the authors seems to be an important contribution to the research in the field of KB augumentation in particular and Semantic Web in general. Thus both originality of the results as well as the significance of the work should be highly judged. Yet the positive impression of the paper is undermined by several issues, connected primarily with the scientific validity of the results as well as preciseness of the description. The primary missing element of the paper is the description of the measures used for computing the similarity of the matched entities (cf. Page 7, the first paragraph). The authors only mention that they used cosine similarity for text, but that is only a vague remark. The reader is referred to the source code of the system, but such a reference is not a good idea. First of all – the published code will probably undergo development, thus the measures might be redefined, making the reproducibility of the results impossible. Secondly – due to the nature of software, it might be hard to find the particular piece of source code that is responsible for implementing the measure referenced in the text. Thus all the methods used for computing the similarity of the features should be formally described in the paper. The second missing element is the description of the parameters of BM25 algorithm. This is a parametric algorithm and the selected parameters can strongly influence its results. Moreover – when comparing the results of KnowMore_match with BM25 it should be indicated if the parameters of vanilla BM25 were tuned for the dataset. A reference to the name of the implementation of the algorithm (Indri, Lucene + Solr/ElasticSearch, independent implementation), would also be valuable. Another important detail that is missing in the paper is the inclusion/omission of a held-out corpus used for tuning the parameters of the algorithms. For instance, there is a tau parameter set to 0.5 as an optimal value. If that parameter was determined using the same data that were later used for testing the results, then the obtained precision/recall/F1 values are invalid. The authors only write that the values were obtained using 10-fold cross-validation without referring to a held-out corpus, so it might be possible that they used the same data for testing and tuning, which is a major methodological fault. There is also a problem with the selection of the method used for training the classifiers that are important part of many of the algorithms. It is not surprising that Naive Bayes classifier performed the bast among the tested methods, since the amount of the training data (approx. 200-300 examples) were not enough to train more powerful classifiers, such as SVM. Yet it is surprising that the authors didn’t try to use logistic regression, since it seems to be the most popular, first-shot classifier when working with binary classes. What is more concerning regarding the NB classifiers are the remarks present in section 5 on page 8. Namely the authors indicate that there are predicates having only one valid value for a particular entity (functional properties or key properties, in the DB parlance) as well properties accepting more valid values for a particular property. This is obviously valid statement, but how this feature can be exploited in NB classifier, since its primary assumption is the fact that the features are independent? The only way to do so is training a distinct classifier for each property, but the description lacks any such suggestion. In fact – the set of features (t^p_1 – predicate term) seems to suggest the opposite, namely the “to be learned” dependence of t^p_4 on t^p_1. The authors selected pairwise percent agreement (PPA) as an indicator of the agreement between the annotators of the data. This is the first paper I’ve read in the field of computer science that cites this measure. Normally Kohen’s Kappa (for two raters) and Fleiss Kappa (for more than two raters) are used. Fleiss and Cohen’s paper (“The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability”) was written in 1973. There is also a paper written by K. Krippendorff (“Reliability in content analysis”, 2004) that is describing PPA as an invalid method for measuring the agreement between raters. Moreover, some of the datasets are highly skewed (e.g. the data fusion datasets, where the number of positive examples exceeds 80%), thus a special care should be taken when describing their statistical features. There are other issues related to the statistical validity of the results and the language used to describe the statistical phenomena. First of them is the lack of the margin of error in reports regarding the precision, recall and F1 measure. Although computer science papers not often provide that value, it is especially important when the sample sizes are so small as those presented in the conducted experiments. It might turn out that the compared methods are not significantly different, due to a large margin error. Regarding the statistical significance – the authors report that some results are (in)significantly different from each other (page 11, last paragraph). Yet this concept is used without any reference to a statistical test showing that such a claim is valid in the context. The authors also write that some result was 3,8%, 39.4%, etc. higher than the other, when the referred quantity is not a relative but an absolute difference. Percentage points would be much more appropriate in that case. All these issues indicate that the paper should undergo a major revision addressing all of them. Regarding the minor issues found in the paper: 1) the paper lacks keywords 2) high recall is more important than precision at the stage of blocking the entities, but the authors gave equal weights for both of them when reporting the matching step performance 3) the objects of the relations are not matched against the extended KBs 4) using a subscript for indicating the steps of the algorithm is not the best idea, since it’s harder for the reader to see the difference between the results of the different steps 5) “iteratively” on page 6 – doesn’t make sense, since there is only one iteration 6) I believe there should be e_i rather than e_q in 10th line on 6th page 7) page 8, “fact level”, t^f, i should be [1,2] not [1,3]. 8) only first words of the titles are capitalized in the References (e.g. dbpedia is not capitalized, etc.)
Review #3
Anonymous submitted on 22/Jun/2017