KnowMore - Knowledge Base Augmentation with Structured Web Markup

Tracking #: 1552-2764

Ran Yu
Ujwal Gadiraju
Besnik Fetahu
Oliver Lehmberg
Dominique Ritze
Stefan Dietze

Responsible editor: 
Guest Editors ML4KBG 2016

Submission type: 
Full Paper
Knowledge bases are in wide-spread use for aiding tasks such as information extraction and information retrieval, where Web search is a prominent example. However, knowledge bases are inherently incomplete, particularly with respect to tail entities and properties. On the other hand, embedded entity markup based on Microdata, RDFa, and Microformats have become prevalent on the Web and constitute an unprecedented source of data with significant potential to aid the task of knowledge base augmentation (KBA). However, RDF statements extracted from markup are fundamentally different from traditional knowledge graphs: entity descriptions are flat, facts are highly redundant and of varied quality, and, explicit links are missing despite a vast amount of coreferences. Therefore, data fusion is required in order to facilitate the use of markup data for KBA. We present a novel data fusion approach which addresses these issues through a combination of entity matching and fusion techniques geared towards the specific challenges associated with Web markup. To ensure precise and diverse results, we follow a supervised learning approach based on a novel set of features considering aspects such as quality and relevance of entities, facts and their sources. We perform a thorough evaluation on a subset of the Web Data Commons dataset and show significant potential for augmenting existing KBs. A comparison with existing data fusion baselines demonstrates superior performance of our approach when applied to Web markup data.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 21/Mar/2017
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This paper addresses an important and timely problem in Linked Data: how to enrich the incomplete description of entities within a dataset using external sources.

It is a significant extension of a preliminary work of the authors.

The paper is globally well written and easy to read. The part on the evaluation is clearly explained and discussed. The main contribution is to provide a pipeline to enrich a RDF dataset based on RDf statements extracted from Web markup, that integrates clustering and supervised classfication techniques based on crowdsourcing.

My main concern is the lack of formal definitions of several important notions that are just introduced by sentences and illustrated by examples.
In particular, "(long) tail entities (types, properties)" is mentioned several times without being defined precisely, and without explanation of the impact of this feature on the problem adressed in this paper.
Also, the notion of co-references is central but not formally defined, as well as what an entity is in the setting of this paper. The notions of "real-world entity", "entity identifier", and "entity descriptions" should be more clearly defined and distinguished. In Section 3.2, there is a mismatch of notations between the notation $e_q$, denoting in Definition 1 "a description of a (real-world or identified ?) entity", and the notation $e_s$ mentioned in the above paragraph, denoting "the entity description of e corresponding to subject s".
Concerning the last point (*the* entity description of e corresponding to subject s): how multivalued properties are handled in the proposed approach ?

In definition 1, the set F' is not defined: it would be better to say "we aim at selecting a subset F' of M .... Each fact $f_i$ in F' represents a valid ...".
BTW, it is not clear whata fact is: an RDF triple ? in this case, it is not consistent with the statement "property-value pair describing the entity q" in the definition (why the subject is not mentioned ? is it q, or more rightly the identifier of the real-world entity q ?).

To be called a definition, Definition1 should formally define the notion of "correct" fact: how can it be checked that a fact "is consistent with the real world" ?

In the entity matching paragraph: the set E is not defined either ; what is the definition of "co-referring entity descriptions" ?

Pseudo-keys are mentioned but not explained or defined.

Review #2
By Aleksander Smywinski-Pohl submitted on 28/Mar/2017
Major Revision
Review Comment:

The paper presents a novel method designed for augmenting KBs with data extracted from Web markup. It concentrates on markup and the evaluation is performed on two entity types: books and movies, which are broadly available on the Web. The performance of the presented method (KnowMore) is compared with two other methods: PrecRecCorr and CBFS. The results presented in the paper indicated that the implemented method yields better results than these state-of-the art data augmentation methods. Thus the system invented by the authors seems to be an important contribution to the research in the field of KB augumentation in particular and Semantic Web in general. Thus both originality of the results as well as the significance of the work should be highly judged.

Yet the positive impression of the paper is undermined by several issues, connected primarily with the scientific validity of the results as well as preciseness of the description.

The primary missing element of the paper is the description of the measures used for computing the similarity of the matched entities (cf. Page 7, the first paragraph). The authors only mention that they used cosine similarity for text, but that is only a vague remark. The reader is referred to the source code of the system, but such a reference is not a good idea. First of all – the published code will probably undergo development, thus the measures might be redefined, making the reproducibility of the results impossible. Secondly – due to the nature of software, it might be hard to find the particular piece of source code that is responsible for implementing the measure referenced in the text. Thus all the methods used for computing the similarity of the features should be formally described in the paper.

The second missing element is the description of the parameters of BM25 algorithm. This is a parametric algorithm and the selected parameters can strongly influence its results. Moreover – when comparing the results of KnowMore_match with BM25 it should be indicated if the parameters of vanilla BM25 were tuned for the dataset. A reference to the name of the implementation of the algorithm (Indri, Lucene + Solr/ElasticSearch, independent implementation), would also be valuable.

Another important detail that is missing in the paper is the inclusion/omission of a held-out corpus used for tuning the parameters of the algorithms. For instance, there is a tau parameter set to 0.5 as an optimal value. If that parameter was determined using the same data that were later used for testing the results, then the obtained precision/recall/F1 values are invalid. The authors only write that the values were obtained using 10-fold cross-validation without referring to a held-out corpus, so it might be possible that they used the same data for testing and tuning, which is a major methodological fault.

There is also a problem with the selection of the method used for training the classifiers that are important part of many of the algorithms. It is not surprising that Naive Bayes classifier performed the bast among the tested methods, since the amount of the training data (approx. 200-300 examples) were not enough to train more powerful classifiers, such as SVM. Yet it is surprising that the authors didn’t try to use logistic regression, since it seems to be the most popular, first-shot classifier when working with binary classes. What is more concerning regarding the NB classifiers are the remarks present in section 5 on page 8. Namely the authors indicate that there are predicates having only one valid value for a particular entity (functional properties or key properties, in the DB parlance) as well properties accepting more valid values for a particular property. This is obviously valid statement, but how this feature can be exploited in NB classifier, since its primary assumption is the fact that the features are independent? The only way to do so is training a distinct classifier for each property, but the description lacks any such suggestion. In fact – the set of features (t^p_1 – predicate term) seems to suggest the opposite, namely the “to be learned” dependence of t^p_4 on t^p_1.

The authors selected pairwise percent agreement (PPA) as an indicator of the agreement between the annotators of the data. This is the first paper I’ve read in the field of computer science that cites this measure. Normally Kohen’s Kappa (for two raters) and Fleiss Kappa (for more than two raters) are used. Fleiss and Cohen’s paper (“The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability”) was written in 1973. There is also a paper written by K. Krippendorff (“Reliability in content analysis”, 2004) that is describing PPA as an invalid method for measuring the agreement between raters. Moreover, some of the datasets are highly skewed (e.g. the data fusion datasets, where the number of positive examples exceeds 80%), thus a special care should be taken when describing their statistical features.

There are other issues related to the statistical validity of the results and the language used to describe the statistical phenomena. First of them is the lack of the margin of error in reports regarding the precision, recall and F1 measure. Although computer science papers not often provide that value, it is especially important when the sample sizes are so small as those presented in the conducted experiments. It might turn out that the compared methods are not significantly different, due to a large margin error. Regarding the statistical significance – the authors report that some results are (in)significantly different from each other (page 11, last paragraph). Yet this concept is used without any reference to a statistical test showing that such a claim is valid in the context. The authors also write that some result was 3,8%, 39.4%, etc. higher than the other, when the referred quantity is not a relative but an absolute difference. Percentage points would be much more appropriate in that case.

All these issues indicate that the paper should undergo a major revision addressing all of them.

Regarding the minor issues found in the paper:
1) the paper lacks keywords
2) high recall is more important than precision at the stage of blocking the entities, but the authors gave equal weights for both of them when reporting the matching step performance
3) the objects of the relations are not matched against the extended KBs
4) using a subscript for indicating the steps of the algorithm is not the best idea, since it’s harder for the reader to see the difference between the results of the different steps
5) “iteratively” on page 6 – doesn’t make sense, since there is only one iteration
6) I believe there should be e_i rather than e_q in 10th line on 6th page
7) page 8, “fact level”, t^f, i should be [1,2] not [1,3].
8) only first words of the titles are capitalized in the References (e.g. dbpedia is not capitalized, etc.)

Review #3
Anonymous submitted on 22/Jun/2017
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The paper discusses knowledge base augmentation from structured web markup data. The problem seems to be an instance of data fusion. The idea of using web markup data for KB enrichment is interesting, and, due to its potential complementary nature wrt. sources used for standard KB construction, might have potential for a huge enrichment of existing knowledge bases. The paper focuses on two parts within the pipeline, namely entity resolution and data fusion.

Unfortunately there are several serious weaknesses.
1) Novelty and significance:
- Contribution 1: While the contribution mentions a novel and specifically tailored data fusion pipeline, it remains largely unclear in what the novelty lies and what is specifically tailored towards the problem. Diversification techniques are claimed to be included, but they look just like deduplication techniques.
- Contribution 2: Similarly, it is claimed that a novel fusion approach is introduced, but I don't understand well which specific parts of it are novel. The same holds for the features. The paper claims "We propose and evaluate an original set of features", but it I do not see any discussion or evaluation of the individual features.
- Contribution 3: This looks to me like the practically most interesting aspect of the work, but also here the paper does not substantiate the claims it makes, i.e., does not contain a significant analysis of the enrichment potential.
3) The paper is hard to read as it suffers from a lack of examples, an unclear formalization and is in many parts not self-contained. It gives the impression of a technical report that mentions what was done. I would expect much more of a discussion about options, reasons for choices, consequences of choices.

Detailed Comments

Potential for Enrichment
The contribution of the paper would be considered huge if it was accompanied by the full dataset of all new statements that were derived by the proposed method. If that is practically difficult, I would at least like to see an extrapolation of what can be expected as total number of new facts, based on the small-scale experiments conducted in the paper. At the moment there is only a discussion for movies, but if we assume that all objects behave movie-like, how would that extrapolate?
The claim "On average, KnowMore populates 14.7% of missing statements in Wikidata, 11.9% in Freebase and 23.7% in DBpedia." appears to be wrong, as you do not know how many statements are missing in the first place. E.g., if a movie has no award in a KB that does not mean that the KB is missing an award. Similarly, if a movie already has an award in a KB, you don't know how many more might be missing.

I was minorly confused to see this term mentioned as goal in the abstract, associating it first with diversity like in search result diversification. I understand that something else is meant, and I also understand that this diversity can be measured in the output, but I do not see how Section 5.2 achieves higher diversity? The section seems to talk only about deduplication. In which way can deduplication influence diversity? Possibly by adjusting a confidence threshold, leading to a classical precision-recall tradeoff? Needs complete rewriting.

I found large parts of the paper to be not self-contained or badly explained. A significant amount of information is outsourced to references, or only explained in ways understandable for people involved in building this pipeline. Some examples are below
- Please give an example of markups and explanation of markups early on, do not only mention that there is some dataset
- Terms such as RDF-Quads, BM-25, blocking, pay-level-domain should be explained briefly when they are introduced (e.g., using relative clauses like "BM-25, an entity ranking algorithm, ...")
- "Previous work only considers correctness by measuring the quality of the source. ..." - Explain difference better
- 3.1 First sentence: Explain, not just reference
- 3.1 Last paragraph: Coherence: Investigation before is about movies and books, conclusion talks about highly volatile fields being present in markup data, not in KBs. Explain the step.
- Terminology in 3.2 is unclear. Are e_s sets? Is q an entity or a subject? Types seem also messed up in "a set of facts f_i \in F' from M", f_i is probably not a set of facts but a single fact? F' is the set? Why F', what is F? Suddenly there is a query q that was not mentioned before? What is its role? Please explain the whole Definition 1 with an example. Same type error appears for "e_i \in E from M".
- 3.3 "name field" - unclear what that is/where that comes from? "attributes" means "criteria"?
- Section 4.1 completely unreadable without consulting external literature. "Blocking" as title is not explained before, what search space is is unclear, co-references, ...
- 4.2: Explain first what and why you want to do, then reference something
- 4.3: Same problem as 4.1. ""
- "Features t_r^i \in [1,2,3]" and other features - use names, explain what they are before explaining why they are good
- "Given that our candidate sets contains many near-duplicates, we approch this problem through ..." - Which problem? The whole paragraph hangs in the air, the parts before and after look like an explanation of features, but the present paragraph is about clustering to solve an undefined problem.
- "We followed the guidelines laid out by Strauss [31]" - Guidelines for what? Why did you do that?
- Evaluation results should not only describe the obtained numbers, but also give reasons why something is better or worse than something else
- "experimented with different decision making parameters as discussed in [24]" - ?

Writing style, typos
- Abstract: "On the other hand" - fix style
- "facilitate" - "enable"?
- "aid KB augmentation" - "do KB augmentation"?
- Contribution makes excessive use of the term "novel"
- "setup; in that they"
- "Dong et al. ..." - fix grammar
- "classified into two classes" - style
- "refuse" - "fuse again"
- "selected...retrieve" - unify tense
- 3.2 first sentence: Fix grammar (relative clause)
- 4.3: "sim" - fix Latex typesetting
- "that, there"
- terms[17]
- "entities type" - add "of"
- "extracted respectively" - add comma
- "6 USD cents" - maybe "6 ct." or "$0.06"?
- "Baselines. We consider ..." - fix grammar
- "product, shows"

Other factual issues
- KBs are still incomplete - Not only still, but for conceptual reasons, they always will be incomplete. See e.g. the paper "But what do we actually know" (AKBC 2016)
- "usually there exist n \geq 0" - Tautology, thus it should be "always" instead of "usually" (though of course the whole statement is meaningless)
- SVM is known in an overwhelming range of applications to be one of the best classifiers. That Naives Bayes outperforms it makes me wonder whether something was done wrong in the experiments. Please explain this finding.
- "A fact is considered novel, if not a second time in our source markup" - Odd definition. If it is not already in the KB I would consider it novel, no matter whether it is once or 10 times in the markup dataset?