Plausibility Assessment of Triples with Distant Supervision

Tracking #: 1753-2965

Soon Hong
Mun Yong Yi

Responsible editor: 
Guest Editors ML4KBG 2016

Submission type: 
Full Paper
This paper reviews the process of triple validation, improving upon the knowledge base building and population process. This paper conceptualizes triple validation as a two-step procedure: a domain-independent plausibility assessment and a domain-dependent truth validation only for plausible triples. It also proposes a new plausible/nonsensical framework overlaid with a true/false framework. This paper focuses on the plausibility assessment of triples by challenging the limitations of existing approaches. It presents an unsupervised approach and attempts to consistently build both positive and negative training data with distant supervision by DBpedia and Wikipedia. It adopts instance-based learning to skip the generation of pre-defined models that have difficulty in dealing with triples’ various expressions. The experimental results support the proposed approach, which outperformed several unsupervised baselines. The proposed approach can be used to filter out newly extracted nonsensical triples and existing nonsensical triples in knowledge bases, as well as to learn even semantic relationships. The proposed approach can be used on its own, or it can complement existing truth-validation processes. Extending background knowledge for better coverage remains for future investigation.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Diego Esteves submitted on 09/Dec/2017
Minor Revision
Review Comment:

After reading author's responses, the idea behind this paper has a clear contribution and motivation IMO, but the story is not clearly defined from the beginning in the article. It still lacks more simple and straightforward explanations, especially regarding the fact that "truth validation" is out of the scope of the paper.

The "triple validation" (or "truth validation in triples") is a complementary step which validates whether a claim represented by a triple (s,p,o) is true or false (e.g., , , or , , ). Why not calling this pre-processing step in a fact-checking pipeline as "triple plausibility estimation" or just simply "triple plausibility" instead of referring to that as "triple validation"? It would give less margin for misinterpretation IMO. With this respect, the title of the manuscript ("Plausibility Assessment of Triples...") is far more clear and objective. For instance, the abstract/introduction are not clear, mixing different concepts, which gives the idea that the contribution is more comprehensive than it is indeed. (e.g., "triple validation" regarding range-domain verification and "triple validation" regarding claim/truth validation, which is far more natural to interpret).

* from this part on I refer to "triple-validation" as a synonym of "truth-validation" and not "domain-range" validation. In this case, for triple-validation in RDF knowledge bases.

The unique contribution of this work is that, although complimentary, the tasks are not mutually exclusive. For instance, "plausibility" can be derived from "triple-validation" for true examples only (= sensical, always) but cannot be derived in case of false examples (since one false example might be plausible or not). Likewise, "triple-validation" can be derived from "plausibility" for non-sensical examples (=false, always), but cannot be derived for sensical examples (since a sensical example might be false or true).

Thus, there are still two major issues I do not agree:

(1) the sensation you are also performing triple-validation (truth validation) by reading the paper, which is not true. I always have the impression you will evaluate the joint task of "plausibility" and verification/validation ("fact-validation") at some point. The intention of the paper should be more clear IMO, avoiding passages such as "This paper conceptualizes triple validation as a two-step procedure: a domain-independent plausibility assessment and a domain-dependent truth validation only for plausible triples". That is still confusing:
- according to the paper, YES (Pages 1 and 2: "A plausible/nonsensical framework overlaid with a true/false framework...").
- according to other passages of the paper, NO (Page 2: "The present research discusses both plausibility assessment and truth validation. However, it implements plausibility assessment only").
- according to your last answer, NO.

(2) the lack of benchmarks comparing "Truth validation" versus "Plausibility + Truth Validation". Authors claim that benefits of having "triple plausibility" as a pre-processing function are clear (P2: "The benefits of plausibility assessment are obvious").

Does it minimizes the error propagation or not? Can you improve the performance of fact-checking algorithms? Although theoretically, this argument may be valid, why not showing this in a real scenario? It is possible that non-sensical examples may be correctly labeled as outliers by machine learning-based classifiers (i.e., this could be learned during its training phase.) from the beginning. Upon such benchmark analysis, this would better support your arguments associated with the fact-checking task, for instance.

Review #2
Anonymous submitted on 14/Mar/2018
Review Comment:

This paper introduces an approach for classification of triples into plausible and implausible cases using distant supervision. The authors present their approach and evaluate it using data from DBPedia and Freebase.

I found this an incredibly hard paper to review, as I am rather convinced that it contains good and relevant work, but failed to really understand what is going on. So, I really did my best to understand the technical contribution. There are two reasons for me to suggest to reject this paper in end, first, I believe that it is mostly the writing of the paper that was responsible for me not understanding it, and secondly, once I understood, I have significant doubts about the validity of the approach.

The writing itself is not bad, and there are examples to explain the technical decisions. But the paper completely fails to properly define the problem it addresses, and to an overview over the research method applied.

While the general idea of addressing implausible cases is interesting, I failed to find a specific definition of what this actually means. The claim is that triple (c) is implausible, as it violates the range restriction. While this is one indication for implausibility I am sure this (with Domain) is not the only one, and even this definition is highly dependent on the specific model in use. In my view, (c) is only implausible if we know that cities and ingredients are incompatible. Otherwise, (a) and (b) are as plausible as (c). This is not a detail, as later in the paper the entire evaluation is based on an evaluation set that is built entirely on the assumption that the difference of plausible and implausible is that a node violates either domain or range. There is a discussion in Section 2, but this is about the solution, and only technical w.r.t. the choice of positive or negative examples.

Given that the paper does not specify a goal in score in, it is very difficult to later assess the validity of the evaluation, or the conclusions drawn from it. (later more on this)

The fact that there is no overview of the research approach taken in the paper is mostly annoying, as it makes it difficult to read the paper, and leaves the reader regularly guessing. So, as I said, I tried to understand the paper, and this is what I made out of it: The task is to classify an previously unknown triple into two classes plausible or no plausible. The method for doing this classification for a triple (spo) is described in section 3, where (1) a number triples are automatically created by substitution of s, p, or o (or multiple), then (2) labelling the triples and (3) applying a knn classifier to decide whether spo is plausible or not. Chapters 4 and 5 then evaluate this method with data from DBpedia and Freebase. If this is indeed correct, then it would have been easy to state, but this never explicitly stated, and for a reader who is not that familiar with the problem, task and approach, it is not as trivial to get this, in particular as several of the technical notions are not introduced (take e.g. Test and Training triple). Here, a lot of clarification is needed.

The main reason, though, for suggesting to reject the paper is not the writing, which could probably be fixed, but in my doubts about the validity of the approach, and more specifically, the choice of evaluation. As far as I understand, the authors produce a test set of 4945 triples, about half of them they consider plausible, the other implausible. Again, the description of the construction of these datasets is vague. On page 9, the paper says: "triples were obtained applying the following rules" and then rules R1-10 are presented. But these rules do not produce or pick any triples, but are constraints. Where do the candidates come from in the first place? Is it a random sample from DBPedia to which the rules are applied? I cannot find in the paper how this is done.

But even if this was clear, the main problem is that two automatically generated sets are compared with each other, both containing rather ad hoc methods to decide on plausibility, and both being based on the same or similar datasets (DBPedia/Wikipedia). This implies that the new approach is not shown to work well as a classifier for plausibility, but for examples that adhere to rules R1-10. Here, only real difference is R5 and 8. This restricts the validity (and insightfulness) of the results enormously. Basically, the experiments show that the approach can reasonable check for violation of domain and ranges. It is unclear to me whether only those cases are studied where there is a formal domain range violation, i.e. where the domain of a relation is formally specified. Unfortunately, this is not described in the paper, while it a crucial assumption the entire evaluation is based on.

There are two cases, either explicitly modelled domain/range restrictions are used, but then there will be an enormous amount of incompleteness. Also, should the data really be available, it would require a reasoner, rather than 2 PostDocs to do the job. So, I assume the PosDocs did this rather informally, which comes down to a check for "plausibility" in the disguise of a check for domain. But, the annotation becomes a crucial element of the experiments. Which criteria were applied?
As I fail to see what is compared with what, and why, the entire evaluation in section 4 appears rather meaningless to me.

Interestingly enough my intuition is that the approach is interesting, and even the experiments are interesting (though in my view not a proof for the validity of the approach w.r.t. plausibility). Unless I am very mistaken, the two ways of establishing *plausibility* (section 3 and 4) are very different in nature, as the main information in sec 4 is based on a failure to comply with domain and range constraints, whereas in section 3 background knowledge is used to label training examples. Unfortunately, I fail to understand this section (3.3). There is no intuitive intuition provided for the idea of using co-occurrences as basis for the main features for the labelling. The definition, on page 8, of when a course is labelled plausible is that the distance score is higher than some critical value, where the distance score is the p-value for the test statistics, which indicates whether the triple is a chance event or not. First, this remains vague, but even if it wasn't why would that work? This seems to be the core idea of the approach (that background knowledge helps to assess the plausibility of the training examples),
but there is no explanation provided.

If I would conjecture what is going on, the idea is that the trainings triples are of similar types as the test triple. By checking co-occurence of those triples one can check how realistic such a triple is w.r.t Wikipedia. But why not test this directly with the test triple? Why would it be more reliable to first produce training examples and "guess" for those their plausibility? By construction, they should be less likely to be true than the test triple. I am not saying that the construction is incorrect, but it should be explained better, and maybe the approach should be compared with an approach that applies the labelling of section 3.3 to the test example. From the description in the paper it remains intuitively unclear to me what the added value of the learning is.

To summarise: I believe that the main problem of the paper is that both the research question and approach remain vague. It is difficult to disagree with the authors that a plausibility check would be useful before checking for correctness of triples. But to agree with the statement, we would need to know exactly what this means. Similarly, in order to say whether the approach is correct, it needs to be described very rigorously and more systematically (not on the detail level, that is okay). And finally, the experiments have to be described and analysed more carefully as well. What are the assumptions made in the construction of the experimental evaluations, and why are those valid.

Some minor comments:
Fig 1: In the table for the labelled training triples the p-value is given, but the table contains, as far as I understand, the label, which is plausible or not-plausible.
page 8: Generalization triples are defined without an argument, but after triple (e), they are used with a triple as an argument.
R10. why do you want a testing set with the same number of plausible and nonsensical triples. In the real world there are far more nonsensical triples. This choice might be a good one, but it should be argued.

Review #3
Anonymous submitted on 17/May/2018
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.