The novel scalable parallel denoising for Chinese online encyclopedia knowledge base based on the semantic distance of entry tags and Spark cluster

Tracking #: 2516-3730

Ting Wang
Jie Li
Jiale Guo

Responsible editor: 
Guilin Qi

Submission type: 
Full Paper
Because of the open-collaborative of online encyclopedia, a large number of knowledge triples are improperly classified in the online encyclopedia system, so it is inevitable to denoise and refine the open-domain encyclopedia knowledge bases (KBs) to improve its quality and precision. However, the lack and inaccuracy of triple semantic features will lead to poor refining effect. Besides, in the face of large-scale encyclopedia KBs, the processing of massive knowledge will lead to too much computing time and poor scalability of the algorithm. In order to solve the problems of knowledge denoising in the Chinese encyclopedia system, firstly, based on the data field theory, this paper proposes a new Cartesian product mapping-based method for quantifying the quality of entry tags, based on which the semantic quantification of encyclopedia KB is carried out. Secondly, this paper proposes a new method based on multi-feature fusion to calculate the semantic distance between the "out-of-vocabulary" entry tags and embed it into the potential function, so as to further improve the potential function and denoising effect on KBs. Thirdly, in order to make our algorithm have good scalability, the proposed denoising algorithm is implemented and optimized in parallel based on Spark cluster computing framework. Finally, a comprehensive comparative analysis is made on the denoising effect and time efficiency with the state-of-the-art distributed Chinese encyclopedia knowledge denoising algorithm. The experimental results on the real-world datasets show that the parallel denoising algorithm proposed in this paper can improve the efficiency of knowledge denoising and the accuracy of KBs, and outperforms the state-of-the-art methods.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 21/Aug/2020
Major Revision
Review Comment:

This paper proposes a method for scalable parallel denoising for the Chinese online encyclopedia knowledge base. Specifically, it first uses a new Cartesian product mapping-based method for quantifying the quality of entry bases. Then it uses a multi-feature fusion method to calculate the semantic distance. After that, they used the Spark cluster computing framework to accelerate computing. I have some comments on the work:

\1. In Section 1 (Introduction), the motivation for this paper is not very clear. The author did not tell the readers what kind of noise it is solving in the knowledge base, how serious this kind of noise is, and what is the significance of solving these noises.

\2. In Section 1 (Introduction), the authors did not mention what the current mainstream denoising methods are and what challenges they still have.

\3. In Section 1 (Introduction), the authors mentioned that the Spark-based parallel denoising algorithm is better than the SOTA method in terms of accuracy, recall rate and time efficiency, but its approach and experimental part have not proved this point.

\4. In Section 2 (Related Work), there is a lot of work that has nothing to do with the content of this paper. The author needs to refine the content and explain its relevance to this paper.

\5. In Section 3 (Problem description), the research problem of this paper is not clearly stated, and the authors did not explain what the input and output is.

\6. In Section 3.3, it is a bit confused to understand what is the improper classification of knowledge. Why should each knowledge triple belong to a classification tag? And why the triple belongs to Painting? The example is not helpful.

\7. In Section 4 (System design), this section needs to be rewritten. The system design part needs to be rewritten. It lacks an overall framework, and the content of each subsection needs to be rewritten. The authors should write about how many components are included and what functions each part has.

\8. In Section 4.5, the parallel processing part is very similar to the previous paper [1]. The authors should tell the difference between them and claim the contribution of this article

\9. In Section 6.1 (Data Set and Design), the data set is a bit confusing. First of all, the authors have emphasized that they use BaiduBaike as the knowledge base, but they use the data of Hudong Baike in the experiment. Secondly, BaiduBaike contains a lot of categories. Why did they choose only 19 of them? What are the reasons?

\10. In Section 6.2 (Experiment 1), how to determine whether the features of triple match the related sub-categories. And what are the 'features' of triple? And the experiment did not prove the effectiveness of the denoising method in this paper, and lack of comparative experiments.

Minor Comments

\1. "springs up like a tide": too Chinglish

\2. Line 12, Page 2, "classification tree" --> "taxonomy"

\3. subsection titles "2.3 Knowledge Graph" and "2.4 When the Semantic Web meets Big Data" are inappropriate.

\4. In Figure 1, "Property Tag" --> "Attribute Tag"

\5. Line 30, Page 17, "KB Hudong" --> "Hudong KB"

\6. Line 51, Page 18, The abbreviations of Precision (P-value) and Recall (R-value) is inappropriate.

\7. In Figure 15, there are two "2 processes" and "4 processes".

Review #2
Anonymous submitted on 20/Sep/2020
Review Comment:

The paper proposes a scalable parallel denoising algorithm. This topic is related to this journal and very important in knowledge graph research domain. Experiments show the new algorithm can increase the efficiency and improve the quality of a given knowledge graph. However, this paper has some problem.

1 The problem is not clearly defined. The paper wants to denoising the improperly category classification. However, because of the lack of clear problem definition, some basic questions are not answered in this paper. For example, what kind of data is noise data? Why this kind of data is produced by open-collaborative edit? If we don't denoising this kind of data, is there any impact on downstream task?

2 There are a lot of papers in related work section. But most of them is not related to this task. I am not sure this the first work for denoising knowledge graph triples. Thus, this work can ‘t convince me that traditional works were inefficient.

3 In section 4, the definition of symbols is very confusing. For example, there is a definition in this section, "P is the data field produced by data in ...". What does the data field mean in this paper? From my understanding, P with subscript X has the same meaning of formula 1. But I do not understand why X is missed in formula 1. Another example is Omega is a n-dimensional space and all (X, Y) belong to Omega. The paper only says X and Y are data object. It is very hard to understand how to use X and Y. Except the two above examples, a lot of symbol definitions are hard to understand.

4 There is no comparison with other algorithm in Experiment and Analysis section and evaluation is based on manually annotation. Thus, I am not sure this work is a more effective job.

Based on these 4 reasons, the quality of the paper is very bad, and the experimental results are not convincing. Thus, I reject this paper to publish.

Review #3
Anonymous submitted on 05/Oct/2020
Major Revision
Review Comment:

This paper aims to address the problem of knowledge denoising in the Chinese encyclopedia system. The authors propose a Cartesian product mapping based method together with the parallel optimization using Spark for quantification of encyclopedia KB.

The structure of the paper needs to be improved, since after reading the first a few sections with a lot of concept definitions and background descriptions, it is still not clear to me what is the concrete technical task the work aims to solve.

There are many concepts and formulas formally defined in the paper without examples to explain them, which are definitely necessary to help understanding. For instance, in Sec. 4.1, an example would be needed to better explain the definition of potential function and what is the purpose of it.

It seems that this paper has some overlaps with the existing work "A Novel Large-scale Chinese Encyclopedia Knowledge Parallel Refining Method Based on MapReduce" published by the same authors. Therefore, the authors have to describe besides the used infrastructure, i.e., Spark in this work rather than MapReduce in previous one, what is the major difference of this work from the previous one from an algorithmic perspective.

Overall, this paper seems to be more like an engineering effort. Therefore, the authors need to more clearly clarify the research contributions.