Mapping Process for the Task: Wikidata Statements to Text as Wikipedia Sentences

Tracking #: 2860-4074

Authors: 
Thang Ta Hoang
Alexander Gelbukh
Grigori Sidorov

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
Abstract: 
Acknowledged as one of the most successful online cooperative projects in human society, Wikipedia has obtained rapid growth in recent years, desires continuously to expand content, and disseminate knowledge values for everyone globally. The shortage of volunteers brings to Wikipedia many issues, including developing content for over 300 languages at the present. Therefore, the benefit that machines can automatically generate content to reduce human efforts on Wikipedia language projects could be considerable. In this paper, we propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level. The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia. We evaluate the output corpus in various aspects: sentence structure analyzing, noise filtering, and relation-ships between sentence components based on word embeddings models. The results are helpful not only for the data-to-text generation task but also for other relevant works in the field.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 09/Sep/2021
Suggestion:
Minor Revision
Review Comment:

The work demonstrates the alignment between the Wikidata statements and the sentences within the Wikipedia project. The authors coin the term WS2T (Wikidata statements to natural language text), which is closely related to NLG. The authors split the methodology into sub-processes, i.e. aligning subjects, predicates or objects, as well as qualifiers with Wikipedia sentences. For the evaluation, entity linking methods/tools, like AIDA or Wikifier are used to evaluate their proposed approaches independently.

The work is well written and covers an interesting research challenge on data alignment and semantic similarity, but for a non-expert in this field very hard to follow. Specifically, section 4 should be discussed with some non-experts to make to more clear for future readers. Another major critic might be the cryptic evaluation section, not in terms of length, but the information on the evaluation set, as well as a comparison of other approaches. Tables 9/10 and further are interesting, but no analysis was done besides that. The authors also talk about the usage of different datasets, but their methodology/analysis just split the Wikipedia sentences into sub-items, which does not make a new dataset in my view.

Further comments to the authors:
- is it possible the figures are wrongly set, in Sec 3, "Figure 2, CODE refers to an ... Simone Laria ..." looking at Fig 2, this is not clear, but it makes more sense when looking at Fig 1; same in Sec 3.3, "In Figure 1, item CODE has three statements corresponding to three ...." here it would make more sense to refer to Fig 2? Also, why is Fig 1 mentioned after Fig 2?
- in current Fig 2, would it make sense to extend the item Q36222633 with its value, to have complete information?
- in Sec 3 "organized in two multilingual objects", why multilingual? - is this the right term as we focus here only on English?
- in Sec 6.2, what is the difference between "The average of the number of words" vs "tokens per sentence"? The appearance of words are tokens (from CL POI), did you mean types (unique appearance of words)?

Some typos or unclear phrases:
- 3.4.2 "Like almost of WST-1 ..." - maybe "Similarly to WST-1 ..."
- 4.5 "... that satisfy conditions ..." >> "that satisfies conditions"
- 4.5 "The reason is there may have a case" - something missing? or unclear
- 4.7 "Except for deviring"? this is not clear?

Review #2
By Pavlos Vougiouklis submitted on 01/Oct/2021
Suggestion:
Major Revision
Review Comment:

The paper proposes a methodology for aligning Wikidata statements with natural language sentences from Wikipedia. In general, the paper is well-written, and most parts of the proposed pipeline are clearly described. I also appreciate the authors' effort to offer a mathematical formulation for the entire process of building the resulting dataset. The authors have already made their data openly available at a persistent URL on GitHub (i.e. at https://github.com/tahoangthang/Wikidata2Text). The repository looks in good shape and sufficient documentation about the resources is provided.

My concern regarding the paper is mostly associated: (i) with how it relates to other existing works that seek to generate similarly-minded datasets, and (ii) with how the quality of the resulting corpus is evaluated. The work is introducing a methodology for obtaining high-quality alignments between English Wikipedia sentences and Wikidata quads (i.e. subject-property-object triples along with their corresponding qualifier information). However, I believe the provided Corpus Evaluation section does not offer sufficient evidence about the \textit{closeness} of the resulting quads with their corresponding sentences and their annotations.

Furthermore, the data collection involved quads from triples consisting of six different properties (P26, P39, P54, P69, P108 and P166). To a substantial extent, the work in the paper is motivated by its usefulness for training Natural Language Generation systems. Nonetheless, I believe that developing a system focused on the realisation of only six properties would vastly hinder its lexical variability. Consequently, the authors should elaborate further on their decision to narrow their experiments to only these properties, and how feasible would it be for their data collection process to be extended to less popular properties in the knowledge graph.

Literature review focuses mostly on research articles that propose Natural Language Generation systems, and briefly describes their proposed solutions. However, there has been extensive line of works that sought to build datasets that align knowledge base triples with texts. The works by Toutanova et al. (2015), Gardent et al. (2017), Elsahar et al. (2018) and Vougiouklis et al. (2020) are only some representative examples, which I believe should be included in the literature review. Furthermore, while the work by Mrabet et al. (2015) is included in this section, there is not enough information about how the proposed methodology relates to their work. I believe the paper would greatly benefit from a paragraph clarifying further how the proposed dataset creation process differs from other works that provide alignments of DBpedia or Wikidata triples with Wikipedia sentences, in particular, the ones that are the most relevant to it (Mrabet et al., 2015; Gardent et al., 2017, Elsahar et al., 2018). Including a table with comparative statistics wrt. various dataset characteristics, such as number of dataset instances, and entity and property coverage, would be very useful as well.

While I appreciate the provided technical details in Section 5.3, I believe the paper would also benefit from a clearer description regarding the input and output of each algorithm used during the data collection process. A figure or an example that is extended across the entire section would be very helpful towards this direction. I also believe some of the provided algorithms focus too much into development details. I would recommend to have some of these details included in a separate Appendix, allowing this section to highlight the aspects of greater scientific merit in the work.

More details should be provided about the setup used for the evaluation of the mapping in Section 6.1. My understanding is that we want to explore how well can a system for Named Entity Recognition (NER) can match the subjects and objects from the quad in the corresponding sentence. It is not clear from the manuscript how the five different NER systems are leveraged for Type Matching, and which part of the dataset is used for training and testing (i.e. in the case of tr-******-sc* setups). I believe $N_u$ is defined as the set of matches that were made incorrectly by the Entity Linking algorithm. If that is the case, this should be explained in a clearer manner in the second paragraph of Section 6.1.

I would also recommend a re-structuring of Section 6.3. The authors should first introduce the purpose of this evaluation that should go beyond which clustering algorithm works better for the dataset, and explain how the quality of the resulting dataset is determined by the provided clustering metrics. Afterwards, the baselines for noise-filtering should be introduced (i.e. w/ clustering algorithms and by setting a threshold wrt. the maximum number of redundant words). I believe a discussion regarding the results of these experiments should be placed in a separate subsection, which would highlight the important findings and how these relate to the quality of the resulting dataset. The paper would also benefit by a qualitative evaluation section focusing on potential problematic alignment cases in the corpus, and when these are likely to occur.

Some further notes on the provided manuscript:

Minor error in the first sentence of Introduction; a trivial change could be as follows: "... will go further than what has been done in the past in order to turn Wikimedia into a fundamental ecosystem of free knowledge..."

In the fourth paragraph of Section 4.1, there is a reference to "Mario Kirev" who is not mentioned in Figure 1--maybe the reference should have been to Simone Loria in Figure 3 (i.e. Simone Loria is mentioned with the "He" pronoun in the figure), but this should become clearer in the manuscript.

There is set of minor typos across the entire manuscript, and some further proof-reading is advised.

Some notes on the accompanying repository:

Please include a link to the GitHub repository in the provided manuscript.

To facilitate easier processing of the output data in the repository, I would urge the authors to provide token position and offset values (i.e. position of the tokens in the input) for each annotation included in the label_sentence_1 and label_sentence_2 fields of the resulting CSV files.

References

K. Toutanova, D. Chen, P. Pantel, H. Poon, P. Choudhury, and M. Gamon, “Representing Text for Joint Embedding of Text and Knowledge Bases", in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 2015, pp. 1499–1509, doi: 10.18653/v1/D15-1174.
Y. Mrabet et al., “Aligning Texts and Knowledge Bases with Semantic Sentence Simplification,” in Proceedings of the 2nd International Workshop on Natural Language Generation and the Semantic Web (WebNLG 2016), Edinburgh, Scotland, 2016, pp. 29–36.
C. Gardent, A. Shimorina, S. Narayan, and L. Perez-Beltrachini, “Creating Training Corpora for NLG Micro-Planners,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, 2017, pp. 179–188.
H. Elsahar et al., "T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples", in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018.
P. Vougiouklis, E. Maddalena, J. Hare, and E. Simperl, "Point at the Triple: Generation of Text Summaries from Knowledge Base Triples", J. Artif. Int. Res., vol. 69, pp. 1–31, 2020, doi: https://doi.org/10.1613/jair.1.11694.

Review #3
By Diego Moussallem submitted on 02/Oct/2021
Suggestion:
Major Revision
Review Comment:

The paper presents an approach for mapping Wikidata triples to Wikipedia sentences. The authors point to the difficulty of maintaining knowledge over 300 languages on Wikipedia and proposes a task to this end. The problem stated by the author regarding the maintenance of knowledge over 300 languages is not completely addressed. Starting in the abstract is expected a multilingual mapping approach or at least some machine translation supporting the process. It is well-known that Wikipedia/Wikidata contains particular knowledge only on specific languages and sometimes the knowledge presented in one language is different from the other language when talking about the same thing, e.g, the content of Cataluña on Spanish and Catalan Wikipedias. However, the authors worked only on English and for this language, we have already many related works. The work has its merits and is valuable, but the paper becomes confusing several times due to the number of repetitive and unnecessary explanations about Wikidata structure and the extraction process. I suggest the authors decreasing the Wikidata section (3), the paper is submitted to the semantic web community, hence, it is not necessary to explain all the time some concepts or remember to the reader what you have said previously.

On the literature review, it contains works related to language generation task involving Wikidata and is well-investigated indeed. However, the authors do not refer to works that investigated the relation extraction problem or the Text2RDF challenge presented, for example, at WebNLG.

Regarding the methodology, I suggest creating an architecture overview to simplify the mapping process and what components are used for the readers. The approach sounds reasonable, but the queries only consider the most frequent properties and entities. Also, the authors rely on specific tools or approaches such as Word2vec. Thus, I wonder how this approach would generalize on other properties/entities and domain-specific text.

Considering the evaluation section, I appreciated the code, the data, and the experiment on different entity linkers and clustering algorithms, but the authors did not complete their ablation study with other components such as word embeddings, they stated

"We mainly depend on Word2vec models to assess the word relatedness (semantic similarity) between labeled sentences and their corresponding quads by statement context to see some relationships"

Why did the authors rely only on word2vec? they should have investigated other approaches such as Glove. Nevertheless, a human evaluation is missing to completely assess their approach, the authors also stated
"we realize that the very rare chance to have a mapping successfully when we receive 18510 sentences over 113913 articles scanned. The reason may come from the language diversity or our narrow working scope on single sentences. Therefore, we may face the problem of low resources in the translation task."

Why did not the authors further investigate this point?

Moreover, I expected an evaluation involving other languages, they could reuse the datasets from Kaffee and Vougiouklis. It would be a great addition.

Apart from the content issues, the authors should spend some time revising their English, some sentences are lengthy and the future tense is used several times in the wrong places.

Minor:

Update reference 23

especially isa relations supposed to attain better outcome performance -> especially is a

the Fig. 1. Data structure of item Q137280 in Wikidata is unnecessary huge