Review Comment:
The paper proposes a methodology for aligning Wikidata statements with natural language sentences from Wikipedia. In general, the paper is well-written, and most parts of the proposed pipeline are clearly described. I also appreciate the authors' effort to offer a mathematical formulation for the entire process of building the resulting dataset. The authors have already made their data openly available at a persistent URL on GitHub (i.e. at https://github.com/tahoangthang/Wikidata2Text). The repository looks in good shape and sufficient documentation about the resources is provided.
My concern regarding the paper is mostly associated: (i) with how it relates to other existing works that seek to generate similarly-minded datasets, and (ii) with how the quality of the resulting corpus is evaluated. The work is introducing a methodology for obtaining high-quality alignments between English Wikipedia sentences and Wikidata quads (i.e. subject-property-object triples along with their corresponding qualifier information). However, I believe the provided Corpus Evaluation section does not offer sufficient evidence about the \textit{closeness} of the resulting quads with their corresponding sentences and their annotations.
Furthermore, the data collection involved quads from triples consisting of six different properties (P26, P39, P54, P69, P108 and P166). To a substantial extent, the work in the paper is motivated by its usefulness for training Natural Language Generation systems. Nonetheless, I believe that developing a system focused on the realisation of only six properties would vastly hinder its lexical variability. Consequently, the authors should elaborate further on their decision to narrow their experiments to only these properties, and how feasible would it be for their data collection process to be extended to less popular properties in the knowledge graph.
Literature review focuses mostly on research articles that propose Natural Language Generation systems, and briefly describes their proposed solutions. However, there has been extensive line of works that sought to build datasets that align knowledge base triples with texts. The works by Toutanova et al. (2015), Gardent et al. (2017), Elsahar et al. (2018) and Vougiouklis et al. (2020) are only some representative examples, which I believe should be included in the literature review. Furthermore, while the work by Mrabet et al. (2015) is included in this section, there is not enough information about how the proposed methodology relates to their work. I believe the paper would greatly benefit from a paragraph clarifying further how the proposed dataset creation process differs from other works that provide alignments of DBpedia or Wikidata triples with Wikipedia sentences, in particular, the ones that are the most relevant to it (Mrabet et al., 2015; Gardent et al., 2017, Elsahar et al., 2018). Including a table with comparative statistics wrt. various dataset characteristics, such as number of dataset instances, and entity and property coverage, would be very useful as well.
While I appreciate the provided technical details in Section 5.3, I believe the paper would also benefit from a clearer description regarding the input and output of each algorithm used during the data collection process. A figure or an example that is extended across the entire section would be very helpful towards this direction. I also believe some of the provided algorithms focus too much into development details. I would recommend to have some of these details included in a separate Appendix, allowing this section to highlight the aspects of greater scientific merit in the work.
More details should be provided about the setup used for the evaluation of the mapping in Section 6.1. My understanding is that we want to explore how well can a system for Named Entity Recognition (NER) can match the subjects and objects from the quad in the corresponding sentence. It is not clear from the manuscript how the five different NER systems are leveraged for Type Matching, and which part of the dataset is used for training and testing (i.e. in the case of tr-******-sc* setups). I believe $N_u$ is defined as the set of matches that were made incorrectly by the Entity Linking algorithm. If that is the case, this should be explained in a clearer manner in the second paragraph of Section 6.1.
I would also recommend a re-structuring of Section 6.3. The authors should first introduce the purpose of this evaluation that should go beyond which clustering algorithm works better for the dataset, and explain how the quality of the resulting dataset is determined by the provided clustering metrics. Afterwards, the baselines for noise-filtering should be introduced (i.e. w/ clustering algorithms and by setting a threshold wrt. the maximum number of redundant words). I believe a discussion regarding the results of these experiments should be placed in a separate subsection, which would highlight the important findings and how these relate to the quality of the resulting dataset. The paper would also benefit by a qualitative evaluation section focusing on potential problematic alignment cases in the corpus, and when these are likely to occur.
Some further notes on the provided manuscript:
Minor error in the first sentence of Introduction; a trivial change could be as follows: "... will go further than what has been done in the past in order to turn Wikimedia into a fundamental ecosystem of free knowledge..."
In the fourth paragraph of Section 4.1, there is a reference to "Mario Kirev" who is not mentioned in Figure 1--maybe the reference should have been to Simone Loria in Figure 3 (i.e. Simone Loria is mentioned with the "He" pronoun in the figure), but this should become clearer in the manuscript.
There is set of minor typos across the entire manuscript, and some further proof-reading is advised.
Some notes on the accompanying repository:
Please include a link to the GitHub repository in the provided manuscript.
To facilitate easier processing of the output data in the repository, I would urge the authors to provide token position and offset values (i.e. position of the tokens in the input) for each annotation included in the label_sentence_1 and label_sentence_2 fields of the resulting CSV files.
References
K. Toutanova, D. Chen, P. Pantel, H. Poon, P. Choudhury, and M. Gamon, “Representing Text for Joint Embedding of Text and Knowledge Bases", in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 2015, pp. 1499–1509, doi: 10.18653/v1/D15-1174.
Y. Mrabet et al., “Aligning Texts and Knowledge Bases with Semantic Sentence Simplification,” in Proceedings of the 2nd International Workshop on Natural Language Generation and the Semantic Web (WebNLG 2016), Edinburgh, Scotland, 2016, pp. 29–36.
C. Gardent, A. Shimorina, S. Narayan, and L. Perez-Beltrachini, “Creating Training Corpora for NLG Micro-Planners,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, 2017, pp. 179–188.
H. Elsahar et al., "T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples", in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018.
P. Vougiouklis, E. Maddalena, J. Hare, and E. Simperl, "Point at the Triple: Generation of Text Summaries from Knowledge Base Triples", J. Artif. Int. Res., vol. 69, pp. 1–31, 2020, doi: https://doi.org/10.1613/jair.1.11694.
|