Towards a Question Answering System over the Semantic Web

Tracking #: 1811-3024

Authors: 
Dennis Diefenbach
Andreas Both
Kamal Singh
Pierre Maret

Responsible editor: 
Axel Polleres

Submission type: 
Full Paper
Abstract: 
Thanks to the development of the Semantic Web, a lot of new structured data has become available on the Web in the form of knowledge bases (KBs). Making this valuable data accessible and usable for end-users is one of the main goals of Question Answering (QA) over KBs. Most current QA systems query one KB, in one language (namely English). The existing approaches are not designed to be easily adaptable to new KBs and languages. We first introduce a new approach for translating natural language questions to SPARQL queries. It is able to query several KBs simultaneously, in different languages, and can easily be ported to other KBs and languages. In our evaluation, the impact of our approach is proven using 5 different well-known and large KBs: Wikidata, DBpedia, MusicBrainz, DBLP and Freebase as well as 5 different languages namely English, German, French, Italian and Spanish. Second, we show how we integrated our approach, to make it easily accessible by the research community and by end-users. To summarize, we provided a conceptional solution for multilingual, KB-agnostic Question Answering over the Semantic Web. The provided first approximation validates this concept.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By John McCrae submitted on 23/Feb/2018
Suggestion:
Minor Revision
Review Comment:

This paper presents a new system for question answering, that is thoroughly evaluated on a large number of benchmarks and shows solid improvement over most of the existing systems. As such this work makes a solid contribution and should be published. The main claim of innovation for this system appears to be from the query generation methodology and the pattern based approach combined with limitations extracted from the data. The authors claim that this comes from the interpretation "without considering syntax", although the results show that this accounts for 9% of the errors made by the proposed system, so perhaps this assumption should be reconsidered for further work. The major weakness of this paper is that the methodology is quite poorly explained, especially in Section 4 where much of the notation used is quite unclear, and overall the logic behind the algorithms is not well-explained. I would imagine that adding diagrams to more clearly show the conditions b_1, ..., b_4 used in Algorithm 2 would greatly help.

The introduction is also quite short and does not explain the contribution of this paper.

p1.
"question answering" should not be capitalized in general
"Section X" is normally capitalized

p2.
"reefied" => "reified"

p4.
", i.e." is odd

p5.
"SPARQL query is bigger": Not sure what is meant with big here
"is not reflecting" => "does not reflect"

p6.
"dbo:deathPlace Germany" would not necessarily indicate a "German mathematician"... do these queries overgenerate?
"as is shown" (no it)

p8.
In Algorithm 2 the following are never defined x_{k,1}, L^(k)
The statement L <- L u (T u (s_1, s_2, s_3)) looks odd... maybe it is incorrect?
I would guess that if (k != K) then, should be if (k >= K) then

p9.
"previews" => "previous"

p10.
"allow *us* to decide"
Be careful with thousand separators: 108,442 75,810 80,000

p11.
I think it would make more sense to group results by language... comparing F-Measures on different languages is problematic as the questions may be quite different

p12.
"moderate hardware": I am not really sure what this means
"reeification" => "reification" (twice)

p13.
"was supporting" => "supported"
"now it *also* supports"
"support *for* a new"

p15
Is [45] really "2016 (to appear)"?

Review #2
By Svitlana Vakulenko submitted on 18/Mar/2018
Suggestion:
Minor Revision
Review Comment:

The paper describes WDAqua approach to question answering over multiple knowledge graphs and summarises the evaluation results across several benchmarks for this task. It covers the details of the proposed approach, references and provides evaluation results of the competitive implementations and briefly suggests frequent sources of errors. The proposed approach is unsupervised. It is based on a set of reasonable heuristics that rely on string matching and restricted breadth-first search. The main advantages of the proposed approach over the related work is a comprehensive evaluation across several benchmarks adjusted to work with several languages and multiple knowledge graphs.

The paper describes an original contribution in terms of development and comprehensive evaluation of a question answering system over several RDF knowledge graphs. The system provides a strong baseline evaluated across several popular benchmarks covering a variety of languages and knowledge graphs. It is also publicly available via a user-friendly interface.

There are a few directions in which the quality of the manuscript can be further improved:

3.1.
* Spell IRI abbreviation.
* What is N?
* Describe tokenisation approach. Are n-grams word- or character-based?
* Apache Lucene is a search engine, not an index technology.

3.2.
* SPARQL queries take too much space on the page

3.3.
* The first query does not cover word ‘born’ as suggested by the example in the text.

3.4.
* It is not clear either how the logistic regression is used or how the threshold is chosen.

3.6.
* “There are no NLP tools used in the approach”. Stemming and stopword removal are NLP techniques. The authors might refer to parsing techniques in this case such as part-of-speech and dependency parsers.

4
* Typo in Algorithm 1 line 9: d_r22
* Algorithm 2 is not intuitive. What is the role of the “if not” conditions?

5.1.
* Typo: “despite previews works”

5.2.1
* Was the error analysis performed on the whole QALD-6 dataset? How many errors in total were there? A table summarising the results of the error analysis results (classes of errors, # and % errors) will be beneficial.
* SimpleQuestions: how were 100 questions selected? Was the selection biased on the frequency of properties in the dataset? Why this step was necessary?
* Discussion referencing the evaluation results and limitations of the approach is missing. In particular, performance on complex questions and the new benchmarks: LC-QuAD and WDAquaCore0Questions.

Review #3
By Gerhard Wohlgenannt submitted on 27/May/2018
Suggestion:
Minor Revision
Review Comment:

The authors present a method for QA over Linked Data, which -- in contrast to existing work --
is mostly independent of the underlying knowledge base and natural language used for querying.
The presentation of the method is quite clear, and a plus of the article is the extensive evaluation
using different QA datasets like QALD (various datasets), SimpleQuestions, etc.

The method for query generation is not too complex, but original in its ambition to be KG- and (natural) language-agnostic.
So, in summary, the biggest advantages are that the approach is agnostic to natural language used and knowledge base,
and also it can be used to query multiple KBs at the same time (although currently not supporting (runtime-)parallel querying of KBs).
The main downsides are, that the approach still needs some manual configuration / training in some of the components, so
for someone external trying to use the method on a new dataset, this might be an issue, esp. with the lack of specific knowledge
on how to set up / train this system.
Some manual work is still necessary in steps like providing lexicalizations for entities/properties, tuning to new languages
(with stopwords/stemming), training the ranking (step 3.4. in the paper), etc.
Furthermore, at the current moment, the system only supports queries of low complexity and for example doesn't handle expressions
like superlatives in queries yet.
Finally, the approach is more general and always doesn't keep up with other state-of-the-art methods trained on the QA specific dataset, but
that is to be expected. Hopefully, future work (eg on errors coming from the lexical gap or missing support of eg. superlatives)
will improve performance metrics.

But, in total, I appreciate the effort to move into the direction of providing a platform for potentially querying any dataset in the LOD cloud
in (theoretically) any natural languages, and see it as significant contribution to QA on LD, and therefore
recommend acceptance of the paper, given that a few minor issues (see below) are addressed.

Minor revision action points:
-----------------------------

Most of the issues here reflect a wish to see some clearer ideas or at least discussion on how to address the current downsides of the presented approach.

- Manual effort needed. Please provide some discussion:
In your approach, where do you see the possibility to further reduce manual effort (without harming QA performance), and how?
For areas where it is not possible to remove manual work, does your system provide a clear description
(eg on github) to someone interested in adopting the system, on where and how to provide manual work -- so that this is not a hindrance
for adoption of the system. Or is at least planned to provide such a description at some point?

It might be good to add a paragraph to the paper that summarizes in which components manual tuning/training is necessary,
and how much effort is to be expected...

- According to your evaluations, issues with the lexical gap between queries and the KB (labels) are biggest source of errors (around 40%).
I'd like to see some ideas (eg in future work) on how to address this issue, as it seems very important for performance.
Furthermore, there has been a lot of work in the last years based on distributional semantics (with word embeddings),
which might be useful to better align terms in queries and the datasets. Just as an example, fastText embeddings are available
for currently 294 languages, trained on Wikipedia (https://github.com/facebookresearch/fastText/blob/master/pretrained-vect...).
So this might be something to add to the system, although I am aware that embeddings can make the system much "heavier".
--> so I am not advocating embeddings necessarily, I'd just like to hear some ideas on how to address this (biggest) source of errors in your system in future work.
(And, for example, the manual alignment of SimpleQuestions properties with lexicalizations in SimpleQuestions questions is not elegant,
so there are some ideas needed.)

- You state that language/KB-agnostic QA has been "poorly" addressed so far.
But as it is important to this paper -- you should more clearly describe how exactly other systems have tried to address language/KB-agnostic QA
(and contrast it to your system if necessary and useful) -- If I missed it while reading, please point to it.

- On p2, you state "our approach can be directly used by end-users." .. How? End-user refers to people asking QA questions, or someone
applying your approach to their LD dataset? And, please add, for example, the github URL of WDaqua?!

- Style: The writing style is in general sometimes a bit too casual for my taste. For example the Abstract starts with
"*Thanks* to the development of the Semantic Web ..."
or the Introduction section starts with:
"Question answering (QA) is an *old* research field ... "
IMHO parts like these should formulated in a more formal way -- but I leave this to the other reviewers/the editor to decide.

Also I think in cases like in the last sentence of the abstract,
"To summarize, we provided a conceptional solution for multilingual, KB-agnostic Question Answering over the Semantic Web."
the use of present tense ("we provide") would be more appropriate.

- [Optional:] Discuss, if it is possible, how much effort is to be expected, when integrating this search functionality eg. into someones local
DBpedia/Wikidata endpoint or tool that works with these datasets ... If you think this question can be answered, and is helpful to the reader.

Typos:
p6 "create new training dataset" -> "create new training datasets"
p10 "multiple-languages" --> "multiple languages"
p10 "perform worst" --> "perform worse" ?
p12 "4 core of Intel .." --> "4 cores of Intel .."
p13 "an unified interface" --> "a unified interface"