Review Comment:
The paper presents Fact Extractor, a NLP pipeline that allows to generate machine-readable statements based on the extraction of n-ary relations from a textual corpus.
The pipeline shows its potentialities when applied to KB enrichment by exploiting textual corpora (e.g., Wikipedia articles for DBpedia).
The overall quality of writing is good. However, some issues may need clarification, furthermore, not too much, but some minor issues like typos exist that require some changes (see below). The structure of the paper is very clear and consistent with the hypotheses and contributions listed in the abstract/introduction. Nevertheless, some sections could be merged together (e.g., Section 2 -> 1, Section 8 -> 7, etc.).
Some claims should be toned down. For example, in Section 3 (Use case) when the authors motivate the reason behind their choice of adopting the Soccer domain for the use case, they argue that 5% of the whole English Wikipedia is a significant portion of the whole chapter. However, 5% can be hardly considered a significant portion of a dataset/sample. Moreover, this ratio is provided with respect to the English Wikipedia, while in the rest of the paper and in the evaluation the authors use the Italian Wikipedia. Hence, the ratio should be provided with respect to the Italian Wikipedia or at least the authors should motivate this incoherence.
In Section 5 (Corpus Analysis) the authors "argue that the loss of information is not significant and can be neglected despite the recall costs". Again, the term significant should be used with more accuracy or ir should be better justified by providing supporting data and proper analysis.
The authors should provide more details for the following parts:
* Section 6: provide ranks and more context about the selection of the LUs esordire, giocare, perdere, rimanere and vincere.
* Section 8: how many rules were defined for the normalisation of numerical expressions? Are these rules available for consultation?
* Section 11.1: no information about the inter-rater reliability is provided. This information is needed in order to demonstrate the value of the
goldstandard built for the evaluation.
* Section 11.2: can the data about property statistics made be available?
The representation of generated statements should be better clarified. In fact, in Section 2 they authors provide an example that does not reflect the semantics provided by the example in Section 9 (page 10). Namely, in the first example the authors state that, from the sentence
"In Euro 1992, Germany reached the final, but lost 0-2 to Denmark", the pipeline generates:
* a new statement "Germany defeat Denmark" where defeat is the frame
* a set of facts about the new property by using FEs.
Instead, in the second example, for the same sentence, the pipeline generates
* a new statement ":Germany defeat :Defeat01"
* a set of facts about the object :Defeat01
The semantics of the first example leds to extensional issues when adding new facts to a property/frame. This issues seem to be resolved by the second example. However, the statement is completely different as no explicit binary relation is generated between Germany and Denmark. Rather, a more coherent (with respect to first example) and correct formalisation could be the following:
:Germany :defeat_01 :Denmark .
:defeat_01 rdfs:subPropertyOf :defeat .
:defeat_01 :winner :Denmark ;
:loser :Germany ;
:competition :Euro_1992 ;
:score "0-2" .
The state of the art is focused on (open) information extraction and semantification. Nevertheless, it misses the comparison with a large slice of relevant works in the areas of machine reading, n-ary relations extraction, existing frame related KBs and theories in the Semantic Web. These works include FRED [1] (see also the paper submitted to the SWJ [2]), FrameBase (see also the paper submitted to the SWJ [4]), [5], [6] and [7].
The comparison with Legalo is unfair (cf. [8] for more details) In fact, Legalo:
* relies on FRED (and its frame-based representation of natural language sentences) for generating binary relations;
* works either on Wikipedia articles or generic free text inputs;
* can be used for KB enrichment (cf. property matcher in the architecture of Legalo).
-- MINOR COMMENTS
* page 1: "DECIPHERING its meaning". If you do not use any cryptographic techniques the
term understanding would be more appropriate.
* page 2: to "to CREDIBLE (thus high-quality)" -> "to RELIABLE (thus high-quality)"
* page 2: "from raw text and produces e.g, " -> "from raw text and produces, e.g, "
* page 6: "The selected LUs comply to" -> "The selected LUs comply WITH"
* page 7: "We alleviate this through EL techniques". What does EL stand for?
* page 9: "which we call reification". I would say "called reification"
* page 13/14: "In avevage, most raw properties" -> "ON avevage, most raw properties"
[1] "Knowledge Extraction Based on Discourse Representation Theory and Linguistic Frames". Valentina Presutti, Francesco Draicchio, Aldo Gangemi. EKAW 2012: 114-129
[2] http://semantic-web-journal.org/system/files/swj1297.pdf
[3] "FrameBase: Representing N-Ary Relations Using Semantic Frames". Jacobo Rouces, Gerard de Melo, Katja Hose. ESWC 2015: 505-521
[4] http://semantic-web-journal.org/system/files/swj1239.pdf
[5] "Gathering lexical linked data and knowledge patterns from FrameNet". Andrea Giovanni Nuzzolese, Aldo Gangemi, Valentina Presutti. K-Cap 2011: 41-48
[6] "Frame Detection over the Semantic Web". Bonaventura Coppola, Aldo Gangemi, Alfio Gliozzo, Davide Picca, Valentina Presutti. ESWC 2009: 126-142
[7] "Towards a Pattern Science for the Semantic Web". Aldo Gangemi and Valentina Presutti. Semantic Web Journal 1.1, 2 (2010): 61-68.
[8] "From hyper- links to Semantic Web properties using Open Knowledge Extraction". Valentina Presutti, Andrea Giovanni Nuzzolese, Sergio Consoli, Aldo Gangemi, and Diego Reforgiato. Accepted for publication on Semantic Web Journal (http://semantic-web-journal.org/system/files/swj1195.pdf).
|
Comments
Part of this submission mentioned in a WWW 2016 paper
I would like to inform the readers that the paper From Freebase to Wikidata: The Great Migration [1] - accepted at the WWW 2016 industry track [2] - mentions the FBK StrepHit soccer dataset in Section 4.4.
The dataset is generated upon the baseline classifier output described in this SWJ submission (cf. Sections 1, 10 and 14).
[1] http://static.googleusercontent.com/media/research.google.com/en//pubs/a...
[2] http://www2016.ca/program/industry-track.html#from-freebase-to-wikidata-...