Review Comment:
Boosting Document Retrieval with Knowledge Extraction and Linked Data
The paper investigates the benefits of using Linked Open Data as well as Knowledge Extraction techniques for real-world document retrieval tasks. It presents and evaluates an approach based on semantic-based expansion of queries and documents that has been implemented in the KE4IR system. The proposed study is conducted respecting scientific standards, using well-recognized evaluation metrics and benchmarks; the source code of the system implementation is also made publicly available for further reuse and analysis. In addition to provided source code, the paper is well written and provides enough details to both reproduce and fully understand presented results.
Interestingly, the authors strengthen several findings of one of their previous study (introducing KE4IR), and show on many large-scale datasets that document retrieval can be improved by defining systems that integrate indexing and querying approaches exploiting Linked Open Data and Knowledge Extractions techniques. I would like to stress the large and very much appreciated engineering and evaluation effort provided by the authors for developing, testing, and evaluating their system. Based on my understanding and analysis of this work, and even if the discussion part could have been extended to cover important aspects of document retrieval that are not discussed (mentioned hereafter), I recommend accepting the proposed work which I consider very interesting.
Comments are provided hereafter – note that most of the following remarks are comments and not modifications that have to be made:
• To complete your state of the art, note that works have also explored modeling Ontology-based information retrieval using semantic similarity measures and aggregation operators, e.g. User centered and ontology based information retrieval system for life sciences, Ranwez et al. - this approach is different from traditional VSM extensions since it relies on direct assessment of semantic similarity analysis and is based on Yager's operators. Regarding state of the art, works related to question answering using Knowledge Representations could have also been mentioned.
• The use of the dot product instead of the cosine similarity could further be discussed (p. 6), since originally the choice of the cosine similarity was indeed made in order to only incorporate vector orientation, and to avoid distinguishing vectors based on their ‘magnitude’. Dropping ||q||_2 is indeed not a problem for your use case. However, the mentioned side effects of incorporating ||d||_2 leads to a deeper discussion. Indeed, by using a similarity expressed only using the dot product you implicitly consider that vectors are expressed in an orthonormal basis which is very far to be the case considering your modeling (since symbolic features associated to so-called semantic terms, further associated to vector dimensions, are even linked by logical implications). This discussion is also related to the way the normalized frequency is further computed. An extended discussion on that aspect could be provided. I also recommend the authors to study works related to cosine extensions (e.g. dw-cosine) – you can for instance refer to Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model by Sidorov et al. 2014.
• To my eyes, w(l(t_i)) should not be incorporated to q_i but rather included in the definition of d . q, i.e., sum_{i=1}^n (d_i . q_i . w(l(t_i)) ); if not, following your argument we could define sim(d,q) = sum_{i=1}^n ( tf_d(t_i,d) . tf_q(t_i,q) . idf(t_i,D) . w(l(t_i)) );
• From a practical point of view, using sum_{l \in L} w(l) = 1 is sort of misleading since it looks like the semantics of the weigh refers to the importance given to each layer, which is not the case considering that the number of dimensions associated to each of them is not the same (it could also be interesting to give their respective sizes). You should consider this remark when the results are discussed (e.g., p. 10). “a w(SEMANTICS) value of 0.0 means that only textual information is used (and no semantic content), while a value of 1.0 means that only semantic information is used (and no textual content).” Yes, but w(SEMANTICS) = 0.5 does not necessarily mean equal importance.
• When considering the TYPE semantic layer, a modeling based on types’ Information Content could also be interesting, e.g. see among works and references proposed by other authors, semantic similarity from natural language and ontology analysis Harispe et al (preprint on ArXiv). This could be used to modify the way the normalized frequency is computed. In addition, implicit and explicit mentions of a topic, e.g. TYPE, could also have been discussed, as it could be interesting to distinguish both cases, e.g. talking about Mathematicians (in general) and mentioning Mathematicians are two different things. Similarly explicit use of semantic relatedness (and not only indirect semantic similarity as you do) could also be used, e.g. Talking about vector spaces, matrices, eigenvalues, linear systems, Gaussian elimination, I indirectly refer to important concepts related to Linear Algebra; however, using the a priori knowledge your model considers none of those URIs would implicitly refer to Linear Algebra (e.g. in the Type semantic layer). An interesting way of incorporating this would be to integrate a ‘weak semantic layer’ that could for instance consider similarities of word embeddings (à la Word2Vec/Glove…).
• 5.3.3, for future work, it would be very interesting to provide the same results in a setting not only exploiting topic titles.
Discussion could also be improved by mentioning:
• Aspects related to multilingual (this can be a strength for your system),
• Management of multiple LOD resources (do we align resources first?),
• Use of uncertainty metrics related to disambiguation/NERC (why not incorporating this information into the model since it is of major importance?),
• Extensions to more refined state of the art IR models,
• And objectively, from a practical point of view, based on the comparison made with state of the art IR systems, considering the improvement we observe in Table 9 as well as the process overload mentioned in p. 8, is it really worth it? Is per definition general IR problem not suited to the use of refined ‘contemporary’ semantic-based approaches?
Minor comments:
• Even if |v| is sometime used to refer to the Euclidian norm, ||v|| or even ||v||_2 makes the reference to the L_2 norm non-ambiguous.
• log(0) is undefined eq. 5, eq. 7 undefined for empty set as denominator.
• Mentions to PIKES and KE4IR (but not for FRED) are made using a specific font; it has to be changed if it is not made on purpose.
|