Assessing deep learning for query expansion in domain-specific arabic information retrieval

Tracking #: 2070-3283

Wiem Lahbib
Ibrahim Bounhas
Yahya Slimani

Responsible editor: 
Guest Editors Semantic Deep Learning 2018

Submission type: 
Full Paper
Abstract. Term mismatch influences negatively the performance of Information Retrieval (IR). User queries are generally imprecise and incomplete, thus lacking important terms useful to understand the user's need. Therefore, detecting semantically similar words in the matching process becomes more challenging, especially for complex languages including Arabic. Employing classic models based on exact matching between documents and queries in order to compute the required relevance scores cannot resolve such problem. In this article, we propose to integrate domain terminologies into the Query Expansion process (QE) in order to ameliorate the Arabic IR results. Thus, we investigate different experimental parameters such the corpus size, the query length, the expansion method and the word representation models, namely (i) word embedding; and (ii) graph-based representation. In the first category, we use neural deep learning-based model (i.e. word2vec and GloVe). In the second one, we build a cooccurrence-based probabilistic graph and compute similarities with BM25. We compare Latent Semantic Analysis (LSA) with both of them. To evaluate our approaches, we conduct multiple experimental scenarios. All experiments are performed on a test collection called Kunuz which provides documents in several domains. This allows us to assess the impact of domain knowledge on QE. According to multiple state-of-the-art evaluation metrics, results show that incorporating domain terminologies in the QE process outperforms the same process without using terminologies. Results also show that deep learning-based QE enhances recall.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Dagmar Gromann submitted on 21/Jan/2019
Minor Revision
Review Comment:

I would like to thank the authors for carefully taking reviewers' comments into consideration and for providing a detailed explanation of applied changes. However, the document still requires proper editing and proof-reading. Several comments of previous proof reading processes are still in the text without having been applied, e.g. in Section 4.4.3, and a substantial number of serious mistakes as well as language problems and careless editing drastically lower the quality of the article. A high quality manuscript that additionally corresponds to the style guide is a prerequisite for this paper to be accepted.

General comments:
- the abstract is still dense, very hard to read, and too long (see comments on style guide below).
- do not talk about "deep neural network-based methods" when referring to embeddings since most of those architectures are anything but deep and in fact usually have only one hidden layer, so they are neural network-based but not deep. While they might still be describd as "deep learning model" they cannot be referenced as "deep neural network"
- Table 2: isn't the POS of the FARASA tool in the first row also wrong? If not, the Ayed's tool result has to be wrong unless both POS results are correct here
- nominal entities? Only nouns or noun phrases as well?
- graph mining: what else would be a context if not the sentence? "We chose to consider each sentence as a context."
- IDF(q_i) needs to be ITF(q_i) after equation 3; what is the difference between "n" in (3) and "N" in (4) - align if the same
- 4.4.3 Neural deep learning => deep learning or neural networks; this section requires serious attention (as do all sections following 5)
- provide references for word2vec and GloVe instead of the link to the online repository (4.4.3)
- where do your embeddings come from? Pretrained or trained on which corpus?
- "vector of semantics" does not mean anything - it is dense vectors in distributed semantic space
- 5.2 "the morphological tool suite" => which one? Also, you mention the IR platform in the comments - include what you used for your implementation in the paper
- Section 6: do not claim that you apply deep learning models when all you use are cosine similarities of trained word embeddings - call them word embeddings or embedding models
- "Skip-gram seems to be the less efficient because it deals better with very large amount of data" is a contradiction
- discribe the results in the tables and not only provide the tables
- one expert evaluated terminologies extracted in 97 domains?
- CBOW etc. do not per se use cosine or correlation metrics at all - you use trained embeddings and calculate distances using the above - rephrase in section "Impact of the expansion method"
- there are two Table 9 in this paper and what are those numbers in the second Table 9? First of all, English uses a period instead of a comma for 0.88 (not 0.88); what do the values (0,9; 40) mean in the first column?

Comments regarding previous reviews:
- by your description of P@n, P@0 measures how many relevant documents there are in the first 0 returned ones - that does not make any sense at all - please clarify and explain this, especially in the article and which IR platform are you talking about - mention in article
- there is no Section 3.1 as referenced in the reply to comments by Reviewer 2
- the paper clearly was not proof read by an English native speaker
- change Section 6 from "interpretation" to "Discussion"
- Tables and figures still separate the flow of the text into two columns at the top and two columns at the bottom - this issue has not been solved
- it is not enough to describe the embedding training process in the reply to reviews - this needs to be described in detail in the paper including all hyperparameters chosen for the training process (window size, mininum frequency, etc.) - also: are you going to make code and embeddings publicly available?
- the highlighting in Table 8 needs to be also explained in the paper
- the whole idea of the reference terminology is still unclear to me - who is that one expert on 97 domains? must be a genius. In addition, evaluation by one expert does not make the resources a reference terminology.
- the building of the terminologies could require some further clarification

Style guide:
Please ensure that your final submission corresponds to the SWJ manuscript preparation guidelines ( Your paper currently does not entirely. For instance, the abstract should be no longer than 200 words, the authors should be provided with their complete address, no indentations of subheaders, etc. Please check each single point of the guidelines for correspondence with your final manuscript. For the sake of political correctness and adhering to modern publishing standards, I again strongly encourage you to stop using the male form only to represent users, that is, use his/her instead. I saw that some instances have been changed to s/he but this is not consistent with using "his" which then needs to be changed to his/her as well as all other instances of using gendered references. Please make sure that your tables fit one page (if they themselves are not longer than one page) and do not extend to the next page as is currently the case with Table 8.

Minor comments:
For proofreading, please consider seeking the help of an English native speaker proficient in the language to the point of being able to edit a scientific journal paper. Someone with such skills should be able to spot mistakes such as the following examples (in order of appearance and only examples, not a complete list):

"Term mismatch influences negatively..."" => "Term mismatch negatively influences" or negatively needs to go to the end of the sentence
"lacking important terms useful to understand" => "lacking important terms useful to understanding" or "lacking important terms that are useful to understand"
"such the corpus size" => "such as corpus size" - omit "the" for all items of this list (omission also where you copy the exact same sentence to the introduction)
"word embedding;" => "word embeddings," (with "s" missing throughout the paper)
"graph-based representation" => "graph-based representations"
"neural deep learning-based model" => "neural network-based models" or "deep learning models" both with "s"
explain what BM25 is in abstract, e.g. "ranking function"
"We compare Latent..." => "We compare the results of Latent..."
"relevance which represents" => "revelance, which" but a little bit unclear
"problematic" => is an adjective, did you mean "problem"?
"what is exactly the user is looking for" => "what exactly the user is looking for"
"gather a user feedback" => "gather user feedback"
"in two levels" => "on two levels"
"That is, one way" => "In other words" ("that is" at the beginning of sentence is highly unusual - please replace everywhere)
"ontologies and dictionaries building" => "ontology and dictionary building" => the former is called ontology learning and not building
"are shareing same specific area" => "share the same specific area"
"are homographic and are written in the same way" => homographic means written in the same way
"agglutination ambiguity" => "agglutinative ambiguity"
"have about 30 interpretation" => "has approximately 30 interpretations"
"Arabic language is also" => "The Arabic language" (throughout the paper)
"manifests through" => "manifests itself through"
"it will be distinguish to carefully choose" => meaning?
"Abderrahim [2] build" => builds
"improve searching results" => "improve search results"
"using a relational database search engine" => does not fit this sentence - check grammar and sentence structure
"On the other side" => this linker needs to be "On the other hand"
"ayed's tool" => "Ayed's tool"
"falsify the calculus" => ???
"To do so" => after "terminologies must exist" does not make any sense
"minimal terminology of the for each domain" => something missing
if Target Domain Detection is supposed to be the acronym TDD then it should be capitalized correctly throught the document
200 word-vector => 200 word vectors
"Precision at N" => "Precision at n" and P@n
... and many more