Efficient Exploration of Scientific Articles using Topic-based Hashing Algorithms

Tracking #: 2123-3336

Carlos Badenes-Olmedo
José Luís Redondo-García
Oscar Corcho

Responsible editor: 
Guest Editors Semantic E-Science 2018

Submission type: 
Full Paper
Searching for similar documents and exploring major themes covered by groups of documents are common actions when browsing collections of scientific articles. This manual knowledge-intensive task may become less tedious and may even lead to unexpected findings if algorithms are applied to help researchers. Most text mining algorithms represent documents in a common feature space that abstracts away from the specific sequence of words used in them. Probabilistic Topic Models reduce that feature space by annotating documents with thematic information. On this low-dimensional latent space some locality-sensitive hashing algorithms have been proposed to perform document similarity search. However, thematic information is hidden behind binary hash codes, preventing thematic exploration and limiting the explanatory capability of topics to justify content-based similarities. This paper presents a novel hashing algorithm based on approximate nearest-neighbor techniques that uses hierarchical sets of topics as hash codes. It not only performs efficient similarity searches, but also allows to extend those queries with thematic restrictions explaining the similarity score from the most relevant topics. Extensive evaluations on both scientific and industrial text datasets validate the proposed algorithm in terms of accuracy and efficiency.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Daniel Garijo submitted on 03/Mar/2019
Major Revision
Review Comment:

This paper presents a hashing algorithm for topic modeling techniques designed to improve their efficiency. In addition, autors claim their approach improves the explainability of topic modeling results by grouping topics in a hierarchical manner.

The paper is well written and easy to follow. The research topic of the paper is not new, but the proposed extensions demonstrate the authors have done their homework with the state of the art and propose a valuable contribution. I think the paper is a good fit for this special issue, but I am a little divided regarding the relevance of the approach. The special issue focuses on Semantic e-Science, and while the authors' work clearly deals with the detection of similiar scientific work in an explainable manner, the "semantic" aspect of the contribution is unclear to me. Are semantics or knowledge representation used in the approach? If so, how? The authors sometimes refer to the "data type" of the papers, but that is not further elaborated. I suggest the authors clarify this in the next revision of the manuscript. Below I describe other comments and suggestions that I think should be addressed as well:

- The experiments define distance metrics and compare their performance among them. However, this is not compared against the state of the art. Why? I think that even if the hierarchy level is the same, a comparison is needed to understand how the current approach performs. In addition, why is precision is selected as a metric and not the F-measure? Is recall not considered important in this case?

- How does the current approach improve efficiency? Tables 4-7 show the ratio of data consumed, but there is no indication on how this affects the overall efficiency of the topic based models. How is this translated to time improvement? Is this improvement worth the loss in precision? The first technique does not seem to yield good results. Is it there just as a baseline?

- Human validation is not present. This seems critical for similarity based techniques, specifically if there is not a large ground truth. Are human-based evaluations going to be part of the future work?

- I am a little confused by the claim of the approach being appropriate for topic detection on unseen texts. An illustrative example would be helpful.

- Figures 1-4 show small variation when changing the number of topics. In particular, Fig 4 seems to be very consistent. Are the variances in the graphs significant? Is it better to have a lower number of topics?

- The data type of the papers is claimed to be important for the topic algorithms. However, it's not mentioned in Section 3. Why is this?

Presentation comments/small issues below:

- The problem statement is not clear until after the second page of the paper. I believe it should be made clear to readers before that point.

- The authors describe the source code as a contribution of their work. The source code supports an implementation of your approach, and that should be the contribution (instead of the code), right? In addition, the development of corpora to validate and test your approach is a separate valuable contribution in my opinion.

- Why not addressing the triangle inequality problem is an issue?

- Could different distance techniques yield different results in Section 4.2?

Review #2
By Anita de Waard submitted on 04/Mar/2019
Minor Revision
Review Comment:


This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include
(1) originality,
(2) significance of the results
(3) quality of writing.

This paper addresses the issue of document similarity, which is used to determine which papers are similar to a specific paper, e.g. on a (publisher’s) website or search engine. The authors argue that their technique using a probabilistic topic model (PTM) is better to use than other techniques, and propose a novel algorithm for ‘semantic hashing’, where word vectors are mapped to bit codes to speed up the time it takes to compare document vectors. Their specific approach uses hierarchical sets of topics which are evaluated using the Jensen-Shannon Divergence metric.
• Overall, the paper is well-written and well laid-out; the reader is clearly guided through the various stages of the research.
• The problem and the author’s contributions are well-identified and although I am not an IR expert and cannot gauge whether the right references have been used, the related work seems well-researched. It is certainly well-argued that this paper adds something to existing systems.
• The paper has a well-described set of software and data at a github site: https://github.com/cbadenes/Large-scale-Topic-based-Search
Questions and comments:
• Though mostly clear, the language is not always grammatically correct and the paper would benefit from close reading by a native speaker, mostly to sort out particles (e.g, “This is just an example about the data structure that will support the different hashing strategies.”); the occasional odd phrase (e.g.”For example, searching for articles in Biomedical domain similar to an article about Semantic Web.”); use of the ‘to-infinitive’ where it is not grammatically correct (“However, the algorithms proposed in this work allow to add new restrictions to the initial query…”) etc.
• Can you explain the following “Searching for similar documents in a domain described by a set of topics cannot be performed using binary hash codes.”; this is not immediately apparent to nonspecialists. Overall, a brief intro to semantic hashing would be appreciated.
• Please give a reference and brief description of “JSD” (probably Jensen-Shannon Divergence, but it would help to spell it out) “The similarity metric used in experiments is JSD, due to is used in literature [42][1][31] …”
• I am afraid I do not understand why the numbers for the three methods are so vastly different (and perhaps, therefore, not really what they mean). Can you please provide an example (e.g. for 1 document, there are 100 similar ones..) and walk us through what an ‘ideal’ table 1/2/3 or 4/5/6 will look like?
• I do not understand Figure 11. Please clarify what we are looking at: what is a lot, what is a little? What is ideal, what are we looking for?
• I appreciate you have not done any user studies, but what are your preliminary thoughts as humans looking at the different results; are the recommended similar document sets significantly different, for a given document?

Review #3
Anonymous submitted on 21/Apr/2019
Minor Revision
Review Comment:

This paper presents a hash method which is based on the topic model for text data. The authors address the hashing issue which is used in the approximate nearest neighborhood problem. The proposed method uses topics as hash codes to improve the interpretability of similarity comparison between document while allowing users to restrict their document query using topics. Basically, the proposed algorithm first computes the topics distribution for each document using some standard topic model algorithms, then try to cluster those topics to get the binary hash code for documents.

The paper is very well written and organized. Objective and methodology are presented clearly. Also, source code and data are provided which makes the work more reproducible.

However, there are some issues about the experiment and qualitative results-

The proposed hash method is based on the topic model, but the paper does not mention which topic model algorithm is being used for evaluation. (LDA Gibb sampling? Or other variants?)

The paper does not experimentally compare the proposed methods with other hash methods. For example, it is better to include precision or complexity of other hash methods to see whether the proposed methods are comparable to those state of the art.

For each run of the topic model algorithm, the learned topic distribution would be different, and the paper mentions that the experimental results are averaged. It would be better to include the variance of precision in the table so that we can see the difference between precision is really significant.

In addition, the paper claims that the number of topics does not influence the performance of the algorithm significantly by comparing results of the different number of topics. But for each data set, I only see two numbers are being used(say, for CORDIS data set, the paper only tests 70 and 150 number of topics). The authors may consider plotting a line chart to show the relationship between the number of topics versus precision to make the argument more convincing.

Finally, why not show the recall results? So we have a better understanding of the algorithm.

Overall, the idea of the paper is nice but the experiment part needs to be solidated.