Efficient Exploration of Scientific Articles using Topic-based Hashing Algorithms

Tracking #: 2123-3336

This paper is currently under review
Carlos Badenes-Olmedo
José Luís Redondo-García
Oscar Corcho

Responsible editor: 
Guest Editors Semantic E-Science 2018

Submission type: 
Full Paper
Searching for similar documents and exploring major themes covered by groups of documents are common actions when browsing collections of scientific articles. This manual knowledge-intensive task may become less tedious and may even lead to unexpected findings if algorithms are applied to help researchers. Most text mining algorithms represent documents in a common feature space that abstracts away from the specific sequence of words used in them. Probabilistic Topic Models reduce that feature space by annotating documents with thematic information. On this low-dimensional latent space some locality-sensitive hashing algorithms have been proposed to perform document similarity search. However, thematic information is hidden behind binary hash codes, preventing thematic exploration and limiting the explanatory capability of topics to justify content-based similarities. This paper presents a novel hashing algorithm based on approximate nearest-neighbor techniques that uses hierarchical sets of topics as hash codes. It not only performs efficient similarity searches, but also allows to extend those queries with thematic restrictions explaining the similarity score from the most relevant topics. Extensive evaluations on both scientific and industrial text datasets validate the proposed algorithm in terms of accuracy and efficiency.
Full PDF Version: 
Under Review