Multi-label categorization for large web-based raw biography texts

Tracking #: 1870-3083

This paper is currently under review
Julien Lacombe
Rémy Chaput
Feras Al Kassar
Marc Bertin
Frédéric Armetta

Responsible editor: 
Guest Editors Semantic Deep Learning 2018

Submission type: 
Full Paper
Recently, new approaches based on Deep Learning have demonstrated good capacities to tackle Natural Language Processing problems. In this paper, after selecting the End-to-End Memory Networks approach for its ability to efficiently manipulate meanings, we study its behavior when facing large semantic problems (large texts, large vocabulary sets) while considering only a realistic modest-sized and unbalanced web-based dataset for the training. Through the study, results and parameters are discussed in relation to the rarity of data. We show that the so formed system manifests robustness and can capture the correct tags, while it is able to gain a very good accuracy for larger datasets, and can be an efficient and advantageous way to complement other approaches because of its ability for generalization and semantic extraction.
Full PDF Version: 
Under Review