Multi-label categorization for large web-based raw biography texts

Tracking #: 1870-3083

Julien Lacombe
Rémy Chaput
Feras Al Kassar
Marc Bertin
Frédéric Armetta

Responsible editor: 
Guest Editors Semantic Deep Learning 2018

Submission type: 
Full Paper
Recently, new approaches based on Deep Learning have demonstrated good capacities to tackle Natural Language Processing problems. In this paper, after selecting the End-to-End Memory Networks approach for its ability to efficiently manipulate meanings, we study its behavior when facing large semantic problems (large texts, large vocabulary sets) while considering only a realistic modest-sized and unbalanced web-based dataset for the training. Through the study, results and parameters are discussed in relation to the rarity of data. We show that the so formed system manifests robustness and can capture the correct tags, while it is able to gain a very good accuracy for larger datasets, and can be an efficient and advantageous way to complement other approaches because of its ability for generalization and semantic extraction.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 14/May/2018
Review Comment:

The paper evaluates an end-to-end memory network in a task of multi-label categorisation over a dataset of bibliographies. While the modifications to the end-to-end memory model are quite limited (section 3.2), the idea behind the experiment is interesting. However, the solution proposed for the multi-label categorisation seems indeed better than the ones mentioned in the state of the art. The application domain remains narrow (tested on one dataset, it would have been interesting to test domain independence experiment in a journal paper).

The choices made by the authors are always well justified, and the comprehensibility of the overall writing is one of the strong point of this paper. While the text tries to remain synthetic, some parts could have been more detailed (conclusion and perspectives).

The presented results are significant compared to state of the art. When the author's model loses added value compared to another model (5.3.2), the explanation presented by the authors is satisfying.

Review #2
Anonymous submitted on 15/May/2018
Major Revision
Review Comment:

The paper describes an empirical study of end-to-end memory networks for the task of multi-label classification of documents. The authors compare this method to BM-25 as a baseline method, and focus on the degradation of results as an effect of the size and the unbalanced nature of the dataset.

The main argument of the paper is plausible: When facing small and unbalanced datasets the multi-label classification problem becomes harder, and models with richer semantic representations may be at an advantage over more "shallow" approaches such as BM-25. The authors intend to reinforce this argument with their experimental results, but unfortunately a severe lack of care for the presentation make it hard for the reader to be convinced. In my experience, unclear wording and missing explanations cumulate over the course of the paper and require a large effort of the reader to make sense of the presented results. Below I will highlight the main shortcomings I identified in presentation of the paper, but before that I will present a point of critique on the approach.

The authors adapt a query-answering method to perform a task they refer to as multi-label categorization. The model itself needs to be given a document and a query, and returns an answer from that query. Apparently (although this is not very clearly described in the paper, I apologize for misrepresenting the method in that case) the work described in the paper uses a single query throughout the experiments, which is "what is the occupation of X?", where X is the subject of the document under consideration. To make the model return multiple labels rather than a single one, the authors replace the final softmax layer by a sigmoid layer. It can be expected that the resulting model will give multiple answers of varying relevance to the query about occupation, but it is not explained in the text what the ground truth labels are. I assume there is a single occupation per document. So where do the multiple labels come from? Apart from that, I think that having a model giving multiple answers to the occupation question does not make it a multi-label categorization model in the typical sense: It is still a model that answers a specific query about the document, rather than give a set of labels concisely describe the document.

The paper makes some interesting comments about the reason why the memory model is less affected by incomplete documents and unbalanced datasets (Sections 5.3.1 an 5.3.2), but the offered hypotheses are not validated in anyway, so at this point they must be considered speculation.

The following issues mainly concern the presentation:

The abstract does not mention the concrete task being addressed, apart from natural language processing.

From the introduction (third paragraph), it appears as if the paper wants to propose a solution to the shortcomings of deep learning approaches to real-world problems in general ("Proposing a solution to these problems..."). This objective however is much too general, since the paper focuses on natural language processing. Please rewrite to clarify.

Since the task addressed in the paper is multi-label categorization of texts, it would be good to focus the introduction a bit more on this task, rather than talk about machine learning for NLP in general.

2.1.2 The introduction/explanation of memory networks is rather vague. A quick look at the literature helped to understand memory networks, but the description in 2.1.2 did not make much more sense after that.

2.1.4 second paragraph "..., we show that the approach proposed in this paper ..." This sentence appears to say that you are proposing and using an alternative word-embedding method to Word2Vec and glove. This is misleading.

3.1.1/3.1.2 The introduction of the matrix C and c_i. First of all, it is not explained what c_i stands for. It is defined in terms of C in Section 3.1.1, but C is only introduced in Section 3.1.2. Where does C come from? Is it learned?

4.1 According to what method are the questions generated? The questions are supposedly generated from the tags, but if a biography has tags "Victor Hugo", and "Writer", how do you form the question? I mean, how do you infer that "Victor Hugo" is a person and "Writer" is an occupation? ...ah, I see on the website that the tags are not just a bunch of labels, but they are structured (name=..., occupation=...) Please clarify this in the paper.

Equation 6. What does t stand for?

Section 5.1.1. "catch rate": This is an important metric for evaluating the model, but unfortunately it is not formalized or even described in unambiguous language: It stands for the "mean of label retrieving, considering as much predicted label as the set size of relevant labels." It sounds as if you mean the standard IR notion of "recall". If not, please clarify what you mean by "label retrieving".

In Table 5 and 6 it is not clear what quantity the "Result" represents. Consequentially it is also not clear whether higher values are better than low values or vice versa. Moreover, since the results presented in the tables are not discussed in any way, the reader has no clue as to the optimal memory and embedding sizes. As such, the tables are completely useless. Also, what does "ME" stand for in the table?

Table 1/2: What are the percentages being shown? Errors? Accuracies? How are these errors/accuracies measured? Please elaborate in the captions.

Figure 3 and 4 are not referred to anywhere in the text.

The authors use both proportions and percentages to represent the catch rate. It would be better to stick to either one or the other. In particular, the authors appear to be confused themselves: The numbers "0.49%", "0.52%", and "0.37%" (Section 5.3.3 end of first paragraph) seem to represent proportions, not percentages.

All in all I think the problems described above make the paper unfit for publication in its current form. I recommend a major rewrite. I think the paper would benefit from a formalized description of the task and the most important metrics, and a thorough revision of the text to make it more accessible. In particular, the authors should make sure that all tables in the paper is are actually discussed in the text, that tables and figures contain informative captions. Finally, a revision by a native English speaker would be welcome to improve unclear wordings.

Review #3
Anonymous submitted on 05/Jun/2018
Review Comment:

The study presented in the paper applies a memory network to the task of multilabel classification of biographies with respect to the person’s occupation. The author’s purpose seems to be evaluating a neural network’s ability to address this task. However, it is not clear how and in which sense this particular task is actually different from other classification tasks, for which we know that neural networks provide very good results. So, the added value of such a paper seems to be quite limited, or at least not enough spelt out .

Even more, the authors do not provide substantial comparisons with other existing models that have addressed a similar task. They use a probabilistic model as a baseline, but nothing is said about the state of the art results on this biography classification task. The section “comparative results” instead focuses on memory networks addressing question answering or word prediction tasks and is only marginally related to the core problem in the paper.

The first part of the paper contains a long discussion on deep learning and its possible problems, ruled-based models, etc. which might be drastically reduced, since they are not addressed at all in the rest of the work

Last but not least, the paper is written in a terrible English, which often impedes reading and full understanding of the text. Moreover, it is full of typos and misspelled names (e.g., word2vect in page 3, wordsens in page 2, among many others.), which are quite annoying.

In the evaluation part (section 5.1.1), it is not clear why the multilabel task prevents the authors to use per-class precision, recall and F1, and then report microaverage data to have a comprehensive model evaluation. It would be useful to see which classes are better recognized than others. A discussion on this point might help to get a closer look at the pros and cons of memory networks.