SRNA: Semantics-aware Recurrent Neural Architecture for classification of document fragments

Tracking #: 1858-3071

Blaz Skrlj
Jan Kralj
Nada Lavrač
Senja Pollak

Responsible editor: 
Guest Editors Semantic Deep Learning 2018

Submission type: 
Full Paper
Deep neural networks are becoming ubiquitous for natural language processing tasks, such as document classification and translation. Semantic resources, such as word taxonomies and ontologies, are yet to be fully exploited in a deep learning setting. This paper presents an efficient semantic data mining approach, which converts semantic information - related to a given set of documents - into a set of novel features that are used for learning. A recurrent deep neural network architecture is also proposed, enabling the system to learn in parallel from the semantic vectors and from the vectorized documents. The experiments show that the proposed approach outperforms the approach without semantic knowledge, where the main gain in accuracy is observed on the documents of reduced length. We showcase the effectiveness of the proposed approach on the topic categorization, sentiment analysis and gender profiling tasks.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 09/Apr/2018
Review Comment:

This paper proposes to leverage WordNet's synsets to boost performance of deep learning methods on text classification. The work is motivated around classification of document fragments instead of entire documents. The paper shows that the introduced approach outperforms a random forest classifier as baseline.

There are a number of shortcomings in this paper which, in my opinion, needs more work to be considered for publication:

* The motivation is somewhat vague and unclear. It is clear that text classification is an important task, however I'm not sure what is the main contribution and novelty of this work. The fact that WordNet is used to semantically enhance features for a deep learning approach is vaguely discussed and there is no enough literature review to show this.

* The related work is structured in two, disjoint parts, 2.1 and 2.2. However there is work combining the ideas of both sections, which should be discussed.

* With respect to the novelty, it is said that "this is one of the first approaches" to leverage taxonomies in deep learning. This is again vague, and there are important references missing, which use WordNet in a similar scenario, and should be discussed and compared, e.g.:

Rothe, S., & Schütze, H. (2015). Autoextend: Extending word embeddings to embeddings for synsets and lexemes. arXiv preprint arXiv:1507.01127.

* The related work doesn't show the existing gaps and how the present work contributes to them.

* The classification experiments could be performed on a cross-validation setting, rather than a fixed train/test set.

* I'm not sure about the utility of reporting both accuracy and micro-F1, when they're technically similar metrics. I would suggest to use macro-F1 along with accuracy instead.

* More baseline approaches could be used, beyond just Random Forests.

* It is said that experiments took 54 hours to run overall. However this isn't very useful if it's reported for the whole experimentation. How long did it take to run your classifier vs the random forests?

* I'm unsure about the motivation around classifying document fragments, and this should be further fleshed out to make it clearer for the reader. It is said that one may want to classify document fragments in a real-time scenario when results are needed quickly, however the long runtime (54 hours) seems to suggest that this may not be the case here. The example used with tweets combined in a single document is also not clear.

Minor comments:
* In the introduction, "author profiling, gender profiling...", I would recommend to have references for each of the tasks discussed.
* Likewise for association rule learning in the following paragraph.
* Section 2, "representation in (see Section 2.1)", "in" should be removed.
* Section 2, "[13] extensively evaluated ... significantly impact the classifier's performance" --> positive or negative impact?
* "hypernims" -> "hypernyms"
* Figure 6 is hard to read if printed.

Review #2
Anonymous submitted on 25/Apr/2018
Review Comment:

This paper presents a method to include information from a taxonomy, e.g., WordNet, in order to inform a neural text classifier with background knowledge to complement empirical evidence from document text. Given an input corpus, this is achieved by finding so-called "representative hypernyms" for words occurring within documents, and using these to provide semantic vectors alongside document vectors.

This paper is not an easy read: it is rather lengthly (I can imagine no problem reducing this to a short conference paper manuscript). The paper includes no example, which would certainly improve the presentation.

Overall, the proposed method is rather ad hoc and it is hard to find any truly semantic methodology here: document processing bypasses the problem of word sense identification (cf. the definition of representative hypernym) and no effort is made to capture the structure of the knowledge base in the vector propositionalization (cf., in contrast, the plethora of works from past years on knowledge base embeddings).

The experimental setting is also rather weak: there is no comparison with the state of the art, only against a bunch of baselines. Consequently, in its current form, it is not possible to assess how well the proposed method fares against competitive approaches.

Minor comments

- Please proofread the manuscript again. Typos include "of-the-shelf" and "hypernim"

Review #3
By Francesco Ronzano submitted on 30/Apr/2018
Minor Revision
Review Comment:

This paper proposes a neural architecture for text classification (SRNA) that relies on both: (i) unsupervisedly learned word embeddings and (ii) semantic feature vectors derived from taxonomical knowledge structures (like the hypernym / hyponym relations of Wordnet). After an overview of unsupervised and knwoledge-based approaches to represent a document and neural architectures for text classifications, the SRNA text classification approach is presented. In SRNA, each document to classify is modeled by means of the set of embeddings of its words as well as by relying on a semantic feature representation dervided from a set of 'representative document hypernyms'. In particular, given a document, such semantic feature representation is derived by considering, for each word w of that document, the set of hypernym lexicalizations shared by all the Wordnet-derived synonyms of the word w. Both word embeddings (by means of a Convolutional Neural Network) and semantic features (by means of an LSTM Network) are exploited jointly to represent a document to be classified: in the SRNA architecture, the output of the Convolutional Neural Network and the LSTM Network are concatenated and a sigma acriv. layer is exploited to identify the most likely class to be assiciated to each document. The SRNA classification approach is compared with baseline Random Forest and Convolutinal Network approaches (based exclusively on word embedding and / or on semantic features). To this purpose three widespread document classification datasets are considered. SRNA significantly outperforms the baselines when we consider short documents fragments as input to the classifiers (up to the first 100 words of each document).


The paper describes an interesting approach to exploit features derived from structured knowledge resources to complement word embedding in a neural architecture for text classification. The paper is globally well-written. Like detailed in the following comments, it would be great if some aspects could be better discussed.

- In Section 3.1 (Proposationalization of the semantic space):
> It would be great if you could provide some more details on the process of selection of the representiative hypernyms of a word w: when you gather the synonyms of a word w occurring in a document, if such word w is polysemic, do you collect all the lexicalizations (synonyms) of each sense (synset) associated to that word in Wordnet?
> The hypernym relations connect Wordnet senses (synsets): given a word w that is polysemic / has more than one sense (synset) associated, to collect the related set of hypernym words, do you collect all the lexicalizations of all the hypernyms of each sense (synset) associated to the word w?
> In general, it would be great to provide in Figure 4 a real example of construction of the representative hypernym set.
> When a word of a document is not contemplated in Wordnet, it will not contribute to generate the semantic vectorial representation of the same document. It would be great if you could provide some statistics on the coverage of Wordnet with respect to the words / vocabulary of the three datasets you consider (e.g. how many nouns on average match to at least one Wordnet lexicalization?)
> Could you specify if you perform stemming / lemmatization before matching a word occurring in a document to Wordnet?

- In Section 3.2 (Learning from the semantic space):
> Could you provide some more info on which word embeddings do you use in your experiments? Which dimension? How are they trained? Are they pre-trained and fine tuned on each evaluation dataset?
> Could you specify which loss function do you use for the multi-class / multi-label dataset (Reuters)?

- Results:
> When the IMDB dataset is considered, the addition of semantic features seems not to improve the classifier performance with respect to the baseline Convolutional Network (only word embeddings). Could this be related to the fact that the enhanced modeling of the semantics of a text (provided by semantic features) is less useful in sentiment analysis (IMDB dataset) than in the case of more content-related text classification tasks (Reuters dataset)?
> Since SRNA significantly outperforms the baselines with short documents fragments (up to the first 100 words of each document), it would be great to provide, when describing the evaluation datasets, some statistic on the distribution of document lengths (avg, std. deviation).
> Could you provide some more details on the "simple forest construction" (Random Forest) approach that you used, that you state could have penalized the performance of Random Forest classifiers (Section 5.3)?

- In general, for reproducibility (as well as to know in more details the parameters of your neural network - embedding settings, convolutional filters, etc.) it would be great to share the code (Python / Keras) that you used in your experiments.

- 1.Introduction: textual or other unstructured form. --> forms
- 2. Background and related work: word-embedding respesentation in (see Section 2.1) --> DELETE in
- 3. Proposed SRNA approach: ...for every synonim of the word w. --> synonYm
... {h|h is a hypernim of s} --> hypernYm
...if this hypernim is connected to at least one... --> hypernYm
... FORMULA OF THE CROSS-ENTROPY LOSS; log(p1) --> p1 should not be subscript
- 4.2 Semantic feature construction: ...(rarest hypernyms above a certain frequency threshold)... --> SHOULD IT BE 'most common hypernyms'?