Studying the Impact of the Full-Network Embedding on Multimodal Pipelines

Tracking #: 1859-3072

Armand Vilalta
Dario Garcia-Gasulla
Ferran Parés
Eduard Ayguade
Jesus Labarta
Ulises Cortés

Responsible editor: 
Guest Editors Semantic Deep Learning 2018

Submission type: 
Full Paper
The current state-of-the-art for image annotation and image retrieval tasks is obtained through deep neural networks, which combine an image representation and a text representation into a shared embedding space. In this paper we evaluate the impact of using the Full-Network embedding in this setting, replacing the original image representation in four competitive multimodal embedding generation schemes. Unlike the one-layer image embeddings typically used by most approaches, the Full-Network embedding provides a multi-scale representation of images, which results in richer characterizations. To measure the influence of the Full-Network embedding, we evaluate its performance on three different datasets, and compare the results with the original embedding scheme, and with the rest of the state-of-the-art. Results for image annotation and image retrieval tasks indicate that the Full-Network embedding is consistently superior to the one-layer embedding. These results motivate the integration of the Full-Network embedding on any multimodal embedding generation scheme, something feasible thanks to the flexibility of the approach.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 02/Apr/2018
Review Comment:

This manuscript focuses on the parallel tasks of image annotation and retrieval, which are typically addressed using multimodal embeddings. The manuscript is an extension of previous work published at the SemDeep2 workshop at IWCS 2017 by Vilalta et al. (2017). In that published work, the authors replaced a typical image representation (the last layer of a CNN) with the full-network embeddings (FNE) suggested by Garcia-Gasulla et al. (2017), a representation which offers a richer visual embedding space by deriving features from multiple layers while also applying discretization. While the previous work is based on the approach of Kiros et al. (2014), this work uses the improved version of Vendrov et al. (2015), and shows consistent improved performance across the three datasets on which the methods are evaluated. In addition, more exhaustive experiments are conducted.


* The evaluation is extensive and fair: the model is compared with the state-of-the-art methods for each task, as well as on the original model by Vendrov et al. (2015) without the FNA component. The authors have controlled various factors such as hyper-parameter values, dataset splits and training time. The models are evaluated on three datasets.

* The implementation details are very informative and detailed.

Major Concerns:

* This work is marginally related to the topic of the special edition, while being related to deep learning, it is not related to semantic web. The introduction suggests that the image annotation and retrieval tasks may help semantic image indexing, but this is pretty much the only connection to semantic web.

* The contribution of this work is pretty small: similarly to the work of Vilalta et al. (2017), while the FNA-based approach is superior to the original CNN-based approach, the proposed methods are constantly outperformed by the state-of-the-art methods in both tasks and on all datasets. The conclusion section suggests that incorporating the FNA representation into these SOTA models may improve their performance as well, but this is left for future work. The additions of this manuscript to the work of Vilalta et al. (2017) are sufficient but not much more.

* The manuscript needs proof-reading. Specifically:

- The related work section reads like a long list of approach names which are not elaborated or explained. It was quite confusing and would have been better if you elaborated on the important approaches, and omitted or wrote a more general description of the other approaches.

- Section 3 needs better structure, e.g. start the section by detailing what each subsection will discuss.

- The FNA approach should be described as part of the related work section, as it is previous work and not part of the contributions of the current work. In addition, it is referenced several times before it is introduced in 3.1.

- Many grammatical errors and typos (see below).

Minor comments / needs clarification:

* The VGG and UVS models are mentioned several times but never described.
* It seems that the evaluation metrics only capture recall. Isn't there a standard evaluation metric for these tasks that captures precision?
* Section 3.1: what is a pre-trained CNN? What is the training objective and data?
* Section 3.5: define curriculum learning.
* Section 4.2: refer to the equations in the model descriptions.
* Section 4.3: which DL framework did you use? Will the code be made available?
* Table 1: which values were tested for each hyper-parameter?
* Section 5: "results are now very close to the ones obtained by other methods" - this is simply not true.
* When you say something is "significant", did you perform significance tests? If so, please elaborate, otherwise either perform them or change to "substantial" or another modifier.
* Section 5: dataset size correlates with performance: makes sense, but it's not completely comparable because these are different datasets. A better experiment would be to change the training set size while keeping the same test set.

* References:

- Change references from Ryan Kiros to Jamie Ryan Kiros.
- Change arXiv references of papers that were published in conferences or journals, e.g. [9] was published at ICLR 2016.
- Reference [39] seems unrelated and is never referenced.

Typos and grammar errors:

* Section 2:

- ...of the GRUs and the last fully-connected *layer* of the CNN
- examples focus only *on* the hardest of them
- A different group of methods is based *on* the Canonical...
- a neural architecture that project*s* image*s* and sentence*s*
- best results on *the* Flickr30K dataset
- DANs exploit two... (remove s)

* Section 3.1:

- "value of the each feature" - remove "the"

* Section 3.2:

- all the words in *the* train
- with the GRUs and the word embedding*s*
- the pipeline training procedure consist*s of* the optimization
- Equation 1: small i and c in the sum subscript
- dot product of the vectors as *a* similarity *measure*

* Section 4.1:

- is an extension of Flickr8K *which* includes it

* Section 4.2:

- We will experiment => We investigate

* Section 4.3.2:

- that higher *dimensionality helps obtaining*

* Section 5:

- The second part summarize*s*
- Each of these blocks contain*s*
- Tables 2-3: Flickr8*K* and Flickr30*K*
- "The modifications are We can see" - ungrammatical
- as it introduces => as it incorporates
- this is *e*specially significant
- experimenting on *the* MSCOCO dataset
- SOTA contributions => SOTA methods
- explicit => specify
- It*'*s important to keep in mind
- Weighting the results => Comparing the results
- we see that *their* relative performance
- In *the* experiments on MSCOCO
- add*s* very little improvement

* Section 6:

- exactly same model => exact same model
- these experiments show that the instability *of the* training

* Section 7:

- exactly same model => exact same model
- alleviate*s* this *problem*

Review #2
By Jindrich Helcl submitted on 17/May/2018
Minor Revision
Review Comment:

The paper presents a method for improving multimodal models by integrating Full-Network embedding (FNE) on tasks of image annotation and image retrieval.
The FNE of an image is computed from activations of all layers of a CNN, unlike the conventional embeddings, which use only the output of the last convolutional
(or fully-connected) layer.
The experiments were conducted on three datasets of different sizes.

The paper is an original integration piece where the novel contribution is in using a different component created by the authors in a previous work.

The results show that using the proposed method beat the baseline consistently. However, while the results are comparable to the selected baseline methods,
the results of the state-of-the-art methods shown in Tables 3, 4, and 5, are a lot higher in almost all cases. The authors comment that incorporation of FNE
to the state-of-the-art methods would likely help as well. It is unclear if conducting these experiments is intended as a future work or not.

The style of the writing is mediocre and could be much better. The reader is often forced to re-read passages of text to understand what is going on. This may partially be
because of very long paragraphs, notably at the beginning of the paper. I also recommend putting a paragraph about the paper structure in the introduction.

Overall, the presented paper is an original contribution, the results are credible and solid. The writing style should be simplified and the text cleaned from typos.

Some details

section 2, paragraph 4:
based in -> based on
FV -> FVs (used in plural), also FVs are build -> FVs are built
missing comma after "image" in the last sentence

par 5:
exploits -> exploit

section 3.2, par 1
simplify "words (embeddings) in the sentence" -> "embeddings"
I am not sure if you use "affine" and "linear" terms correctly.
If omitting the bias term from an affine transformation makes it a linear transformation, then the affine transformation was linear all along.

par 2:
consist on -> consists of

at the end of 3.2 the reader first encounter the fact that there is a normalization going on, so this should be clarified and said earlier.

Section 3.4
The first mention of Hinge loss is in the second paragraph, after all of the equations. It should be said earlier that the equations are actually computing a thing called "hinge loss".

I'd put the last paragraph to a footnote since it is not really a part of the experiments.

Section 4.1, paragraph Flickr30k
"conform" is not the word you are looking for. Try form, comprise, construct, ...

Section 4.3.1
missing comma after "epoch"

- the situation when the correct pair could be considered incorrect can happen even if there is only one caption from the image used. There can be many similar images
in the dataset and these images can have the same captions regardless of the actual reference.

- to avoid two of the same captions in a batch, you do not need to throw them out from the dataset. I'd just shuffle each of the five datasets (one for each caption) and then concatenate them. This way, it's very unlikely two captions of a same image end up in a same batch.

- note that "early stopping" means stopping early. If you train for N epochs and then take the best model as determined on validation, it is a very similar principle but there
is no actual "early stopping" involved.

Section 4.3.2
300D -> 300 (in the similar cases as well)

Section 4.3.3
missing comma after "(fc7)" in the first paragraph.

Section 5
Tables 4.4 and 4 -> Tables 4 and 5

Review #3
By Luis Espinosa Anke submitted on 24/May/2018
Minor Revision
Review Comment:

This paper presents a method for multimodal modeling of text and images. It is claimed that for image representation, typically only simple 1-layer CNNs are used, and therefore an improved representation is proposed.

I think the paper would benefit a lot from rephrasing its actual contribution. It would be good to elaborate on the actual modifications of the model’s design and implementation, and perhaps clarifying a bit more where it is that their system is intuitively better, what is the rationale for the modification, etc. Then, in addition to the experimental results (which are extensive), show an analysis/qualitative exploration section where it is shown in which cases the proposed method works better, perhaps trying to find a pattern which can be exploited in further models by researchers working in the same field.

A major drawback is that the presented method doesn’t seem to work better than existing technologies, and while the promise (or hope) that these systems may improve after being enhanced with thei proposed architecture, this is not empirically shown, so it is difficult to see a contribution.

Also I think it would be very important to put this work in context of semantic web / knowledge graphs technologies. For example, it would be interesting to see the performance of this model in the task of hypernymy modeling, which has direct applications in knowledge graph construction and taxonomy learning. In fact, there is prior work that combines both images and text [1].

On the flipside, the experiments are extensive and show a deep understanding of the current state of the art of caption/image retrieval systems, which is highly appreciated.

Finally, the presentation of the paper could be improved, I suggest the authors to proofread the paper, especially in the first sections, addressing a recurrent problem in overly long and nested subordinate sentences.

[1] Kiela, D., Rimell, L., Vulić, I., & Clark, S. (2015). Exploiting image generality for lexical entailment detection. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) (Vol. 2, pp. 119-124)..