Explainable Zero-shot Learning via Attentive Graph Convolutional Network and Knowledge Graphs

Tracking #: 2465-3679

Authors: 
Yuxia Geng
Jiaoyan Chen
Zhiquan Ye
Wei Zhang
Huajun Chen

Responsible editor: 
Dagmar Gromann

Submission type: 
Full Paper
Abstract: 
Zero-shot learning (ZSL) which aims to deal with new classes that have never appeared in the training data (i.e., unseen classes) has attracted massive research interests recently. Transferring of deep features learned from training classes (i.e., seen classes) are often used, but most current methods are black-box models without any explanations, especially textual explanations that are more acceptable to not only machine learning specialists but also common people without artificial intelligence expertise. In this paper, we focus on explainable ZSL, and present a knowledge graph (KG) based framework that can explain the feature transferring in ZSL in a human understandable manner. The framework has two modules: an attentive ZSL learner and an explanation generator. The former utilizes an Attentive Graph Convolutional Network (AGCN) to match inter-class relationship with deep features (i.e., map class knowledge from WordNet into classifiers) and learn unseen classifiers so as to predict the samples of unseen classes, with impressive (important) seen classes detected, while the latter generates human understandable explanations of the feature transferability with class knowledge that are enriched by external KGs, including a domain-specific Attribute Graph and DBpedia. We evaluate our method on two benchmarks for animal recognition. Augmented by class knowledge from KGs, our framework makes high quality explanations for the feature transferability in ZSL, and at the same time improves the recognition accuracy.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 01/May/2020
Suggestion:
Accept
Review Comment:

The article is about explainable ZSL, and presents a knowledge graph (KG) based framework that can explain the
feature transferring in ZSL in a human understandable manner. The framework has two modules: an attentive ZSL learner and an
explanation generator. The former utilizes an Attentive Graph Convolutional Network (AGCN) to match inter-class relationship
with deep features (i.e., map class knowledge from WordNet into classifiers) and learn unseen classifiers so as to predict the
samples of unseen classes, with impressive (important) seen classes detected, while the latter generates human understandable
explanations of the feature transferability with class knowledge that are enriched by external KGs, including a domain-specific
Attribute Graph and DBpedia.

I have carefully reviewed the changes that the authors have made in blue highlight. Most of my concerns have been addressed; hence, I recommend accept.

Review #2
By Dagmar Gromann submitted on 31/May/2020
Suggestion:
Minor Revision
Review Comment:

Thank you very much for your careful and substantial revision of the first submission and for your detailed explanations of all changes made. In my view, the paper has substantially improved and many previously unclear points and many of my concerns have been clarified. However, there are still some points that need to be addressed.

Specifically, the selection of the classifier in the test phase is still unclear to me. First, v_i and its updated version with attention weights are described as feature vectors. But then in Section 4.2.3. the same variable seems to become a classifier instead of a feature vector. How do you multiply a classifier with image features and how does this "multiplication" result in classification scores? Is this a terminological problem between feature representations of classes and the term classifier? Even if it is a terminological problem, I would still like to understand how this vector multiplication (between image and class vectors) results in a classification score? What is the classifier here?

Several terminological and definition inconsistencies still make the paper partially hard to understand and also to reproduce, which should be fixed carefully. For instance, the calculation of the support and confidence values in association rule mining are described almost identically ("c% of attributes in D owned by X are also owned by Y" (confidence), "ratio of attributes that are owned by both X and Y" (support)). It is also stated that "each class involves a classifier", which presumably should be that for each class a classifier is trained? Given this sentence and the problem described in the previous paragraph, maybe a definition of what the authors understand as classifier is in order?

The creation of a hierarchical structure based on WordNet is not entirely clear to me. It is only stated that each class is aligned with a synset in WordNet. Maybe there are some steps missing in this description? I would also encourage the authors to define what they mean by "visual knowledge" and "visual classifier" as these might be quite ambiguous.

In terms of language, some of the text of the previous submission has been improved, but there is still a considerable number of issues. For instance, the difference between transfer and transferability has only partially been fixed and there are lots of language problems. The newly added parts seem to not even have been subjected to a simple spell checker, e.g. "considerring", "accessable" and "bacause", let alone a proper revision.

Given the described points, I recommend another minor revision of this paper.

Minor comments:
- After two careful readings, I could not find any explanation why the most closely related work, that is, Selvaraju et al. [9] has not been included in the baseline models. Shouldn't this (at least an explanation) be included?
- The following sentence is hard to understand: "With learned feature vectors of classes, we use the CNN classifiers of classes as the training supervision to map inter-class relationship into deep CNN features so that predicting a visual classifier for each class node." What is a visual classifier? What is the difference between the two CNNs in this sentence?
- What does the following sentence on p. 10 mean (esp. the very last part): "...get the support sets of rules (i.e., common attributes) as evidence to output"?
- "seen and unseen entities" in the General KG section should be entities for seen and unseen classes, right? If not, what are seen and unseen entities?
- If you mention someone's work, including software, there really has to be a reference. Please include one for SpaCy.
- Nouns with special meanings is an inadequate definition of named entities (Section 4.3.3.).
- Experiment: I am quite lost with the following sentence: "During validation, the testing samples are predicted on seen classes to evaluate the prediction ability of learned seen classifiers so as to obtain well-trained unseen classifiers." Why would you even use test set samples during validation? And if the seen and unseen classes are mutually exclusive, how could the seen classifier classify test samples?
- "49.2% of unseen classes learn their classifiers": I do not believe that unseen classes learn their classifiers.
- I am not sure that the human evaluation results of the explanations are consistent with the introductory claim about their high quality, since they rather seem to be of medium readability and good to medium rationality.
- "when the number of attributes more than 10, in which the representative characteristics may miss" what does this sentence mean?
- I am not sure "own" can be used as it is used everywhere in the paper

Review #3
By Michael Cochez submitted on 24/Jun/2020
Suggestion:
Minor Revision
Review Comment:

Overall, the authors have addressed most of the comments from the previous iteration of the paper. However, from my perspective there still is a major issue preventing acceptance of the paper.

The issue is that the evaluation is only done on unseen classes. I had raised that issue in my previous review. The authors answered:

“ It is known that there are usually two testing settings in ZSL. One is standard ZSL, which predicts the testing samples of unseen classes on unseen classes. The other is generalized ZSL, where the testing samples of seen and unseen classes are classified with candidate labels from both seen and unseen classes. In this paper, for investigating the feature transferability from seen classes to unseen classes, we focus on the standard ZSL setting to evaluate the prediction ability of unseen classifiers and generate explanations for these unseen classes. It is also worth considering how to deal with explainable ZSL in generalized ZSL setting in real-world applications. Maybe we can adopt a two-phase framework -- a coarse-grained phase to judge if a testing sample comes from seen classes or unseen classes, and a fine-grained phase to make final predictions, where traditional classifiers (e.g., softmax classifiers) are used to predict its label with candidates from seen class set if the sample is from seen classes predicted by coarse-grained phase, and ZSL classifiers are used to predict its label with candidates from unseen class set if it belongs to unseen classes. We can make further attempts for this in the future”

My view is that this is not something to just look at in the future. It is perfectly justified, and even essential, to have an experiment where you want to show transferability. However, I see also a strict need to evaluate with the known classes in place. As far as I currently understand your work, there is also no need to do a two-stage process. Just treat the known classes in the same fashion as your unknown ones. This task will of course be harder, and that is exactly the point. I am expecting the results to be much worse as what you currently obtain. This, however, would be still an interesting outcome, because it would show that 1) you can transfer learn, but 2) when having both known and unknown classes things do not work as well. Besides, it would be very existing if you can provide deeper insight in how the class concussions occur most often. Either, the confusion is more or less uniform (not so interesting case) or the confusion happens most often between seen and unseen classes, which would give us further insights.

I do have some more minor issues below, but I see having this experiment as a major missing piece in this paper. I was considering a major revision to make sure this issue was amended, but this would lead to an immediate reject. Hence, I decided to go for a minor revision, and ask the authors to perform such an experiment for the next version of the paper.

A second issue that needs still more attention is describing how the features flow between the models exactly. I am not getting the whole picture, still. It might have something to do with the phrasing. For example, I do not get the sentence “With learned feature vectors of classes, we use the CNN classifiers of classes as the training supervision to map inter-class relationship into deep CNN features so that predicting a visual classifier for each class node”. Is it correct that the features coming out of your AGCN are never really put into the CNN, but only used in the end to compute a dot product which is then interpreted as the score?
The same confusion might be solved if I understand what $f_i$ in formula 4 exactly is. Is it the outcome of a pre-trained CNN? If so, why do you call it “the classifier of seen class i”?

Minor issues

In equation 3, I am surprised to see that \hat{v}_i is computed using attention on the neighbors, but not using the state of the node v_i itself at all. Is that intentional? Why?
As it was now mentioned, it caught my attention that you have an extremely large state in the nodes (2048). What is the reason for that choice?

You write “our model is a regression model rather than a classification model, which usually works better.” Which of the two works better? For which case?

There are a couple of issues which you gave more attention in your cover letter as in the paper. Perhaps you can also expand your explanation in the paper further.