Is Neuro-Symbolic AI Meeting its Promise in Natural Language Processing? A Structured Review

Tracking #: 3172-4386

Authors: 
Kyle Hamilton
Aparna Nayak
Bojan Božić1
Luca Longo

Responsible editor: 
Guest Editors NeSy 2022

Submission type: 
Survey Article
Abstract: 
Advocates for Neuro-Symbolic Artificial Intelligence (NeSy) assert that combining deep learning with symbolic reasoning will lead to stronger AI than either paradigm on its own. As successful as deep learning has been, it is generally accepted that even our best deep learning systems are not very good at abstract reasoning. And since reasoning is inextricably linked to language, it makes intuitive sense that Natural Language Processing (NLP), would be a particularly well-suited candidate for NeSy. We conduct a structured review of studies implementing NeSy for NLP, with the aim of answering the question of whether NeSy is indeed meeting its promises: reasoning, out-of-distribution generalization, interpretability, learning and reasoning from small data, and transferability to new domains. We examine the impact of knowledge representation, such as rules and semantic networks, language structure and relational structure, and whether implicit or explicit reasoning contributes to higher promise scores. We find that systems where logic is compiled into the neural network lead to the most NeSy goals being satisfied, while other factors such as knowledge representation, or type of neural architecture do not exhibit a clear correlation with goals being met. We find many discrepancies in how reasoning is defined, specifically in relation to human level reasoning, which impact decisions about model architectures and drive conclusions which are not always consistent across studies. Hence we advocate for a more methodical approach to the application of theories of human reasoning as well as the development of appropriate benchmarks, which we hope can lead to a better understanding of progress in the field. We make our data and code available on github for further analysis.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Vered Shwartz submitted on 06/Jul/2022
Suggestion:
Accept
Review Comment:

I’m happy with the revisions, that address both my concern about coverage (more papers were added to the analysis, from the ACL anthology) and my concern about suitability as an introductory text (two introductory sections were added).

Minor comments:

2. Regarding the distinction between pre-training data and fine-tuning data, thanks for adding this sentence. I think that catastrophic forgetting of the pre-training task is not the biggest concern when the pre-training task is (masked or standard) LM objective. After all, most downstream NLP tasks don’t need to guess a word in its context. I would add additional concerns, such as:

(a) some of these LMs are so big that only rich organizations can afford GPUs with enough memory to use them.

(b) despite performance improvements on many NLP tasks, these models are still limited in their reasoning abilities. For example, GPT-3, with its massive training, gets only 80% accuracy on 3-digit addition problems, a task which a symbolic model would clearly get 100% correct (https://arxiv.org/abs/2202.07785). Such models tend to overfit to the training corpus, including revealing private information and encoding biases.

(c) for this reason, instead of having one pre-trained LM used by everyone in the community and fine-tuned with small training sets on specific tasks, the community started a LM arms race. Bigger LMs are trained on more data, obviating the environmental benefit from re-using LMs in the first place.

You can cite “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” by Bender et al. for some of these claims: https://dl.acm.org/doi/pdf/10.1145/3442188.3445922.

3. Thanks for including the list of venues in the appendix, and for using the ACL anthology. Is it true that despite using the ACL anthology, the only venue where you found papers that satisfy the requirements was ACL itself? What about TACL, EMNLP, NAACL, EACL, COLING, etc.? The NLP community submits papers to any of these venues pretty much interchangeably.

Review #2
Anonymous submitted on 21/Jul/2022
Suggestion:
Accept
Review Comment:

I thank the authors for their hard work on this manuscript. I see the paper has improved a lot. I really like how they extended some of the sections (e.g., the language one) and the new figures. I believe this is now a very nice survey paper that can be useful for the community.

A few minor things the authors might want to fix:

* Figure 8. The entire paper is filled with beautifully designed figures, but the two elements in Figure 8 - that come from the original papers - are very low quality. I'd suggest the authors redraw these two (same comment might apply to Figure 9, but I see the authors have a similar picture in Figure 26).

* Figure 25. Shows the transformers made with RNN components. Shouldn't these be transformer layers?

* I still think that more reference to the symbolic NLP part could have been added and also (https://ojs.aaai.org//index.php/AAAI/article/view/5962) is a valuable paper to add, but I will not force the authors to do so if they do not think they will add value to the paper.

Review #3
By Filip Ilievski submitted on 03/Aug/2022
Suggestion:
Major Revision
Review Comment:

I thank the authors for the extensive revisions of their original submission, and for their effort to address the reviewer comments.
The paper has quite some interesting information and it has been improved since the initial version; yet, there are several key issues that prevent me from voting for acceptance:

1) The paper lacks scope. Writing a review paper about how to achieve general AI (which is basically covered in section 1), how to integrate neural and symbolic, and then what is NLP/NLU/NLG/NLI, and what are use cases and applications is just way too much. This results in a paper that is extremely long and not focused, i.e., stays on a shallow level and presents some information but not other.

2) I still do not understand the main research question, and how it is (and its subquestions are) answered in the results and the discussion sections. Perhaps this is because of terminology. The authors ask whether the five ambitious goals are met, but then provide evidence from a small number of papers on whether they attempt to achieve these goals. In essence, there is a big gap between attempting and succeeding. And succeeding is not obvious: does it mean improving over purely neural and purely symbolic methods? Does it mean achieving 100% on some task? On many tasks?

3) The third point is that the paper structure in the longer sections is unclear to me. Especially, this holds for the results section, but also noticeable in the related work section. Given that the results are about answering the research questions, I was surprised to see the section organized along the lines of methods. And then, the discussion section has a different kind of organization which again does not map back directly to the research question(s).