Review Comment:
Overview
The authors present an architecture designed to take instructions as natural language input (e.g., take the cup to the table) and act on the provided instructions. The architecture, or pipeline, uses a combination of the Socio-physical Model of Activities (SOMA) and a Deep Language Understanding (DLU) system to to parse perceptual and linguistic information and ground it to objects and actions required for carrying out the instruction.
General Comments & Criticisms
The authors are attempting to tackle a set of problems critical to the increased deployment of robotics within the home, such as, but not limited to 1) natural language understanding, 2) context sensitivity, 3) grounding/referencing, and 4) action selection and execution. Each one of these are significant challenges and are distinct research foci. One of the contributions from this work is how to pull all of these pieces together by leveraging ontologies, and speaks to the originality of the reported research. Indeed, the DLU+SOMA approach is one of several possible for achieving (re)taskable robots. Another end-to-end system/framework/architecture/approach/solution that was not mentioned in the text is Interactive Task Learning (Gluck & Laird, 2018; Kirk, Mininger, & Laird, 2016).They provide an alternative solution to the DLU+SOMA solution presented in the paper, and it would be worth discussing differences b/t the approaches, strengths, weaknesses, etc. The writing was clear and concise and the various supplemental materials appear to be complete.
Another general criticism I have is that the entirety of the paper provides a description of the system w/o any real evaluation of the systems performance as a whole, issues with individual components, etc. This evaluation would provide readers good indicators where to focus future research (knowledge gaps, semantic map acquisition, etc.) that need to be tackled to make the proposed architecture more robust and add weight to the points made in the discussion and limitations section. I would recommend being as quantitative as possible (failure/success rates, processing speeds, etc. over multiple runs). Along these lines, one example instruction is provided: "put the cup on the table". It would be very informative to report the variety of instructions that are handled and the variety that are not handled by the system.
Specific Comments & Criticisms
• Page 1: a point was made that “Most commonly, any robot activity starts with the robot receiving instructions for that specific activity…verbally from a human or through written texts such as recipes or procedures found in online repositories.” This statement and description is more suggestive of a ‘could be possible’ or a ‘potential future’ than a ‘currently available method’ for providing instructions to household robots to complete tasks (especially wikiHow as an instruction set source), but I could be wrong about this.
• The questions posed at the end of the introduction sets the stage: “How can we use ontological knowledge to extract and evaluate parameters from a natural language instruction in order to simulate it formally?” and the authors specify their approach as a potential solution. I just want to note that research reported by Eberhart, et al. (2020)is also attempting to answer the same question in using a generalizable ontology of instruction.
• A game engine is used for simulating the agent’s/robot’s behavior. This is a common approach for demonstrating capabilities and gaps of robots and intelligent systems in general. However, the original context describing the instruction taking robot was helping out in a kitchen. Is it the case that the simulated instantiation and the physical context close enough that the demonstration is generalizable between the two contexts?
• In the sentence beginning on line 39, page 4, what happens if the underlying grammar cannot map the semantics onto the SOMA concepts?
• In the sentence beginning on line 39, page 6, it would be worthwhile to briefly describe how the terminal scene is constructed within the system.
• The paragraph beginning on line 28 of page 8, the DLU’s use of ‘human computation’ to solve the problem of relative spatial closeness. Is this a requirement of the system for any task presented that requires spatial approximation and guidance, or is this a side-effect of the intelligence residing within a virtual environment?
References
Eberhart, A., Shimizu, C., Stevens, C., Hitzler, P., Myers, C. W., & Maruyama, B. (2020). A Domain Ontology for Task Instructions. In B. Villazón-Terrazas, F. Ortiz-Rodríguezm, S. M. Tiwari, & S. K. Shandilya (Eds.), Knowledge Graphs and Semantic Web. Second Iberoamerican Conference and First Indo-American Conference, KGSWC 2020 (pp. 1–13). Mérida, Mexico: Communications in Computer and Information Science, vol. 1232.
Gluck, K. A., & Laird, J. E. (Eds.). (2018). Interactive task learning: Humans, robots, and agents acquiring new tasks through natural interactions. Interactive task learning: Humans, robots, and agents acquiring new tasks through natural interactions. Gluck, Kevin A.: Cognitive Models and Agents Branch, Air Force Research Laboratory, Wright-Patterson AFB, OH, US, 45433: The MIT Press.
Kirk, J., Mininger, A., & Laird, J. (2016). A demonstration of interactive task learning. IJCAI International Joint Conference on Artificial Intelligence, 2016-Janua, 4248–4249.
|