Accepted Papers

PixL2R: Guiding Reinforcement Learning using Natural Language by Mapping Pixels to Rewards

Authors: Prasoon Goyal, Scott Niekum, Ray Mooney

Reinforcement learning (RL), particularly in sparse reward settings, often requires prohibitively large numbers of interactions with the environment, thereby limiting its applicability to complex problems. To address this, several prior approaches have used natural language to guide the agent's exploration. However, these approaches typically operate on structured representations of the environment, and/or assume some structure in the natural language commands. In this work, we propose a model that directly maps pixels to rewards, given a free-form natural language description of the task, which can then be used for policy training. Our experiments on the Meta-World robot manipulation domain show that language-based rewards significantly improve learning. Further, we analyze the resulting framework using multiple ablation experiments to better understand the nature of these improvements.

Language-Conditioned Goal Generation: a New Approach to Language Grounding in RL

Authors: Cédric Colas, Ahmed Akakzia, Pierre-Yves Oudeyer, Mohamed Chetouani, Olivier Sigaud

Paper Video

In the real world, linguistic agents are also embodied agents: they perceive and act in the physical world. The notion of Language Grounding questions the interactions between language and embodiment: how do learning agents connect or ground linguistic representations to the physical world ? This question has recently been approached by the Reinforcement Learning community under the framework of instruction-following agents. In these agents, behavioral policies or reward functions are conditioned on the embedding of an instruction expressed in natural language. This paper proposes another approach: using language to condition goal generators. Given any goal-conditioned policy, one could train a language-conditioned goal generator to generate language-agnostic goals for the agent. This method allows to decouple sensorimotor learning from language acquisition and enable agents to demonstrate a diversity of behaviors for any given instruction. We propose a particular instantiation of this approach and demonstrate its benefits.

Language-Goal Imagination to Foster Creative Exploration in Deep RL

Authors: Tristan Karch, Nicolas Lair, Cédric Colas, Jean-Michel Dussoux, Clément Moulin-Frier, Peter Ford Dominey, Pierre-Yves Oudeyer

Paper Video

Developmental machine learning studies how artificial agents can model the way children learn open-ended repertoires of skills. Children are known to use language and its compositionality as a tool to imagine descriptions of outcomes they never experienced before and target them as goals during play. We introduce IMAGINE, an intrinsically motivated deep RL architecture that models this ability. Such imaginative agents, like children, benefit from the guidance of a social peer who provides language descriptions. To take advantage of goal imagination, agents must be able to leverage these descriptions to interpret their imagined goals. This generalization is made possible by modularity: a decomposition between learned goal-achievement reward function and policy relying on deep sets, gated attention and object-centered representations. We introduce the Playground environment and study how this form of goal imagination improves generalization and exploration over agents lacking this capacity.

Extended Abstract: Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Authors: Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, Dhruv Batra

Paper Video

We ask the following question -- can we leverage abundant 'disembodied' web-scraped vision-and-language corpora (e.g. Conceptual Captions (Sharma et al., 2018)) to learn visual groundings (what do 'stairs' look like?) that improve performance on a relatively data-starved embodied perception task (Vision-and-Language Navigation)? Specifically, we develop VLN-BERT, a visiolinguistic transformer model that scores the compatibility between an instruction ('...stop near the sofa') and a sequence of panoramic images. We demonstrate that pretraining VLN-BERT on image-text pairs from the web significantly improves performance on VLN -- outperforming the prior state-of-the-art in the fully-observed setting by 4 absolute percentage points on success rate. Ablations of our pretraining curriculum show each stage to be impactful -- with their combination resulting in further synergistic effects.

On the Relationship Between Structure in Natural Language and Models of Sequential Decision Processes

Authors: Roma Patel, Rafael Rodriguez-Sanchez, George Konidaris

Paper Video

Human language is distinguished by powerful semantics, rich structure, and incredible flexibility. It enables us to communicate with each other, thereby affecting the decisions we make and actions we take. While Artificial Intelligence (AI) has made great advances both in sequential decision-making using Markov Decision Processes (MDPs) and in Natural Language Processing (NLP), the potential of language to inform sequential decision-making is still unrealized. We explore how the different functional elements of natural language---such as verbs, nouns and adjectives---relate to decision process formalisms of varying complexity and structure. We attempt to determine which elements of language can be usefully grounded to a particular class of decision process and how partial observability changes the usability of language information. Our work show that more complex, structured models can capture linguistic concepts that simple MDPs cannot. We argue that the rich structure of natural language indicates that reinforcement learning should focus on richer, more highly structured models of decision-making.

An Overview of Natural Language State Representation for Reinforcement Learning

Authors: Brielen Madureira, David Schlangen

Paper Video

A suitable state representation is a fundamental part of the learning process in Reinforcement Learning. In various tasks, the state can either be described by natural language or be natural language itself. This survey outlines the strategies used in the literature to build natural language state representations. We appeal for more linguistically interpretable and grounded representations, careful justification of design decisions and evaluation of the effectiveness of different approaches.

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments – Extended Abstract

Authors: Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, Stefan Lee

Paper Video

We develop a language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions. By being situated in continuous environments, this setting lifts a number of assumptions implicit in prior work that represents environments as a sparse graph of panoramas with edges corresponding to navigability. Specifically, our setting drops the presumptions of known environment topologies, short-range oracle navigation, and perfect agent localization. To contextualize this new task, we develop models that mirror many of the advances made in prior settings. We find significantly lower performance in the continuous setting -- suggesting that performance in topological settings may be inflated by the strong implicit assumptions.

Reinforcement Communication Learning in Different Social Network Structures

Authors: Marina Dubova, Arsenii Kirillovich Moskvichev, Robert L. Goldstone

Paper Video

Social network structure is one of the key determinants of human language evolution. Previous work has shown that the network of social interactions shapes decentralized learning in human groups, leading to the emergence of different kinds of communicative conventions. We examined the effects of social network organization on the properties of communication systems emerging in decentralized, multi-agent reinforcement learning communities. We found that the global connectivity of a social network drives the convergence of populations on shared and symmetric communication systems, preventing the agents from forming many local 'dialects'. Moreover, the agent's degree is inversely related to the consistency of its use of communicative conventions. These results show the importance of the basic properties of social network structure on reinforcement communication learning and suggest a new interpretation of findings on human convergence on word conventions.

Pow-Wow: A Dataset and Study on Collaborative Communication in Pommerman

Authors: Takuma Yoneda, Matthew Walter, Jason Naradowsky

Paper Video

In multi-agent learning, agents must coordinate with each other in order to succeed. For humans, this coordination is typically accomplished through the use of language. In this work we perform a controlled study of human language use in a competitive team-based game, and search for useful lessons for structuring communication protocol between autonomous agents. We construct Pow-Wow, a new dataset for studying situated goal-directed human communication. Using the Pommerman game environment, we enlisted teams of humans to play against teams of AI agents, recording their observations, actions, and communications. We analyze the types of communications which result in effective game strategies, annotate them accordingly, and present corpus-level statistical analysis of how trends in communications affect game outcomes. Based on this analysis, we design a communication policy for learning agents, and show that agents which utilize communication achieve higher win-rates against baseline systems than those which do not.

Pre-trained Word Embeddings for Goal-conditional Transfer Learning in Reinforcement Learning

Authors: Matthias Hutsebaut-Buysse, Kevin Mets, Steven Latré

Paper Video

Reinforcement learning (RL) algorithms typically start tabula rasa, without any prior knowledge of the environment, and without any prior skills. This however often leads to low sample efficiency, requiring a large amount of interaction with the environment. This is especially true in a lifelong learning setting, in which the agent needs to continually extend its capabilities. In this paper, we examine how a pre-trained task-independent language model can make a goal-conditional RL agent more sample efficient. We do this by facilitating transfer learning between different related tasks. We experimentally demonstrate our approach on a set of object navigation tasks.

WordCraft: An Environment for Benchmarking Commonsense Agents

Authors: Minqi Jiang, Jelena Luketina, Nantas Nardelli, Pasquale Minervini, Philip Torr, Shimon Whiteson, Tim Rocktäschel

Paper Video

The ability to quickly solve a wide range of real-world tasks requires a commonsense understanding of the world. Yet, how to best extract such knowledge from natural language corpora and integrate it with reinforcement learning (RL) agents remains an open challenge. This is partly due to the lack of lightweight simulation environments that sufficiently reflect the semantics of the real world and provide knowledge sources grounded with respect to observations in an RL environment. To enable research on benchmarking agents with commonsense knowledge, we propose WordCraft, an RL environment based on LittleAlchemy2. This environment is small and fast to run, but built upon entities and relations inspired by real-world semantics. We evaluate several representation learning methods on this benchmarks and propose a new method for integrating knowledge graphs within an RL agent.

Emergence of Multilingualism in Population based Referential Games

Authors: Shresth Verma

Paper Video

The ability of agents to learn to communicate by interaction has been studied through emergent communication tasks. Inspired by previous work in this domain, we extend the referential game setup to a population of spatially distributed agents. In such a setting, our experiments reveal that multiple languages can emerge in the population and some agents develop multilingual traits. Further, an action-advising framework is proposed for improving sample efficiency in the learning process.

Emergence of compositional language in communication through noisy channel

Authors: Łukasz Kuciński, Paweł Kołodziej, Piotr Miłoś

Paper Video

In this paper, we investigate how communication through a noisy channel can lead to the emergence of compositional language. Our approach is , allows for different inductive biases on the agents’ architecture, and trains without periodical resets of the networks’ weights. This relaxes some of the assumptions in recently developed methods. The impact on the structure of the resulting language is shown in the context of signaling games. We also develop a new metric for measuring degree of compositionality.