In a unique experiment, a team of Facebook researchers from the Artificial Intelligence Research (FAIR) group has developed a new research task called, ‘Talk the Walk’, that explores the use of “embodied” Artificial Intelligence (AI) in daily life.
It’s but obvious that for AI systems to become really useful in our daily lives, they’ll need to achieve a full comprehension of human language.
Embodied AI is a move in that direction. The FB team said a strategy for eventually building AI with “human-level language understanding” was to train those systems “in a more natural way, by tying language to specific environments.”
This is similar to babies first learning to name what they can see & touch; & it’s this approach which is sometimes referred to as embodied AI, favoring learning in the context of a system’s surroundings, rather than training through large data sets of text.
FB said in a blog post that as part of this 1st experiment, a pair of AI agents had to communicate with each other to accomplish the shared goal of navigating to a specific location in New York. But rather than presenting the AI agents with a simplified, game-like setting, the goal was for the tourist bot to navigate its way through 360-degree images of actual New York City neighborhoods. This was done with the help of the guide bot who saw nothing but a map of the neighborhood.
Using a in-house mechanism called MASC (Masked Attention for Spatial Convolution), the researchers helped the guide bot focus on the right place on the map. This, in turn, produced results that were, in some cases, more than twice as accurate on the test set.
The goal of this work, said the AI research team members, was to improve the research community’s understanding of how communication, perception, & action can lead to grounded language learning, & to provide a stress test for natural language as a method of interaction.
As part of FAIR’s contribution to the broader pursuit of embodied AI, the research team has released the baselines & data set for Talk the Walk. Sharing this work will provide other researchers with a framework to test their own embodied AI systems, particularly with respect to dialogue, they said.
In order to provide an environment for their systems to learn & demonstrate grounded language, the FAIR researchers used a 360-degree camera to capture portions of 5 New York City neighborhoods. These selected areas featured uniform, grid-based layouts with typical four-cornered street intersections, & served as the 1st-person perception of the environments for one half of each pair of AI agents — the “tourist” — to operate in.
The AI “guide,” on the other hand, had access to only a 2D overhead map with generic landmarks, such as “restaurant” & “hotel.” Neither bot could share its view with the other. So navigating to a specific location required communication, said the team. Each experiment in this task concluded when the guide made a prediction that the tourist had arrived at the goal location. If the prediction was correct, the episode was marked as successful; a failed prediction was marked as incorrect.
The reliance on realism is new for this field. Entirely simulated environments are the norm. FAIR researchers also created the natural language interaction between the agents. Rather than generating carefully worded messages for the bots to use, such as “Go to the next block, then turn right to get to the restaurant,” the team collected real interactions from human players. These participants were assigned the same guide & tourist roles as the bots, with the same shared-navigation goals & information constraints (either 1st-person views or overhead maps).
Among the many findings, the research result showed that humans using natural language were worse at localizing themselves than AI agents using synthetic communication. The team said like Talk the Walk’s other comparisons between human & machine performance, this important result helped establish a baseline for further study of the challenges related to developing AI systems that rely on natural language, as well as possible opportunities.