In The Jetsons, a futuristic cartoon that premiered in 1962, the titular family employed Rosie the Robot to do laundry, clean house, cook meals, and help take care of the children. More than 60 years later, we still have no Rosie-like robot to meet our daily needs, despite significant effort in that direction, says Jennifer Grannen, a graduate student in computer science at Stanford University.
“Everyone is really interested in pushing the capabilities of autonomous robots, but there will always be some situation outside the training data where a fully autonomous robot is going to fail,” Grannen says.
So, Grannen and fellow computer science graduate students Siddharth Karamcheti and Suvir Mirchandani, with associate professors Dorsa Sadigh and Percy Liang, set a different goal: designing a non-autonomous domestic robot that can collaborate with people and learn on the job.
“We were interested in exploring what it would take to deploy robots with people,” Karamcheti says. If a robot is helping you cook, maybe it’s holding the pot while you’re stirring. If you’re making lunch for the family one morning, it’s helping you pack each of the lunch bags. And more: You’re verbally telling it what to do, and if it doesn’t know how, it will ask for instructions.
As Grannen puts it, “We should be leveraging the fact that we have humans in the home who know what the robot should be doing and can provide feedback to help the robot learn.”
The team presented a prototype of just such a robot system called Vocal Sandbox at the 8th Annual Conference on Robot Learning in September 2024. Vocal Sandbox, funded in part by the Stanford Institute for Human-Centered AI, uses a large language model (LLM) to understand verbal commands and, in real time, ask for help with identifying new objects and learning new movements and complex behaviors.
Karamcheti hopes Vocal Sandbox will be the first of many systems that will gradually help us integrate robots into our lives. “The easier we make it for people to teach robots simple things, the sooner we can get these robots into people’s lives to start providing some bit of utility,” he says.
Vocal Sandbox at Work
The physical setup for Vocal Sandbox consists of a robot arm, a couple of cameras, a laptop, and a graphical user interface (GUI) that shows the human user what the robot is planning to do. There’s also a lot going on behind the scenes, including automated speech recognition; a text-to-speech system that enables real-time voice communication by the robot; a learned keypoint model that helps the robot identify objects by name and location; and a dynamic movement model that can learn movements from kinesthetic demonstrations.
But what’s really new here is the Vocal Sandbox learning framework that allows the system to learn both low- and high-level skills on the fly using an LLM (GPT-3.5 Turbo). Low-level skills can be thought of as understanding particular verbs and nouns that the LLM translates into code that yields specific robot actions. High-level behaviors can be thought of as a combination of known verbs and nouns that together comprise a new behavior.
The team demonstrated the Vocal Sandbox framework’s usefulness in two situations: helping a human pack a gift bag; and controlling the camera while a human set up Lego pieces to create a stop-motion animation.
In the gift bag scenario, a low-level skill might be the ability to execute a specific action such as “pick up the toy car.” If the robot already knows the meaning of “pick up” but not how to identify a toy car, it might say, “I don’t know what you mean by the toy car. Can you show me?” The human would then click on the image of the toy car in the GUI, teaching the system the new low-level skill of what a toy car looks like.
If the user assembling gift bags wants the robot to use a higher-level skill, such as “pack the toy car in the gift bag,” they would explain: “To pack means to pick up the toy car, go to the gift bag, and release.” And because the robot already knows the actions “pick up,” “go to,” and “release,” it can perform that new higher-level behavior of “packing” items from then on, Grannen says.
Vocal Sandbox also allows the robot to learn kinesthetic movements. For example, in the stop-motion action scenario, the user asked the robot to “zoom in” on Iron Man, and the robot asked what it means to “zoom.” The user then manually moved the camera from one place to another to demonstrate the action of zooming.
With each instruction to the system, the robot displays its plan of action on the GUI and asks the user if the plan is correct before executing it – an important safety feature. “The GUI shows you the recipe that the system is going to follow, step by step,” Karamcheti says. It identifies which object it’s going to interact with and shows the curve the robot arm will follow to perform the requested action. And the user has to say “OK” before the robot does any of it. “This modularity is the key mechanism we have right now for trust and reliability,” he says.
Both experiments demonstrated the efficiency of the Vocal Sandbox framework. But using the robot as the camera person for the stop-motion film was particularly efficient. It can be tedious to create stop-motion animation because you have to arrange the cameras very precisely while also arranging the Lego pieces and directing the story, Grannen says. “By using a robot as your partner, you’re able to focus on the creative parts of the task and also unlock a massive speedup.”
Moving to Realistic Settings
Soon, Grannen hopes to test Vocal Sandbox in a real-life setting. She’s been talking to a local bakery about how they might benefit from an extra hand. The idea would be to take the robot to the bakery and, without programming it for a specific bakery task, see if the employees can teach it to do useful things in a short time. Might they teach it to chop vegetables, take hot pans out of the oven, or put croissants into a basket on the display shelf?
“We really want to see what users choose to do when they’re working with the robot,” Grannen says. “What creativity does it unlock, and what types of things can they accomplish?”
Grannen is also interested in using Vocal Sandbox to assist older adults and people with disabilities. “To be an effective caregiver, you have to be able to customize and take feedback from the patient,” Grannen says. “In my opinion, this work takes a big step in that direction.”
For example, she says, Vocal Sandbox would be a boost to her prior work with assistive feeding by giving users the ability to teach the system new foods or new higher-level behaviors such as “feed me the soup” rather than “pick up the spoon, scoop some soup, bring it to my mouth.” Vocal Sandbox could even enable assistive bathing when combined with an appropriate gripper, which Grannen has been helping develop with the Stanford Robotics Center. “With a soft gripper and Vocal Sandbox, a person could say, ‘Can you scrub harder here’ or ‘That’s too much,’ giving users the ability to verbally control how the robot follows the contours of their body without hurting them.”
Karamcheti has seen robots do more and more cool things in lab settings or in siloed warehouse environments without people in the loop, but those efforts have not brought us any closer to seeing a robot deployed to collaborate with us in a seamless, easy way, he says.
“I want the Vocal Sandbox system to be the first of many systems where we can just put a robot with a person and they can figure out how to get some amount of utility from it, and just grow its skillset from there,” he says. Such robot collaboration may get us closer to a Rosie-the-Robot future.