An Open-Source AI Agent for Doing Tasks on the Web

NNetNav learns how to navigate websites by mimicking childhood learning through exploration.
The arrival of large language models (LLMs) has pushed artificial intelligence toward new, useful heights: Our computers help us write emails, essays, and computer code.
Now, developers are attempting to turn chatbots into action bots that can book flights for us, find information hidden deep in a website, pull data from multiple sources to create shipping or sales reports, or create a new repository on GitHub, says Shikhar Murty, a computer science graduate student in his final year at Stanford University. “AI agents that can take actions on our behalf online could potentially reduce much of the burden of computer use – especially for repetitive tasks,” he says.
The recent release of several commercial AI agents (such as OpenAI’s Operator, ByteDance’s UI-TARS, and Anthropic’s “Computer Use” feature) suggests he may be right. But these agents also raise concerns about proprietary systems that are trained with unknown data, have poorly understood capabilities, use untold amounts of energy, and watch our every move.
“Having closed [proprietary] models that you can chat with and that can fix your email and paper drafts is one thing, but having a private model do tasks collaboratively with you on browsers while watching your browsing history and your computer use really goes a step further,” Murty says.
To address these concerns, Murty and his colleagues, including advisor and Stanford professor Chris Manning, have developed NNetNav (published on preprint server ArXiv). This AI agent learns through its interactions with websites and can accomplish tasks online as well as or better than GPT-4 and several other agents, all while using fewer parameters and remaining fully open source.
According to Manning, “NNetNav could become a lighter weight, faster, privacy-preserving alternative to OpenAI’s recently released Operator for using an AI agent to do things on the web.” Manning is the Thomas M. Siebel Professor in Machine Learning and professor of computer science in the School of Engineering, professor of linguistics in the School of Humanities and Sciences, and senior fellow at the Stanford Institute for Human-Centered AI (HAI).
One reason NNetNav is so lightweight: Rather than being trained with examples of how humans behave online, NNetNav gathers synthetic training data by exploring websites much the way a young child might. It clicks all the buttons and types into all the fields to see what will happen. It then prunes out the pathways that don’t help achieve the user’s goals. It’s an approach Murty thinks could eventually lead to more efficient LLMs.
“We’ve pretty much exhausted the available static data for training large language models, but we still want to keep making progress,” he says. “Learning from interaction is a completely different modality that hasn’t been explored.”
Learning Through Interaction
Because AI chatbots rely on LLMs trained by reading texts, teaching them to interact with websites in useful ways has proven challenging. LLMs can predict the next word in a sentence, but not the next dropdown menu to click or what to do after it appears. The usual solution is to train the model with expert demonstrations. But getting a person to demonstrate every possible useful web interaction is nearly impossible.
Murty and his colleagues wondered if an LLM could be trained using synthetic data obtained through unsupervised interaction with websites. Just as human children learn cause and effect through positive reinforcement as they interact with the world (how to build vertical stacks of blocks that won’t topple; how to crank the jack-in-the-box so it will pop up), might an AI agent learn to control a computer by exploring websites?
To test the idea, the team created an exploration policy where a specific persona (e.g., an investor active on a Reddit channel about investing) is provided with a website and asked to randomly explore. After the persona clicks and types for a while, the model attempts to discover their goals (such as posting a comment or finding information about an investment). It’s similar to giving a quilter a pair of scissors and observing them learn to cut fabric.
More specifically: At fixed time steps, the LLM attempts to guess what a human could have told the model such that this specific interaction would result, Murty says. If it can’t describe a meaningful subgoal, then the step can be pruned out. If it can describe a meaningful subgoal, the trajectory is rewarded. So if NNetNav’s explorations discover legitimate subgoals that progress in sequence toward a larger goal (e.g., “Search for flights from SFO,” “Search for flights from SFO to BOS,” “Book the cheapest flight”), the model is rewarded and the successful trajectories become the fine-tuning data for NNetNav’s AI agent.
Through pruning, the model is also learning to avoid doing truly stupid things like clicking on a dropdown menu and then re-clicking it so it disappears. “We humans have very strong priors built in about how to use websites and models typically don’t,” Murty says. “The point of NNetNav is to winnow down the ways the model interacts with websites so they are more similar to the ways humans interact with websites.”
Testing NNetNav
To see how well NNetNav can navigate websites to achieve specific goals, the research team turned to two benchmark website resources: WebArena and WebVoyager. WebArena consists of clones of real websites that AI agents can log in to. WebVoyager uses live websites, but the AI agents are barred from logging in. Evaluating NNetNav in these contexts ensures that it can do no harm. “You don’t want to write comments on BBC News or boost the wrong kind of content,” Murty says.
The team collected 10,000 positive demonstrations of NNetNav on 20 websites. Those successful trajectories were then used to fine-tune the model. When the team looked at NNetNav’s performance before and after fine-tuning, the model compared favorably with GPT-4 and did better than other open-source, unsupervised methods. It also used about one-third fewer parameters than the next-best performing model.
Even though NNetNav was doing more with less, it was only successful in about one of six assigned tasks. It’s a problem of coverage: It’s very difficult to gather enough training data to make sure an AI agent can tackle every possible assignment. And, as the researchers showed, these models don’t generalize well. “Even if you train a model on a large set of websites, it won’t work super well on other websites,” Murty says. “At the end of the day, it’s going to be a game of coverage unless we develop methods for learning on the fly.” Which is what he’s pursuing next.
Interactive Learning on the Fly
Murty thinks computer control has much to learn from robotics with its pipelines for learning on the fly. For example, after training a robot using supervised learning (such as examples of humans doing a task), a robotics researcher might turn to reinforcement learning – allowing the robot to learn on its own whether it has achieved desired goals. But before reinforcement learning is possible, there has to be a good base model.
“For computer control, NNetNav could provide that base model for doing reinforcement learning in academic settings,” Murty says. “That’s kind of the natural next step: How do you build the next part of the pipeline for learning as you go? That’s what we’re working on now.”
Murty’s also curious to see whether learning from unsupervised interactions might ultimately help fine-tune large language models. Right now, when we use different words to express the same thing (“Find me a flight from SFO to JFK” or “I want to fly from the Bay Area to New York City”), large language models view these as very different in vector space. But if language models learn that both of these goals correspond to the same sequence of clicking and typing actions, perhaps they will become more accurate and efficient, Murty says.
As AI agents proliferate, Murty hopes a privacy-protecting option such as NNetNav will become a go-to choice. But society will also need to regulate the risks posed by AI agents generally – risks that such agents could make serious errors; nudge public opinion through the spread of dis- and misinformation; or harm users by falling prey to dark patterns or scams on the web. “There need to be sufficient guardrails and audit requirements so that AI agents don’t do bad things or are caught when they do.”