Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Saw, Sword, or Shovel: AI Spots Functional Similarities Between Disparate Objects | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
news

Saw, Sword, or Shovel: AI Spots Functional Similarities Between Disparate Objects

Date
October 13, 2025
Topics
Robotics
Computer Vision

With a new computer vision model that recognizes the real-world utility of objects in images, researchers at Stanford look to push the boundaries of robotics and AI.

In the field of AI known as computer vision, researchers have successfully trained models that can identify objects in two-dimensional images. It is a skill critical to a future of robots able to navigate the world autonomously. But object recognition is only a first step. AI also must understand the function of the parts of an object—to know a spout from a handle, or the blade of a bread knife from that of a butter knife.

Computer vision experts call such utility overlaps “functional correspondence.” It is one of the most difficult challenges in computer vision. But now, in a paper partially funded by the Stanford Institute for Human-Centered AI that will be presented at the International Conference on Computer Vision, Stanford scholars will debut a new AI model that can not only recognize various parts of an object and discern their real-world purposes but also map those at pixel by pixel granularity between objects. A future robot might be able to distinguish, say, a meat cleaver from a bread knife or a trowel from a shovel and select the right tool for the job. Potentially, the researchers suggest, a robot might one day transfer the skills of using a trowel to a shovel—or of a bottle to a kettle—to complete a job with different tools.

“Our model can look at images of a glass bottle and a tea kettle and recognize the spout on each, but also it comprehends that the spout is used to pour,” explains co-first author, Stefan Stojanov, a Stanford postdoctoral researcher advised by senior authors Jiajun Wu and Daniel Yamins. “We want to build a vision system that will support that kind of generalization—to analogize, to transfer a skill from one object to another to achieve the same function.”

Establishing correspondence is the art of figuring out which pixels in two images refer to the same point in the world, even if the photographs are from different angles or of different objects. This is hard enough if the image is of the same object but, as the bottle versus tea kettle example shows, the real world is rarely so cut-and-dried. Autonomous robots will need to generalize across object categories and to decide which object to use for a given task. One day, the researchers hope, a robot in a kitchen will be able to select a tea kettle to make a cup of tea, know to pick it up by the handle, and to use the kettle to pour hot water from its spout.

Autonomy Rules

True functional correspondence would make robots far more adaptable than they are currently. A household robot would not need training on every tool at its disposal but could reason by analogy to understand that while a bread knife and a butter knife may both cut, they each serve a specific purpose.

In their work, the researchers say, they have achieved “dense” functional correspondence, where earlier efforts were able to achieve only sparse correspondence to define only a few key points on each object. The challenge so far has been a paucity of data, which typically had to be amassed through human annotation. 

“Unlike traditional supervised learning where you have input images and corresponding labels written by humans, it’s not feasible to humanly annotate thousands of pixels individually aligning across two different objects,” says co-first author Linan “Frank” Zhao, who recently earned his master’s in computer science at Stanford. “So, we asked AI to help.”

The team was able to achieve a solution with what is known as weak supervision—using vision-language models to generate labels to identify functional parts and using human experts only to quality-control the data pipeline. It is a far more efficient and cost-effective approach to training.

“Something that would have been very hard to learn through supervised learning a few years ago now can be done with much less human effort,” Zhao adds.

In the kettle and bottle example, for instance, each pixel in the spout of the kettle is aligned with a pixel in the mouth of the bottle, providing dense functional mapping between the two objects. The new vision system can spot function in structure across disparate objects—a valuable fusion of functional definition and spatial consistency.

Seeing the Future

For now, the system has been tested only on images and not in real-world experiments with robots, but the team believes the model is a promising advance for robotics and computer vision. Dense functional correspondence is part of a larger trend in AI in which models are shifting from mere pattern recognition toward reasoning about objects. Where earlier models saw only patterns of pixels, newer systems can infer intent.

“This is a lesson in form following function,” says Yunzhi Zhang, a Stanford doctoral student in computer science. “Object parts that fulfill a specific function tend to remain consistent across objects, even if other parts vary greatly.”

Looking ahead, the researchers want to integrate their model into embodied agents and build richer datasets.

“If we can come up with a way to get more precise functional correspondences, then this should prove to be an important step forward,” Stojanov says. “Ultimately, teaching machines to see the world through the lens of function could change the trajectory of computer vision—making it less about patterns and more about utility.”

Paper authors: Stefan Stojanov, Linan Zhao, Yunzhi Zhang, Daniel L. K. Yamins, Jiajun Wu

Share
Link copied to clipboard!
Contributor(s)
Andrew Myers

Related News

AI Can’t Do Physics Well – And That’s a Roadblock to Autonomy
Andrew Myers
Jan 26, 2026
News
breaking of pool balls on a pool table

QuantiPhy is a new benchmark and training framework that evaluates whether AI can numerically reason about physical properties in video images. QuantiPhy reveals that today’s models struggle with basic estimates of size, speed, and distance but offers a way forward.

News
breaking of pool balls on a pool table

AI Can’t Do Physics Well – And That’s a Roadblock to Autonomy

Andrew Myers
Computer VisionRoboticsSciences (Social, Health, Biological, Physical)Jan 26

QuantiPhy is a new benchmark and training framework that evaluates whether AI can numerically reason about physical properties in video images. QuantiPhy reveals that today’s models struggle with basic estimates of size, speed, and distance but offers a way forward.

Spatial Intelligence Is AI’s Next Frontier
TIME
Dec 11, 2025
Media Mention

"This is AI’s next frontier, and why 2025 was such a pivotal year," writes HAI Co-Director Fei-Fei Li.

Media Mention
Your browser does not support the video tag.

Spatial Intelligence Is AI’s Next Frontier

TIME
Computer VisionMachine LearningGenerative AIDec 11

"This is AI’s next frontier, and why 2025 was such a pivotal year," writes HAI Co-Director Fei-Fei Li.

The Architects of AI Are TIME’s 2025 Person of the Year
TIME
Dec 11, 2025
Media Mention

HAI founding co-director Fei-Fei Li has been named one of TIME's 2025 Persons of the Year. From ImageNet to her advocacy for human-centered AI, Dr. Li has been a guiding light in the field.

Media Mention
Your browser does not support the video tag.

The Architects of AI Are TIME’s 2025 Person of the Year

TIME
Machine LearningComputer VisionDec 11

HAI founding co-director Fei-Fei Li has been named one of TIME's 2025 Persons of the Year. From ImageNet to her advocacy for human-centered AI, Dr. Li has been a guiding light in the field.