Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
AI Can’t Do Physics Well – And That’s a Roadblock to Autonomy | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
news

AI Can’t Do Physics Well – And That’s a Roadblock to Autonomy

Date
January 26, 2026
Topics
Computer Vision
Robotics
Sciences (Social, Health, Biological, Physical)
breaking of pool balls on a pool table
istock

QuantiPhy is a new benchmark and training framework that evaluates whether AI can numerically reason about physical properties in video images. QuantiPhy reveals that today’s models struggle with basic estimates of size, speed, and distance but offers a way forward.

In a video of a green-felted pool table, several multicolored balls careen across the screen randomly. Most people could begin to assemble a fairly accurate estimate of the speed of any one of the pool balls, but ask AI to do the same, and the results can vary wildly. AI, it turns out, is not great at physics.

AI’s inability to comprehend the physical world is holding back a new age of robotics, autonomous vehicles, and other visually aware fields, say the developers of QuantiPhy, a new test that is charting AI’s lagging-but-improving understanding of the physical world.

QuantiPhy evaluates AI’s ability to numerically estimate an object’s size, velocity, and acceleration when given any one of these properties – like the diameter of a pool ball – and it allows the researchers to compare models to see which is best and which are improving fastest. Most importantly, thanks to QuantiPhy, the authors say they now know how to make AI better.

“To date, models appear to lean heavily on pretrained world knowledge – on memorized facts – rather than real quantitative reasoning from visual and textual inputs,” explained Ehsan Adeli, director of the Stanford Translational Artificial Intelligence (STAI) Lab, a faculty member of the Stanford Vision and Learning (SVL) Lab and HAI, and the senior author of a new preprint paper introducing QuantiPhy. “It represents a significant leap in our ability to measure AI’s capacity to comprehend and interact with the real world.”

“QuantiPhy is part benchmark test that gives us the ability to fairly evaluate the physical comprehension across today’s most popular models, but also a model itself that shows how all models can improve,” adds co-first author Tiange Xiang, a PhD student and member of the SVL Lab. 

As such, the authors say QuantiPhy could help move models that understand video, images, and text simultaneously – visual language models, or VLMs – past simple linguistic plausibility toward a numerically accurate understanding of the world that will make robots and autonomous vehicles smarter, more useful, and safer.

Quantitative Difference

While generative AI models have impressive qualitative abilities to summarize large quantities of text, write essays and poetry, and generate original images, they continually fall short in their quantitative understanding of the physical world. 

Qualitatively, AI can accurately describe a coconut falling from a palm tree to a beach below, but it cannot accurately estimate the coconut’s speed. On these physics-related questions, “AI produces responses that sound plausible, but on closer analysis prove to be little more than guesswork,” Adeli says.

“Even the very best models rarely perform better than chance on estimating distances, orientations, and sizes of objects in two-dimensional videos,” Xiang says. “And this is not a trivial shortcoming. In assessing AI’s improving ability to do basic physics and in helping developers hone these skills, QuantiPhy represents a critical step toward physically aware AI.” 

Domestic robots and autonomous vehicles will need to do better. A household robot must understand it needs to apply a gentler force when cracking an egg than when cutting into a butternut squash, or that it should wait until a mixer blade has stopped spinning before removing the bowl. Industrial robots will need similar skills to navigate the factory floor and manipulate objects to assemble products. Autonomous security cameras will need such capabilities to recognize threats to the valuable assets they safeguard. 

'AI Learned Best on its Own'

To develop QuantiPhy, the research team took a multifaceted approach that combined real-world and simulated data. They gathered more than 3,300 videos from the internet and recorded experiments in a lab. “We set up a space with four to five cameras and manually recorded several physical interactions, allowing us to provide QuantiPhy with accurate 3D data,” Xiang recalled.

Then, they turned QuantiPhy loose. In one approach to training, QuantiPhy was tasked with evaluating the videos and making quantitative assessments on its own through a sort of trial-and-error process. In a second, QuantiPhy was pre-fed step-by-step processes applied by humans to make accurate calculations. Surprisingly, an end-to-end learning approach – without explicitly hand-engineered reasoning steps – performed best. The results suggest that forcing models to follow human-designed reasoning steps can sometimes hinder quantitative learning. 

“We tried to give the model a head start by prompting it to, first, count the number of pixels in the image frame to estimate the size of various objects in the image, then transform that scale into real-world units,” Xiang explained of the team’s process. “Surprisingly, however, the direct, unprompted approach worked better. AI learned best on its own.”

A major finding of the project, Li points out, is that VLMs rely too much on pretraining world knowledge. That is, they use memorized facts instead of visual inputs. “Their approach is more like guessing than reasoning,” says co-first author Puyin Li, a graduate student at STAI/SVL Labs. “The evidence from our tests supports this.”

For instance, Li says that in tests, the VLMs generally performed better with complex scenes, which provide greater opportunity for “guessing” while also making accurate object detection and measurement more difficult. Likewise, VLMs perform “terribly” when presented with counterfactual contexts. In one video, the team told the VLM to assume a car in the scene was 6,000 meters long and asked it to estimate the car’s width. Where a human might adapt and reason according to the proportional shift, VLMs tended to “hallucinate” in such situations. Lastly, VLMs responded reasonably well to QuantiPhy’s questions even when no video was provided. 

“VLMs are very successful guessers,” Li explained – producing plausible answers even when those answers are not grounded in visual measurement.

Tomorrow and Beyond

Down the road, better physical reasoning could have profound implications. In healthcare, QuantiPhy could aid in precision robotic surgery. In autonomous diagnostics, it could help analyze medical images and note physical changes. In domestic robotics, physical comprehension could enhance robots’ ability to interact with their environment to become better companions and collaborators. Autonomous vehicles should likewise benefit from improved spatial reasoning to enhance their safety and efficiency. 

The team next hopes to refine QuantiPhy’s reasoning capabilities in three dimensions using multi-camera inputs, enabling QuantiPhy to make ever more accurate spatial calculations and to improve vision language models in more complex spaces, such as rotational dynamics (think spinning balls and turbines), deformable objects (in surgery or manufacturing), varied camera perspectives, and complex multibody interactions (from automotives to spacecraft and advanced robotics).

“We’re excited to pioneer what we believe to be a new field of AI,” Xiang concluded. “We believe that the future of robotics depends on AI with the sort of sophisticated physical reasoning that QuantiPhy is only beginning to reveal.”

Learn more by visiting the QuantiPhy website or read the paper.

Contributing authors include graduate students Ella Mao, Shirley Wei, and Xinye Chen; Adnan Masood, PhD, of UST; and Fei-Fei Li, co-director of the Stanford Institute for Human-Centered AI (HAI). 

istock
Share
Link copied to clipboard!
Contributor(s)
Andrew Myers

Related News

AI Reveals How Brain Activity Unfolds Over Time
Andrew Myers
Jan 21, 2026
News
Medical Brain Scans on Multiple Computer Screens. Advanced Neuroimaging Technology Reveals Complex Neural Pathways, Display Showing CT Scan in a Modern Medical Environment

Stanford researchers have developed a deep learning model that transforms overwhelming brain data into clear trajectories, opening new possibilities for understanding thought, emotion, and neurological disease.

News
Medical Brain Scans on Multiple Computer Screens. Advanced Neuroimaging Technology Reveals Complex Neural Pathways, Display Showing CT Scan in a Modern Medical Environment

AI Reveals How Brain Activity Unfolds Over Time

Andrew Myers
HealthcareSciences (Social, Health, Biological, Physical)Jan 21

Stanford researchers have developed a deep learning model that transforms overwhelming brain data into clear trajectories, opening new possibilities for understanding thought, emotion, and neurological disease.

Spatial Intelligence Is AI’s Next Frontier
TIME
Dec 11, 2025
Media Mention

"This is AI’s next frontier, and why 2025 was such a pivotal year," writes HAI Co-Director Fei-Fei Li.

Media Mention
Your browser does not support the video tag.

Spatial Intelligence Is AI’s Next Frontier

TIME
Computer VisionMachine LearningGenerative AIDec 11

"This is AI’s next frontier, and why 2025 was such a pivotal year," writes HAI Co-Director Fei-Fei Li.

The Architects of AI Are TIME’s 2025 Person of the Year
TIME
Dec 11, 2025
Media Mention

HAI founding co-director Fei-Fei Li has been named one of TIME's 2025 Persons of the Year. From ImageNet to her advocacy for human-centered AI, Dr. Li has been a guiding light in the field.

Media Mention
Your browser does not support the video tag.

The Architects of AI Are TIME’s 2025 Person of the Year

TIME
Machine LearningComputer VisionDec 11

HAI founding co-director Fei-Fei Li has been named one of TIME's 2025 Persons of the Year. From ImageNet to her advocacy for human-centered AI, Dr. Li has been a guiding light in the field.