Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Video-STaR: A Self-Training Approach to Video | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
news

Video-STaR: A Self-Training Approach to Video

Date
April 18, 2025

The advance could lead to visually aware AI coaches and teachers that can do everything from correct a golf swing to train better surgeons.

Today’s large language models show impressive capabilities in interpreting and generating text. Many scholars likewise dream of similar tools that interpret videos and images, but those hopes have remained elusive mostly because there is too little textual information describing videos that can then train models. The prospect of hiring humans to provide text descriptions —  a process known as “labeling” —  would be a time-consuming and expensive task.

That may change with the release of Video Self-Training with augmented Reasoning — or  Video-STaR — which was recently published at the  International Conference on Learning Representations by a team of researchers from Google Research and Stanford University. Video-STaR enables models to train themselves to accurately reason about and describe the actions in video and images given only auxiliary video metadata and labels. It could lead to a massive, GPT-like dataset for videos, the researchers said.

“Video-STaR allows AI to engage with dynamic, real-world actions in a way that previous models simply couldn’t," said doctoral student Orr Zohar, first author of the paper describing Video-STaR. “It could open up exciting new avenues for how AI learns from video data,” said senior author of the paper, Serena Yeung-Levy, an assistant professor of biomedical data science at Stanford School of Medicine. Her lab is developing tools for AI-assisted surgery and surgical skill evaluation. “We hope that these methods will enable a lot of biomedical applications,” Yeung-Levy said.

The researchers imagine visually aware AI instructors that can analyze videos of human activities ranging from a golf swing to surgery, to provide real-time technique analysis and corrective feedback.

Reasonable Outcomes

Existing video datasets have proved overly simplistic and, at best, provide mere descriptions rather than deep reasoning about the content and actions in the videos. The key innovation in Video-STaR is its ability to take advantage of any labeled video dataset, no matter how extensive the labeling.

Video-STaR uses self-training cycles to improve its comprehension. It is first prompted to answer questions about video content. Those answers are then filtered to only those that contain the original video labels — that is, Video STaR filters out incorrect labels. Then, Video-STaR re-trains itself on these newly generated answers to improve its analytical skills.

“In effect, Video-STaR utilizes the existing labels as a form of supervision, a way to check its work,” Zohar said. “We found that these models learn to reason as an emergent behavior.”

“The model continuously refines itself,”  Zohar added. “And over time, generates richer and more accurate responses. This self-training mechanism not only reduces the need for costly human annotations but also makes it possible to train video-language models on a much larger and more diverse set of data.”

In one example presented in the paper, Video-STaR analyzed videos of a diving competition, correctly assessing the number of somersaults performed, identifying the diver’s tuck position, and evaluating the entry. It then appraised the dive as “quite tough” and issued a reasonably accurate degree-of-difficulty score of 64.68, for a dive that had been rated at 65.6 by human competition judges.

Future Focus

Video-STaR’s potential applications extend beyond improving AI’s ability to answer questions about videos, opening a world of possibilities in fields such as robotics, sports performance analysis, education, and even surgery. In sports, a tool like this could evaluate a golfer’s swing, a tennis player’s stroke, or a gymnast’s routine and offer insights on how to improve their techniques.

Professor Yeung-Levy imagines similar AI-enabled medical instructors. "For me, one major goal is being able to assess the quality of surgical performance through video analysis,” Yeung-Levy explained. Video-STaR could lead to AI systems that provide constructive feedback on a surgeon’s technique and train more and better surgeons. “Ultimately, it could improve outcomes for patients," she said.

Future research will likely focus on improving the label filtering process and extending Video-STaR’s capabilities to more complex, longer-form videos, Zohar noted.

"The goal is for AI to be able to engage in real conversations about video content, where the user can ask follow-up questions and the model is able to make deeper connections between actions and events in the video content,” Zohar said. “That’s the next frontier."

Other Stanford contributors include Stanford postdoctoral scholar Xiaohan Wang and Google Researchers, Yonattan Bitton and Idan Szpektor. Video-STaR is a collaboration with Google Research through the Stanford Institute for Human-Centered Artificial Intelligence’s Industry Affiliates Program. The work was also supported in part by the National Science Foundation and Knight-Hennessy Scholars Foundation.

Share
Link copied to clipboard!
Contributor(s)
Andrew Myers

Related News

The Evolution of Safety: Stanford’s Mykel Kochenderfer Explores Responsible AI in High-Stakes Environments
Scott Hadly
May 09, 2025
News

As AI technologies rapidly evolve, Professor Kochenderfer leads the charge in developing effective validation mechanisms to ensure safety in autonomous systems like vehicles and drones.

News

The Evolution of Safety: Stanford’s Mykel Kochenderfer Explores Responsible AI in High-Stakes Environments

Scott Hadly
Privacy, Safety, SecurityMay 09

As AI technologies rapidly evolve, Professor Kochenderfer leads the charge in developing effective validation mechanisms to ensure safety in autonomous systems like vehicles and drones.

How Stanford HAI Defines Human-Centered AI With Executive Director Russell Wald
Technovation
May 08, 2025
Media Mention

In this podcast, HAI Executive Director Russell Wald explores how universities, policymakers, and industry must collaborate to keep AI human-centered. Wald shares takeaways from the AI Index, explains how China is narrowing the performance gap, and outlines why academic institutions are vital to ethical AI leadership.

Media Mention
Your browser does not support the video tag.

How Stanford HAI Defines Human-Centered AI With Executive Director Russell Wald

Technovation
May 08

In this podcast, HAI Executive Director Russell Wald explores how universities, policymakers, and industry must collaborate to keep AI human-centered. Wald shares takeaways from the AI Index, explains how China is narrowing the performance gap, and outlines why academic institutions are vital to ethical AI leadership.

Ambient Intelligence, Human Impact
May 07, 2025
News

Health care providers struggle to catch early signals of cognitive decline. AI and computational neuroscientist Ehsan Adeli’s innovative computer vision tools may offer a solution.

News

Ambient Intelligence, Human Impact

HealthcareComputer VisionMay 07

Health care providers struggle to catch early signals of cognitive decline. AI and computational neuroscientist Ehsan Adeli’s innovative computer vision tools may offer a solution.