Video-STaR: A Self-Training Approach to Video

Date

April 18, 2025

The advance could lead to visually aware AI coaches and teachers that can do everything from correct a golf swing to train better surgeons.

Today’s large language models show impressive capabilities in interpreting and generating text. Many scholars likewise dream of similar tools that interpret videos and images, but those hopes have remained elusive mostly because there is too little textual information describing videos that can then train models. The prospect of hiring humans to provide text descriptions — a process known as “labeling” — would be a time-consuming and expensive task.

That may change with the release of Video Self-Training with augmented Reasoning — or Video-STaR — which was recently published at the International Conference on Learning Representations by a team of researchers from Google Research and Stanford University. Video-STaR enables models to train themselves to accurately reason about and describe the actions in video and images given only auxiliary video metadata and labels. It could lead to a massive, GPT-like dataset for videos, the researchers said.

“Video-STaR allows AI to engage with dynamic, real-world actions in a way that previous models simply couldn’t," said doctoral student Orr Zohar, first author of the paper describing Video-STaR. “It could open up exciting new avenues for how AI learns from video data,” said senior author of the paper, Serena Yeung-Levy, an assistant professor of biomedical data science at Stanford School of Medicine. Her lab is developing tools for AI-assisted surgery and surgical skill evaluation. “We hope that these methods will enable a lot of biomedical applications,” Yeung-Levy said.

The researchers imagine visually aware AI instructors that can analyze videos of human activities ranging from a golf swing to surgery, to provide real-time technique analysis and corrective feedback.

Reasonable Outcomes

Existing video datasets have proved overly simplistic and, at best, provide mere descriptions rather than deep reasoning about the content and actions in the videos. The key innovation in Video-STaR is its ability to take advantage of any labeled video dataset, no matter how extensive the labeling.

Video-STaR uses self-training cycles to improve its comprehension. It is first prompted to answer questions about video content. Those answers are then filtered to only those that contain the original video labels — that is, Video STaR filters out incorrect labels. Then, Video-STaR re-trains itself on these newly generated answers to improve its analytical skills.

“In effect, Video-STaR utilizes the existing labels as a form of supervision, a way to check its work,” Zohar said. “We found that these models learn to reason as an emergent behavior.”

“The model continuously refines itself,” Zohar added. “And over time, generates richer and more accurate responses. This self-training mechanism not only reduces the need for costly human annotations but also makes it possible to train video-language models on a much larger and more diverse set of data.”

In one example presented in the paper, Video-STaR analyzed videos of a diving competition, correctly assessing the number of somersaults performed, identifying the diver’s tuck position, and evaluating the entry. It then appraised the dive as “quite tough” and issued a reasonably accurate degree-of-difficulty score of 64.68, for a dive that had been rated at 65.6 by human competition judges.

Future Focus

Video-STaR’s potential applications extend beyond improving AI’s ability to answer questions about videos, opening a world of possibilities in fields such as robotics, sports performance analysis, education, and even surgery. In sports, a tool like this could evaluate a golfer’s swing, a tennis player’s stroke, or a gymnast’s routine and offer insights on how to improve their techniques.

Professor Yeung-Levy imagines similar AI-enabled medical instructors. "For me, one major goal is being able to assess the quality of surgical performance through video analysis,” Yeung-Levy explained. Video-STaR could lead to AI systems that provide constructive feedback on a surgeon’s technique and train more and better surgeons. “Ultimately, it could improve outcomes for patients," she said.

Future research will likely focus on improving the label filtering process and extending Video-STaR’s capabilities to more complex, longer-form videos, Zohar noted.

"The goal is for AI to be able to engage in real conversations about video content, where the user can ask follow-up questions and the model is able to make deeper connections between actions and events in the video content,” Zohar said. “That’s the next frontier."

Other Stanford contributors include Stanford postdoctoral scholar Xiaohan Wang and Google Researchers, Yonattan Bitton and Idan Szpektor. Video-STaR is a collaboration with Google Research through the Stanford Institute for Human-Centered Artificial Intelligence’s Industry Affiliates Program. The work was also supported in part by the National Science Foundation and Knight-Hennessy Scholars Foundation.