AI Helps Math Teachers Build Better "Scaffolds"

Education researchers have evaluated the ability of large language models (LLMs) to help middle school math teachers structure tiered lessons to reach diverse skill levels — a strategy called scaffolding.
To those outside of education, it may come as a surprise that the hardest aspect of teaching is often not what happens in the classroom, but the preparation that must happen outside it, beyond normal work hours. The toughest work is in the planning and structuring of lessons for classes with students of varying knowledge and skill levels. And, with the learning loss of the pandemic, America’s classrooms — particularly middle school classes — are more than ever filled with students of diverse skill levels.
Against that backdrop, education and computer science researchers at Stanford University have evaluated large language models’ ability to help middle school math teachers create tiered lessons that allow them to nurture those who might have fallen behind while simultaneously holding the interest of more advanced students. Everyone wins, the researchers say, most of all the teachers for whom the model is a tremendous thought partner surfacing ideas that they might not have considered themselves.
“Teachers spend so much time adapting curricula to their students’ needs, but no one is really asking — how can we support them in that process?” says Rizwaan Malik, a Knight-Hennessy Scholar studying education data science at the Stanford Graduate School of Education. Malik is first author on a new study, published in the British Journal of Educational Technology, introducing the task and evaluation framework.
The paper introduces the first evaluation framework for lesson scaffolding grounded in expert teachers’ processes and the first experiments that test and adapt LLMs for this task.
"The idea of scaffolding is trying to put in supports to the curriculum that help all students, regardless of where they are, to access the content in the curriculum," says Dora Demszky, professor of education data science and senior author of the paper. Demszky’s and Malik’s work was supported by the Stanford Institute for Human-Centered AI (HAI) seed grant program.
Studying Teachers To Train the Model
Before they began experimenting with LLMs, Demszky and Malik analyzed teachers’ lesson planning to understand the fundamentals of scaffolding. This is perhaps the hardest part of lesson planning, says Malik, a former math teacher familiar with the vagaries and the time commitment of lesson planning.
“The premise of the project was to see what technology can do to help teachers with that process of taking a curriculum and making it classroom ready,” Malik says. “We’re not just creating a tool, but a framework that helps teachers scaffold curriculum effectively, ensuring AI-generated content aligns with real classroom needs.”
In their analysis, they identified three steps teachers go through in creating lesson plans: observation (evaluating their students’ skill levels), formulation of an instructional strategy, and implementation though a scaffolded lesson plan that meets the needs of all students.
A Better Warmup
The AI model was designed to generate “warmup” exercises that help students activate prior knowledge. In user evaluations, these AI-generated exercises were rated better than human-created ones in terms of accessibility, alignment with learning objectives, and teacher preference.
The highest-rated approach fed the model an additional dataset of original curriculum materials and used complex and nuanced prompts informed by an expert educator (Figure 1).

Figure 1: An example prompt provided in the paper.
“Maintaining rigor while supporting students with different needs is crucial — simplifying too much only increases learning gaps,” says Demszky.
AI is not without limitations, the researchers stress. LLMs are quite good at generating text-based content — story problems and written descriptions — but they struggle with visual approaches, diagrams, graphs, and so forth, which are an essential component of math instruction. The researchers are working to address these limitations now. Their most recent paper, under review, seeks to address the specific challenges of diagram generation with the first benchmark for K-12 math diagrams.
Next Steps
In future iterations, Demszky and Malik plan to expand the dataset to include instructional scaffolds beyond warmups. To further hone the tool, they would also like to pilot it in a real classroom. Finally, they are looking into personalized scaffolding strategies tailored to specific classrooms and, perhaps, even individual students.
Despite the promising results, however, neither researcher imagines a day when AI replaces teachers as lesson planners; instead they expect AI to serve as a valuable thought partner for educators to help them work more efficiently while improving student learning.
“The key thesis underpinning all our work is that nothing can ever replace a teacher,” Malik concludes. “AI should augment, not substitute, their expertise.”
Additional Stanford authors include Dorna Abdi, a graduate of the Education Data Science master’s program, and Rose Wang, a doctoral candidate in computer science, both members of the EduNLP lab.