iStock
Developing new classroom curricula is a complex, time-consuming process. Instructors must create lessons and then run experiments with numerous students under different conditions to ensure they work for all learners.
Stanford scholars at the intersection of AI and education posed an interesting question: Could AI improve the process? In a recently published study, they show how large language models (LLMs) can mimic the experts who create and evaluate new materials to assist curriculum designers in getting more high-quality education content to students faster.
“In traditional methods, instructors design every detail, from what topics to cover to example problems for students to solve to supporting videos and other media. Then they test the material on students to see what’s effective,” says Joy He-Yueya, a computer science PhD student who is part of the Stanford AI Lab (SAIL). “It’s a slow process with many logistical challenges. We thought, there might be a better way.”
With support from a multiyear Hoffman-Yee Research Grant, He-Yueya and her co-advisors, associate professor of computer science at Stanford School of Engineering Emma Brunskill and associate professor of psychology at Stanford School of Humanities and Sciences Noah D. Goodman, started brainstorming alternative approaches.
Previously, AI researchers had tried to build computational models of student learning that could be used to optimize instructional materials; however, this approach fell short due to the difficulty of modeling the cognitive dynamics of human students. Instead, the trio wondered if a model could be trained to act like a teacher and use its own judgment to evaluate new learning materials.
AI as Instructor
First, the scholars needed to verify whether an LLM could be an effective evaluator of educational materials. In a simulated expert evaluation, the scholars asked GPT-3.5 to consider a student’s prior knowledge of a math concept, along with a specific set of word problems, and predict the student’s performance on test questions administered after the lesson. For this phase of research, the team wanted to understand whether certain learning materials are effective for different student personas, such as eighth graders learning algebra or fifth graders struggling with fractions.
To assess the model’s capabilities as a simulated educational expert, the scholars decided to run a small set of basic tests to see if the model’s curriculum evaluations could replicate two well-known phenomena in education psychology. The first is that instructional strategies need to change as a learner’s skills develop. While beginners benefit from structured guidance in the materials, more proficient students perform better with minimal guidance. The Stanford team reasoned that if the LLM replicated this “Expertise Reversal Effect” in its assessments of learning materials, this would be a good indicator of the AI’s potential for mimicking human teachers.
According to the second phenomenon, called the “Variability Effect,” introducing a greater variety of practice problems doesn’t always help students master a concept because it can overload their memory capacity. Less is more, in other words.
When the scholars put their model to the task of evaluating math word problems involving systems of equations and different groups of students, once again, the results echoed this known pattern of outcomes.
The Instruction Optimization Approach
Once they had confirmed the potential for an AI instructor to evaluate new materials, the scholars turned their attention to the question of whether a pair of models could work together to optimize educational content. They proposed a pipeline approach in which one model generates new educational material and the other evaluates the materials by predicting students’ learning outcomes, as measured by post-test scores. They applied this Instruction Optimization Approach to develop new math word problem worksheets.
Overall, the AI approach performed well: In a study involving 95 people with teaching experience, those experts generally concurred with the AI evaluator on which AI-generated worksheets would be more effective. The scholars noted a few exceptions, where teachers did not find a significant difference between worksheets that the AI thought were significantly different. The findings from this research are detailed in a 2024 paper published at the Educational Data Mining Conference: Evaluating and Optimizing Educational Content with Large Language Model Judgments.
“While LLMs should not be viewed as a replacement for teaching expertise or real data about what best supports students, our hope is that this approach could help support teachers and instructional designers,” Brunskill said.