


Researchers used artificial intelligence to analyze the brain scans of students solving math problems, offering the first-ever peek into the neuroscience of math disabilities.
Researchers used artificial intelligence to analyze the brain scans of students solving math problems, offering the first-ever peek into the neuroscience of math disabilities.

As Large Language Models (LLMs) become increasingly integrated into our everyday lives, understanding their ability to comprehend human mental states becomes critical for ensuring effective interactions. However, despite the recent attempts to assess the Theory-of-Mind (ToM) reasoning capabilities of LLMs, the degree to which these models can align with human ToM remains a nuanced topic of exploration. This is primarily due to two distinct challenges: (1) the presence of inconsistent results from previous evaluations, and (2) concerns surrounding the validity of existing evaluation methodologies. To address these challenges, we present a novel framework for procedurally generating evaluations with LLMs by populating causal templates. Using our framework, we create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations. We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations. Using BigToM, we evaluate the social reasoning capabilities of a variety of LLMs and compare model performances with human performance. Our results suggest that GPT4 has ToM capabilities that mirror human inference patterns, though less reliable, while other LLMs struggle.
As Large Language Models (LLMs) become increasingly integrated into our everyday lives, understanding their ability to comprehend human mental states becomes critical for ensuring effective interactions. However, despite the recent attempts to assess the Theory-of-Mind (ToM) reasoning capabilities of LLMs, the degree to which these models can align with human ToM remains a nuanced topic of exploration. This is primarily due to two distinct challenges: (1) the presence of inconsistent results from previous evaluations, and (2) concerns surrounding the validity of existing evaluation methodologies. To address these challenges, we present a novel framework for procedurally generating evaluations with LLMs by populating causal templates. Using our framework, we create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations. We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations. Using BigToM, we evaluate the social reasoning capabilities of a variety of LLMs and compare model performances with human performance. Our results suggest that GPT4 has ToM capabilities that mirror human inference patterns, though less reliable, while other LLMs struggle.


Vanessa Parli, Stanford HAI Director of Research and AI Index Steering Committee member, notes that the 2025 AI Index reports flourishing and higher-quality academic research in AI.
Vanessa Parli, Stanford HAI Director of Research and AI Index Steering Committee member, notes that the 2025 AI Index reports flourishing and higher-quality academic research in AI.
This study aims to understand the impact of instabilities and turbulence arising from canopy mixing layers on wind-driven wildfire spread. Using an experimental flume (water) setup with model vegetation canopy and thermally buoyant plumes, we study the influence of canopy-induced shear and turbulence on the behavior of buoyant plume trajectories. Using the length of the canopy upstream of the plume source to vary the strength of the canopy turbulence, we observed behaviors of the plume trajectory under varying turbulence yet constant cross-flow conditions. Results indicate that increasing canopy turbulence corresponds to increased strength of vertical oscillatory motion and variability in the plume trajectory/position. Furthermore, we find that the canopy coherent structures characterized at the plume source set the intensity and frequency at which the plume oscillates. These perturbations then move longitudinally along the length of the plume at the speed of the free stream velocity. However, the buoyancy developed by the plume can resist this impact of the canopy structures. Due to these competing effects, the oscillatory behavior of plumes in canopy systems is observed more significantly in systems where the canopy turbulence is dominant. These effects also have an influence on the mixing and entrainment of the plumes. We offer scaling analyses to find flow regimes in which canopy induced turbulence would be relevant in plume dynamics.
This study aims to understand the impact of instabilities and turbulence arising from canopy mixing layers on wind-driven wildfire spread. Using an experimental flume (water) setup with model vegetation canopy and thermally buoyant plumes, we study the influence of canopy-induced shear and turbulence on the behavior of buoyant plume trajectories. Using the length of the canopy upstream of the plume source to vary the strength of the canopy turbulence, we observed behaviors of the plume trajectory under varying turbulence yet constant cross-flow conditions. Results indicate that increasing canopy turbulence corresponds to increased strength of vertical oscillatory motion and variability in the plume trajectory/position. Furthermore, we find that the canopy coherent structures characterized at the plume source set the intensity and frequency at which the plume oscillates. These perturbations then move longitudinally along the length of the plume at the speed of the free stream velocity. However, the buoyancy developed by the plume can resist this impact of the canopy structures. Due to these competing effects, the oscillatory behavior of plumes in canopy systems is observed more significantly in systems where the canopy turbulence is dominant. These effects also have an influence on the mixing and entrainment of the plumes. We offer scaling analyses to find flow regimes in which canopy induced turbulence would be relevant in plume dynamics.