Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.
Sign Up For Latest News

This brief introduces a quantitative framework that allows policymakers to evaluate the behavior of language models to assess what kinds of opinions they reflect.
This brief introduces a quantitative framework that allows policymakers to evaluate the behavior of language models to assess what kinds of opinions they reflect.

As AI-hallucinated case citations flood the courts, judges have increased fines for attorneys who have cited fake cases. HAI Policy Fellow Riana Pfefferkorn hopes this will "make the firm sit up and pay better attention."
As AI-hallucinated case citations flood the courts, judges have increased fines for attorneys who have cited fake cases. HAI Policy Fellow Riana Pfefferkorn hopes this will "make the firm sit up and pay better attention."
This paper presents the first few-shot LLM-based chatbot that almost never hallucinates and has high conversationality and low latency. WikiChat is grounded on the English Wikipedia, the largest curated free-text corpus. WikiChat generates a response from an LLM, retains only the grounded facts, and combines them with additional information it retrieves from the corpus to form factual and engaging responses. We distill WikiChat based on GPT-4 into a 7B-parameter LLaMA model with minimal loss of quality, to significantly improve its latency, cost and privacy, and facilitate research and deployment. Using a novel hybrid human-and-LLM evaluation methodology, we show that our best system achieves 97.3% factual accuracy in simulated conversations. It significantly outperforms all retrieval-based and LLM-based baselines, and by 3.9%, 38.6% and 51.0% on head, tail and recent knowledge compared to GPT-4. Compared to previous state-of-the-art retrieval-based chatbots, WikiChat is also significantly more informative and engaging, just like an LLM. WikiChat achieves 97.9% factual accuracy in conversations with human users about recent topics, 55.0% better than GPT-4, while receiving significantly higher user ratings and more favorable comments.
This paper presents the first few-shot LLM-based chatbot that almost never hallucinates and has high conversationality and low latency. WikiChat is grounded on the English Wikipedia, the largest curated free-text corpus. WikiChat generates a response from an LLM, retains only the grounded facts, and combines them with additional information it retrieves from the corpus to form factual and engaging responses. We distill WikiChat based on GPT-4 into a 7B-parameter LLaMA model with minimal loss of quality, to significantly improve its latency, cost and privacy, and facilitate research and deployment. Using a novel hybrid human-and-LLM evaluation methodology, we show that our best system achieves 97.3% factual accuracy in simulated conversations. It significantly outperforms all retrieval-based and LLM-based baselines, and by 3.9%, 38.6% and 51.0% on head, tail and recent knowledge compared to GPT-4. Compared to previous state-of-the-art retrieval-based chatbots, WikiChat is also significantly more informative and engaging, just like an LLM. WikiChat achieves 97.9% factual accuracy in conversations with human users about recent topics, 55.0% better than GPT-4, while receiving significantly higher user ratings and more favorable comments.
Shana Lynch, HAI Head of Content and Associate Director of Communications, pointed out the "'era of AI evangelism is giving way to an era of AI evaluation,'" in her AI predictions piece, where she interviewed several Stanford AI experts on their insights for AI impacts in 2026.
Shana Lynch, HAI Head of Content and Associate Director of Communications, pointed out the "'era of AI evangelism is giving way to an era of AI evaluation,'" in her AI predictions piece, where she interviewed several Stanford AI experts on their insights for AI impacts in 2026.

A diversity of perspectives from Stanford leaders in medicine, science, engineering, humanities, and the social sciences on how generative AI might affect their fields and our world
A diversity of perspectives from Stanford leaders in medicine, science, engineering, humanities, and the social sciences on how generative AI might affect their fields and our world


Readers wanted to know if their therapy chatbot could be trusted, whether their boss was automating the wrong job, and if their private conversations were training tomorrow's models.
Readers wanted to know if their therapy chatbot could be trusted, whether their boss was automating the wrong job, and if their private conversations were training tomorrow's models.
