Toward Responsible Development and Evaluation of LLMs in Psychotherapy | Stanford HAI
Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
Navigate
  • About
  • Events
  • AI Glossary
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

policyPolicy Brief

Toward Responsible Development and Evaluation of LLMs in Psychotherapy

Date
June 13, 2024
Topics
Healthcare
Read Paper
abstract

This brief reviews the current landscape of LLMs developed for psychotherapy and proposes a framework for evaluating the readiness of these AI tools for clinical deployment.

Key Takeaways

  • Large language models (LLMs) hold promise for supporting, augmenting, and even automating psychotherapy through tasks ranging from note-taking during interviews to assessment and delivering therapy.

  • However, psychotherapy is a uniquely complex, high-stakes domain. The use of LLMs in this field poses wide-ranging safety, legal, and ethical concerns.

  • We propose a framework for evaluating and reporting on whether AI applications are ready for clinical deployment in behavioral health contexts based on safety, confidentiality/privacy, equity, effectiveness, and implementation concerns.

  • Policymakers and behavioral health practitioners should proceed cautiously when integrating LLMs into psychotherapy. Product developers should integrate evidence-based psychotherapy expertise and conduct comprehensive effectiveness and safety evaluations of clinical LLMs.

Executive Summary

There is growing enthusiasm about the potential of OpenAI’s GPT-4, Google’s Gemini, Anthropic’s Claude, and other large language models (LLMs) to support, augment, and even fully automate psychotherapy. By serving as conversational agents, LLMs could help address the shortage of mental healthcare services, problems with individual access to care, and other challenges. In fact, behavioral healthcare specialists are beginning to use LLMs for tasks such as note-taking, while consumers are already conversing with LLM-powered therapy chatbots.

However, psychotherapy is a uniquely complex, high-stakes domain. Responsible and evidence-based therapy requires nuanced expertise. While the stakes involved with using an LLM for productivity purposes may be failing to maximize efficiency, in behavioral healthcare, the stakes may include the improper handling of suicide risk.

Our paper, “Large Language Models Could Change the Future of Behavioral Healthcare,” provides a road map for the responsible application of clinical LLMs in psychotherapy. We provide an overview of the current landscape of clinical LLM applications and analyze the different stages of integration into psychotherapy. We discuss the risks of these LLM applications and offer recommendations for guiding their responsible development.

In a more recent paper, “Readiness for AI Deployment and Implementation (READI): A Proposed Framework for the Evaluation of AI-Mental Health Applications,” we build on our prior work and propose a new framework for evaluating whether AI mental health applications are ready for clinical deployment.

This work underscores the need for policymakers to understand the nuances of how LLMs are already, or could soon be, integrated in psychotherapy environments as researchers and industry race to develop AI mental health applications. Policymakers have the opportunity and responsibility to ensure that the field evaluates these innovations carefully, taking into consideration their potential limitations, ethical considerations, and risks.

Introduction

The use of AI in psychotherapy is not a new phenomenon. Decades before the emergence of mainstream LLMs, researchers and practitioners used AI applications, such as natural language processing models, in behavioral health settings. For instance, various research experiments used machine learning and natural language processing to detect suicide risk, identify homework resulting from psychotherapy sessions, and evaluate patient emotions. More recently, mental health chatbots such as Woebot and Tessa have applied rules-based AI techniques to target depression and eating pathology. Yet they frequently struggle to respond to user inputs and have high dropout rates and low user engagement.

LLMs have the potential to fill some of these gaps and change many aspects of psychotherapy care thanks to their ability to parse human language, generate human-like and context-dependent responses, annotate text, and flexibly adopt different conversational styles.

However, while LLMs show vast promise in performing certain tasks and skills associated with psychotherapy, clinical LLM products and prototypes are not yet sophisticated enough to replace psychotherapy. There is a gap between simulating therapy skills and implementing them to alleviate patient suffering. To achieve the implementation piece, clinical LLMs need to be tailored to psychotherapy contexts using prompt engineering—structuring a set of instructions so they can be understood by an AI model—or fine-tuning techniques that use curated datasets to train the LLM.

As LLMs are increasingly used in psychotherapy, it is essential to understand the complexity and stakes at play: In the worst-case scenario, an “LLM co-pilot” functioning poorly could lead to the improper handling of the risk of suicide or homicide. While clinical LLMs are, of course, not the only AI applications that may involve life-or-death decisions—consider self-driving cars, for example—predicting and mitigating risk in psychotherapy is unique. It requires conceptualizing complex cases, considering social and cultural contexts, and addressing unpredictable human behavior. Poor outcomes or ethical transgressions from clinical LLMs could seriously harm individuals and undermine public trust in behavioral healthcare as a field, as has been seen in other domains.

Beginning with an overview of the clinical LLMs in use today, our first paper reviews the current landscape of clinical LLM development. We examine how clinical LLMs progress across different stages of integration and identify specific ethical and other concerns related to their use in different scenarios. We then make recommendations for how to responsibly approach the development of LLMs for use in behavioral health settings. In our second paper, we propose a framework that could be used by developers, researchers, clinicians, and policymakers to evaluate and report on the readiness of generative AI mental health applications for clinical deployment.

Read Paper
Share
Link copied to clipboard!
Authors
  • Elizabeth C. Stade
    Elizabeth C. Stade
  • Shannon Wiltsey Stirman
    Shannon Wiltsey Stirman
  • Lyle Ungar
    Lyle Ungar
  • Cody L. Boland
    Cody L. Boland
  • H. Andrew Schwartz
    H. Andrew Schwartz
  • David B. Yaden
    David B. Yaden
  • João Sedoc
    João Sedoc
  • Robert J. DeRubeis
    Robert J. DeRubeis
  • Robb Willer
    Robb Willer
  • Jane P. Kim
    Jane P. Kim
  • Johannes Eichstaedt
    Johannes Eichstaedt
Related
  • What is Prompt Engineering?

    Prompt Engineering is the practice of carefully crafting instructions, or "prompts," to guide AI language models toward producing desired outputs. By adjusting the wording, structure, and context provided in a prompt, users can significantly influence the quality, style, and accuracy of the model's responses. This skill has become essential for effectively using large language models across tasks ranging from creative writing to technical problem-solving to data analysis. 

  • What is a Large Language Model (LLM)?

    A Large Language Model is an AI system trained on massive amounts of text data to understand and generate human-like language. It uses deep learning techniques, specifically neural networks with billions of parameters, to predict and produce coherent text, answer questions, translate languages, write code, and perform various other language-based tasks.

Related Publications

Operationalizing Real-Time Monitoring of Clinical AI
Zhongnan Fang, Lina Cheuy, Hye Sun Na, Akshay Chaudhari, David B. Larson
Quick ReadMay 14, 2026
Policy Brief

This brief demonstrates how real-time monitoring can address critical gaps in the oversight of radiological AI tools.

Policy Brief

Operationalizing Real-Time Monitoring of Clinical AI

Zhongnan Fang, Lina Cheuy, Hye Sun Na, Akshay Chaudhari, David B. Larson
HealthcareRegulation, Policy, GovernanceQuick ReadMay 14

This brief demonstrates how real-time monitoring can address critical gaps in the oversight of radiological AI tools.

Toward Responsible AI in Health Insurance Decision-Making
Michelle Mello, Artem Trotsyuk, Abdoul Jalil Djiberou Mahamadou, Danton Char
Quick ReadFeb 10, 2026
Policy Brief

This brief proposes governance mechanisms for the growing use of AI in health insurance utilization review.

Policy Brief

Toward Responsible AI in Health Insurance Decision-Making

Michelle Mello, Artem Trotsyuk, Abdoul Jalil Djiberou Mahamadou, Danton Char
HealthcareRegulation, Policy, GovernanceQuick ReadFeb 10

This brief proposes governance mechanisms for the growing use of AI in health insurance utilization review.

Response to FDA's Request for Comment on AI-Enabled Medical Devices
Desmond C. Ong, Jared Moore, Nicole Martinez-Martin, Caroline Meinhardt, Eric Lin, William Agnew
Quick ReadDec 02, 2025
Response to Request

Stanford scholars respond to a federal RFC on evaluating AI-enabled medical devices, recommending policy interventions to help mitigate the harms of AI-powered chatbots used as therapists.

Response to Request

Response to FDA's Request for Comment on AI-Enabled Medical Devices

Desmond C. Ong, Jared Moore, Nicole Martinez-Martin, Caroline Meinhardt, Eric Lin, William Agnew
HealthcareRegulation, Policy, GovernanceQuick ReadDec 02

Stanford scholars respond to a federal RFC on evaluating AI-enabled medical devices, recommending policy interventions to help mitigate the harms of AI-powered chatbots used as therapists.

Russ Altman’s Testimony Before the U.S. Senate Committee on Health, Education, Labor, and Pensions
Russ Altman
Quick ReadOct 09, 2025
Testimony

In this testimony presented to the U.S. Senate Committee on Health, Education, Labor, and Pensions hearing titled “AI’s Potential to Support Patients, Workers, Children, and Families,” Russ Altman highlights opportunities for congressional support to make AI applications for patient care and drug discovery stronger, safer, and human-centered.

Testimony

Russ Altman’s Testimony Before the U.S. Senate Committee on Health, Education, Labor, and Pensions

Russ Altman
HealthcareRegulation, Policy, GovernanceSciences (Social, Health, Biological, Physical)Quick ReadOct 09

In this testimony presented to the U.S. Senate Committee on Health, Education, Labor, and Pensions hearing titled “AI’s Potential to Support Patients, Workers, Children, and Families,” Russ Altman highlights opportunities for congressional support to make AI applications for patient care and drug discovery stronger, safer, and human-centered.