Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Toward Responsible Development and Evaluation of LLMs in Psychotherapy | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
policyPolicy Brief

Toward Responsible Development and Evaluation of LLMs in Psychotherapy

Date
June 13, 2024
Topics
Healthcare
Read Paper
abstract

This brief reviews the current landscape of LLMs developed for psychotherapy and proposes a framework for evaluating the readiness of these AI tools for clinical deployment.

Key Takeaways

  • Large language models (LLMs) hold promise for supporting, augmenting, and even automating psychotherapy through tasks ranging from note-taking during interviews to assessment and delivering therapy.

  • However, psychotherapy is a uniquely complex, high-stakes domain. The use of LLMs in this field poses wide-ranging safety, legal, and ethical concerns.

  • We propose a framework for evaluating and reporting on whether AI applications are ready for clinical deployment in behavioral health contexts based on safety, confidentiality/privacy, equity, effectiveness, and implementation concerns.

  • Policymakers and behavioral health practitioners should proceed cautiously when integrating LLMs into psychotherapy. Product developers should integrate evidence-based psychotherapy expertise and conduct comprehensive effectiveness and safety evaluations of clinical LLMs.

Executive Summary

There is growing enthusiasm about the potential of OpenAI’s GPT-4, Google’s Gemini, Anthropic’s Claude, and other large language models (LLMs) to support, augment, and even fully automate psychotherapy. By serving as conversational agents, LLMs could help address the shortage of mental healthcare services, problems with individual access to care, and other challenges. In fact, behavioral healthcare specialists are beginning to use LLMs for tasks such as note-taking, while consumers are already conversing with LLM-powered therapy chatbots.

However, psychotherapy is a uniquely complex, high-stakes domain. Responsible and evidence-based therapy requires nuanced expertise. While the stakes involved with using an LLM for productivity purposes may be failing to maximize efficiency, in behavioral healthcare, the stakes may include the improper handling of suicide risk.

Our paper, “Large Language Models Could Change the Future of Behavioral Healthcare,” provides a road map for the responsible application of clinical LLMs in psychotherapy. We provide an overview of the current landscape of clinical LLM applications and analyze the different stages of integration into psychotherapy. We discuss the risks of these LLM applications and offer recommendations for guiding their responsible development.

In a more recent paper, “Readiness for AI Deployment and Implementation (READI): A Proposed Framework for the Evaluation of AI-Mental Health Applications,” we build on our prior work and propose a new framework for evaluating whether AI mental health applications are ready for clinical deployment.

This work underscores the need for policymakers to understand the nuances of how LLMs are already, or could soon be, integrated in psychotherapy environments as researchers and industry race to develop AI mental health applications. Policymakers have the opportunity and responsibility to ensure that the field evaluates these innovations carefully, taking into consideration their potential limitations, ethical considerations, and risks.

Introduction

The use of AI in psychotherapy is not a new phenomenon. Decades before the emergence of mainstream LLMs, researchers and practitioners used AI applications, such as natural language processing models, in behavioral health settings. For instance, various research experiments used machine learning and natural language processing to detect suicide risk, identify homework resulting from psychotherapy sessions, and evaluate patient emotions. More recently, mental health chatbots such as Woebot and Tessa have applied rules-based AI techniques to target depression and eating pathology. Yet they frequently struggle to respond to user inputs and have high dropout rates and low user engagement.

LLMs have the potential to fill some of these gaps and change many aspects of psychotherapy care thanks to their ability to parse human language, generate human-like and context-dependent responses, annotate text, and flexibly adopt different conversational styles.

However, while LLMs show vast promise in performing certain tasks and skills associated with psychotherapy, clinical LLM products and prototypes are not yet sophisticated enough to replace psychotherapy. There is a gap between simulating therapy skills and implementing them to alleviate patient suffering. To achieve the implementation piece, clinical LLMs need to be tailored to psychotherapy contexts using prompt engineering—structuring a set of instructions so they can be understood by an AI model—or fine-tuning techniques that use curated datasets to train the LLM.

As LLMs are increasingly used in psychotherapy, it is essential to understand the complexity and stakes at play: In the worst-case scenario, an “LLM co-pilot” functioning poorly could lead to the improper handling of the risk of suicide or homicide. While clinical LLMs are, of course, not the only AI applications that may involve life-or-death decisions—consider self-driving cars, for example—predicting and mitigating risk in psychotherapy is unique. It requires conceptualizing complex cases, considering social and cultural contexts, and addressing unpredictable human behavior. Poor outcomes or ethical transgressions from clinical LLMs could seriously harm individuals and undermine public trust in behavioral healthcare as a field, as has been seen in other domains.

Beginning with an overview of the clinical LLMs in use today, our first paper reviews the current landscape of clinical LLM development. We examine how clinical LLMs progress across different stages of integration and identify specific ethical and other concerns related to their use in different scenarios. We then make recommendations for how to responsibly approach the development of LLMs for use in behavioral health settings. In our second paper, we propose a framework that could be used by developers, researchers, clinicians, and policymakers to evaluate and report on the readiness of generative AI mental health applications for clinical deployment.

Read Paper
Share
Link copied to clipboard!
Authors
  • Elizabeth C. Stade
    Elizabeth C. Stade
  • Shannon Wiltsey Stirman
    Shannon Wiltsey Stirman
  • Lyle Ungar
    Lyle Ungar
  • Cody L. Boland
    Cody L. Boland
  • H. Andrew Schwartz
    H. Andrew Schwartz
  • David B. Yaden
    David B. Yaden
  • João Sedoc
    João Sedoc
  • Robert J. DeRubeis
    Robert J. DeRubeis
  • Robb Willer
    Robb Willer
  • Jane P. Kim
    Jane P. Kim
  • Johannes Eichstaedt
    Johannes Eichstaedt

Related Publications

Toward Responsible AI in Health Insurance Decision-Making
Michelle Mello, Artem Trotsyuk, Abdoul Jalil Djiberou Mahamadou, Danton Char
Quick ReadFeb 10, 2026
Policy Brief

This brief proposes governance mechanisms for the growing use of AI in health insurance utilization review.

Policy Brief

Toward Responsible AI in Health Insurance Decision-Making

Michelle Mello, Artem Trotsyuk, Abdoul Jalil Djiberou Mahamadou, Danton Char
HealthcareRegulation, Policy, GovernanceQuick ReadFeb 10

This brief proposes governance mechanisms for the growing use of AI in health insurance utilization review.

Response to FDA's Request for Comment on AI-Enabled Medical Devices
Desmond C. Ong, Jared Moore, Nicole Martinez-Martin, Caroline Meinhardt, Eric Lin, William Agnew
Quick ReadDec 02, 2025
Response to Request

Stanford scholars respond to a federal RFC on evaluating AI-enabled medical devices, recommending policy interventions to help mitigate the harms of AI-powered chatbots used as therapists.

Response to Request

Response to FDA's Request for Comment on AI-Enabled Medical Devices

Desmond C. Ong, Jared Moore, Nicole Martinez-Martin, Caroline Meinhardt, Eric Lin, William Agnew
HealthcareRegulation, Policy, GovernanceQuick ReadDec 02

Stanford scholars respond to a federal RFC on evaluating AI-enabled medical devices, recommending policy interventions to help mitigate the harms of AI-powered chatbots used as therapists.

Russ Altman’s Testimony Before the U.S. Senate Committee on Health, Education, Labor, and Pensions
Russ Altman
Quick ReadOct 09, 2025
Testimony

In this testimony presented to the U.S. Senate Committee on Health, Education, Labor, and Pensions hearing titled “AI’s Potential to Support Patients, Workers, Children, and Families,” Russ Altman highlights opportunities for congressional support to make AI applications for patient care and drug discovery stronger, safer, and human-centered.

Testimony

Russ Altman’s Testimony Before the U.S. Senate Committee on Health, Education, Labor, and Pensions

Russ Altman
HealthcareRegulation, Policy, GovernanceSciences (Social, Health, Biological, Physical)Quick ReadOct 09

In this testimony presented to the U.S. Senate Committee on Health, Education, Labor, and Pensions hearing titled “AI’s Potential to Support Patients, Workers, Children, and Families,” Russ Altman highlights opportunities for congressional support to make AI applications for patient care and drug discovery stronger, safer, and human-centered.

Michelle M. Mello's Testimony Before the U.S. House Committee on Energy and Commerce Health Subcommittee
Michelle Mello
Quick ReadSep 02, 2025
Testimony

In this testimony presented to the U.S. House Committee on Energy and Commerce’s Subcommittee on Health hearing titled “Examining Opportunities to Advance American Health Care through the Use of Artificial Intelligence Technologies,” Michelle M. Mello calls for policy changes that will promote effective integration of AI tools into healthcare by strengthening trust.

Testimony

Michelle M. Mello's Testimony Before the U.S. House Committee on Energy and Commerce Health Subcommittee

Michelle Mello
HealthcareRegulation, Policy, GovernanceQuick ReadSep 02

In this testimony presented to the U.S. House Committee on Energy and Commerce’s Subcommittee on Health hearing titled “Examining Opportunities to Advance American Health Care through the Use of Artificial Intelligence Technologies,” Michelle M. Mello calls for policy changes that will promote effective integration of AI tools into healthcare by strengthening trust.