Brown University Audits AI Therapy Chatbots: 15 Ethical Failures Across 137 Sessions
- Brown University researchers systematically audited AI therapy chatbots across 137 sessions, evaluated by trained peer counselors and licensed psychologists — the most rigorous ethics audit of LLM-based mental health tools to date
- 15 recurring ethical failures identified, grouped into 5 thematic categories — the failures are not rare edge cases but patterned, predictable violations
- "Lack of contextual understanding" emerged as a core failure theme — AI cannot integrate session history, cultural context, and clinical priorities the way a human therapist does
- California SB 903 (2025-26 session) would prohibit AI from making independent therapeutic decisions, detecting emotions, or recording therapy without consent — legislative response to the ethical gap
For two years, the mental health field has been debating AI therapy chatbots in the abstract. The debates produced position papers and think-pieces but little empirical data on what actually goes wrong when patients interact with these systems. A Brown University team, presenting at the 2025 AAAI/ACM Conference on AI Ethics and Society, decided to stop debating and start measuring. They audited 137 AI therapy sessions with structured expert review. The findings should settle the debate — or at least refocus it.
The 15 failures and the 5 themes
The research identified 15 recurring ethical failures grouped into 5 categories:
1. Contextual understanding failures — The AI cannot hold the full clinical picture across a session, let alone across sessions. Each interaction starts from a limited context window; the patient's history, previous disclosures, and treatment goals are not reliably integrated.
2. Scope and competency failures — The AI responds to content it is not equipped to handle safely: active suicidality, psychotic symptoms, complex trauma presentations. It does not recognize its own limitations consistently.
3. Boundary violations — The AI makes interventions that exceed the "supportive companion" role the system was designed for, venturing into diagnostic territory or treatment recommendations without the clinical framework to do so responsibly.
4. Informed consent gaps — Patients do not consistently understand what the AI is, what it can and cannot do, or how their data is being used.
5. Harm mitigation failures — When a patient expresses acute distress, the AI's responses do not reliably match the risk level. Crisis escalation protocols exist on paper but fail in practice.
The California legislative response
California SB 903 (2025–26 session) is one direct response. The bill would prohibit licensed professionals from allowing AI to perform independent therapeutic decisions or emotion detection, and would require patient informed consent before AI records or transcribes therapy sessions. Illinois enacted similar legislation in 2025. The regulatory wave is following the evidence — and the evidence is not good for AI therapy chatbots operating without human oversight.
For your practice
For clinicians using or considering AI tools: the Brown audit reveals the specific failure modes to watch for. If you use AI for note-taking or scheduling, the risk is lower. If you use AI in any patient-facing therapeutic capacity, the audit findings apply directly — require informed consent, maintain human review of all clinical content, and assume AI cannot handle crisis presentations. For the regulatory context: California SB 903 and similar bills will affect practice even where they are not yet law, because malpractice standards follow regulatory trends. For patient advocacy: know that "AI therapy" tools patients may be using outside your oversight carry documented ethical risks — this is the data to bring into that conversation.
We spent two years debating whether AI therapy chatbots cross ethical lines. Brown University counted the crossings. The number is 15, across 5 categories, in a patterned and predictable way.
Single research team's audit methodology. Specific AI systems audited may have been updated since testing. Expert review involves subjective judgment even when structured. Generalizability to newer LLMs (GPT-5, Claude 4 generation) uncertain as models continue to evolve.