Abstract
Large language models like ChatGPT have captured healthcare's imagination. While institutions remain cautious about formal deployment, individual clinicians increasingly turn to these tools informally, copying notes into ChatGPT for summaries or asking diagnostic questions during complex cases. This grassroots adoption reveals both the demand for AI assistance and a real limitation: general-purpose LLMs weren't built for psychiatric care.
During consultations, mental health clinicians assess how their patients feel, how they sleep, how they speak, and compare these observations against prior visits and broader medical context. These are the subtle, experience-driven cues that clinicians develop over years of practice. General-purpose AI models typically miss them. Just as biopharma recognized it needs domain-specific biology foundation models rather than retrofitted general AI, psychiatry faces the same reality.
This article explores why general-purpose LLMs fall short in mental healthcare, and why the path forward runs through specialized foundation models trained on psychiatric data.
The Crisis Behind the AI Rush
Studies show that 90% of patients with bipolar disorder relapse in their lifetime, nearly half within two years of recovery. In schizophrenia, that number can reach 72% within two years of a first psychotic episode. Not all of these relapses are inevitable. Many could be anticipated with earlier detection of warning signs. But the current system only catches what's visible during scheduled appointments, and most of what matters happens in between.
Measurement-based care offers a proven path to better outcomes. We've described this in detail in our manifesto, along with why existing tools, from clinician-administered scales to patient self-reported questionnaires, remain largely unadopted in practice.
AI can help close this gap, but general-purpose LLMs fall short as they weren't designed for the complexity of mental health.
Why General-Purpose LLMs Fall Short
When clinicians use ChatGPT or Claude for psychiatric notes, they're using tools trained on internet text and general conversation, not psychiatric clinical data.
Designed for structured medicine. General-purpose AI performs well where clinical decisions follow codified pathways: oncology staging, drug interaction checks, radiology classification. Psychiatry doesn't work that way. Treatment is iterative, largely trial and error, with efficacy assessed through subjective clinical judgment over sustained observation. General models typically struggle to reconcile contradictory or ambiguous clinical information, the kind psychiatrists navigate in every consultation.
Unable to assess symptoms in context. Psychiatry is fundamentally about assessing symptoms, which are alterations in behavior, within a patient's personal and medical context. When a patient says “I can't sleep,” it might reflect an actual sleep disorder or a subjective impression linked to depression, mania, anxiety, psychosis, or medication effects. Distinguishing between these requires understanding the patient's history, current treatment, and how this complaint fits into a broader clinical picture.
This extends to how patients express themselves. Pressured speech, thought blocking, tangentiality: these patterns carry real diagnostic weight. In psychiatry, the spoken word is the primary therapeutic medium. The most clinically meaningful signal is often simply a patient telling their doctor how they feel. General-purpose LLMs, trained on written text, have no way to capture this. Even speech recognition systems like Whisper, which achieve ~10% error rates on typical speech, jump to ~30% on neuropsychiatric patients, rendering transcripts clinically unreliable.
Beyond these specific gaps, psychiatry places structural demands that general AI doesn't account for. Mental health requires tracking symptom evolution over weeks and months (depression at week one looks very different from week eight), and two patients with identical diagnoses may present completely different clinical profiles. Models need to handle both temporal complexity and symptom heterogeneity while remaining clinically useful.
The Foundation Model Approach
Foundation models, large networks pre-trained on vast datasets then adapted to specific tasks, are well suited to psychiatry's complexity.
Robust generalization. Training across different pathologies (depression, bipolar disorder, schizophrenia) creates models that handle heterogeneity better than diagnosis-specific systems. Psychiatry's complex, unstructured medical data, long seen as a computational liability, becomes a source of contextualized clinical insights.
Multi-modal integration. Foundation architectures can unify audio features, linguistic content, behavioral patterns, and clinical history, mirroring how psychiatrists actually synthesize information. Unlike conventional models that require separate validation for each modality, foundation models learn shared representations across data types.
Flexible adaptation. A single foundation model can be adapted to multiple clinical applications (symptom monitoring, relapse prediction, treatment response) without retraining from scratch. Training for one task often enhances performance across others and can unlock capabilities the system wasn't explicitly designed for.
These advantages, however, only materialize with appropriate training data.
Building What Doesn't Exist Yet
The bottleneck for psychiatric AI isn't algorithms. It's clinical data infrastructure.
Psychiatric datasets can't be assembled the way radiology archives can. While retrospective labelling works in some medical fields, psychiatric diagnosis can't be reliably inferred from isolated signals: a one-minute voice sample or a few nights of sleep data don't tell you whether someone is experiencing a depressive episode. Reliable assessment requires structured clinical evaluation conducted alongside data collection, longitudinal follow-up over months or years, and population diversity across diagnoses, severities, and demographics. Each meaningful data point represents hours of coordination between researchers, clinicians, and patients.
Companies in this space face a clear choice: work within the limitations of general-purpose models, or invest years building the specialized datasets that psychiatric foundation models require.
Callyope chose the second path. Through clinical trials and research partnerships with academic hospitals, we have built the world's largest behavioral dataset in neuroscience, spanning the spectrum of brain conditions. Not because it's the fastest route to market, but because it's the only way to build AI that actually works for this field.
The technology is ready. What's been missing is the clinical data, and the willingness to build around it properly.





