
As interest in medical AI grows, a major challenge remains: most large language models (LLMs) are trained and tested using general metrics that do not reflect the complexity of clinical practice. To bridge this gap, CHUV has developed a new clinician‑centric framework that places healthcare professionals at the heart of model evaluation and improvement.
In this initiative, 241 clinicians from 22 specialties created more than 3,700 realistic clinical vignettes and contributed over 12,500 expert evaluations. This large, diverse body of clinical insight was used to align and assess a 70‑billion‑parameter medical LLM, resulting in the development of Llama‑3.1 Meditron‑3‑CHUV. The aligned model showed significant improvements over its base version across 11 key dimensions—including safety, fairness, clarity, and contextual relevance—and achieved performance comparable to leading proprietary systems.
A central outcome of this work is MOOVE‑CHUV, the largest clinician‑annotated preference dataset ever released in the medical AI field. This resource enables hospitals, researchers, and developers to evaluate and refine medical LLMs using real clinical expectations rather than abstract benchmarks.
This approach demonstrates a major shift in how medical AI can be built: clinicians are not just users of AI tools, but essential collaborators in shaping them. By integrating expert judgment directly into the development cycle, healthcare institutions can create safer, more trustworthy, and more context‑aware AI systems—while preserving data protection and supporting local deployment. This work offers a scalable pathway for developing AI that truly meets the needs of both clinicians and patients.