
The illusion of readiness: Stress testing large frontier models on multimodal medical benchmarks
Gu, Yu, Fu, Jingjing, Liu, Xiaodong, et al.
arXiv arXiv:2509.18234
Science · Solutions · Society · Soul
Models, teams, and a dream — building LLMs, multimodal systems, and agents for health and science.

I'm Yu Gu (also Aiden Gu; Chinese: 顾禹). I build large language models, multimodal systems, and agentic workflows for health and science. Currently a Principal Applied Scientist at Microsoft Research and Health & Life Sciences.
Previously co-founded a Series C AI startup. Published in Nature, Science, Cell, and all major AI venues ICLR, NeurIPS, CVPR etc. My work serves millions and supports real-world decisions every day.
Selected highlights from recent research.

Gu, Yu, Fu, Jingjing, Liu, Xiaodong, et al.
arXiv arXiv:2509.18234

Jianwei Yang, Reuben Tan, Qianhui Wu, et al.
CVPR

Gu, Yu, Yang, Jianwei, Usuyama, Naoto, et al.
arXiv arXiv:2310.10765

Xu, Hanwen, Usuyama, Naoto, Bagga, Jaspreet, et al.
Nature

Zhao, Theodore*, Gu, Yu*, Yang, Jianwei, et al.
Nature methods

Gu, Yu, Tinn, Robert, Cheng, Hao, et al.
ACM
Research papers, conference proceedings, and scholarly contributions.
Liu, Qianchu, Zhang, Sheng, ... et al.
arXiv arXiv:2505.03981 (2025)
Wong, Cliff, Preston, Sam, ... et al.
arXiv arXiv:2502.00943 (2025)
Gu, Yu, Fu, Jingjing, ... et al.
arXiv arXiv:2509.18234 (2025)
Jin, Ying, Codella, Noel C, ... Hwang, Jenq-Neng
arXiv arXiv:x (2025)
Jin, Ying, Codella, Noel C, ... Hwang, Jenq-Neng
NeurIPS - The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance (2025)
Current and past research projects and contributions to the field.
Building and adapting domain-specific LLMs for biomedical NLP and real-world healthcare tasks, including pretraining, fine-tuning, and evaluation.
Publications:
Designing and stress-testing vision-language foundation models across medical imaging and multimodal benchmarks at scale.
Publications:
Advancing generalizable reasoning across modalities and domains, and targeted distillation for robust information extraction.
Publications:
Developing multimodal AI agents and workflows that orchestrate tools and reasoning to act in complex real-world settings.
Publications:
Latest updates on publications, presentations, awards, and research activities.
The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks. In just the first week, the paper has sparked meaningful conversation — highlighted by Eric Topol, shared across Health AI communities, and prompting outreach from Science and Health AI leaders looking for what comes next.
Related: The illusion of readiness: Stress testing large frontier models on multimodal medical benchmarks →We're excited to announce QRad, accepted to NeurIPS - The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance, a new project that enhances radiology report generation by captioning-to-VQA reframing.
Related: QRad: Enhancing Radiology Report Generation by Captioning-to-VQA Reframing →BiomedParse topped the CVPR 2025 3D Biomedical Image Segmentation Challenge! Our model delivered best-in-class performance across 42 tasks spanning CT, MRI, PET, ultrasound, and microscopy. check out the announcement
Related: A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities →At Microsoft Build, we introduced the Healthcare Agent Orchestrator, now available in Azure AI Foundry Agent Catalog*. For details: HAO science blog
Our recent publication in Nature Communications: A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings
Related: A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings →3D segmentation made simple - #MedImageParse 3D is live! #MedImageParse is now optimized for 3D imaging. Check out the blog by David Ardman: https://www.microsoft.com/en-us/industry/blog/healthcare/2025/03/03/leading-the-charge-to-transform-healthcare-with-advanced-ai/
We're excited to unveil **Magma**—our flagship multimodal AI project (Multimodal Agentic Model at Microsoft Research). Today, we released Magma on arXiv (2502.13130), along with its Project Page and GitHub repo. The project has already captured significant community attention, with top influencers sharing the news.
Related: Magma: A Foundation Model for Multimodal AI Agents →LLaVA-Rad training data—is out, a multi-modal dataset featuring 400,042 X-ray image-text pairs from MIMIC-CXR, enhanced with GPT-4 for accurate report structuring and clarity.
Related: LLaVA-Rad MIMIC-CXR Annotations →Satya Nadella shared our multi-agentic MTB project.
Our **#CXRReportGen** is featured in Forbes -- delivering state-of-the-art performance at half the model size, it is trained on commercially approved data, ensuring adaptability for specialized applications. check out https://ai.azure.com/catalog/models/CxrReportGen
Models, teams, and a dream — often in that order.