Leveraging open-source large language models for clinical information extraction in resource-constrained settings

L. Builtjes, J. Bosma, M. Prokop, B. van Ginneken and A. Hering

JAMIA Open 2025;8.

Abstract

Objective

We aimed to evaluate the zero-shot performance of open-source generative large language models (LLMs) on clinical information extraction from Dutch medical reports using the Diagnostic Report Analysis: General Optimization of NLP (DRAGON) benchmark.

Methods

We developed and released the llm_extractinator framework, a scalable, open-source tool for automating information extraction from clinical texts using LLMs. We evaluated 9 multilingual open-source LLMs across 28 tasks in the DRAGON benchmark, covering classification, regression, and named entity recognition (NER). All tasks were performed in a zero-shot setting. Model performance was quantified using task-specific metrics and aggregated into a DRAGON utility score. Additionally, we investigated the effect of in-context translation to English.

Results

Llama-3.3-70B achieved the highest utility score (0.760), followed by Phi-4-14B (0.751), Qwen-2.5-14B (0.748), and DeepSeek-R1-14B (0.744). These models outperformed or matched a fine-tuned RoBERTa baseline on 17 of 28 tasks, particularly in regression and structured classification. NER performance was consistently low across all models. Translation to English consistently reduced performance.

Discussion

Generative LLMs demonstrated strong zero-shot capabilities on clinical natural language processing tasks involving structured inference. Models around 14B parameters performed well overall, with Llama-3.3-70B leading but at high computational cost. Generative models excelled in regression tasks, but were hindered by token-level output formats for NER. Translation to English reduced performance, emphasizing the need for native language support.

Conclusion

Open-source generative LLMs provide a viable zero-shot alternative for clinical information extraction from Dutch medical texts, particularly in low-resource and multilingual settings.