Benchmarking lung tumour segmentation models: stratified performance of deep learning models across tumour sizes and cancer stages

A. Cerrato Nieto, E. Scholten, S. Schalekamp, M. Prokop and C. Jacobs

European Congress of Radiology 2026.

Purpose: Accurate lung tumour segmentation in CT scans is crucial for staging, radiotherapy planning, and treatment monitoring. Published deep learning models for lung tumour segmentation show varying consistency across tumour stages and may fail in complex tumours. This study compared five publicly available models with an in-house model trained to be robust to variation in tumour sizes and cancer stages. Methods:A dataset of 588 CT scans from lung cancer patients (2006-2020) was retrospectively collected and annotated at Radboud University Medical Center. A deep learning model was trained using the nnU-Net architecture on subsets of Radboud patients (n=505), the NSCLC-Radiomics dataset (n=362) and the Medical Segmentation Decathlon dataset (n=56). Our model was compared with five publicly available models, including the Universal Lesion Segmentation baseline model, the Medical Segmentation Decathlon lung model, DuneAI, TotalSegmentator and nnInteractive. Segmentation accuracy was assessed using volumetric and boundary metrics, including stratified analyses by tumour size (<=30 mm, >30-50 mm, >50-70 mm, >70 mm) on our internal test dataset (n=83). Results: Our proposed model performed equal to or superior to the best public models regarding volumetric Dice scores (median >=0.87), showing an increase ranging from 0.01 to 0.28, depending on the model and the tumour size group. The model demonstrated substantially improved consistency with interquartile ranges <= 0.10 for all tumour sizes. It also achieved higher surface Dice together with lower Hausdorff distance, indicating improved tumour border accuracy. Performance remained superior in clinically demanding cases with cavities, local invasion, and large masses. Conclusion: Our model improves performance and robustness over prior models across tumour sizes, including challenging cases. It represents a promising step towards automated evaluation of lung tumours. Limitations: Independent validation in larger multicentre datasets is required.