Deep-learning based malignancy risk estimation of pulmonary nodules in PET/CT imaging

L. Leijten, E. van der Heijden, E. Aarntzen, R. Verhoeven and C. Jacobs

European Congress of Radiology 2026.

Purpose: The current BTS guidelines recommend evaluation of suspicious pulmonary nodules using [18F]FDG-PET/CT imaging followed by Herder model risk stratification. However, its reliance on a small set of predefined features may limit its diagnostic accuracy. This study aims to develop a deep learning (DL) algorithm for risk stratification and compare its performance to the Herder risk model and clinician assessment. Methods: In a single-center retrospective study, we collected 533 indeterminate pulmonary nodules (265 malignant) with a diameter of 18.4 ± 12.1 (mean ± SD) in 436 patients. Cases were identified through biopsy records (n=341), querying for FDG-PET/CT reports with mentions of pulmonary nodules (n=50), and using automatic nodule detection in confirmed benign patients (n=142). Histopathological confirmation and 2-year cancer registry follow-up served as reference standard. A patient-level random split created a development set (n=393) and test set (n=140). Subsequently, we developed a PET/CT DL algorithm which was benchmarked against the Herder risk model and imaging-only assessment by a nuclear physician with 7 years of experience. Model performance was measured using the area under the receiver operating characteristics curve (AUC) with differences between models compared with the DeLong test. Results: On the test set, the PET/CT DL algorithm achieved an AUC 0.85 [95%CI: 0.77 - 0.92] compared to 0.78 [0.68 – 0.86] (p=0.106) by the Herder model. The DL algorithm outperformed imaging assessment by the nuclear physician: AUC 0.77 [0.68 – 0.85] (p=0.024). Conclusion: The PET/CT DL algorithm showed similar performance to the guideline-recommended Herder risk model, but outperformed imaging-only assessment by the nuclear physician. The DL algorithm may help to optimize the clinical decision between invasive diagnosis and CT surveillance follow-up. Limitations: The single-center design and non-consecutive cohort may limit generalizability of the results.