External validation of an AI algorithm for pulmonary nodule malignancy risk estimation on a dataset of incidentally detected pulmonary nodules

R. Dinnessen, K. Venkadesh, D. Peeters, H. Gietema, E. Scholten, C. Schaefer-Prokop and C. Jacobs

European Congress of Radiology 2024.

Purpose: An AI algorithm for malignancy risk estimation was developed and validated on screen-detected pulmonary nodules. We aimed to test the AI algorithm in clinical data and compare the results to the Brock model.

Methods and materials: A size-matched dataset of solid incidentally detected pulmonary nodules with a diameter range between 5-15 mm was collected, consisting of 53 malignant nodules from CT scans performed at least two months prior to a lung cancer diagnosis, and 53 benign nodules. Differences in patient and nodule characteristics between the malignant and benign groups were assessed. AUCs and 95% confidence intervals were determined and compared using the DeLong method. Sensitivity and specificity were determined at a 10% malignancy risk threshold for the AI algorithm and Brock model, according to the British Thoracic Society guidelines.

Results: No statistical difference in size was detected between the malignant and benign nodules (median [range]: 10.8 [5.8, 15.4]; 10.4 [5.8, 15.1]; respectively). Cases with malignant nodules had a significantly lower number of nodules (p=0.001). The AI algorithm significantly outperformed the Brock model (p<0.001). AUC [95% CI] of the AI algorithm and Brock model were 0.87 [0.80-0.94] and 0.59 [0.48-0.69], respectively. The AI algorithm had a higher sensitivity (0.60 [0.46-0.74]) and specificity (0.87 [0.75-0.95]) than the Brock model (0.42 [0.28-0.56]; 0.75 [0.62-0.86]; respectively).

Conclusion: The AI algorithm outperformed the Brock model in a clinical dataset with a more heterogeneous population than a screening population. The AI algorithm demonstrated the potential for nodule risk stratification in a clinical setting, which can aid clinicians in decisions in nodule management, thereby potentially reducing unnecessary follow-up.

Limitations: This is a retrospective validation on a single-centre dataset. More research is needed to test the performance in larger and multi-centre data.