External test of a deep learning model incorporating prior imaging for risk stratification of persistent pulmonary nodules on follow-up CT

N. Antonissen, R. Dinnessen, D. Peeters, E. Scholten, F. Mohamed Hoesein, R. Vliegenthart, H. Gietema, C. Schaefer-Prokop, M. Prokop and C. Jacobs

Annual Meeting of the Radiological Society of North America 2025.

Purpose: This study aimed to externally test a deep learning model (DL-Dual) developed to estimate the 3-year malignancy risk of persistent pulmonary nodules using both current and prior low-dose CT (LDCT) scans. Its performance was compared to that of a single-time-point model (DL-Single) and the LungRADS guideline across short-term and annual follow-up intervals. Methods and Materials: We analyzed 6,350 paired LDCT scans from the Dutch-Belgian NELSON lung cancer screening trial. Two DL models were evaluated: DL-Dual, trained on annual/biennial CT scan pairs from the NLST cohort, and DL-Single, trained on individual NLST scans. Persistent pulmonary nodules (>=3 mm) were identified at two follow-up intervals: (1) baseline to short-term follow-up (2,541 nodule pairs; 68 malignant; median interval: 91 days), and (2) baseline to first annual incidence round (3,809 nodule pairs; 76 malignant; median interval: 370 days). At nodule level, malignancy risk scores were computed using DL-Dual (prior + current CT) and DL-Single (only current CT), and Lung-RADS categories were assigned using published growth-based criteria. Performance was measured using area under the ROC curve (AUC) and specificity at sensitivity matching Lung-RADS category 3. Results: On short-term follow-up, DL-Single (AUC 0.95, 95% CI: 0.94-0.97) significantly outperformed DL-Dual (AUC 0.93, 95% CI: 0.90-0.96; p < 0.05). On annual follow-up, DL-Single (AUC 0.95, 95% CI: 0.93-0.97) and DL-Dual (AUC 0.96, 95% CI: 0.94-0.98; p = 0.16) performed similarly. Both DL models significantly outperformed Lung-RADS in both short-term (AUC 0.80, 95% CI: 0.74-0.86) and annual follow-up (AUC 0.82, 95% CI: 0.77-0.88), all p < 0.05. At matched sensitivity (Lung-RADS category 3), specificities were 95.8% (DL-Dual), 96.2% (DL-Single), and 87.7% (Lung-RADS) on short-term scans, and 97.8%, 96.4%, and 90.6%, respectively, on annual scans. Conclusions: DL-Dual demonstrated robust overall discriminatory performance, however, DL-Single performed comparably at annual screening and outperformed DL-Dual at short-term follow-up. Both models significantly outperformed Lung-RADS, achieving higher specificity at matched sensitivity, highlighting superior accuracy of AI-based approaches over a growth-based guideline in estimating malignancy risk. Further research is needed to assess DL Dual's performance in scenarios that may better align with its intended use, such as newly detected nodules, extended follow-up intervals, and non-baseline screening rounds. Clinical Relevance/Application: DL-based malignancy risk models could improve management of persistent pulmonary nodules and reduce false positives compared to diameter- or volume-based protocols like Lung-RADS.