Deep Learning for Lung Nodule Malignancy Prediction: Comparison With Clinicians and the Brock Model on an Independent Dataset From a Large Lung Screening Trial

K. Venkadesh, A. Setio, Z. Saghir, B. van Ginneken and C. Jacobs

Annual Meeting of the Radiological Society of North America 2020.

PURPOSE: The majority of studies on automated lung nodule malignancy prediction utilize subjective labels provided by radiologists instead of using a histopathological reference standard. The aim of this study was to investigate the performance of a deep learning system that was trained using subjective labels from LIDC-IDRI by testing it on two independent datasets of nodules from the Danish Lung Cancer Screening Trial (DLCST) with histopathological proof or follow-up over a period of at least 2 years, and comparing performance with a panel of 11 clinicians and the clinically established Brock risk model. METHOD AND MATERIALS: We considered nodules annotated by at least 3 out of 4 radiologists from the LIDC-IDRI dataset. The malignancy ratings were averaged and indeterminate nodules were excluded resulting in 680 nodules (352 benign and 328 malignant) for development. We trained a deep learning system based on 2D multi-view CNN and 3D extension of VGGNet on the development set. We tested the system on two sets of nodules from DLCST. The first set, dataset A, consisted of 62 cancers and 120 random benign nodules and the second set, dataset B, consisted of the same 62 cancers and a size-matched group of 118 benign nodules. A group of 11 clinicians, consisting of 4 radiologists, 5 radiology residents, and 2 pulmonologists, were tasked with grading the nodules on a continuous scale from 0 to 100. Finally, the Brock risk model was also applied to all nodules. RESULTS: On dataset A, the deep learning system produced an AUC of 0.941, which is better than the average clinician (0.892, p = 0.02) and comparable to the Brock model (0.924, p = 0.35). On dataset B, the system produced an AUC of 0.737, which is comparable to the Brock model (0.70, p = 0.268) but worse than the average clinician (0.80, p = 0.034). CONCLUSION: The deep learning system trained with subjective labels performed comparably with humans and the Brock model but showed certain vulnerabilities when presented with large benign nodules. It is important to recognize the challenges involved in classifying indeterminate lung nodules and we think the field would benefit from publicly available datasets with a reference standard set by histopathological proof or follow-up.