Deep Learning for Lung Cancer Detection in Screening CT Scans: Results of a Large-Scale Public Competition and an Observer Study with 11 Radiologists

C. Jacobs, A. Setio, E. Scholten, P. Gerke, H. Bhattacharya, F. M. Hoesein, M. Brink, E. Ranschaert, P. de Jong, M. Silva, B. Geurts, K. Chung, S. Schalekamp, J. Meersschaert, A. Devaraj, P. Pinsky, S. Lam, B. van Ginneken and K. Farahani

Radiology: Artificial Intelligence 2021;3(6):e210027.

DOI PMID Download Cited by ~26

Purpose To determine whether deep learning algorithms developed in a public competition could identify lung cancer on low-dose CT scans with a performance similar to radiologists. Materials and Methods In this retrospective study, a dataset consisting of 300 patient scans was used for model assessment; 150 patient scans were from the competition set and 150 were from an independent dataset. Both test datasets contained 50 patient scans with cancer and 100 without cancer. The reference standard was set by histopathological examination for cancer positive scans and imaging follow-up for at least 2 years for cancer negative scans. The test datasets were applied to the top three performing algorithms from the Data Science Bowl 2017 public competition (called grt123, Julian de Wit \& Daniel Hammack [JWDH] and Aidence). Model outputs were compared with an observer study of 11 radiologists that assessed the same test datasets. Each scan was scored on a continuous scale by both the deep learning algorithms and the radiologists. Performance was measured using multireader multicase receiver operating characteristic analysis. Results The area under the receiver operating characteristic curve (AUC) was 0.877 (95\% CI: 0.842, 0.910) for grt123, 0.902 (95\% CI: 0.871, 0.932) for JWDH, and 0.900 (95\% CI: 0.870, 0.928) for Aidence. The average AUC of the radiologists was 0.917 (95\% CI: 0.889, 0.945), which was significantly higher than grt123 (P = .02); however, no significant difference between the radiologists and JWDH (P = .29) or Aidence (P = .26) was found. Conclusion Deep learning algorithms developed in a public competition reached performance close to radiologists.