An observer study comparing radiologists with the prize-winning lung cancer detection algorithms from the 2017 Kaggle Data Science Bowl

C. Jacobs, E. Scholten, A. Schreuder, M. Prokop and B. van Ginneken

Annual Meeting of the Radiological Society of North America 2019.

PURPOSE: The 2017 Kaggle Data Science Bowl challenge awarded 1 million dollars in prize money to develop computer algorithms for predicting, on the basis of a single low-dose screening CT scan, which individuals will be diagnosed with lung cancer within one year of the scan. Participating teams received a training set of around 1500 low-dose CT scans to develop and train their algorithms and final performance was measured on a test set of 500 scans, containing 151 lung cancer cases. Over 2000 teams submitted results. The best 10 algorithms all used deep learning and are freely available as open source code. To gain insight into how the performance of these algorithms compares to radiologists, we conducted an observer study including 11 readers who read 150 cases from the test set. METHOD AND MATERIALS: We randomly extracted 100 benign cases and 50 lung cancer cases from the test set of the challenge. Each algorithm scored each test case with a score between 0 (low) and 1 (high) for harboring a malignancy. We developed a web-accessible workstation in which human experts could review chest CT scans. The web workstation included the common tools found in a professional medical viewing workstation. We invited 11 readers, a mix of radiologists and radiology residents, to read these 150 CT cases and assign a score between 0 (low) and 100 (high) whether the patient will develop a lung cancer within one year of the presented scan. ROC analysis was used to compare the performance of the human readers with the algorithms. The primary outcome was area under the ROC curve. 95% confidence intervals were computed by 1000 bootstrap iterations and are reported between brackets. RESULTS: The mean area under the ROC curve for the human readers was 0.90 [0.85-0.94]. The mean area under the ROC curve for the algorithms was 0.86 [0.81-0.91]. The mean human reading time per case varied between 96 and 275 seconds. CONCLUSION: The top 10 algorithms from the Kaggle Data Science Bowl 2017 showed promising performance, but were still inferior to human readers. Future analysis will focus on understanding the strengths and weaknesses of the computer algorithms and the human readers and how these can be optimally combined. CLINICAL RELEVANCE/APPLICATION: Fully automatic algorithms using deep learning developed in a large-scale challenge show promising performance for lung cancer detection in chest CT, but performed inferior to radiologists in this subset of the test set.