Deep learning for automatic calcium scoring in CT: Validation using multiple cardiac CT and chest CT protocols

S. van Velzen, N. Lessmann, B. Velthuis, I. Bank, D. van den Bongard, T. Leiner, P. de Jong, W. Veldhuis, A. Correa, J. Terry, J. Carr, M. Viergever, H. Verkooijen and I. Išgum

Radiology 2020;295(1):66-79.

DOI PMID Cited by ~136

Background: Although several deep learning (DL) calcium scoring methods have achieved excellent performance for specific CT protocols, their performance in a range of CT examination types is unknown. Purpose: To evaluate the performance of a DL method for automatic calcium scoring across a wide range of CT examination types and to investigate whether the method can adapt to different types of CT examinations when representative images are added to the existing training data set. Materials and Methods: The study included 7240 participants who underwent various types of nonenhanced CT examinations that included the heart: coronary artery calcium (CAC) scoring CT, diagnostic CT of the chest, PET attenuation correction CT, radiation therapy treatment planning CT, CAC screening CT, and low-dose CT of the chest. CAC and thoracic aorta calcification (TAC) were quantified using a convolutional neural network trained with (a) 1181 low-dose chest CT examinations (baseline), (b) a small set of examinations of the respective type supplemented to the baseline (data specific), and (c) a combination of examinations of all available types (combined). Supplemental training sets contained 199-568 CT images depending on the calcium burden of each population. The DL algorithm performance was evaluated with intraclass correlation coefficients (ICCs) between DL and manual (Agatston) CAC and (volume) TAC scoring and with linearly weighted k values for cardiovascular risk categories (Agatston score; cardiovascular disease risk categories: 0, 1-10, 11-100, 101-400, >400). Results: At baseline, the DL algorithm yielded ICCs of 0.79-0.97 for CAC and 0.66-0.98 for TAC across the range of different types of CT examinations. ICCs improved to 0.84-0.99 (CAC) and 0.92-0.99 (TAC) for CT protocol-specific training and to 0.85-0.99 (CAC) and 0.96-0.99 (TAC) for combined training. For assignment of cardiovascular disease risk category, the k value for all test CT scans was 0.90 (95% confidence interval [CI]: 0.89, 0.91) for the baseline training. It increased to 0.92 (95% CI: 0.91, 0.93) for both data-specific and combined training. Conclusion: A deep learning calcium scoring algorithm for quantification of coronary and thoracic calcium was robust, despite substantial differences in CT protocol and variations in subject population. Augmenting the algorithm training with CT protocol-specific images further improved algorithm performance.