Consistency of breast density categories in serial screening mammograms: A comparison between automated and human assessment

K. Holland, J. van Zelst, G. den Heeten, M. Imhof-Tas, R. Mann, C. van Gils and N. Karssemeijer

Breast 2016;29:49-54.

Reliable breast density measurement is needed to personalize screening by using density as a risk factor and offering supplemental screening to women with dense breasts. We investigated the categorization of pairs of subsequent screening mammograms into density classes by human readers and by an automated system. With software (VDG) and by four readers, including three specialized breast radiologists, 1000 mammograms belonging to 500 pairs of subsequent screening exams were categorized into either two or four density classes. We calculated percent agreement and the percentage of women that changed from dense to non-dense and vice versa. Inter-exam agreement (IEA) was calculated with kappa statistics. Results were computed for each reader individually and for the case that each mammogram was classified by one of the four readers by random assignment (group reading). Higher percent agreement was found with VDG (90.4\%, CI 87.9-92.9\%) than with readers (86.2-89.2\%), while less plausible changes from non-dense to dense occur less often with VDG (2.8\%, CI 1.4-4.2\%) than with group reading (4.2\%, CI 2.4-6.0\%). We found an IEA of 0.68-0.77 for the readers using two classes and an IEA of 0.76-0.82 using four classes. IEA is significantly higher with VDG compared to group reading. The categorization of serial mammograms in density classes is more consistent with automated software than with a mixed group of human readers. When using breast density to personalize screening protocols, assessment with software may be preferred over assessment by radiologists.