Purpose: Deep learning (DL) systems based on convolutional neural networks (CNNs) have achieved expert-level performance in different classification tasks, and have shown the potential to reduce current experts’ workload significantly. We explore this potential in the context of automated stratification of ophthalmic images. DL could accelerate the setup of clinical studies by filtering large amounts of images or patients based on specific inclusion criteria, as well as aid in patient selection for clinical trials. DL could also allow for automated categorization of entering images in busy clinical or screening settings, enhancing data triaging, searching, retrieval, and comparison. Automated stratification could also facilitate data collection and application of further DL-based phenotyping analysis, by generating useful sets of images for expert annotation, training, or testing of segmentation algorithms. In our work, we focus on the stratification of color fundus images (CFI) based on multiple features related to age-related macular degeneration (AMD) at different hierarchical levels. We further analyze the robustness of the automated stratification system when the amount of data available for development is limited. We performed our validation on two different population studies.
Setting/Venue: Deep learning applied to ophthalmic imaging.
Methods: Automated stratification of CFI was performed based on the presence or absence of the following AMD features, following a hierarchical tree with different branches (Bi) and levels (Hi) from generic features (H0) to specific features (H3): AMD findings (H0); B1: drusen (H1), large drusen (H2), reticular pseudodrusen (H3); B2: pigmentary changes (H1), hyperpigmentation (H2), hypopigmentation (H2); B3: late AMD (H1), geographic atrophy (H2), choroidal neovascularization (H2). The automated stratification system consisted of a set of CNNs (based on the Inception-v3 architecture) able to classify the multiple AMD features (presence/absence) at higher and lower levels. This allowed to automatically stratify incoming CFI into the hierarchical tree. CFI from the AREDS dataset were used for development (106,994 CFI) and testing (27,066 CFI) of the CNNs. We validated the robustness of the system to a gradual decrease in the amount of data available for development (100%, 75%, 50%, 25%, 10%, 5%, 2.5%, and 1% of development data). An external test set (RS1-6) was generated with 2,790 CFI from the Rotterdam Study. This allowed to validate the performance of the automated stratification across studies where different CFI grading protocols were used.
Results: Area under the receiver operating characteristic curve (AUC) was used to measure the performance of each feature’s classification within the automated stratification. The AUC averaged across AMD features when 100% of development data was available was 93.8% (95% CI, 93.4%-94.2%) in AREDS and 84.4% (82.1%-86.5%) in RS1-6. There was an average relative decrease in performance of 10.0±4.7% between AREDS and the external test set, RS1-6. The performance of the system decreased gradually with each development data reduction. When only 1% of data was available for development, the average AUC was 81.9% (81.0%-82.8%) in AREDS and 74.0% (70.8%-77.0%) in RS1-6. This corresponded to an average relative decrease in performance of 12.7±13.2% in AREDS and 12.6±7.8% in RS1-6.
Conlusions: The automated stratification system achieved overall high performance in the classification of different features independently of their hierarchical level. This shows the potential of DL systems to identify diverse phenotypes and to obtain an accurate automated stratification of CFI. The results showed that automated stratification was also robust to a dramatic reduction in the data available for development, maintaining the average AUC above 80%. This is a positive observation, considering that the amount of data available for DL development can be limited in some settings, and the gradings can be costly to obtain. Nevertheless, variability in performance across features could be observed, especially for those with very low prevalence, such as reticular pseudodrusen, where performance became more unstable when few data were available. The external validation showed these observations held when the automated stratification was applied in a different population study, with an expected (but not drastic) drop of performance due to differences between datasets and their grading protocols. In conclusion, our work supports that DL is a powerful tool for the filtering and stratification of ophthalmic images, and has the potential to reduce the workload of experts while supporting them in research and clinical settings.