Quality assessment of colour fundus and fluorescein angiography images using deep learning

Background/aims Image quality assessment (IQA) is crucial for both reading centres in clinical studies and routine practice, as only adequate quality allows clinicians to correctly identify diseases and treat patients accordingly. Here we aim to develop a neural network for automated real-time IQA in colour fundus (CF) and fluorescein angiography (FA) images. Methods Training and evaluation of two neural networks were conducted using 2272 CF and 2492 FA images, with binary labels in four (contrast, focus, illumination, shadow and reflection) and three (contrast, focus, noise) modality specific categories plus an overall quality ranking. Performance was compared with a second human grader, evaluated on an external public dataset and in a clinical trial use-case. Results The networks achieved a F1-score/area under the receiving operator characteristic/precision recall curve of 0.907/0.963/0.966 for CF and 0.822/0.918/0.889 for FA in overall quality prediction with similar results in most categories. A clear relation between model uncertainty and prediction error was observed. In the clinical trial use-case evaluation, the networks achieved an accuracy of 0.930 for CF and 0.895 for FA. Conclusion The presented method allows automated IQA in real time, demonstrating human-level performance for CF as well as FA. Such models can help to overcome the problem of human intergrader and intragrader variability by providing objective and reproducible IQA results. It has particular relevance for real-time feedback in multicentre clinical studies, when images are uploaded to central reading centre portals. Moreover, automated IQA as preprocessing step can support integrating automated approaches into clinical practice.


Methods Details
Baseline method The approach of Sadeghipour et al. [1] is used as a baseline approach for automated image quality assessment, allowing to put the results of the proposed deep learning (DL) model in better context.This baseline is built on a handcrafted feature-based machine learning approach by Dias et al. [2], utilizing an ensemble of networks for quality assessment, one per category.For each category, custom hand-crafted features are extracted from the input image and used to train a Support Vector Machine, naive Bayes classifier, classification tree and AdaBoost classifier.The model with the best validation performance is then selected as this category's classifier.Furthermore, a last classifier is trained using the predicted quality scores of all categories as input, predicting an overall quality score.
Pre-processing Images have been resized to 512 x 512 pixel before getting processed by the network.To keep the aspect ratio, the larger side was scaled to 512 pixels, while the smaller side was scaled in correct ratio and padded evenly on both sides with black pixels.Furthermore, different image augmentation techniques have been randomly applied during training.Random flipping in vertical and horizontal direction with a probability of 0.5, random rotation between -15 and +15 degree, vertical/horizontal translation of up to 20%/10% of the image size as well as scaling of ±10% have been applied.

Architecture
The architecture of the proposed neural network follows a conventional ResNet18 structure as proposed by He et al. [3].This network is composed of an initial convolution, followed by 4 ResNet layers, a global average pooling and a fully connected layer followed by a sigmoid function, forming the model output.The initial convolution consists of a convolutional layer followed by BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s) A ResNet layer is composed of two convolutional blocks, each consisting of a convolutional and batch normalization layer, with an intermediate ReLu between those two blocks.Furthermore, a residual connection is joining information before and after the ResNet layer.However, to be able to apply the principle of Monte Carlo dropout [4] and produce an uncertainty score for predictions, dropout layers have been added after each ResNet Layer of the network.The model concludes with a fully connected layer with 4 or 5 output neurons, depending on the input modality.Each neuron provides a quality score between 0 and 1 for a specific category.A visualization of the architecture is shown in eFigure 3.

Training details
Training of the DL models was performed using a minibatch size of 32.Both models were trained for 20000 iterations halted by 200 uniformly distributed calculations of the validation performance.A pre-trained ResNet18 with added dropout layers (p=0.2) was trained using the Adam optimizer with standard parameters and a learning rate of 5*10 -4 .The binary cross entropy loss was computed for each output category and combined into a final overall loss using the (equally weighted) average: composed of an image series of the retina covering a timespan of up to 20 minutes.
Therefore, predictions somehow have to be transformed from image to visit level.To achieve this, we first apply the trained final model to each image of the visit iteratively.Second, the mean of all individual predictions is calculated to combine the individual results.Finally, a binary prediction on the quality of the whole visit is produced by applying a threshold (CF: 0.417, FA: 0.424) to this mean value.The thresholds have been optimized on the validation set to allow an unbiased estimation of the performance on the test set.
BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s) Test set details Data with regression labels used for evaluating the models was divided into a validation and test set on a patient level with an approximate ratio of 1:2.The validation set was utilized for monitoring network training and threshold calculations, while the test set was used for final performance evaluation.For all 264/321 test set images of CF/FA, the mean of human labels per category are shown in eTable 4 before and after binary transformation.The test sets of the clinical trial use-case were constructed as a balanced dataset of 44 (CF) and 86 (FA) visits, with a 50% share of good and bad quality samples each.After processing of the dataset BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s) through the model, ground truth was revised and lead to a share of good quality visits of 0.59 (CF) and 0.47 (FA).

Likert scale evaluation
The model was trained on binary labels.To enable more detailed evaluation of prediction error, specially for borderline cases, evaluation samples were annotated with labels of higher granularity.In this experiment, the distribution of good/poor quality predictions provided by the model per Likert scale ground truth label are visualized (eFigure 2).For CF a clear trend of increasing number of positive predictions with increasing Likert scale label is visible, with the most even distribution of positive and negative predictions for label 3.For FA, a similar trend is visible apart from a deviation from label 4 to 5.
Image size Quality is assessed in multiple general image quality categories, which might consist of artifacts sensitive to the size of images (e.g.focus, noise).We therefore conducted an experiment to evaluate the impact of input image sizes on model performance.Four models were trained per modality (CF, FA) on the same training images with different rescaling factors, resulting in image pixel sizes of 256 x 256, 512 x 512, 1024 x 1024 and 1532 x 1532.Average results do not show any significant differences for the evaluated metrics in both modalities (eTable 3a, eTable 3b).Exemplary, the performance of all four models is provided for the category 'noise' in eTable 3c, showing a certain relation between image resolution and model performance.While accuracy, precision and F1-Score improve with increasing image size, recall, and both area under the curve metrics (AUC-ROC, AUC-PRC) achieve best results for an image size of 512 x 512 pixels.

Comparison of traditional machine learning baseline and proposed
approach In eTable 5 quantitative results for the hand-crafted feature baseline [1], BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s) the proposed deep learning approach and the second human grader are presented.
Our method clearly outperforms the baseline, showing higher numbers in almost all metrics in both modalities.While critical performance drops can be seen in multiple categories for the baseline approach, the proposed model shows a more stable behavior.
These results are in line with findings on convolutional neural networks (CNN) based approaches outperforming conventional feature based methods [7].However, when considering accuracy and precision, the baseline seems to occasionally achieve better results than the presented approach.At the same time, the used dataset is not balanced for each category (eTable 4), meaning that accuracy is less conclusive compared to other measures (e.g.classification of all samples as good quality could result in high accuracy due to the lower amount of bad quality samples).Furthermore, precision and recall are interdependent metrics and should be considered only together, e.g. in form of their harmonic mean (F1-score).
BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s)    eTable 1: List of imaging devices per manufacturer included in the provided datasets.

2 .
used dataset consists of images acquired by more than 200 clinical sites, different device manufacturers, varying diseases and pixel resolutions, ranging from 496 x 512 to 6000 x 4000 pixels.Details regarding the distribution of image pixel resolutions in X and Y direction among training, validation and test set for CF and FA are visualized in eFigure 1. Information with respect to imaging devices per manufacturer is provided in eTable 1.The distribution of images per manufacturer and diseases among datasets and in total for CF and FA is shown in eTable Metrics details Accuracy describes the amount of correct predictions, precision the proportion of positive predictions which are actually positive, recall the fraction of positive predictions out of all positive examples and F1-score the harmonic mean of precision and recall.[6]Both AUC measures provide insight into the model performance among changing thresholds and therefore indicate performance stability.

BMJ
Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s) Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s)

eFigure 1 :
Resolution of images in X and Y direction among training, validation and test set for CF (left) and FA (right).A small jitter has been added to better visualize overlapping samples.

:
Visucam 500, Visucam 524, Visucam NM/FA, Visucam Pro NM BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s) Distribution of images per manufacturer and diseases among datasets and in total for (a) CF and (b) FA.Performance (accuracy, precision, recall, F1-score, AUC-ROC, AUC-PRC) of four models trained and evaluated on different image sizes (256, 512, 1024, 1532) on the test set.While (a) the table on the top shows the performance results averaged over all target categories of CF, (b) the middle table provides the average performance results for FA.As an exemplary result, (c) the table on the bottom illustrates metrics for the modality specific category 'noise'.Mean of human labels of used test sets for (a) CF and (b) FA before and after transformation of regression labels into binary labels.Quantitative results of the baseline method, the second manual grading and the proposed DL approach on the test set for (a) CF and (b) FA.Accuracy, precision, recall, F1-score, AUC-ROC and AUC-PRC have been calculated for each category.In addition, the average across all categories is provided.BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s)