Article Text
Abstract
The rapid advancements in generative artificial intelligence are set to significantly influence the medical sector, particularly ophthalmology. Generative adversarial networks and diffusion models enable the creation of synthetic images, aiding the development of deep learning models tailored for specific imaging tasks. Additionally, the advent of multimodal foundational models, capable of generating images, text and videos, presents a broad spectrum of applications within ophthalmology. These range from enhancing diagnostic accuracy to improving patient education and training healthcare professionals. Despite the promising potential, this area of technology is still in its infancy, and there are several challenges to be addressed, including data bias, safety concerns and the practical implementation of these technologies in clinical settings.
- Public health
- Medical Education
- Diagnostic tests/Investigation
- Prognosis
Data availability statement
Data sharing not applicable as no datasets generated and/or analysed for this study.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Introduction
Generative models have revolutionised the landscape of artificial intelligence (AI), offering groundbreaking capabilities in image generation that hold transformative potential.1 2 This innovation opened new possibilities in medical imaging, including ophthalmology, where generative adversarial networks (GANs) have been employed for tasks such as image synthesis, including the generation of ocular fundus photographs.3 Despite their success, GANs faced limitations like artefact generation and limited variety of image outputs (referred to as mode collapse), leading to the advent of diffusion models.4 Diffusion models have achieved high fidelity and diversity in image generation of colour fundus photographs (CFP) and optical coherence tomography (OCT) images that are indistinguishable from actual clinical photographs.5
The utility of synthetic images in healthcare, particularly for augmenting real training data, addresses critical challenges like sample bias and the under-representation of specific patient groups.6 In ophthalmology, leveraging synthetic images not only improves the training of deep learning (DL) models but also facilitates the development of innovative solutions for rare diseases such as inherited retinal disease (IRD).7 Concurrently, the advent of vision-language models (VLMs) has propelled significant advancements in natural language processing.8 These models, trained on extensive datasets of images paired with textual captions, show promising capabilities in generating visual content from text descriptions. This emerging technology along with future text-to-video enhancements holds potential for ophthalmology, offering novel ways to enhance surgical education and patient care.9–11
In this work, we explore the advancements of generative AI, focusing on the pivotal role of GANs, diffusion models and the burgeoning area of text-to-image models. We also discuss challenges of generative AI including bias, data privacy and patient safety.
History and overview of generative models
A foundational model is an AI system that is developed through extensive self-supervised training on large amounts of unlabelled data that can be later adapted for a variety of downstream tasks.2 Its capability to be trained on multimodal data—such as text and images—enables it to perform tasks like generating text from images and creating images from textual descriptions. Broadly, there are four categories of foundation models, as summarised in table 1: large language models (LLMs), large vision models (LVMs), VLMs and large multimodal models (LMMs).
LLMs are text-focused models, trained on extensive textual corpora amassing trillions of words, enabling them to generate human-like text.2 12–16 LVMs specialise in image processing and are trained on vast image datasets for tasks like image recognition and generation. VLMs can generate unique images based on textual prompts after getting trained on datasets containing image-text pairs.17 These datasets comprise millions of paired images corresponding textual descriptions to enable models like DALL-E and Stable Diffusion to understand and create complex visual content in response to textual prompts. Another architecture, LMMs, represents the most advanced category. They are designed to process and generate content across multiple formats, including text, images, videos and music.18
The literature on foundation models in ophthalmology is expanding. A significant focus has been on LLM applications for medical question answering and clinical reasoning, as well as agentic AI that assists clinicians with workflow tasks, such as summarising notes and medical documentation.19–21 While VLMs using Gemini have been tested for interpreting OCT pathology, their reliability remains unproven.22 In contrast, LVMs like RETFound represent a highly promising development. This pioneering model forCFP and OCT has demonstrated superior performance compared with traditional deep learning models. This achievement is largely due to the use of self-supervised learning, which allows the model to learn effectively from unlabelled images, thus bypassing the need for extensively annotated datasets. Additionally, the incorporation of Vision Transformers into the model architecture leverages their ability to handle varied image resolutions and complexities, further enhancing the model’s performance in analysing medical images.23 The subsequent sections will concentrate on the image-generation capabilities of AI in ophthalmology, focusing on image-to-image and text-to-image models.
Image-to-image models
A pivotal moment in AI-driven image generation was the introduction of GANs in 2014 by Goodfellow et al.24 GANs consist of two neural networks: the generator and the discriminator. The generator creates images, while the discriminator evaluates them. The generator’s goal is to produce images so realistic that the discriminator cannot differentiate them from real images. This adversarial process results in the generator producing increasingly refined images. While GANs marked a significant advancement, they had limitations, including the potential for generating artefacts and the challenge of mode collapse, where the generator produces limited varieties of outputs. Diffusion models effectively mitigate common GAN issues such as artefact generation and mode collapse. For artefact generation, diffusion models enhance image quality by gradually removing noise added to an image in a controlled way. This method involves a series of small, precise adjustments, resulting in superior performance metrics in tests of image diversity and realism.25 Regarding mode collapse, the iterative noise reduction process helps maintain a wide variety of image types, preventing the model from limiting its output variety—a frequent limitation with GANs. The training method of diffusion models is specifically designed to enhance both the detail and diversity of images, thereby effectively overcoming these major drawbacks observed in GAN-based methods.4 26
We provide a comparative summary of GANs and diffusion models in table 2.
Use cases
GANs and diffusion models are becoming increasingly pivotal in ophthalmology, as they offer a broad range of applications and capabilities. GANs excel in numerous areas: they can predict post-intervention outcomes such as reduced macular oedema and the results of oculoplastic surgeries, remove artefacts from CFPs, generate angiography images from CFPs through domain transfer, create synthetic images of rare diseases to augment data and perform segmentation tasks.3 27–33 These diverse applications underscore their versatility and value in advancing medical imaging and treatment prediction.
Diffusion models in ophthalmology are particularly noted for their ability to produce colour fundus photographs and OCT images that are virtually indistinguishable from actual clinical images.5 This capability not only enhances the realism of synthetic images but also supports more robust training of DL models.
The generation of synthetic images addresses several critical challenges in ophthalmology. Real training datasets often suffer from limitations such as sample bias, including under-representation of certain patient demographics and rare diseases.34 35 Synthetic images can augment these datasets, thereby reducing the number of real images needed and enhancing model performance across more diverse scenarios.36 37 This approach helps overcome the generalisability issues in AI, where models trained on limited datasets may not perform well on unrepresented image types. This lack of generalisability is a significant barrier to the deployment of AI in real-world settings.38
A particularly pressing issue is the under-representation of populations from low and middle-income countries in publicly available datasets, which could exacerbate health inequalities.39 Diffusion models show promise in mitigating this through the generation of CFPs from limited datasets, thus improving equal representation in training data.40
Moreover, synthetic images have shown potential in training models for rare conditions, such as IRD. For instance, the Eye2Gene DL model uses images to predict the genetic profiles of diseases but struggles with under-represented genetic subtypes. To address this, GANs like SynthEye are being developed to generate high-quality synthetic images that diversify the training datasets of diagnostic models, enhancing their accuracy and reliability.6 7
Text-to-image models
Simultaneously, the evolution of VLMs has marked a significant leap in the field of natural language processing and image generation.41 These models, incorporating both the intricate understanding of natural language and visual elements, are trained on vast datasets of images paired with textual captions.42 By integrating the capabilities of diffusion models for high-quality image generation with nuanced language understanding, VLMs such as DALL-E, Stable Diffusion and Midjourney have the ability to produce detailed and contextually relevant images from textual descriptions. Available for public use, these models have garnered considerable attention for their innovative approach to bridging the gap between textual input and visual output, showcasing the potential of combining language and vision in AI.43
Use cases
Currently, text-to-image models such as DALL-E, Midjourney and StabilityAI have generalist knowledge and are able to create meaningful images from text prompts. However, for medical use, which is a niche and specific field, text-to-image generation of medical imaging is not yet fully developed. Using DALL-E 2 to generate X-ray images of anatomical areas such as the skull, hands, chest and ankle to name a few, shows realistic and well-proportioned images; however, detailed structure of bones is not correct.44 When tasked with more complex modalities such as CT and MRI, the model only shows images with features that reveal the modality; however, overall, the image is non-sensical. These widely available large text-to-image models seem to have some medical domain knowledge in the training data but not sufficient to create realistic medical images.45 We likely need to see models trained on a greater number of medical images for accurate text-to-image generation.
However, there are some current promising features of text-to-image models that could benefit the field of ophthalmology. As most ophthalmic disease is associated with a change in vision and visual defects, it will be useful to create images to demonstrate visual defects to accompany textual descriptions. This could be useful for both patients and for healthcare professionals. As an example, ophthalmologists could generate images for clinicians to aid in understanding of neuro-ophthalmological conditions such as oscillopsia in multiple sclerosis, Charles Bonnet syndrome, Pulfrich phenomena and rare conditions with no examination findings such as visual snow syndrome.10 46 Simulations of eye disorders do exist;47 however, ophthalmologists could create personalised prognostic markers in the form of images for patients to understand how their vision may change if their condition worsens or what they could expect if treatment is started. Another area is improving informed consent for operations. For procedures that will change a person’s appearance, such as ptosis surgery or orbital decompression surgery, ophthalmologists can generate personalised images for the patient to show how they could look following their surgery.48 We provide examples of generating images with ChatGPT4 and built-in DALL-E in figure 1.
Further developments include text-to-video generation.49 50 In addition to providing a more immersive experience to the examples above, video generation can be a useful learning tool for surgery simulation and learning. Within ophthalmology, practising through surgical simulation is well recognised.9 51 AI is also being developed to provide real-time analysis during operations.11 Generative AI could enable generation of operations and procedures, which could be a tool for learning and simulation.
Addressing the challenges and future directions
While generative AI holds immense potential in healthcare, certain challenges demand attention.19 52–54 The emerging issues can be grouped under data bias, safety and implementation. These should not be overlooked by any stakeholder including academics and private entities which have increased in prevalence in AI research.55 56 As a whole, open communication, transparency and regulation are needed to safeguard public health.
Bias, inclusivity and copyright issues in model training
There is the perception that the massive datasets that generative models are trained on are diverse; however, this is difficult to audit.2 12 Neglected sampling errors and bias in these datasets can compromise the ground truth and lead to poor generalisability.19 52–54 This can affect political, racial, socioeconomic domains and can deepen the inequalities in healthcare in terms of accessibility and quality of care. Minority groups, immigrants, rural populations and individuals with lower socioeconomic status are prone to such biases as they are less represented within online databases and electronic health records.57 58 Efforts to diversify the datasets with different real-world resources present another issue centred around copyright. There have been legal claims regarding the use of publicly accessible original work to train generative algorithms without authorisation.53 59 This necessitates extended transparency and defined standards in model training, along with heightened sensitivity to sources of bias. There is also an apparent need for a legal foundation governing dataset curation for pre-specified training purposes.19 53 54
Safety concerns regarding output integrity and data privacy
One concern regarding generative AI is the generation of incorrect or misleading results, also referred to as ‘hallucinations.’ Hallucinations can have negative consequences on patient safety and awareness through health-related misinformation.19 52–54 More serious complications can arise if these products are used within clinical decision support and reporting or to train other diagnostic algorithms and even physicians.19 52–54 Given the current legal framework still being immature, it is also unclear who will be considered liable for such results.19 52 This can expectedly lead to reluctance in end-users, physicians and patients, who are unable to discern hallucinations and even at times, AI-generated from real content.19 53 To avoid hallucinations, AI models nevertheless require to be trained with comprehensive health data. However, this poses a risk of security breach if the data is shared with multiple centres and especially with third-party developers.60 In case of health data violations in which anonymity is lost or the data is used for purposes other than specified, negative outcomes such as mental stress, erosion of trust and even group-based harm and discrimination can arise.61 These issues therefore need to be discussed under an ethical umbrella to cover consent for data processing and ensuring autonomy of patients as well.61 62
The development of AI models has been rapid; however, the regulatory discussion and response have been slow. It has even been proposed to cease all development of the technology until more stringent regulation is in place.63 Organisations like the European Union are moving forward with specific frameworks like the ‘AI Act’ where a categorisation of the danger level of AI models and corresponding set of guidelines have been specified.64 Other existing legal frameworks for digitalisation in healthcare will likely require AI-specific and particularly generative AI-specific clauses.60 64 65 Physician engagement in the layout of health-related AI regulations through specific taskforces can be helpful to uphold patient safety.65
Challenges to implementation into Workflow
It is crucial to identify which specific tasks in healthcare can be improved with the integration of generative AI, early on.66 These tools need specific positioning within their use cases (ie, care delivery or training) with clear indications and contraindications to delineate appropriate use and accountability.61 62 64 65 67 Without such specific guidelines, end-users including physicians can be confused over responsibilities and become discouraged eventually.61 62 67 They also need reliable channels for troubleshooting and reporting of adverse operations to assist end-users in times of need.67 68 User-friendliness is essential, as a complex system can counterintuitively burden staff already facing demanding workloads.68 Physician training on these models and corresponding guidelines should be considered in the medical curricula and as a part of continuous professional training.65 Since the capabilities of AI systems are predicted to increase exponentially, definition of such roles and boundaries within the delivery of care should not be delayed.69 Initiatives which lack this foresight about workflow as their core standpoint may get impeded when it comes to real-world practice.
Conclusion
The fast-emerging field of generative AI has immense potential for progress in ophthalmology including revolutionary advancements in diagnosis, accurate prognostication and professional training. However, there are certain challenges regarding data bias, safety and implementation. Addressing these through open conversations between academia, government and industry is vital for transparent and effective regulation to mitigate risks. Ultimately, as the digital and real world intersect further, we might look back on the generative models as the beginning of a new and brighter chapter in healthcare and ophthalmology.
Data availability statement
Data sharing not applicable as no datasets generated and/or analysed for this study.
Ethics statements
Patient consent for publication
Ethics approval
Not applicable.
References
Footnotes
SCS and MS are joint first authors.
X @FaresAntaki, @pearsekeane
Contributors Conception and design of the study (SCS, MS, FA, JH and PAK); data collection (SCS and MS); writing of the manuscript (SCS, MS, FA and JH); preparation of figures (SCS and JH); supervision (PAK); review and discussion of the results (all authors); edition and revision of the manuscript (all authors). We have used AI solely in the generation of the Figure 1. We used ChatGPT-4 (OpenAI) to get descriptive written prompts and later to generate images of certain eye conditions based on those prompts. The process is described in detail in the description of the figure summary, submitted alongside the figure.
Funding This study was funded by UK Research and Innovation.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.