Article Text

Download PDFPDF
New meaning for NLP: the trials and tribulations of natural language processing with GPT-3 in ophthalmology
  1. Siddharth Nath1,2,
  2. Abdullah Marie3,
  3. Simon Ellershaw4,
  4. Edward Korot5,
  5. Pearse A Keane2
  1. 1 Ophthalmology and Visual Sciences, McGill University, Montreal, Quebec, Canada
  2. 2 National Institute for Health Research, Biomedical Research Centre for Ophthalmology, UCL Institute of Ophthalmology, Moorfields Eye Hospital City Road Campus, London, UK
  3. 3 School of Medicine and Dentistry, Queen's University Belfast, Belfast, UK
  4. 4 UKRI Centre for Doctoral Training in AI-enabled Healthcare, University College London, London, UK
  5. 5 Byers Eye Institute, Stanford University, Stanford, California, USA
  1. Correspondence to Dr Pearse A Keane, National Institute for Health Research, Biomedical Research Centre for Ophthalmology, UCL Institute of Ophthalmology, Moorfields Eye Hospital City Road Campus, London, London, UK; p.keane{at}ucl.ac.uk

Abstract

Natural language processing (NLP) is a subfield of machine intelligence focused on the interaction of human language with computer systems. NLP has recently been discussed in the mainstream media and the literature with the advent of Generative Pre-trained Transformer 3 (GPT-3), a language model capable of producing human-like text. The release of GPT-3 has also sparked renewed interest on the applicability of NLP to contemporary healthcare problems. This article provides an overview of NLP models, with a focus on GPT-3, as well as discussion of applications specific to ophthalmology. We also outline the limitations of GPT-3 and the challenges with its integration into routine ophthalmic care.

  • Diagnostic tests/Investigation
  • Telemedicine

Data availability statement

Data sharing not applicable as no datasets generated and/or analysed for this study.

Statistics from Altmetric.com

Introduction

Ophthalmologists have conventionally been trained to fear the acronym ‘NLP’, for it indicates ‘no light perception’—a level of vision where the patient is effectively blind. In a more modern context, and outside of ophthalmology, however, ‘NLP’ has grown to most commonly mean ‘natural language processing’—a subset of machine intelligence focused on the interaction of human language with computer systems.1

For as long as computers have existed, natural language processing (NLP) has been an area of interest, with Alan Turing’s proposal of what is now called the ‘Turing test’—an experiment for determining whether the language generated by a computer is distinguishable from that produced by a human—having been in place since 1950.2 Modern language models have come remarkably close to passing a Turing test, with OpenAI’s Generative Pre-trained Transformer 3 (GPT-3) being able to write entire pieces of prose nearly indistinguishable from human authors.3–5

GPT-3 is the latest in a series of autoregressive language models that uses deep learning to generate text in a human-like language format. Trained using the Common Crawl, WebText, Books1 and Books2 datasets, and all of English-language Wikipedia, in a ‘few-shot’ approach, GPT-3 is considered to have learnt a snapshot of the entire internet.6 7 Like other autoregressive language models, GPT-3 predicts the next text element in a sequence of words, based on a query, task or instruction, given in natural human language. The emphasis with NLP models, such as GPT-3, is the production of a result in human-like text, allowing for broad use cases and easy, user-friendly interaction. The reason this is a useful evaluation task pertains to the sentiment of a model correctly predicting the next word in a sequence suggesting that it has a general understanding of the preceding sequence’s meaning and context.

These language models collectively form a group known as ‘transformers’, which are capable of predicting text through a complex mechanism termed ‘attention’, responsible for assigning a contextual value to singular inputs in order to determine the appropriate output. Attention builds on mechanisms used by earlier language models (called recurrent neural networks or RNNs), which were sequential in nature and would consider all prior inputs in order to determine an output.8 Transformer models use an encoder (responsible for translating the input sequence into a vector which can be interpreted by the model) and a decoder (which converts the model’s processing into human text) in their architecture to complete their queries.

Within the realm of transformers, GPT-3 is especially unique in that it was trained using 175 billion parameters, over 10 times more than prior models with similar architecture, and that it is able to learn using a ‘few-shot’ approach. Effectively, where other language models require substantial inputs to contextualise a response, GPT-3 is able to glean appropriate context from only a few short phrases, or even, single words, and is able to output text that is relevant without additional training. Moreover, GPT-3 differs from other transformers in that it was developed primarily with a decoder architecture. Instead of coding user inputs as a vector for the model to assess, GPT-3 generates responses by processing its own outputs, in the context of prior queries, through the model, effectively ‘learning from itself’. GPT-3 has gained considerable attention in the mainstream media for its ability to process natural language inputs across an array of disciplines and for the breadth of data on which it is trained. For instance, it has been used to write code (and even complete software programmes) through natural language instruction, develop dialogue for virtual reality experiences, translate between languages and even analyse customer experiences.9–11

Despite its powerful processing capabilities, GPT-3, and other autoregressive language models like it, have considerable, and even potentially harmful, limitations. As it was trained on a large portion of internet content, GPT-3 incorporates the biases found across the web in its own processing. These include: gender, racial and geopolitical biases. In the assessment of gender biases, for example, the developers of GPT-3 focused on the associations between gender and occupation. They noted that occupations requiring higher levels of education, such as finance or academic professor, were more likely to be followed by male identifiers, while those, such as nurse, midwife, receptionist and housekeeper, were more likely to be followed by female identifiers.6 12 GPT-3 also suffers from difficulty with ‘common sense physics’, and lacks context about the world (and consequently, the tasks requested of it) as it is not grounded in domains outside of language, such as video or real-world interaction. Like all autoregressive language models, GPT-3 is also unable to correct itself once it begins to make mistakes. In writing prose, for instance, it is unable to go back and edit, and often one mistake will lead to many more, as it uses preceding words to predict its next output.

Nonetheless, given its impressive capabilities and the potential for implementation without the need for ground-up programming, GPT-3 has garnered significant attention from the healthcare community, and offers compelling solutions to contemporary clinical problems.13 This article will provide an overview of some of the existing healthcare applications of GPT-3 and discuss potential uses and challenges in applying it to ophthalmology.

Existing healthcare applications of GPT-3

NLP models have found applications across various healthcare disciplines, most prominently in the assessment of electronic health record (EHR) data. Language models are particularly well suited to this task, as they can process large volumes of text at a scale unachievable by human assessors and require little-to-no visual context for extracting requested data. Researchers have developed NLP models for a variety of EHR use cases, from extracting physician-reported pain data from oncologic consultation notes, to identification of potential clinical trial participants by way of extracting inclusion and exclusion criteria.14 15

Given its initial release and limited availability through a restricted-access programme requiring formal application and approval, healthcare applications of GPT-3 remain sparse. Most recently, Logé et al at Stanford University developed a dataset for assessing bias in medical question-answering systems and tested it using both GPT-3 and its predecessor, GPT-2.16 This work represents a particularly challenging use of autoregressive language models as it introduces a highly complex and multifactorial clinical problem. Pain management in the clinic is complicated by the individual lived experience of pain across patients, variable manifestations of pain sequelae, inherent subjectivity in pain reporting and clinician biases, both implicit and explicit.17–19 They noted that GPT-3 advocated for treating every patient with pain management; however, there was significant variability across individual medical contexts and simulated patient’s gender and race. GPT-3 was 3.6% more likely to refuse pain treatment to black patients than white patients, and, equally more likely to refuse pain treatment to women.16 This work demonstrates, that, although highly trained autoregressive language models like GPT-3 have the potential to address complex clinical problems, clinicians should be wary that inherent societal biases may be incorporated in their training data and resulting implementations.

Potential applications of GPT-3 in ophthalmology

Within ophthalmology, NLP has been trialled in EHR-driven use cases as well.20 For instance, NLP pipelines have been developed to identify patients with an array of ocular diseases, including glaucoma, herpes zoster ophthalmicus, cataracts and pseudoexfoliation syndrome.21–24 NLP has also been developed to triage ophthalmic referrals and has found uses in predicting the rate of antibiotic use and intraoperative complications from operative notes, as well as predicting patient quality of life in association with vision loss in the setting of chronic ocular disease.25–28 All of the above applications required de novo development of pipelines along with requisite planning for large training and validation datasets. Although impressive in their capacity, these pipelines also remain siloed in their own use cases, unable to pull data across fields to inform their tasks. For instance, if a patient with cataracts presents and is triaged by a herpes zoster ophthalmicus model, the outputs of the model will not be meaningful.

GPT-3 offers ophthalmologists and researchers the opportunity to build from a ready-to-use application programming interface (API) and remains ‘informed’ about multiple disciplines as its training effectively encompasses the English internet. GPT-3 is also noteworthy in that it can be trained using minimal input (called zero-shot learning or few-shot learning), improving the ease with which solutions can be developed and deployed and also reducing upstart costs. Few-shot learning removes the need for costly expansive training datasets as GPT-3 can effectively ‘learn’ from a handful of inputs, and sometimes no inputs at all. For instance, GPT-3 could be used to form the backbone of an ophthalmic triage system that offers human-like responses to emergent questions, with minimal training on potential prompts. Such a system could be implemented in emergency departments or optometrist offices to ensure clinical resources are used in a timely, equitable and efficient manner. Moreover, this system could also implement GPT-3’s language translation capabilities to accept questions in an array of local languages and relay responses back in a patient-friendly dialect. Intelligent triage systems have already shown promise in the practice of ophthalmology during the COVID-19 pandemic, and an implementation with GPT-3 would fill an important, unmet need.29 GPT-3 could also be used to develop a consultation note creation system that standardises ophthalmology notes across subspecialties and clinicians. Although NLP, and more broadly, machine intelligence, have been used to develop ambient virtual scribe systems, such as AutoScribe, these implementations remain limited as they can only record dialogue and offer suggestions to clinicians, requiring review of raw speech data and substantial cognitive and time effort.30 As GPT-3 has been shown to write prose from ‘few-shot’ learning, it could conceivably be trained to create consult notes for ophthalmology from a few keywords entered by clinicians. Coupling this capacity with dictation software could create a powerful EHR tool, which would create standardised, mail-ready consult notes. Such a system could streamline already overloaded ophthalmology clinics, improve the timeliness of responses to referring physicians and ease involvement of patients in research. GPT-3’s intelligent language capabilities could also be used to create teaching tools for global ophthalmology outreach. For instance, charities, such as Orbis, could implement GPT-3 to improve translation of highly technical medical data and streamline patient education and local physician training on medical missions. Finally, and perhaps most compellingly, GPT-3’s ability to write nascent code with little training could aid other ophthalmologists and vision science researchers in developing their own artificial intelligence (AI) solutions. Our group has previously evaluated performance of clinicians without coding experience in the development of AI solutions and we have resolved that clinician-driven AI requires appropriate education of physicians in the nuances of the field.31 32 A GPT-3-based coding platform would leap many of these educational requirements and open the field of AI solution development to end-user clinicians, paving the way for previously unexplored applications.

Challenges in implementation of GPT-3 in ophthalmology

Despite these possibilities, use of GPT-3 in clinical practice carries inherent challenges. Because GPT-3 was trained using a large proportion of the internet, it carries within its processing capabilities congruent biases of gender, race and politics, which within a healthcare setting, could be devastating in patient-facing implementations. Such biases could work to alienate already vulnerable patient populations, lead to inequitable and inadequate care, and further cement biases in contemporary medical practice. To combat this, clinicians using GPT-3 should consider building safeguards; for instance, within ophthalmology, as gender and race data is not required for triaging many ocular emergencies, such data could be omitted at the collection stage.

In addition to inherent biases, implementation of GPT-3 within clinical practice is also limited by its inability to correct mistakes, and more worryingly, press on with an inaccurate processing stream. For instance, GPT-3 was widely criticised in the mainstream media for encouraging a simulated patient to commit suicide when trialled in a psychiatric question-answer implementation.33 In order to prevent similar catastrophic results from an ophthalmic implementation of GPT-3, clinicians could ensure that key outputs are assessed for raw quality by a human evaluator. Although this increases required resources, it would still create a ‘semiautonomous’ system, which, if implemented for instance as a triage tool, could nonetheless improve access to care. To minimise the need for clinician resources, a filtering system could be developed, whereby only significant interventions, if suggested by an intelligent system, need to be reviewed by a human evaluator. Interventions or recommendations that do not have potential for harm to patients may not require human approval, thereby reducing the required human cognitive resources. Such systems would be analogous to modern academic medical practice, where junior doctors often make simple clinical decisions independently, but review more complex interventions with senior supervisors or consultants.

Implementation of GPT-3 into ophthalmology is also limited by its lack of image-processing capability. Ophthalmology is a highly visual specialty, with examination findings at the slit lamp or multimodal imaging guiding most treatment decisions. To address this, clinicians could turn to transformer models based on GPT-3, which combine text and imaging data, for their solutions. OpenAI, for instance, has recently announced the Contrastive Language-Image Pre-training model and the DALL-E 2 model, both of which combine text and image-processing functionality. In fact, DALL-E 2 has gained considerable attention recently for its ability to create nascent images from text input, offering ophthalmologists a powerful augment to GPT-3’s language processing capabilities.34 35 Lastly, implementation of GPT-3 into routine ophthalmic care has been limited by access to the platform itself. GPT-3 was initially available only through a web-based API, managed by OpenAI, but licensed ‘exclusively’ to Microsoft. As of November 2021, OpenAI has expanded availability of the API to end users in a predefined list of countries; however, use of the API remains subject to adherence with OpenAI’s guidelines.36 While democratisation of AI tools and arguments for open access versus privately developed algorithms are beyond the scope of this article, it remains important to consider that any potential healthcare applications require compliance with institutional and regulatory agency privacy laws as well as consistent reliable performance with limited downtime. Clinicians should consider real-world integration as an important aspect of any AI solution prior to piloting implementations.

Conclusions

In summary, autoregressive language models, such as GPT-3, offer clinicians tremendous potential for addressing complex clinical problems. Within ophthalmology, GPT-3 is especially compelling as it has the capability to improve use of valuable ophthalmic clinical resources, improve clinic flow and bring AI solution development to end-user clinicians. Ophthalmologists and researchers should remain wary of GPT-3’s inherent biases and consider their impact on patients within each specific use case. Careful implementation of GPT-3 and combination text–image models, such as DALL-E 2, holds the potential for transforming patient care and revolutionising contemporary ophthalmology.

Data availability statement

Data sharing not applicable as no datasets generated and/or analysed for this study.

Ethics statements

Patient consent for publication

Ethics approval

Not applicable.

References

Footnotes

  • Twitter @Sid_Nath, @pearsekeane

  • Contributors SN, AM and PAK conceived the review topic and identified relevant literature sources. SN, AM, SE, EK and PAK collaboratively wrote the manuscript. All authors reviewed and approved the final version of the article prior to submission.

  • Funding PAK is supported by a Moorfields Eye Charity Career Development Award (R190028A) and a UK Research & Innovation Future Leaders Fellowship (MR/T019050/1).

  • Competing interests EK has acted as a consultant for Google Health and Genentech and is an equity holder in Reti Health. PAK has acted as a consultant for DeepMind, Roche, Novartis, Apellis and BitFount, and is an equity owner in Big Picture Medical. He has received speaker fees from Heidelberg Engineering, Topcon, Allergan and Bayer.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.