Article Text

Evaluation of the Nallasamy formula: a stacking ensemble machine learning method for refraction prediction in cataract surgery
  1. Tingyang Li1,
  2. Joshua Stein2,3,4,
  3. Nambi Nallasamy1,2
  1. 1Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
  2. 2Kellogg Eye Center, Department of Ophthalmology and Visual Sciences, University of Michigan, Ann Arbor, Michigan, USA
  3. 3Center for Eye Policy and Innovation, University of Michigan, Ann Arbor, Michigan, USA
  4. 4Department of Health Management and Policy, University of Michigan School of Public Health, Ann Arbor, Michigan, USA
  1. Correspondence to Dr Nambi Nallasamy, Kellogg Eye Center, Department of Ophthalmology and Visual Sciences, University of Michigan, Ann Arbor, Michigan, USA; nnallasa{at}


Aims To develop a new intraocular lens power selection method with improved accuracy for general cataract patients receiving Alcon SN60WF lenses.

Methods and analysis A total of 5016 patients (6893 eyes) who underwent cataract surgery at University of Michigan’s Kellogg Eye Center and received the Alcon SN60WF lens were included in the study. A machine learning-based method was developed using a training dataset of 4013 patients (5890 eyes), and evaluated on a testing dataset of 1003 patients (1003 eyes). The performance of our method was compared with that of Barrett Universal II, Emmetropia Verifying Optical (EVO), Haigis, Hoffer Q, Holladay 1, PearlDGS and SRK/T.

Results Mean absolute error (MAE) of the Nallasamy formula in the testing dataset was 0.312 Dioptres and the median absolute error (MedAE) was 0.242 D. Performance of existing methods were as follows: Barrett Universal II MAE=0.328 D, MedAE=0.256 D; EVO MAE=0.322 D, MedAE=0.251 D; Haigis MAE=0.363 D, MedAE=0.289 D; Hoffer Q MAE=0.404 D, MedAE=0.331 D; Holladay 1 MAE=0.371 D, MedAE=0.298 D; PearlDGS MAE=0.329 D, MedAE=0.258 D; SRK/T MAE=0.376 D, MedAE=0.300 D. The Nallasamy formula performed significantly better than seven existing methods based on the paired Wilcoxon test with Bonferroni correction (p<0.05).

Conclusions The Nallasamy formula (available at outperformed the seven other formulas studied on overall MAE, MedAE, and percentage of eyes within 0.5 D of prediction. Clinical significance may be primarily at the population level.

  • Lens and zonules
  • Optics and Refraction

Data availability statement

No data are available.

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Key message

What is already known on this topic

  • Despite numerous recent advances, there is still room for improvement in intraocular lens (IOL) power selection.

What this study adds

  • The Nallasamy formula is an ensemble machine learning (ML)-based method for IOL power selection that outperformed seven existing IOL formulas, including Barrett, Emmetropia Verifying Optical and PearlDGS.

How this study might affect research, practice or policy

  • The Nallasamy formula offers cataract surgeons an accurate, ML-based method for IOL power selection and has the potential to help improve refractive outcomes after cataract surgery.


Cataract surgery is the most commonly performed surgical procedure in the United States (approximately 4 million/year) and worldwide (approximately 23 million/year). The appropriate selection of intraocular lens (IOL) power based on accurate prediction of postoperative refraction is necessary for achieving a favourable refractive outcome and is closely associated with patient satisfaction. An inappropriate IOL power was found to be the indication for approximately 20% of cataract surgery cases that required secondary intervention, lens removal or lens exchange, according to analyses of records between 2002 and 2017.1 2

Various generations of IOL power calculation formulas have been published since the 1960s. From the earliest regression formulas (Binkhorst formula, SRK formula) to the fourth and fifth generation of vergence formulas which established the effective lens position (ELP) as a function of the axial length, lens thickness (LT) and keratometry, the accuracy of IOL power calculation has been substantially improved. Among existing formulas, the Barrett Universal II formula3 is widely used and several publications have demonstrated that Barrett Universal II has greater accuracy than other traditional formulas.4 5 In addition to the above-mentioned formulas, a number of new IOL formulas have been published recently, such as the Emmetropia Verifying Optical (EVO) formula,6 which is a theoretical thick lens formula, and the PearlDGS formula,7 8 which is a machine learning (ML)-based thick lens calculation method.

Although the methodology for IOL power selection has been studied for decades, patient expectations for refractive outcomes continue to rise and room remains for improvement in refraction prediction performance. ML and artificial intelligence have proven to be successful in many medical applications, including ophthalmology.9 10 Researchers have begun to incorporate ML into IOL power calculations in recent years.

However, key limitations exist among recently-published ML-based IOL calculation methods: (1) performance comparisons limited to older generation formulas,11 (2) failure to achieve statistically significant improvement over current generation formulas,12 and (3) small datasets that leave the robustness and generalisability of methods in question.13

With a goal of advancing the understanding of IOL power selection for general cataract patients and improving refraction prediction accuracy, in this study, we developed a novel ML-based IOL power calculation method, the Nallasamy formula, based on a large dataset of 5016 cataract patients. In this model, we employed ensemble ML methods and novel data augmentation methods. The performance of our method was compared with that of Barrett Universal II, EVO, Haigis, Hoffer Q, Holladay 1, PearlDGS, and SRK/T on an unseen testing dataset of 1003 patients.

Materials and methods

Data collection and preprocessing

This study focused on a subset of patients receiving care at the University of Michigan between 25 August 2015 and 27 June 2019. The preoperative biometry records were obtained from Lenstar LS 900 optical biometers (Haag-Streit USA, EyeSuite software V.i9.1.0.0) at University of Michigan’s Kellogg Eye Center. Patient demographics (including patient age, gender and ethnicity) and cataract surgery information were obtained via the Sight Outcomes Research Collaborative (SOURCE) Ophthalmology Data Repository. SOURCE is a data repository that tracks the electronic health record data of all patients receiving any eye care at participating academic medical institutions. The information deposited in SOURCE includes patient demographics, diagnoses identified based on International Classification of Diseases codes, procedures based on Current Procedural Terminology (CPT) codes, and structured and unstructured (free-text) data from all clinical encounters (clinic visits, operative reports, etc). Various studies using data from SOURCE were published.14–18 Manifest refractions were performed at the end of the first postoperative month by trained technicians employed by University of Michigan’s Kellogg Eye Center. Manifest refraction data was obtained through the SOURCE repository.

The inclusion criteria for the cases were as follows: (1) Cataract surgery was performed (CPT code=66 982 or 66984); (2) An Alcon SN60WF one-piece acrylic monofocal lens was implanted, (3) No refractive surgery was performed before the cataract surgery; (4) No additional surgery was performed at the time of cataract surgery. Cases with any CPT code other than 66 982 or 66 984 were excluded; (5) Visual acuity was 20/40 or better and (6) Data were complete and was not out of bounds for any of the formulas with which performance was compared.

Stacking ensemble ML framework

After all preprocessing steps, we obtained a clean tabular dataset of 5016 patients wherein each eye had a complete profile of preoperative biometry, patient demographics (patient gender and age), the power of the implanted IOL and the postoperative refraction. Preoperative biometry included the axial length (AL), crystalline LT, anterior chamber depth (ACD), aqueous depth, astigmatism, white-to-white, central corneal thickness, and keratometry (K1 and K2, Embedded Image). The postoperative refraction was calculated from the spherical component (SC) and the cylindrical component (CC) with an adjustment with regard to the lane length at Kellogg Eye Center (10 ft, 3.048 m): Embedded Image according to Simpson and Charman’s recommendation.19

The prediction task was framed as a regression problem where the goal was to build an ML algorithm that predicts the postoperative refraction using available information. The value to be predicted is referred to as the target value (represented as Y in figure 1) and the inputs that are used to make the predictions are referred to as features or predictors (represented as X in figure 1). The dataset was randomly split into a training/validation set with 4013 patients (5890 eyes) which was 80% of all patients, and a testing set with 1003 patients which was 20% of all patients (figure 1). In order to make sure all samples in the testing set were independent, one eye was selected at random and dropped from the dataset for all patients with both eyes available in the dataset. The training/validation set was used for cross-validation and hyperparameter selection of the ML model. The testing set was used for performance comparison between the existing formulas and our ML-based method.

Figure 1

The overall method pipeline. IOL, intraocular lens.

Ensemble learning is a technique that involves combining the predictions from base learners with the goal of reducing variance and achieving improved prediction performance. An ensemble model is usually believed to outperform individual learners in most cases.20 Stacking (or stacked generalisation) is one of the most commonly used meta-learning paradigms, where a number of base-learners are trained using the raw training data and a single meta-learner is trained to combine the predictions from the base-learner.21 The reason for using an ensemble ML model in this study is to take advantage of different classes of ML algorithms and improve the overall performance of the model. The stacking model consists of two layers. In the first layer a group of level-1 learners was trained based on the raw data (preoperative patient data and the postoperative refraction). The second layer consists of the metamodel which uses the output of the level-1 learners as the input features. Therefore, the number of input features for the level-2 model equals the number of level-1 models. The output from the level-2 meta-model is the final prediction result (figure 1).

Lens constant optimisation of existing IOL formulas

The existing formulas Haigis, Hoffer Q, Holladay 1, SRK/T were implemented in Python based on their specific equations.22–29 The results obtained were validated against printouts from Haag-Streit USA, EyeSuite software V.i9.1.0.0. The prediction results of Barrett Universal II,3 EVO (V2.0)6 and PearlDGS7 8 were obtained through their online calculators. The constants of the corresponding formulas were optimised based on the cases in the training dataset (4013 patients). The most optimal constant was selected by zeroing the mean prediction error. The optimised constants are listed in table 1.

Table 1

The optimised lens constants

Cross-validation and hyperparameter tuning

During the development of the ML model, we performed model evaluation and selection through five-fold cross-validation. During the cross-validation, 4013 training/validation cases were divided into training sets and validation datasets. A random eye was removed for patients with both eyes available in the validation dataset. The optimisation of hyperparameters of the ML models, the combination of the level-1 models, and the selection of the level-2 model were performed by minimising the averaged mean absolute error (MAE) based on the cross-validation results.

Performance comparison on the testing set

To compare the performance between our method and existing IOL formulas, we trained the ML-model with the entire training dataset (5890 eyes) and made predictions on the testing dataset. We calculated the mean arithmetic error (ME), MAE, median absolute error (MedAE) of the postoperative refraction predictions and the SD of the prediction error. We also calculated the number and percentage of patients with an absolute prediction error of less than or equal to 0.25 D, 0.50 D, 0.75 D and 1.00 D, and evaluated the statistical significance of the difference between formulas with Cochran’s Q test. The statistical significance of the difference between the testing set performance of the IOL formulas was assessed using a Friedman test followed by a paired Wilcoxon test with Bonferroni correction. To investigate the performance of our method in cases with different axial lengths, we calculated the SD, ME, MAE, and MedAE for patients in the short AL group (AL<22 mm), medium AL group (22 mm ≤AL ≤ 26 mm) and long AL group (AL>26 mm). In addition to the above metrics, we calculated the slope of the correlation between the arithmetic error and AL as m. Using the above variables, we computed the IOL Formula Performance Index (FPI) as recommended by Hoffer and Savini30 for each formula as follows, where n is the percentage of eyes with an absolute error within 0.5 D. Higher FPI means better accuracy.

Embedded Image

To investigate the effect of the size of the training data on the performance of the ML model, we randomly sampled 10%, 20%, …, 90% of the training data, then retrained and compared the alternative models’ results on the testing set. The proportions of training cases were adjusted before the application of data augmentation and data transformation techniques. All other configurations and hyperparameters were kept the same for alternative models except for the number of training cases.

In this study, the refraction prediction error was defined as follows. The criterion for statistical significance was p<0.05. All statistical analyses were scripted with Python V.3.9.5.

Embedded Image


Out of 5016 patients, 4013 patients (5890 eyes) were assigned to the training/validation dataset, and 1003 cases were isolated as a hold-out testing dataset for performance comparison. A summary of the patient demographics in the training and testing sets is shown in table 2. A total of 49 surgeons performed the surgeries included in dataset. The distribution of data is shown in online supplemental figure S1.

Table 2

Summary of patient demographics

The performance of our method and existing methods is shown in table 3. According to the Wilcoxon test, our method performed significantly better than all the other seven methods with an MAE of 0.312 D, which was 4.9% lower than that of Barrett (0.328 D) and 3.1% lower than that of EVO (0.322 D). The specific p values can be found in online supplemental table S1. Our method also achieved the highest FPI.

Table 3

Performance summary in the testing set

The percentage of patients with an absolute error less than or equal to 0.25 D, 0.50 D, 0.75 D and 1.00 D is shown in figure 2. Our method resulted in a larger percentage of patients in the absolute error ≤0.5 D group (80.2%) compared with Barrett (78.3%), EVO (79.8%) and PearlDGS (77.7%), and a larger percentage of patients in the absolute error ≤1.0 D group (97.6%) compared with Barrett (96.6%), or EVO (96.9%) and PearlDGS (97.4%). Overall, our method achieved the highest percentage in the absolute error ≤0.5 D group among all eight formulas, and was statistically better than all other formulas except EVO (Cochran’s Q test p values were shown in online supplemental table S2) on this metric.

Figure 2

The percentage of patients in each error category for each formula, calculated based on the results in the testing dataset. EVO, Emmetropia Verifying Optical.

We compared the performance of the tested formulas among patients with different axial lengths in table 4. Numerically, our method achieved the lowest MAEs and SDs among all eight formulas in all 3 AL groups. The relationship between the prediction errors and the ALs is shown in figure 3. The errors of our method remained close to zero across the whole span of ALs.

Figure 3

The mean prediction errors in the testing set grouped based on axial lengths. Each dot represents the mean prediction error of eyes with an axial length between a specific range. EVO, Emmetropia Verifying Optical.

Table 4

The postoperative refraction prediction performance of existing formulas and our method in short/medium/long al groups in the testing set

When the model was trained with different proportions of the training data (figure 4), the corresponding performance on the testing set displayed a trend towards improving performance (decreased MAE) with increasing training set sizes.

Figure 4

The change of the mean absolute prediction error in the testing set when the machine learning method uses 10%, 20%, …, 100% of the training data.


We have presented here a new ML-based IOL power calculation method which performs statistically significantly better than Barrett Universal II, EVO (V.2.0) and PearlDGS on a large unseen testing dataset. We chose an ensemble ML framework for this particular problem, and this choice allows the method to compensate for the potential biases of individual learners. During the development of the model, we designed and applied several data augmentation methods to enhance prediction performance. Data augmentation methods are not only beneficial for enlarging the dataset size, but also to address natural imbalances in clinical datasets. The biometry measures are not uniformly distributed as shown in online supplemental figure S1. For example, the axial length has more instances in the medium group (between 22 mm and 26 mm) compared with the long and short AL groups. The postoperative refractions and the implanted IOL powers were not uniformly distributed either. All IOL powers in the dataset were manually selected by surgeons with a particular target refraction in mind, typically between 0 D and −3 D. Data augmentation helps to account for the scarcity of extreme cases and biases introduced by clinical decision-making process.

In this study, we used a relatively large dataset of 6893 eyes. Evaluation of the relationship between the proportion of the available training data used and MAE demonstrated the expected inverse relationship. This trend continued even as the training set was increased from 90% to 100% of the available training data (figure 4), indicating the potential for further improvement as the same model is exposed to larger datasets.

We achieved lower MAEs than Barrett Universal II, PearlDGS, and EVO in all three axial length groups. Our method yielded 80.2% of eyes with a predicted refraction within Embedded Image of the true refraction, which was approximately 2% more than that of Barrett (78.3%) (p=0.04). The Nallasamy formula also achieved 51.2% of eyes within Embedded Image, which was approximately 2% more than that of all other methods (next closest was EVO at 49.3%). Due to sheer volume of cataract surgery worldwide—23 million cataract surgeries each year—achieving an additional 2% of patients with refractive error less than 0.25 D would likely be clinically relevant at a population level. At the same time, the difference in MAE between our method and the next closest (EVO) of 0.010 D is not likely to be of clinical significance for the average patient. This discrepancy in clinical relevance appears to arise from the difference between the average patient and the overall population. Table 4 demonstrates that the differences in MAE are smaller in the medium axial length group than in the short and long axial length groups. Since there are far more patients in the medium axial length group than in the short and long axial length groups, the reported MAE reflects the smaller difference in errors in the more common medium axial length group. The overall difference in percentage of patients with errors less than 0.25 D is reflective of larger errors typically seen in the short and long axial length groups. Figure 3 highlights the divergence in prediction error of the Nallasamy formula and other methods at the limits of axial length.

Recently, Hoffer et al proposed in Ophthalmology the use of the FPI as a means of evaluating and ranking the performance of IOL power calculation methods.30 Higher values of the FPI indicate higher performance. Our method strongly outperformed the existing formulas on FPI, achieving a 0.447 FPI while the existing formulas ranged from 0.085 to 0.312 (table 3). The FPI takes into account the (1) SD of the prediction error, (2) the MedAE, (3) the AL bias, and (4) the percentage of eyes with refraction predictions within 0.5 D of true refractions. Our method demonstrated superior performance on each of these individual metrics, as summarised in table 3. Of particular note is our method’s superior SD of the prediction error, which Holladay et al recently referred to as “the single best parameter to characterise the performance of an IOL power calculation formula.”31

Also of interest is the AL bias, which is calculated as the slope of the correlation of the AL and the prediction error for a given formula. The existing IOL formulas demonstrate strong correlations between AL and the prediction error, as depicted in figure 3. ML-based methods such as ours, on the other hand, have the potential to better capture the nonlinearity of the relationship between biometric variables, IOL power, and postoperative refraction, resulting in substantially smaller AL bias (eg, −0.03 for Nallasamy vs 0.31 for Barrett). This translates to improved performance across AL categories (short, medium, and long), and should obviate the need for using different formulas based on axial length.

We are aware of multiple limitations of our study. Our method has not yet been validated on a dataset from a different medical institution. Performance analysis on external datasets will be a focus of future work as we begin to apply our approach to different populations around the world. Another limitation is that we were not able to compare our performance with a few formulas such as Hill-RBF because of a lack of access. However, prior studies indicate that Barrett Universal II is a good reference point for top-tier IOL formulas.4 5 32 An additional limitation is that at present, our method has been customised for the Alcon SN60WF lens, and additional data will be needed to adjust the method for additional lens models. We were not able to test the Nallasamy formula’s performance on eyes with extremely long or extremely short axial lengths due to a lack of available data in our dataset. Considering the Nallasamy formula was not trained with those eyes either, we believe the Nallasamy formula is currently not suitable to be used for extreme eyes. The online Nallasamy formula calculator (available at displays a warning message if AL is outside the range of 21 mm - 31.5 mm. Similarly, a warning is displayed if the K readings are outside the range of 37 D–52 D.

An intrinsic difference between ML-based methods and the vergence formulas is that vergence formulas estimate the ELP as a vital variable during the calculation of the postoperative refraction, but ML-based methods usually take a one-step approach for prediction, unless the model is specifically designed to predict both the ELP and the postoperative refraction. In previously published work, we reported the development of an ML-based method for postoperative ACD estimation.16 17 However, the method presented here does not rely on prediction of a postoperative ACD or ELP as an intermediate variable, unlike the vergence formulas. This approach may allow the ML method to avoid the propagation of errors (however small) introduced during the prediction of the postoperative ACD or ELP.

While the theoretical optics-based methods remain crucial for special cases, ML offers improved performance for large populations through the identification of latent patterns in historical data that can go unrecognised by existing methods. To that end, we have reported here the successful development and testing of an ML-based approach to IOL power calculation for cataract surgery that outperforms Barrett Universal II, PearlDGS and EVO on all broadly accepted metrics of IOL calculation performance. The Nallasamy formula is now freely available to the public to use online at

Data availability statement

No data are available.

Ethics statements

Patient consent for publication

Ethics approval

Institutional review board approval was obtained for the presented study. All subjects were fully anonymised and, therefore an informed consent was not required for this retrospective study. The study was carried out in accordance with the tenets of the Declaration of Helsinki.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • Contributors TL: data analysis, programming and writing of the manuscript; JS: data collection; NN: data analysis, programming, data collection, guidance on method development, and writing of the manuscript. NN is responsible for the overall content as the guarantor.

  • Funding This work was supported by the Lighthouse Guild, New York, NY (JS) and National Eye Institute, Bethesda, MD, 1R01EY026641-01A1 (JS).

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.