Introduction

Ageing has a significant structural impact on the brain that correlates with decreased mental and physical fitness1 and increased risk of neurodegenerative diseases such as Alzheimer’s disease2 and Parkinson’s disease3. Recent publications, have demonstrated that MRIs can be used to predict chronological age with reasonably good accuracy1,4,5. Such predictions provide an estimate of biological brain age in independent samples. The traditional way to perform brain age prediction is to extract features from brain MRIs followed by classification or regression analysis. This includes extracting principal components4, cortical thickness and surface curvature6, volume of gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF)7, and constructing a similarity matrix8. The drawback of using feature extraction methods is loss of information since the features are likely not designed explicitly for extracting information relevant to brain age. Recently, deep learning (DL) methods have garnered much interest9. These methods learn features that are important without a priori bias or hypothesis. Convolutional neural networks (CNNs)10 are deep learning techniques that are especially powerful for image processing and computer vision. Previously, they have been applied to brain age prediction11,12. Notably, Cole et al.12 implemented a 3D CNN trained on T1-weighted MRIs to predict brain age and achieved promising results.

PAD (the difference between predicted brain age and chronological age) estimates the deviation from healthy ageing. Studies have shown that positive PAD correlates with measures of reduced mental and physical fitness; including weaker grip strength, poorer lung function, slower walking speed, lower fluid intelligence, higher allostatic load, and increased mortality risk1. In addition, positive PAD has been shown to associate with cognitive impairments5,8,13,14, diabetes15, traumatic brain injuries8, schizophrenia16,17, and chronic pain18. On the other hand, a negative PAD associates with higher educational attainment19, increased physical activity19, and meditation20. Moreover, PAD has been demonstrated to be heritable12,21 and to have a polygenic overlap with brain disorders such as schizophrenia, bipolar disorder, multiple sclerosis, and Alzheimer’s disease21. Furthermore, the high degree of genetic correlation found among psychiatric and some neurological disorders suggests that current diagnostic boundaries do not necessarily reflect underlying biology22. Hence, defining a novel phenotype capturing global age-related changes in brain structure could, via variants in the sequence of the genome that associate with these changes, provide novel biological insights.

Here we present a new brain age prediction method (Fig. 1) that uses a 3D CNN trained on MRIs to predict brain age. The input data are a T1-weighted image registered to Montréal Neurological Institute (MNI) space and data derived from the T1-weighted image, i.e., a Jacobian map, and gray and white matter segmented images (Fig. 1). The input data also include information about the subject’s sex and the type of MRI scanner. The output of the network is the predicted brain age.

As mentioned above, Cole et al.12 trained a 3D CNN to perform brain age prediction. Our network is different in four key ways. (1) We use a significantly different architecture. While their architecture resembles a standard VGGNet architecture23 our architecture uses the recent ResNet design24. One of the drawbacks of the VGG architecture is that the vanishing gradient problem limits the potential depth of the network. In contrast, the ResNet architecture has no such depth limits. ResNets also have smoother loss surfaces25, which in turn helps speeding up convergence. (2) We add inputs to the final CNN layer to factor in information about sex and scanner. (3) Our technique is the first to use deformation information encoded in Jacobian maps to predict brain age. (4) As we have mentioned, our method combines predictions from multiple CNNs by either averaging predictions or by training a data blender.

In experiments, we compare our proposed method to a few brain age prediction methods based on feature extraction and machine learning. We also demonstrate that transfer learning is useful for adapting a CNN trained to predict brain age on one site to a new site while retaining predictive accuracy. And we look at how the PAD calculated with our method is affected by random weight initialization and retraining. We then check for associations between PAD and performance on neuropsychological tests. Finally, we perform genetic analysis on PAD using UK Biobank data, resulting in identification of associations with five sequence variants for which we provide detailed phenotypic characterizations.

Results

Combining CNN outputs improves prediction accuracy

Our brain age prediction method was developed using images from structural brain MRIs for 1264 healthy Icelanders. To overcome problems caused by training a DL method on such a small dataset we use multiple images of the same individuals and utilize a data augmentation strategy. We start off by training the method independently on the four previously mentioned image types (Table 1A). The CNN that predicts the test set with the least error is the CNN trained on T1-weighted images followed by the CNN trained on WM segmented images (Supplementary Figs. 4 and 5 show scatter plots of the CNN test set predictions against chronological age).

Table 1 Chronological age prediction accuracy for the considered methods.

Having four predictions from four different data sources opens up the possibility of combining the predictions. The most straightforward way of fusing the forecasts is by using a majority voting scheme, e.g., by averaging the predictions made by the four CNNs. Another way to combine forecasts is to implement a data blender, for example, by implementing a linear regression model trained to predict brain age from the four CNN brain age predictions. This technique attempts to find the best linear combination of the four brain age predictions so in theory it should be guaranteed to be at least as good as the best predicting CNN method. To demonstrate this, we tried combining CNN brain age predictions using majority voting and linear regression data blending (Table 1B). Comparing the test set results of Table 1B to the results in Table 1A, we see that combining predictions results in lower test error than achieved by the CNN trained on T1-weighted images.

It is not straightforward to compare the accuracy of our method to previous brain age prediction methods, because they are evaluated on other datasets. However, to establish a baseline for the CNN-based techniques, we investigated methods based on feature extraction such as surface-based morphometry (SBM)26, voxel-based morphometry (VBM)27, and similarity matrices. Machine learning regression methods were trained on these three types of features separately. For each feature type, we trained eight different types of regression methods. The list of methods we tried is far from exhaustive, instead these methods were chosen to represent commonly used and relatively simple to tune regression methods. In addition, we considered methods such as relevance vector regression (RVR)28 and Gaussian process regression (GPR)29 which have previously been successfully used to predict brain age4,8. Table 1C shows the prediction results for the regression models with the lowest test error for each feature type (the Methods section and Supplementary Table 1 include more information and results for the regression methods trained on these features). In addition, Table 1C shows results for combining the best predictions for SBM, VBM and similarity matrix features using the same methods used to combine the CNN predictions. Similarly to the CNNs, the combined predictions have lower test MAE than any of the methods limited to single feature types. However, if we compare the results in Table 1B and C we see that the predictions made with combined CNN outputs are more accurate than any of those based on the feature extraction methods.

Testing the CNN on other datasets

Next, we examine how the method performs if we predict brain age of images from other datasets. To do so, we evaluate it on the IXI and UK Biobank30 datasets and combine predictions using majority voting. We use this combination method rather than data blending because it has similar accuracy to linear regression blender with the added benefit that it is unnecessary to train an extra linear model on the predictions. We observe that the initial prediction error of the method is high (Table 2). The problem is that there can be subtle differences between data from different scanning sites which will cause a model trained on one site to fail when predicting on the other site. There are multiple reasons for this. The MRI scanner type and parameters between sites can be different, which can cause differences between resolution, contrast and noise levels. Also, the distribution of age can be different between sites, for example, it is problematic if the new site has a wider age range than the training set.

Table 2 UK Biobank and IXI prediction performance with and without transfer learning.

We hypothesize that a CNN that is already proficient at predicting brain age at one site only needs a small adjustment to adapt to data from a new site. A transfer learning strategy achieves this: First, we freeze the model weights of the convolutional layers so that only the fully connected layers are trainable. Second, the CNN is re-trained on a portion of the data from the new site. An advantage of this strategy is that there are now fewer parameters to train, which means we can use less data and training will be faster. We carry out the transfer learning strategy by retraining the majority voting CNN on 440 images from the IXI dataset.

The re-trained CNN is validated on 104 images from the IXI dataset left out during training (validation set) and tested on 12395 images from the UK Biobank dataset (test set). Table 2 shows that the prediction accuracy is increased significantly by doing so. In addition, the test set predictions before and after transfer learning are shown on a scatter plot against chronological age (Supplementary Fig. 6). Surprisingly the accuracy of predictions for the UK Biobank site improve even though the CNN was not explicitly trained on it. This is intriguing and is perhaps explained by the fact that the IXI set includes a wider age range than the Icelandic set and includes 3T MRI images unlike the Icelandic set.

In subsequent sections, the CNNs trained with transfer learning on the IXI sample will be used in downstream analysis of the UK Biobank sample. While it is likely that transfer learning on a small subset of the UK Biobank sample will lower the UK Biobank test MAE, we refrained from doing this because we want to use as many subjects as possible in the downstream analysis and because of the limited age range in the UK Biobank sample (all subjects are in the age range 45–80 years). Training on such a limited age range would severely bias the model towards predicting ages inside this range. To get around this, it is necessary to train the model on a sample with a wider age range. This is why we use the CNNs trained on the IXI sample, which includes subjects in the age range 20–86 years, in the downstream analysis.

Effect of random CNN weight initialization on PAD

We know that because CNNs start out in random initial states, and because they have highly non-convex loss functions25, it is possible that two randomly initialized instances of our brain age prediction method will converge to two distinct local minima. These states could in theory both predict age equally well but have uncorrelated PAD values. Here we face a potential problem, because in the absence of a ground truth for the PAD there is no way to tell if either one of these PAD predictions is accurate. This sort of unreliable CNN behavior would be problematic for any downstream analysis that utilizes the brain age prediction, because any conclusions made about the PAD would depend on the initialization of the CNN. In light of this, it would be reassuring if we could demonstrate that our method generally converges to similar PAD predictions after training.

To test this, four additional randomly initialized instances of our brain age prediction method are trained and the agreement between their PADs is examined. This procedure entails repeating these three main steps four times: (1) Train four CNNs on the Icelandic dataset on the four previously mentioned image types. (2) Freeze convolutions layers and train the CNNs on the IXI dataset (transfer learning step). (3) Predict brain age in the UK Biobank dataset using CNNs, combine the predictions with majority voting and calculate PAD values.

After repeating these steps, we get four instances of the brain age prediction method that predict brain age of the 12395 subjects in the UK Biobank with mean absolute error (MAE) equal to 4.6, 5.5, 5.4, and 4.9 years, respectively. The reason why the error is higher here compared with the original results is that we did not reinitialize and retrain the CNNs in cases were the optimization got stuck in a poor local minimum or a saddle point. Nevertheless, if we look at the agreement of the original and the four new PADs we find that the intraclass correlation (ICC) is estimated to be equal to 0.86 (95% confidence interval [CI] = [0.855, 0.863]). This indicates that the UK Biobank PAD calculated using our method stays rather consistent between the five different training runs and is relatively robust to random weight initialization.

Associations between PAD and performance on neuropsychological tests

As mentioned above, previous studies have linked high PAD to cognitive impairment5,8,13,14. In light of this, we are interested in looking at if PAD associates with performance on neuropsychological tests. Specifically, performance on tests administered by the UK Biobank that are designed to measure: fluid intelligence, numeric memory, visual memory, prospective memory, simple processing speed, complex processing speed, visual attention, and verbal fluency. To estimate PAD in the UK Biobank, we train four CNNs on the Icelandic set, then the IXI set using transfer learning, and combine their predictions using majority voting. We did not find evidence of association between PAD and performance on the fluid intelligence, numeric memory, pairs matching, and prospective memory tests (Supplementary Table 2 includes these results). However, we see from Table 3 that PAD is associated with worse performance on the digit substitution test (DSST), trail making tests (TMTs), and the reaction time test (a more detailed description of the tests can be found in Supplementary Notes 17). As expected, these results indicate that PAD is in fact associated with cognitive impairment.

Table 3 Pearson’s r correlation between PAD and performance on neuropsychological tests.

Genome-wide association study

PAD has previously been shown to be heritable12,21, however, to our knowledge no sequence variants conferring risk of or protecting against PAD have been identified. In order to look for such variants, we ran a genome wide association scan (GWAS) in the UK Biobank sample on PAD (same PAD as from the section Testing the CNN on other datasets) using BOLT-LMM31. This scan yields two sequence variants, rs2435204-G and rs1452628-T (Fig. 2 and Table 4A) (Supplementary Figs. 8 and 9 show locus zoom plots for the two genome-wide significant variants). In addition, given that sequence variants known to associate with brain structure are likely to be enriched for variants that associate with PAD. We decided to test a smaller set of 331 brain structure variants for association with PAD. This yielded associations with three additional variants (Table 4B). For more information, the ‘Statistical methods’ section contains information about how the brain structure variants were identified.

Fig. 1
figure 1

Illustration of the proposed method and input data. a A flowchart showing a high-level overview of the proposed brain age prediction system. b Examples of image types generated by the preprocessing step. From left to right: a registered T1-weighted slice, a Jacobian map slice, a GM segmented slice, and a WM segmented slice.

Fig. 2
figure 2

Manhattan plot of the GWAS results for the UK Biobank data. The horizontal line denotes the P value threshold for genome-wide significant effect.

Table 4 Sequence variants associated with PAD estimated using BOLT-LMM.

The high number of tests conducted in GWAS combined with the general small effect size of common markers greatly increases the risk of a false postives32. To protect against potential confound effects we adjusted for potential nuisance variables, such as age, gender, total intracranial volume, principal components from genetic ancestry analysis, head motion, genotyping array, and imaging center. In addition, we removed individuals of non-white British ancestry and one subject from each related pair of individuals (the Statistical Methods section provides more information about the exact adjustment procedure). And then to thoroughly vet each hit we took two steps. (1) We performed a replication test on held out data. (2) Checked if the reported variants associate with other phenotypes related to brain ageing.

(1) The five reported sequence variants also associated with PAD in a replication set of 4456 subjects (Table 4 [C, D]). Four other variants which came up in the discovery stage were omitted because they did not replicate. (2) The identified sequence variants also associate with brain structure likely to be affected by brain ageing. Associations between these sequence variants and SBM/VBM brain structure phenotypes and correlation between PAD and the brain structure phenotypes are shown in Supplementary Tables 4 and 5. Supplementary Table 4 shows that both PAD and rs1452628-T associate with lower CSF throughout the cerebral cortex which is consistent with reduced cortical sulcal openings. On the other hand, rs2435204-G associates with lower total white matter surface area, and reduced area in a number of cortical brain regions (Supplementary Table 5). Although it was known a priori that the other three sequence variants would associate with structural brain phenotypes, the specific structural brain phenotypes that associate with these variants are listed in Supplementary Tables 68.

Running LD score regression33 on the GWAS results, we estimated the SNP-heritability for PAD to be \({h}_{{\mathrm{snp}}}^{2}=0.264\) (95% CI = [0.178, 0.350]). In addition, the intercept of the LD score regression model is equal to 0.991 (95% CI = [0.979, 1.003]), which indicates that the model did not find any evidence of confounding effects in the PAD GWAS results. This \({h}_{{\mathrm{snp}}}^{2}\) estimate is close to the one previously estimated by Kaufmann et al.21 (\({h}_{{\mathrm{snp}}}^{2}=0.1828\) [SE \(=0.02\)]). And predictably our \({h}_{{\mathrm{snp}}}^{2}\) is lower than the narrow-sense heritability estimate of PAD estimated by Cole et al.12 (\({h}^{2}=0.66\) [SE \(=0.09\)]) using a twin study sample.

Discussion

Here, we have presented a novel deep learning approach, using residual convolutional neural networks to predict brain age from a T1-weighted MRI, a Jacobian map, and gray and white matter segmented images, to study the discrepancy between age-related structural brain changes and chronological age. The MRI based deep learning system was shown to predict brain age from T1-weighted MRI data with a \(\,\text{MAE}\,=3.39\) and \({R}^{2}=0.87\) on test data. Comparing our approach to other machine learning methods trained on surface-based morphometry, voxel-based morphometry, and similarity matrix features, we showed that our approach predicts brain age more accurately. We showed that transfer learning can be used to efficiently increase prediction accuracy for new sites. The PAD calculated using this method was shown to be relatively robust to random weight initialization and retraining, a result that indicates that the PAD estimated using our method can be used as a reliable phenotype in the study of brain ageing, as well as in the study of specific disorders of the brain. We also proposed that PAD could be an informative phenotype for genetic association studies, and indeed, our association analysis of PAD in a discovery set of 12378 subjects and replication set of 4456 subjects yielded five sequence variants.

The sequence variant with the strongest association, rs2435204-G, tags the H2 (inverted) form of the 17q21.31 inversion polymorphism34. This inversion spans ~1 Mb and includes 10 genes, including MAPT, a gene that encodes the tau protein which has been implicated in various dementias35. In addition, micro-deletions within the inversion are known to cause intellectual disability36. The H1 inversion haplotype has been associated with increased risk of Parkinson’s disease, male-pattern baldness, and several other phenotypes, whereas H2 has been associated with a number of phenotypes including neuroticism37, fibromyalgia18, lower educational attainment, increased fecundity38, and smaller intracranial volume (ICV)39 (Note that PAD is adjusted for ICV, thus the observed effect on PAD is not caused by ICV). Due to the extensive linkage disequilibrium (LD) the 17q21.31 inversion region, reported markers for various associations in the region often differ between studies. For example, the most recent GWAS meta-analysis of Parkinson’s disease reports an association with rs17649553-T, that is fixated on and highly correlated with the H2-tagging rs2435204-G (\({r}^{2}=0.82\), \(D^{\prime} =1\)), with OR \(=0.78\) (95% CI \(=[0.76,0.80]\)), P= \(1.26\times 1{0}^{-68}\) (their meta-analysis was carried out with a fixed-effects model based on inverse-variance weighting)40.

rs2435204-G also associates with brain structure phenotypes. Supplementary Table 5, shows that both PAD and rs2435204-G associate with increased thickness and decreased area in cortical brain regions. Interestingly, this pattern of increased thickness and decreased area has previously been associated with neuroticism41. Thus, lifestyle or phenotypes associated with a high neuroticism score, including anxiety, worry, fear, anger, frustration, depressed mood, and loneliness may associate with PAD.

The other genome-wide significant sequence variant, rs1452628-T, is located close to KCNK2 (also known as TREK1), which belongs to the two-pore domain potassium channel family and is mainly expressed in the brain42. In mice, KCNK2 has been implicated in neuroinflammation43, cerebral ischemia44, and blood-brain barrier dysfunction45. rs1452628-T correlates with SNPs that have previously been associated with cortical sulcal opening and GM thickness, rs6667184 (\({r}^{2}=0.68\)), and rs864736 (\({r}^{2}=0.49\))46.

In addition, we identified three sequence variants associated with PAD by restricting the analysis to SNPs known a priori to associate with structural phenotypes. (1) rs2790099-C is located in an intron of RUNX2, a gene that encodes the RUNX2 protein which is essential for osteoblastic differentiation and skeletal morphogenesis and has been shown to play several roles in cell cycle regulation47. Supplementary Fig. 7 shows that rs2790099-C is a possible cis-eQTL of RUNX2 and it is most expressed in the basal ganglia (caudate and putamen). This lines up with the a priori brain structure GWAS that shows that rs2790099-C has genome-wide significant associations with white matter volume of regions in the basal ganglia (putamen and pallidum) (Supplementary Table 6). (2) rs6437412-C is an intron variant of LINC01968 that associates with increased cortical CSF (Supplementary Table 7). (3) rs2184968-C is located in an intron of CENPW, a gene that has previously been associated with traits, such as, height48, cognitive performance49, and male-pattern baldness50. Our analysis shows that rs2184968-C is associated with increased CSF in subcortical regions and increased size of the fourth ventricle (Supplementary Table 8).

Confound effects are a problem for big imaging studies due to the huge number of imaging artifacts that can potentially influence both imaging and non-imaging variables of interest32. Some of the confound effects we have tried to control for are effects due to age, sex, head size, population structure, and scanner type. Head motion is another potentially problematic confound effect, because it causes reduction of estimated gray matter volume and thickness in MRI images similar to what we expect to see due to ageing51. While head motion is not important in the evaluation of our method (see Cole et al.12), it is potentially a problematic confound for GWAS analysis because certain clinical groups associate more with scanner motion. Elliott et al.52 suggest to use fMRI-derived head motion estimates to correct for confound effects due to head motion when running GWAS analysis on brain structure phenotypes. We adjusted PAD for head motion as they suggest, however, this correction only had a small effect on our results. Other potential confounds that we looked at were sample relatedness (the first 40 principal from components genetic ancestry analysis), genotyping array, and the assessment center where neuropsychological testing was performed. As with head motion, adjusting for these variables did not affect our results.

From our analysis we see that PAD associates with worse performance on neuropsychological tests, specifically poor performance on DSST, TMT, and the reaction time tests (Table 3). Interestingly, both the DSST and the reaction time test are designed to measure cognitive processing speed. The TMT is designed to asses visual attention. However, psychomotor speed is a factor in successful TMT performance53. Furthermore, a decline in processing speed along with impairment of reasoning, memory, and executive function are well documented to occur in age-associated cognitive decline54. As such, these results are in line with other studies that link high PAD to cognitive impairment5,8,13,14. We note, that the association between PAD and TMT is consistent with the previous finding of Cole et al.8. However, the large dataset used here gives more conclusive results. Supporting this, we additionally find that schizophrenia, a brain disorder characterized by complex patterns of cognitive impairment, correlates with positive PAD (greater brain ageing than chronological age) and (Supplementary Table 3).

In conclusion, we have presented a new method for predicting brain age using cutting-edge machine learning techniques. Our deep learning method produces a single measure (PAD) from raw MRI data that captures complex underlying correlated changes in MRI and can be used to study various traits and diseases, and in particular for genetic discovery. Using such a method represents one potential way for overcoming challenges with high dimensional data and multiple testing that plagues MRI research. Applying our method to large genomic datasets such as the UK Biobank has enabled us to identify novel associations between sequence variants and brain ageing. The variants identified are common SNPs with small effects on PAD (Table 4) accounting for only a fraction of the trait variance. However, these first findings provide a foothold, and further research into these loci as well as additional GWAS studies have potential to shed light on the biological underpinnings of the ageing brain and its connection to various diseases and disorders.

Methods

Datasets

The proposed method was evaluated on T1-weighted MR images from three independent datasets: an Icelandic dataset, the UK Biobank dataset, and the IXI dataset. DeCODE genetics provided the Icelandic MR data, consisting of scans from 1264 healthy subjects aged between 18 and 75 years. This dataset includes 1815 scans in total, since some subjects have several scans. The Icelandic data were acquired using two different scanners, a 1.5T Phillips Achieva scanner, and a 1.5T Siemens Magnetom Aera scanner. Scans were imaged using a T1-weighted gradient echo sequence (Philips Achieva: repetition time (TR) = 8.6 ms, echo time (TE) = 4.0 ms, flip angle (FA) = 8°, 170 slices, slice thickness = 1.2 mm, acquisition matrix = 192 \(\times\)192, FOV = 240 \(\times\)240 mm; Siemens Aera: repetition time (TR) = 2400 ms, echo time (TE) = 3.54 ms, flip angle (FA) = 8°, 160 slices, slice thickness = 1.2 mm, acquisition matrix = 192 \(\times\) 192, FOV = 240 \(\times\) 240 mm). Any serious neurological disorders were prescreened and removed. In addition, we removed from the training and holdout sets subjects diagnosed with neurodevelopmental and mental disorders such as autism, bipolar disorder, intellectual disability, or schizophrenia, and subjects with any copy number variations previously associated with neurodevelopmental or psychiatric disorders.

The UK Biobank dataset consists of T1-weighted MR images of 15040 healthy subjects aged between 46 and 79 years old. The data were all collected using a 3T Siemens Skyra scanner. It is well-known that the presence of undetected population structure can lead to both false positive results and failure to detect genuine associations in genetic association studies55, in an effort to combat this our analysis was constrained to 12378 individuals of white British ancestry. An additional release of MRI images by UK Biobank was added to a replication set. This set contains 6888 subjects (thereof 4456 subjects of white British ancestry) aged between 47 and 80 years old. The images in this set were collected using the same protocol as the previous UK Biobank set.

The IXI dataset consists of T1-weighted MR images of 544 healthy subjects and is freely available online. The subjects age at imaging was between 20 and 86 years old. The IXI data were collected from three different sites. The Hammersmith Hospital using a Philips 3T system, Guy’s Hospital using a Philips 1.5T system and the Institute of Psychiatry using a GE 1.5T system. Histograms of the age distribution of the three datasets mentioned are shown in Supplementary Figs. 13.

Preprocessing

Preprocessing was carried out using the computational anatomy toolbox (CAT12)56. First, the input data were inhomogeneity corrected. Then the skull and other non-brain elements were removed. Finally, the images were registered into the standard MNI space using the deformable registration algorithm DARTEL57. For further information, refer to the CAT12 manual58.

There are three types of images that the preprocessing step generates. The first is an MNI-registered image. Second, a Jacobian map which is a by-product of the deformable registration. Last, a gray matter and white matter soft segmented image. All of the image types mentioned above have voxel size 1.5 mm\({}^{3}\) and voxel resolution 121 × 145 × 121.

CNN architecture

The CNN architecture we developed is based on the residual architecture24 (Fig. 3). It was implemented using Keras with TensorFlow as backend59 and consists of five residual blocks, each followed by a max pooling layer of stride 2 × 2 × 2 and kernel size 3 × 3 × 3, and one fully connected block. The convolutional part of the CNN reduces the input image from size 121 × 145 × 121 to 128 feature maps of size 4 × 5 × 4. The fully connected part reduces these feature maps down to an age prediction.

Fig. 3
figure 3

A flowchart showing the components of the proposed CNN architecture. Residual (Res), fully connected (FC).

The residual block, displayed in Fig. 4, consists of a combination of layers which are repeated twice inside the residual blocks. This combination is composed of a 3D convolutional layer with stride 1 × 1 × 1 and kernel size 3 × 3 × 3, a batch re-normalization layer60, and an ELU activation function61. The defining element of the residual block is the skip connection which adds the signal feeding into the residual block to the output of a layer close to the end of the block. The number of feature maps in block number \(n\) was chosen by the rule \({2}^{n+2}\).

Fig. 4
figure 4

A flowchart showing the components of the proposed residual block. Batch re-normalization (BRN), convolutional layer (Conv).

The fully connected block, depicted in Fig. 5, is a multilayer perceptron (MLP)62 with one hidden layer. The input layer has \(128\times 4\times 5\times 4=10240\) neurons, the hidden layer (FC 1) has 256 neurons that use an ELU activation function, and the output layer has a single neuron. Following the hidden layer, a dropout63 layer with keep rate equal to 0.8 is employed. The output layer (FC 2) has no activation function which means that it performs a linear regression on the hidden layer features. To account for factors such as scanner type and sex that can affect the estimated brain age of an individual we include them as inputs in the linear regression by concatenating them with the hidden features of the MLP.

Fig. 5
figure 5

A flowchart showing the components of the proposed fully connected block. Fully connected layer one (FC1), concatenation layer (Concat), fully connected layer two (FC2).

The mean absolute error was used as the loss function and the CNN was optimized using Adam64 with parameters: learning rate \(=0.001\), decay = \(1{0}^{-6}\), \({\beta }_{1}=0.9\), \({\beta }_{2}=0.999\), and batch size \(=4\). The He initialization strategy65 was used to initialize the weights, and each trainable node in the CNN was regularized with \({l}_{2}\) weight decay66, with \(\lambda =5\times 1{0}^{-5}\). Early stopping67 with model checkpointing was used, i.e., if the validation error did not improve in 100 epochs the training was stopped and the weights with the lowest validation error selected. Furthermore, to reduce the risk of overfitting, data augmentation68 was used to generate new training instances by applying a coordinate transformation to a random subset of the training data, consisting of a combined 3D rotation and a 3D translation. The rotation angles were between −40° and 40° with equal probability, and the translation distance, for each direction, was selected between −10 and 10 voxels with equal probability.

Our CNN implementation uses about ~8 GB of memory and the training time using an Intel Xeon Gold 6130 Processor CPU with 32GB of RAM and an NVIDIA Tesla V100 16GB GPU was about 2 days.

SBM, VBM, and similarity matrix brain age prediction

The SBM features were generated using FreeSurfer’s recon-all algorithm69 and the VBM features were generated using the CAT12 toolbox (the specific names of the SBM and VBM features are listed in Supplementary Data 1). The similarity matrix was constructed by taking the inner product between the combined gray and white matter segmented images of each subject. The SBM and VBM features were adjusted for intracranial volume, sex and scanner type. The features were then zero centered and normalized to unit variance. The regression methods that were tested were, linear regression70, lasso71, ridge regression72, elastic net73, random forest regression74, support vector regression75, relevance vector regression28, and Gaussian process regression29. A grid search was used to find the tuning parameters corresponding to the lowest cross-validation error for the methods mentioned. The regression models were implemented using scikit-learn76, except relevance vector machines, which used scikit-rvm.

In addition, we tested combining predictions made by models trained on these three feature types. We decided to pick predictions for the method with the lowest CV MAE for each feature type. Specifically, these regression methods were GPR with a Matérn kernel for both the SBM and VBM, and ridge regression for the similarity matrix features. The methods were picked by CV MAE instead of test MAE to prevent data leakage. Since the SBM and VBM features were not available for every image in the Icelandic dataset, we first calculated the average brain age prediction for each subject, before inner joining the predictions by subject into a single data frame. This resulted in a combined data frame with 1246 rows and three columns containing brain age predictions for the three regression methods. Since training these regression models is faster than training the CNNs, we were able to combine them using 10-fold cross-validation predictions. Thus, the linear regression blender can train on predictions from the whole training set, instead of being limited to the 298 images in the validation set, as is the case for the CNN prediction combination.

Statistical methods

To assess the accuracy of the machine learning methods we performed simple training and validation splits, and selected a suitable model by evaluating the validation MAE. The subjects from the Icelandic sample were split between these three sets, and if a subject had multiple images, the images were all put in the same set. The data were divided into 64% training set (\({N}_{s}=809\), \({N}_{i}=1171\)), 16% validation set (\({N}_{s}=202\), \({N}_{i}=298\)), and 20% test set (\({N}_{s}=253\), \({N}_{i}=346\)), were \({N}_{s}\) is the number of subjects and \({N}_{i}\) is the number of images. When evaluating the machine learning models the MAE and \({R}^{2}\) score for the images in the validation and test set is calculated.

To assess the transfer learning performance, the IXI dataset was split into 80% training set (\(N=440\)), 20% validation set (\(N=104\)) and the whole UK Biobank dataset was used as a test set (\(N=12395\)). As before, we evaluate accuracy by calculating the MAE and \({R}^{2}\) score on the validation and test set.

In order to test the reliability of PAD, the intraclass correlation was calculated with ICCbare from the ICC R package. The 95% confidence interval was estimated using bootstrapping with 2000 sampling iterations.

The Pearson correlation coefficient was calculated in order to test for association between PAD and performance on neuropsychological tests. Before performing the association test we first removed individuals of non-white British ancestry and subjects from related pairs. We then adjusted the PAD for age at imaging visit, age\({}^{2}\), sex, age \(\times\) sex, age\({}^{2}\)\(\times\) sex, total intracranial volume, the first 40 principal components from genetic ancestry analysis, head motion, genotyping array, imaging center, and assessment center where neuropsychological tests were conducted. The adjustments was performed using linear regression. Adjustment for variables such as genotyping array are probably not necessary for testing for association between PAD and performance on neuropsychological tests. However, we included them to keep the adjusted PAD similar to the one we perform the GWAS on. Nine correlation tests were performed, so a Bonferroni adjusted significance level \({\alpha }_{B2}=0.05/9\approx 0.00556\) was used. We estimated the 95% confidence interval using bootstrapping with 2000 sampling iterations.

We performed a GWAS on PAD using BOLT-LMM31 to find associated sequence variants. For the genetic analysis we used version 3 of the imputed genetic dataset released by UK Biobank in July 201777. The UK Biobank genetic data were assayed using two very similar genotyping arrays (95% of marker content is shared). Roughly 10% of the subjects were genotyped using applied Biosystems UK BiLEVE Axiom Array by Affymetrix and the rest using the closely related Applied Biosystems UK Biobank Axiom Array77. Variants with imputation quality score below 0.3, and minor allele frequency below 0.1% were filtered out, which left ~20 million variants to be considered for GWAS. Before performing GWAS, we removed individuals of non-white British ancestry and subjects from related pairs. We then adjusted the PAD for age at imaging visit, age\({}^{2}\), sex, age \(\times\) sex, age\({}^{2}\)\(\times\) sex, total intracranial volume, the first 40 principal components from genetic ancestry analysis, head motion, genotyping array, and imaging center using linear regression. The adjusted PAD was then normalized with an inverse normal transformation. After normalization the linear regression adjustments were reapplied. Sequence variants associated with PAD are only reported if they reach genome-wide significance. If two genome-wide significant variants are in LD (\({r}^{2}\,> \, 0.1\)) we report the variant with the lower P-value.

In addition, we tested for association between PAD and sequence variants known to associate with structural brain phenotypes. These variants were found by performing GWAS separately on 305 SBM phenotypes generated with recon-all by using the Freesurfer 6.0 software69 and 540 VBM phenotypes generated by using CAT1256. All genome-wide significant markers were then aggregated into a single list. In cases where variants were in LD (\({r}^{2}\,> \, 0.5\)), only the variant with the lower P-value was selected. The final list included 331 variants, to account for testing test variants for the second time a Bonferroni adjusted significance level \({\alpha }_{B3}=\frac{0.05}{2\cdot 331}\approx 7.5\,\times 1{0}^{-5}\) was used for the PAD association test.

To reduce the risk of false positive sequence variant associations we additionally checked for association in a replication set of 4456 subjects. To pass this test the association between the variants under consideration and PAD need to show evidence of statistical significance (\({\alpha }_{R}\,<\, 0.05\)).

Heritability analysis

To estimate SNP-heritability (\({h}_{{\mathrm{snp}}}^{2}\)) we ran LD score regression33 on the PAD GWAS summary statistics. We used the ldsc command line tool and followed standard procedure when running it. To train the LD score regression model, we used precomputed European 1000 Genomes phase 3 LD Scores, and filtered out rare variants with MAF < 0.01 and imputation quality score < 0.9. The slope of the trained regression model times the number of SNPs and divided by the sample size is an estimate of \({h}_{{\mathrm{snp}}}^{2}\)33. In addition, the intercept of the model minus one is a measure of confounding bias in the test statistics due to confounding effect, such as cryptic relatedness and population stratification33.

eQTL analysis

To investigate if any of the variants are expression quantitative trait loci (eQTLs) we used the GTEx database (GTEx Analysis Release V7 [dbGaP Accession phs000424.v7.p2])78. Our eQTL analysis was carried out by logging onto https://gtexportal.org, typing in the corresponding rs number of identified variants, and checking if they have any associated eQTLs. However, identifying whether a variant is truly causal in both GWAS and eQTL is challenging because of the uncertainty caused by LD79. Therefore, we only report variants as eQTLs of genes if they are close to being the most significant eQTL of that specific gene.

Ethical regulations

The Icelandic participants in this study were recruited by deCODE genetics to study the cognitive and neurological effects of rare variants previously associated with schizophrenia and autism spectrum disorder. The UK Biobank oversaw the recruitment of subjects of British nationality. Approval for the aforementioned schizophrenia study was obtained from the National Bioethics Committee of Iceland and the Icelandic Data Protection Authority. Written informed consent was obtained from all participants or their guardians before blood samples or phenotypic data were obtained. All sample identifiers were encrypted in accordance with the regulations of the Icelandic Data Protection Authority. Information about ethics oversight in the UK Biobank can be found at https://www.ukbiobank.ac.uk/ethics/.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.