Predicting EQ-5D Index Scores from the PROMIS-29 Prole for the United Kingdom, France, and Germany

Background: EQ-5D health utility (HU) scores are commonly used in health economics to compute quality-adjusted life years (QALYs). EQ-5D scores, which are country-specic, can be derived directly or by mapping from self-reported health-related quality of life (HRQoL) scales such as the PROMIS-29 prole. The PROMIS-29 from the Patient Reported Outcome Measures Information System is a comprehensive assessment of self-reported health with excellent psychometric properties. We sought to nd optimal models for predicting EQ-5D scores from the PROMIS-29 in the United Kingdom, France, and Germany and compared the prediction performances with that of a US model. Methods: We collected EQ-5D-5L and PROMIS-29 proles and three samples representative of the general populations in the UK (n=1,509), France (n=1,501), and Germany (n=1,502). We used stepwise regression with backward selection to nd the best models to predict the EQ-5D score from all seven PROMIS-29 domains. We investigated the agreement between the observed and predicted EQ-5D scores in all three countries using various indices for the prediction performance, including Bland-Altman plots to examine the performance along the HU continuum. Results:The EQ-5D index scores were best predicted in Germany (RMSE GER = 0.10, MAE GER = 0.06), followed by France (RMSE FR = 0.11, MAE FR = 0.08) and the UK (RMSE UK = 0.12, MAE UK = 0.09). The Bland-Altman plots show that the inclusion of higher-order effects reduced the underprediction of low HU scores. Conclusions: Our models provide a valid method to predict EQ-5D-5L index scores from the PROMIS-29 for the UK, France, and Germany.

1 Introduction 1. 1 The concepts of quality-adjusted life years, health utility, and the EuroQoL Quality-adjusted life years (QALYs) are routinely used in cost-utility analyses to evaluate the economic effectiveness of health care innovations or interventions (1). QALYs are of particular importance in health technology assessments (HTAs) (2). Both the National Institute of Health and Clinical Excellence (NICE) in England and Wales and the US Panel on Cost-Effectiveness in Health and Medicine have endorsed QALYs to compare health care interventions from an economic perspective (1). In light of budget constraints in publicly funded health care systems, QALYs serve as a benchmark for the allocation of scarce resources in a way that maximizes utility to individuals and to society (2).
A QALY is de ned as the product of the number of life years and a health utility (HU) score that represents the value of a particular health state. HU values can at best achieve a value of 1 (full health). A value of 0 is considered dead and health states with a negative value are considered worse than dead. Individual HU scores are patient-reported, preference-based ratings of health-related quality of life (HRQoL) (3). The most frequently used HRQoL measure, the EuroQoL EQ-5D-5L (EQ-5D), covers the following ve domains: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression (4)(5)(6)(7). Each of these domains is rated on a ve-point scale, thus differentiating 3125 (i.e., 5 5 ) health states. In valuation surveys in the general population of 10 different countries, these health states were ranked and linked to a single EQ-5D index value, expressing countryspeci c valuations of HRQoL(8-12).

Indirect derivation of individual EQ-5D index values by mapping and the Patient Reported Outcomes Measurements Information System (PROMIS)
EQ-5D index values for individuals are best obtained directly using the EQ-5D-3L or EQ-5D-5L questionnaire. If direct assessment is not feasible, a common strategy is to estimate HU scores by using a "mapping" or "crosswalk" algorithm from a non-preference-based patient-reported outcome measure (PROM) (13,14). Little consensus exists on which linking method is the most appropriate choice. In a recent systematic review, 147 studies mapping the EQ-5D index values were identi ed (13). In more than 75% of all mappings in this review, ordinary least squares (OLS) linear regression was used. Although OLS linear regression showed robust results compared to alternative methods, it has several drawbacks (15,16). First, predicted HU scores may fall outside the possible range of the metric (i.e., values greater than one). Second, the relationship between non-preferencebased PROM and HU scores might be non-linear, meaning that the impact of symptoms and/or health domains differs across the HU continuum (16).
Developing a mapping algorithm to link the health domains of the Patient Reported Outcome Measurement Information System (PROMIS) to the EQ-5D index value is of special importance because PROMIS is increasingly used due to its favourable psychometric properties. It constitutes a collection of generic and condition-speci c, non-preference-based PROMs that have been developed using item response theory (IRT) (17). For each PROM, so-called item banks have been developed comprising items that are highly informative regarding the PROM to be measured and that do not function substantially different across the most prominent demographic groups (e.g., women and men) (18,19). These item banks can be used to develop tailored short forms or for computerized adaptive testing (CAT) (20). PROMIS overcomes signi cant limitations of legacy instruments such as ceiling effects and is becoming the reference measurement approach to PROMs (21). Due to the invariance property of PROMIS-29 domains, for each health domain, PROMIS scores are obtained on the same metric, regardless of which item sets have been utilized (18,19).
This property is possible even if a measure has been used that is only linked to one of the PROMIS metrics.
Respondents' item answers can still be placed on that PROMIS health domain. For example, self-reported anxiety measured by MASQ, PANAS and GAD-7 is linked to the PROMIS Anxiety metric (22). Depressive symptoms measured by BDI-2, CES-D, and PHQ-9 can be expressed on the PROMIS Depression metric (23). Therefore, mapping from PROMIS T-scores to EQ-5D creates the potential to link a broad range of PROMs to HU expressed by the EQ-5D.
Using OLS linear regression on data collected in the US, Revicki (2009) estimated a model to predict EQ-5D index   scores from ve PROMIS T-scores in the US(24): For this PROMIS domain model, Revicki reports that approximately 57% (adjusted R 2 ) of the variance in EQ-5D index scores can be explained by the variables in the model, and the intraclass correlation coe cient (ICC) is 0.73. Furthermore, 95% of all the residuals are between -0.20 (2.5%) and 0.15 (97.5%). The relatively small width of these so-called empirical limits of agreement (LoA) is indicative of an appropriate tted model. However, Revicki also reported that this equation does not work very well for low levels of health (EQ-5D < 0.40). Revicki used the EQ-5D-3L questionnaire and applied the US EQ-5D-3L value set by Shaw (2015) (25). EQ-5D index values and mappings are country-speci c (8,26). Revicki's model can therefore only be used to predict EQ-5D scores in the US.

Aims of this study and research questions
As EQ-5D scores are known to be country-speci c, the primary aim of this study was to develop mapping functions to link PROMIS-29 to EQ-5D index values for the UK, France, and Germany. For each health domain, we explored the form of its relationship with HU expressed by the EQ-5D and examined whether these relationships would be the same across the three countries under investigation. Furthermore, we investigated whether the optimal models would be structurally equivalent across countries and compared the prediction performance of the nal models to the prediction performance of Revicki's model.

Samples
Data were collected online by an independent polling company (Ipsos) in April and May 2015. Quota sampling was employed to obtain samples representative of the general population with respect to the marginal distributions of sex, age, occupation, region, and population density of the UK (n=1,509), France (n=1,501), and Germany (n=1,502). Sample weights were calculated using the random iterative method (RIM) to match the latest data available in each country (census 2011 for the UK and Germany, census 2012 for France).
We only brie y summarize the most important differences between the three samples here. The interested reader is referred to Table A.1 (Appendix) for a comprehensive overview of the marginal distributions of sex, age, educational level, occupational status, and income in the three samples. Participants in the German sample (mean age = 50.0 years old) were slightly older than participants in the French (48.4 years old) and UK samples (47.8 years old). Participants in the German sample were more likely to have a low educational background (23.4%) than participants in the French (7.6%) and UK samples (8.1%). Participants in the French sample were more likely to be unemployed/inactive (48.4%) than participants in the German (41.5%) and UK samples (39.4%).
As participants could only proceed through the survey by answering each item, there were no missing data.

PROMIS domains and item banks
We used the PROMIS-29 v2.0 Pro le to assess seven core domains of health: physical function, fatigue, pain, anxiety, depression, sleep disturbance, and the ability to participate in social roles and activities (referred to as participation in the remainder of this article) (27). The visual analogue scale (VAS) item expressing pain intensity on a scale ranging from 0 to 10 was not used in this study. Each domain is assessed with four items, and the domain scores are expressed as T-scores (M = 50 & SD = 10) with the US general population as a reference. Note that due to the invariance property of IRT, T-scores obtained from the PROMIS-29 are on the same metric as the scores Revicki used in his analysis, though these scores were generated using different items. For desirable constructs (e.g., physical function), higher T-scores indicate better health, whereas for undesirable domains (e.g., depression), higher T-scores indicate poorer health states.
The psychometric properties of the PROMIS-29 pro le, including evidence of construct and criterion validity, have been reported elsewhere (28)(29)(30)(31). An earlier analysis of the data used in this study revealed that scores on the seven health domains of the PROMIS-29 are measurement invariant across the UK, France, and Germany except for one item (32). Hence, the predictor scores of self-reported health that we used in this study are invariant with respect to nationality.

EQ-5D-5L
The EQ-5D-5L is a standardized, patient-reported, and preference-based instrument to measure generic health [3][4][5][6][7][8]. Five health dimensions are involved: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. Each dimension of the EQ-5D-5L has ve levels (i.e., response options): "No problems" (or 1), "Slight problems" (2), "Moderate problems" (3), "Severe problems" (4), and "Extreme problems" (5). These de ne 5 5 or 3125 different health states. The value assigned to each of these health states is determined by so-called value sets, developed by EuroQoL using time trade-off (TTO) and visual analogue scale (VAS) as preference elicitation methods (4,8). The maximum value for a health state is 1.00 or "full health". The minimum value depends on the value set applied and can be negative, then considered "worse than dead". For example, a pattern of 11111 is translated to a health state value of 1, while the pattern 54545 may correspond to -0.2. Note that persons in different countries value health states differently, so the EQ-5D index value is countryspeci c (8,9,11,12,25).
EQ-5D index values can be derived from EQ-5D-5L using either the crosswalk to the 3L value set or using the new 5L value sets (8). Crosswalks to the 3L value sets are available for ten countries, including the US, the UK, France, and Germany (4,8). A 5L value set is available for Germany (12). There is also one for England, which is not equivalent to the UK, and none yet for France (9,10). We therefore used the 3L crosswalk set for all three samples, thereby ensuring comparability among our samples and to Revicki's model, which used the 3L value set for the US(8,24,25).

Statistical analysis 2.3.1 Relationships among individual health domains and health utility across the UK, France, and Germany
To obtain a rst impression of the form of the relationships among individual health domains and HU and to judge whether the relationships are stable across the three countries under investigation, we plotted the seven domain scores against health utility in the UK, France, and Germany.

Optimal models for predicting health utility in the three countries
We applied stepwise regression with backward selection to nd the best models to predict the EQ-5D score for the UK, France, and Germany, starting with full models that incorporated linear, quadratic, and cubic effects for the same seven PROMIS domains as Revicki. Because sociodemographic factors such as age and sex are known to be useful in predicting HU, they were also entered as possible predictors (13).
The Bayesian information criterion (BIC) was used to steer the inclusion and exclusion of predictors in the stepwise regression analyses (33). To minimize the risk of signi cance by chance, for each model estimated, we used 10-fold cross-validation (34). With this in-sample cross-validation technique, the initial dataset is randomly split into 10 subsamples of approximately equal size. One of these subsamples is kept for validation, while the other nine subsamples are used for parameter estimation. This process is repeated ten times, and the results are averaged across repetitions.
The root mean square error (RMSE) and the mean absolute error (MAE) were used as measures of the prediction precision. Note that we deliberately chose to use different criteria than those used by Revicki because measures of precision and bias, such as the RMSE and the MAE, are preferred over either R 2 -based or information-based (AIC and BIC) criteria (35). In addition, we determined the width between the 95% empirical limits of agreement and compared them to the 95% theoretical limits of agreement (i.e., ± 1.96 * SD(residuals)). To check the prediction performance along the HU continuum, especially for low levels of HU, Bland-Altman plots were used. We used R version 3.4.1, IBM SPSS Statistics version 23, and Microsoft Excel version 15 to run the analyses.

Impact of misspeci ed mapping functions on the prediction performance
To the best of our knowledge, as of February 2020, the mapping function reported by Revicki was the only one available for predicting EQ-5D scores from the PROMIS-29 (24). Hence, we were interested in quantifying the detrimental effect of applying this foreign mapping function to the data collected in Europe. Note that application of Revicki's model to the data collected in the UK, France and Germany (i) disregards the country speci city of the EQ-5D, (ii) does not utilize the potential predictive value of the PROMIS-29 health domains not used by Revicki, (iii) does not take higher-order effects into account, and in combination with the foregoing, (iii) disregards country dependency of the form of relationships (i.e., the speci c values of the regression coe cients used).
Because we were also interested in which factor is mainly responsible for the differences in prediction performance, we moved stepwise from Revicki's model to our models as follows: First, we used the ve health domains of Revicki's model, but with regression coe cients optimized towards the data collected in each country separately. Second, we investigated the incremental value of adding either sleep disturbance, participation, or both to the prediction equation. Third, we allowed for incorporation of quadratic and/or cubic effects (M3). The relationships among the seven PROMIS domains and HU expressed by the EQ-5D score in the three European countries are displayed in Figure 1.

Optimal models for predicting health utility in the three countries
Recall that we used stepwise regression with backward selection to nd optimal models for predicting the EQ-5D scores for the UK, France, and Germany. The primary models thus comprised linear, quadratic, and cubic effects for each PROMIS domain plus effects for age and sex. Effects that did not signi cantly improve the prediction performance were sequentially removed from these models. The nal models to optimally estimate the EQ-5D score by the PROMIS-29 for the UK, France, and Germany can be found in Table 1 below (the standardized coe cients are in parentheses). All the models were con rmed by 10-fold cross-validation.  The unstandardized coe cients displayed in Table 1 can be used to compute EQ-5D scores from the PROMIS Tscores. However, interpretation of the regression coe cients needs to take into account two speci cs of polynomial regression models.
First, the regression coe cients of the higher-order effects (quadratic and cubic effects) appear to be much smaller than those for the linear effects, as the values of the predictor variables (with mean=50) are taken to the power of two for the quadratic effects (M 2 =2,500) and to the power of three for the cubic effects (M 3 =125,000).
Hence, coe cients have a substantially larger impact on the scale of the criterion.
Second, the single standardized regression coe cients shown in Table 1 should not be used to infer the form of the relationship between the individual health domains and the EQ-5D score because we have up to three effects (linear, quadratic, and cubic) in each health domain, and the relationship thus must be described by the summed effect of all three effects. Furthermore, not all the coe cients are in agreement with Figure 1. In Figure 1, we plotted the relationship of a single health domain with the EQ-5D score, irrespective of the values in all the other health domains. Instead, the regression coe cients displayed in Table 1 are optimal, given the effect of all the other effects already taken into account (stepwise procedure), which also explains why the nal models in the three countries are so different. Age, for example, has a positive effect on HU in the UK, a negative effect on HU in France, and no effect on HU in Germany. Although out of the 23 possible predictors twelve (UK and France) and ten (Germany) were kept in the nal models, only four effects were consistently chosen across countries: the linear effect of participation, the quadratic effect of physical functioning, and cubic effects of depression and pain interference.
The prediction performance of these models is summarized in Table 2 below. HU expressed by the EQ-5D score can be best mapped from the PROMIS-29 in Germany (RMSE GER = 0.10, MAE GER = 0.06), followed by France (RMSE FR = 0.11, MAE FR = 0.08) and the UK (RMSE UK = 0.12, MAE UK = 0.09). Furthermore, for all three countries, the widths of the empirical limits of agreement are always smaller than the widths of the theoretical limits of agreement. The prediction performances of the nal models along the HU continuum are depicted in the Bland-Altman plots below. Note that especially in the German sample, there are not many respondents with low health utility (EQ-5D < 0.2). Furthermore, prediction performance appears to be slightly better for high levels of HU (EQ-5D > 0.8) than for intermediate or low HU. The differences in the prediction performances between the application of Revicki's model versus our models are depicted in Table 3 below. The application of Revicki's model to the data collected in Europe would systematically underestimate the HU score for the UK (-0.10) and for France (-0.09) but not for Germany. As was the case for our models, the prediction performance of Revicki's model is the best in Germany, and the differences in the prediction performances between Revicki's and our mapping function are smaller in Germany than for the UK or for France, as indicated by the values of the RMSE, MAE, and empirical LoAs. The last step in our analyses was to investigate which factor was mainly responsible for the observed  In this paper, we developed optimal models for mapping the EQ-5D index values from the PROMIS-29 for the UK, France, and Germany. In contrast, with Revicki's model, which was optimized towards valuations of health states in the US, our models can be used to optimally predict HU expressed by the EQ-5D score for the UK, France, and Germany. Furthermore, we showed that the incorporation of higher-order effects into the regression equations substantially reduced the underestimation of low health utilities. The EQ-5D index value can therefore now be predicted from the PROMIS-29 in three major European countries for use in economic evaluations of health interventions. Our results in terms of the RMSE and MAE are well within the limits of what is usually reported for mapping algorithms (13,(36)(37)(38)(39)(40). The global underestimation of the predicted EQ-5D values in OLS has also been reported in dialysis patients (41).
We furthermore demonstrated that the application of a foreign model, in this case, the application of a US model, to European data, will yield biased results, especially for poor health states; however, this model performs well in upper ranges of health. One might therefore consider using a foreign model with domestic data as a second-best option if a country-speci c mapping algorithm is not available. This decision might make sense, for example, when using our German model for Austrian data in or using Revicki

Strengths and limitations
This study was conducted using three large samples representative of the general population in three European countries. To ensure comparability, the sampling strategies were the same across countries. This strength of our study is directly related to its foremost weakness: Severe health states are not frequently observed in the general population, and the proposed models therefore rely on few observations for low health states. Furthermore, our models allowed judgement of the incremental value of incorporating two additional health domains and higherorder effects for HU prediction.
Finally, some authors have argued against OLS regression as a type of mapping method even though, as outlined above, it is the most widely used method. First, arguments against that method are due to the phenomenon of regression to the mean. Second, linear regression models tend to predict HU score greater than one, which is a value that is impossible by de nition of HU (16). In our study, the risk of predicting HU values greater than one is circumvented by incorporation of non-linear trends.

Directions for future research and the PROMIS Preference Score (PROPr) for QALYs
Our mapping functions should be con rmed to samples with a greater frequency of low health states. Therefore, we are planning to replicate our ndings with data collected from spine patients who were assessed before surgery. In this study, we will also investigate whether regressing the EQ-5D dimensions on the PROMIS domain scores rst and then calculating the EQ-5D index values from the EQ-5D dimensions has incremental value (44).
PROMIS Preference Score (PROPr) to compute HU for QALYs directly from 7 PROMIS health domains: cognition, depression, fatigue, pain, physical function, sleep disturbance, and ability to participate in social roles and activities (in this paper, this domain is referred to as participation) (45)(46)(47)(48)(49). Note that these 7 PROMIS domains are not entirely equivalent with those 7 domains from the PROMIS-29 pro le (anxiety is missing in the PROPr, while cognition is missing in the PROMIS-29). The PROPr was valuated in US preferences using the standard gamble method (SG), while the EQ-5D uses TTO (10,25,47,50).
The PROPr could potentially be used instead of the EQ-5D index in cost-effectiveness analyses. Since many European HTA authorities such as NICE speci cally demand the use of the well-established EQ-5D index value to measure HU in cost-effectiveness analyses, mapping the PROMIS-29 to the EQ-5D will still be needed(43).
Furthermore, although PROMIS is being used more frequently in the US, it has not yet attained a dominant role in HRQoL assessments in Europe. In addition, as of February 2020, there is no PROPr value set for European preferences(47,48).

Conclusion
Our mapping functions can be used to predict EQ-5D index values from the PROMIS-29 for cost-utility analyses in health technology assessments in the UK, France and Germany. The inclusion of polynomial regression terms decreases the prediction bias for lower health states.
Our results support the assertion that mapping functions are country-speci c. The application of Revicki's model to the data collected in the three European countries leads to biased HU estimates for the UK and France and to less precise estimates in all three countries. Estimation of country-speci c regression coe cients for the ve health domains identi ed by Revicki strongly improves the average prediction performance but does not remedy the underprediction of low health states.

Declarations
Compliance with ethical standards Funding: This study was funded by the Centre Virchow-Villerme.
Con ict of interest: Authors declare that they have no con ict of interest.
Ethical approval: All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.
Informed consent: Informed consent was obtained from all individual participants included in the study.  Figure 1 Relationships among the PROMIS domains and health utility expressed by the EQ-5D score Bland-Altman plots of the predicted and observed health utility scores for the UK, France, and Germany