Comparing the mapping between EQ-5D-5L, EQ-5D-3L and the EORTC-QLQ-C30 in non-small cell lung cancer patients

Background Several mapping algorithms have been published with the EORTC-QLQ-C30 for estimating EQ-5D-3L utilities. However, none are available with EQ-5D-5L. Moreover, a comparison between mapping algorithms in the same set of patients has not been performed for these two instruments simultaneously. In this prospective data set of 100 non-small cell lung cancer (NSCLC) patients, we investigate three mapping algorithms using the EQ-5D-3L and EQ-5D-5L and compare their performance. Methods A prospective non-interventional cohort of 100 NSCLC patients were followed up for 12 months. EQ-5D-3L, EQ-5D-5L and EORTC-QLQ-C30 were assessed monthly. EQ-5D-5L was completed at least 1 week after EQ-5D-3L. A random effects linear regression model, a beta-binomial (BB) and a Limited Variable Dependent Mixture (LVDM) model were used to determine a mapping algorithm between EQ-5D-3L, EQ-5D-5L and QLQ-C30. Simulation and cross validation and other statistical measures were used to compare the performances of the algorithms. Results Mapping from the EQ-5D-5L was better: lower AIC, RMSE, MAE and higher R2 were reported with the EQ-5D-5L than with EQ-5D-3L regardless of the functional form of the algorithm. The BB model proved to be more useful for both instruments: for the EQ-5D-5L, AIC was –485, R2 of 75 %, MAE of 0.075 and RMSE was 0.092. This was –385, 69 %, 0.099 and 0.113 for EQ-5D-3L respectively. The mean observed vs. predicted utilities were 0.572 vs. 0.577 and 0.515 vs. 0.523 for EQ-5D-5L and EQ-5D-3L respectively, for OLS; for BB, these were 0.572 vs. 0.575 and 0.515 vs. 0.518 respectively and for LVDMM 0.532 vs 0.515 and 0.569 vs 0.572 respectively. Less over-prediction at poorer health states was observed with EQ-5D-5L. Conclusions The BB mapping algorithm is confirmed to offer a better fit for both EQ-5D-3L and EQ-5D-5L. The results confirm previous and more recent results on the use of BB type modelling approaches for mapping. It is recommended that in studies where EQ-5D utilities have not been collected, an EQ-5D-5L mapping algorithm is used.


Background
Health Related Quality of Life (HRQoL) is an important outcome from both clinical and economic perspectives. For cancer patients, it can be considered as a measure of the trade-off between survival benefit, toxicity from treatments and the physical and emotional well-being of the patients [1]. HRQoL is also considered to be an important predictor of survival [2]. Furthermore, HRQoL is critical for understanding the economic value of (cancer) treatments, because some cancer treatments are not only expensive but also the clinical benefits are modest and the burden of adverse events is quite high. Therefore, the risk-benefit relationship of cancer treatments can be guided by HRQoL outcomes [3].
One feature of health economic evaluation is the use of generic HRQoL measures to determine patient level health utilities for adjusting clinical outcomes to generate Quality Adjusted Life Years (QALYs) [4]. In some cases, utilities from commonly used generic HRQoL measures such as EQ-5D-3L or EQ-5D-5L are not always available. Therefore, reliance is made on alternative approaches to estimate patient level utilities using 'mapping' or ' cross-walking'where a statistical algorithm developed from a condition-specific measure (e.g. such as the cancer specific EORTC-QLQ-C30) is used.
The advantages and limitations of mapping have been discussed in detail elsewhere Brazier, 2010) [5,6]. Recently Crott (2014), Arnold (2015) and Doble (2015) [7][8][9] examined the performance of the most common mapping algorithms applied to the QLQ-C30. Several limitations of some of the simpler mapping algorithms from the EQ-5D-3L were noted. These related to untenable assumptions of linearity, homoscedasticity, multimodality, skewness, censoring and an over reliance on R 2 as the metric of model performance; and in some cases poor over prediction, particularly at poorer health states [5,7,8,10,11]. Mapping algorithms based on EQ-5D-3L have been shown to consistently over-predict utilities, particularly at poorer health states [5,6]. In order to address some of the limitations, alternative functional and statistical forms of mapping algorithms were examined (Kharroubi 2007, Crott, 2010, Hernandez, 2012, Sabourin et al., 2015 [5,[10][11][12][13]. These functional forms in some cases generated improved predictive capability (e.g. Hernandez, 2012, . In some cases however, changing the functional form did not offer improved prediction over and above simpler models [5,6]. Moreover, when applied to external data, some of the algorithms performed poorly [7,8]. In addition to the statistical framework of mapping algorithms, questions have been raised about the usefulness and indeed validity of mapping (Round, 2012) [14]. It is suggested that it is unclear as to what exactly is being predicted from mapping models, because the target is unknown (Round, 2012) [14]. However, this is precisely what a mapping model is supposed to do -to estimate the unknown utilities, which we assume to be 'knowable' based on reasonable assumptions. Although this, among other criticisms of mapping are important [5,6,15], they are perhaps not strong enough to dismiss mapping altogether. Consequently, about 25 % of health technology appraisal (HTA) submissions to NICE have used mapping (Longworth, 2013) [16] in the UK; while in Australia, this was reported to be about 24 % (Suchffham, 2008) [17]. Moreover, the published mapping models (for the QLQ-C30), suggest the unknown utilities are likely to be 'knowable' to some extent because some mapping algorithms have shown to yield close approximates of the target mean utility. Therefore, mapping can serve a useful purpose for estimating patient level utilities and continues to be used in HTAs of cancer drugs for estimating utilities (or sensitivity analyses) despite these criticisms.
Separately, concerns have also been raised about the sensitivity of the EQ-5D-3L and by extension to the derived mapped utilities [18][19][20][21]. Most mapping using the EORTC-QLQ-C30 (QLQ-C30) are based on EQ-5D-3L. Given the reported limitations and criticisms levelled against the EQ-5D-3L and the consequent development of the EQ-5D-5L, a mapping algorithm for the EQ-5D-5L appears to be the next logical step in this area of research.
There are two commonly used generic HRQoL measures for determining utilities used in health economic evaluation -EQ-5D-3L and the more recent EQ-5D-5L. The main difference between these two instruments is that the latter has responses measured on a 5 point scale, with many more health states [22]. EQ-5D-3L was suggested as having limited discriminative ability and less power to detect between group differences compared with EQ-5D-5L [22][23][24]. Research is ongoing as to the best value sets for use with EQ-5D-5L. Meanwhile, an interim scoring is currently available for EQ-5D-5L using a crosswalk algorithm from EQ-5D-3L to EQ-5D-5L.
In this research we compare the performance of three mapping algorithms (from QLQ-C30): a Random Effects linear model, a Beta-Binomial (BB) and a Limited Dependent Variable Mixture Model (LDVMM), for each of two utility measures: EQ-5D-5L and EQ-5D-3L, separately. To our knowledge, no study of mapping compares algorithms from both instruments in the same set of patients; and none are available between EQ-5D-5L and QLQ-C30, particularly from a non-small cell lung cancer (NSCLC) patient population. Khan & Morris (2014), using data from a randomized controlled trial (RCT) [5], showed that a three-part BB model performed best amongst other commonly used algorithms. This analysis examines mapping models using data from NSCLC patient in a real world NHS setting. This will offer researchers a way of computing patient-level utilities from the EQ-5D-5L (and EQ-5D-3L) with greater generalizability than a RCT.

Study design
A single cohort prospective (non-interventional) follow up study in 100 NSCLC patients was designed. Patients with histologically confirmed NSCLC gave informed consent (for data collection and follow up) and were followed up during their routine anti-cancer treatment and cancer management for a period of at least 12 months. Patients were recruited between March 2014 and July 2015 from the Liverpool and Clatterbridge Cancer Centre. The trial recieved local ethics approval (Liverpool Central) and research was conducted in compliance with the Helsinki declaration.
EQ-5D-5L, EQ-5D-3L and QLQ-C30 assessments were carried out monthly from registration. EQ-5D-3L and EQ-5D-5L were assessed at least 1 week apart to avoid potential for 'carry over'. Patients were given the HRQoL forms to take home and they returned them by post or when they attended their next hospital visit. They were instructed to complete the EQ-5D-3L in the first week and the EQ-5D-5L in the second (or third) week of each month.

Instruments
EQ-5D-3L is widely used for economic evaluation, has 243 health states and for each state, a corresponding utility value is available [5,6]. In this paper, we use the UK tariffs based on the Time Trade-Off (TTO) method [23]. The raw scores from the EQ-5Ds were converted into an index ranging from -0.549 to 1, where 1 denotes 'perfect' quality of life, 0 for death and values below 0 as states 'worse than death'. EQ-5D-5L consists of five questions identical to EQ-5D-3L (mobility, self-care, usual activities, pain/discomfort, and anxiety/depression), but with an expanded 5 point scale (compared to the 3 point scale of EQ-5D-3L) [25]. These are 'no problem' , "'slight problems' , 'moderate problems' and 'severe problems' in all five dimensions, and 'unable' in mobility, self-care and usual activities or 'extreme problems' in pain/discomfort and anxiety/ depression. The scoring of EQ-5D-5L uses an interim cross-walk based algorithm (UK value sets) between EQ-5D-3L and EQ-5D-5L (Van Hout, 2012) in the absence of a full value set [22,26].
The EORTC QLQ-C30 is an established instrument for measuring HRQoL in various cancers [27].

Statistical methods
Three models were used to compare the mapping.

Linear random effects model
The linear model with a random effect is an extension of the ordinary least squares (OLS) model. One importance difference is that subject level effects are included (sometimes called a mixed effects model). In the context of mapping, because utility scores are observed for each subject on more than one occasion, the responses are not independent. The subject level differences (between subject variability) can be modelled with a random effect. For this reason the model is termed a mixed effects model because variability of utilities occurs between and within subjects. This model is relatively easy to use when applied to an external data set to predict patient level utilities. This is important because, in practice, a mapping algorithm should also have a feature that it can be used practically and as simply as possible.
Overly complicated models require more assumptions and hence introduce greater uncertainty. The principle of parsimony should be adopted when developing a mapping model. The model form in a general linear mixed model framework is: Where β is a matrix with the fixed effects parameters (e.g. the 15 coefficients of the QLQ-C30) and u is a matrix (or vector) with the random (subject) terms and ε is the experimental error term (corresponding to the fixed effects).

Limited dependent variable mixture model (LDVMM)
A second model proposed by Hernandez et al. [10] belonging to the class of limited dependent variable (LDV) models is the so-called Adjusted Limited Variable Dependent Mixture Model (ALVDMM) [10]. This particular model has several noteworthy features. The first is that it assumes additivity of effects (as in a linear model). The second is that it involves a latent variable that is censored. The censoring occurs (similarly applied in a TOBIT model) because there are considered to be unobservable values. Hernandez et al. [10] noted that since there is a gap in utilities between the values 0.833 and 1 for the EQ-5D-3L, the preferences for health states are in effect 'cut-off' on the higher side of values at (or above) 0.833 to a value of 1 (we essentially capture the ceiling effect). That is, if a patient's (true) utility is >0.833, the instrument (EQ-5D) cannot capture this and we assume a value of 1.
The LDV type models generate predicted estimates in a more complex way which involve finding the probability that the unobserved (latent) value is above or below the censored threshold value (e.g. 0.833) using the ratio of the probability density function (PDF) to cumulative density functions (CDF). This feature of the LDVs allow the possibility to model the presence of several distributions simultaneously. Hernandez et al. [10] modelled data against the (simpler) health assessment questionnaire (HAQ) in an arthritis population. The greater the number of latent classes, the greater the complexity of interpretation. Application of 3 classes in the context of 15 QLQ-C30 domain parameters is likely to lead to a much more complex latent class structure and therefore two classes (two mixed distributions) are used for both the EQ-5D-3L and EQ-5D-5L in this analysis. This is justified by observing the kernel density estimates which suggest a bimodal distribution for EQ-5D-3L (values between about -0.549 to 0.3 and 0.3 to 1) in this data set (see Fig. 1). For the EQ-5D-5L, the mixture of distributions is not obvious, although there is marked skewness.
The model form for the mixture model used in this context is now described in further detail: Assuming responses Y (i.e. EQ-5D utilities), whose distribution depends on an unobservable random variable S; S can occupy one of k states (k = 2 in this example), the number of which might be unknown but is at least known to be finite. Since S is not observable, it is referred to as a latent variable. Let π j denote the probability that S takes on state j. For example, in the case of the EQ-5D-3L for the ALVDMM, j = 1 might refer to values of EQ-5D-3L < 0.833 and j = 2 would refer to states such that EQ-5D-3L utilities are > 0.833.
Conditional on S, the distribution of the response Y is assumed to be f j (y;α j , β j | S = j). What this expression (i.e. (f j (y;α j , β j | S = j)) means is that depending on the number of states (S), a model (with a form f j (y;α j , β) can be used to determine the relationship between Y (the EQ-5D) and a set of predictors, β (e.g. the 15 QLQ-C30 coefficients). For example, for j = 1 (values of EQ-5D-3L between -0.549 and 0.3), the EQ-5D-3L are assumed to follow a Normal distribution. For values between 0.3 and 1 (j = 2), the data can be considered to follow a Beta Binomial (BB) distribution. In another scenario, for j = 1, a Weibull function could be used, and for j = 2 a Normal distribution used; there would 6 parameters to estimate (2 parameters for the Weibull, 2 parameters for the Normal and consequently two mixing probabilities (π 1 and π 2 ), the probability of observations belonging to one or another class. The 6 parameters to be predicted do not include any of the QLQ-C30 predictors (parameters), where a further, 16 parameters are estimated.
The following mixture models were simultaneously fitted: (i) EQ-5D as a function of 15 QLQ-C30 domain scores (Normal Distribution assumed between -0.549 and 0.30 for example) (ii) EQ-5D as a function of 15 QLQ-C30 domain scores (Beta Binomial distribution assumed between 0.30 and 1 for example) (iii)The Mixing probabilities as a function of the 15 QLQ-C30 domain scores (two mixing probabilities which classify observations as belonging to distributions in (i) or (ii)) Clearly, the above modelling approach is complex, perhaps unnecessary and can lead to model non convergence. Its practical implementation as an external algorithm is therefore an important consideration. A transformation may be carried out if specific distributions are assumed (e.g. modelling negative values). For example, for values between -0.549 and 0.30, a Gamma (or Beta Binomial) distribution would not be possible.
Therefore, in this analysis two distributions are considered for modelling: The predicted estimates are determined in a complicated way from the ratio of the CDF to the PDF of the EQ-5D responses and using the estimated mixing probabilities. The mixing probabilities can be interpreted as the ratio of observations belonging to one of two distributions. If the mixing probabilities were 0.5, then 50 % of the EQ-5D-3L might be considered to follow a Normal distribution and the remaining 50 % a different distribution. A useful exposition of finite mixture models can be found in Schlattman (2009) [28].
A maximum likelihood estimation for continuous and discrete response distributions is used based on a dual quasi-Newton optimization algorithm using the SAS® software [29]. A global maxima was sought using initial starting values to search for a local maxima, followed by re-running the model using estimates generated from previous model runs.

Beta binomial model
For the ALVDMM previously used, censoring occurs for values at 0.833 for the EQ-5D-3L. This is not the case for the EQ-5D-5L, where values between 0.833 and 1 do exist. For this reason (Fig. 1) the distribution of the EQ-5D-5L can be considered appropriate for modelling on a continuous type scale between -0.549 and 1.0 (after a transformation of Y-a/b-a), and therefore the BB model is the third model that is considered for mapping. The details of the BB model are elaborated and discussed in Khan & Morris (2014) [5] and show an improved fit compared with simpler linear and LDV type models (e.g. TOBIT and CLAD).

Model performance criteria
Several model performance statistics were used including the root mean square error (RMSE) which is a measure of model fit (lower values indicate better fit), mean prediction error, R 2 , mean absolute error (MAE), and percent predicted >1 and < -0.594 were. Chai (2014) argues that the RMSE is more appropriate than the MAE, particularly if the error distribution is Normally distributed [30]. In addition, the Aikakes Information Criteria (AIC) values and percent predicted within a target range (e.g. ±5 %, ±10 %) of the observed values were determined.

Simulation and cross validation
Multivariate simulation (1,000 simulations using Fleishman methods) [31,32] were used to test the uncertainty of the models. The method of Fleishman uses higher order moments (e.g. kurtosis and skewness) to generate correlated simulated data regardless of the distribution of each of the original variables. The steps involved in simulation require computing the mean, SD, skewness and kurtosis for each of the observed 15 QLQ-C30 domain scores. Using the Fleishman (1978) [31] power transform: The values of α, β, δ and γ are estimated from randomly generated data Z, normally distributed with mean of zero and a variance of 1 and the observed measures for kurtosis and skewness. The values of α, β, δ and γ are estimated through a process of iteration so that Y can be determined. The derived Y (e.g. 15 QLQ-C30 scores) are simulated (correlated) responses which are not necessarily normally distributed. Khan et al. [5] have shown that the QLQ-C30 scores are unlikely to follow a Normal distribution in most cases.
For each simulated data set, cross validation was used. Half (50 %) of the simulated dataset (randomly selected) was used to develop the mapping model and the other half used to test the model (out of sample predictions). For each realization (i.e. dataset simulated), the model performance statistics (e.g. RMSE and R 2 ) were generated and reported. Although, there is no theoretical reason for 50 % of the data used for developing the model, other cut-offs (e.g. 75 % vs 25 %) were also considered.

Random effects model
The performance of the random effects model was comparable to the LDVMM. Table 3 shows the parameter estimates for the 15 QLQ-C30 coefficients. If all scores for the functional domain, Global score and Finance score are assumed to be perfect (i.e. score of 100) and no signs and symptoms are present (i.e. score of 0), the predicted EQ-5D-3L and EQ-5D-5L scores are estimated to be about 0.89 and 0.96 respectively. On the other hand, if symptom and functional scores are the worst possible (scores of 0  and 100 for function and symptoms respectively), the predicted EQ-5D-3L and EQ-5D-5L falls to about 0.10 and 0.09 respectively. EQ-5D-5L therefore predicts better at both extremes Table 5.

LDVM
The LDVM model estimates are more complicated to generate as they involve two distributions and two mixing probabilities. Consequently more than 32 parameters are involved in determining predictions for the best and worst case scenarios ( Table 6). The LDVMM also predicts well at extremes, despite similar R 2 and RMSE to the random effects model ( Table 5 and Table 7). However, the LDVMM is much more complex to use as an algorithm. Users would also need to know details of the mixing probabilities as well as make stronger assumptions about the mixed distribution. Other mixtures were also considered but the Normal/Beta mixture offered the best (smallest AIC) fitting model.   [5]: overprediction at poorer health states. There does however appear to be some evidence that mapping algorithms based on EQ-5D-5L may yield improved predicted utilities at poorer health states. In particular, the BB model showed improved predictions regardless of the instrument. The predictions at poorer health states (Fig. 4) present some interesting findings. Modelling with the LDVMM consisted of a BB and Normal distribution. Values >0.30 were modelled assuming a BB distribution. Predictions at poorer health states (assumed to be -0549 to 0.30) appear slightly worse. Better predictions with the LDVM after EQ-5D values >=0.30 are observed. This supports a BB algorithm as a plausible model for developing a mapping algorithm.
The predicted values are notably worse for the EQ-5D-3L. About 50 % of predicted utilities were over-predictions (higher than the observed value by any amount) with the EQ-5D-5L; for EQ-5D-3L this was 67 %; 93 % vs 97 % of utilities were overpredictions for the EQ-5D-5L vs EQ-5D-3L respectively.

Simulation and cross validation
Each simulated data set of 985 observations for EQ-5D-5L and EQ-5D-3L were subject to a cross validation using a 50 % random sample (about 492 observations each for EQ-5D-5L and EQ-5D-3L respectively) for the BB model. Hence, a total of 1,000 R 2 , RMSE and mean predicted values were observed (Table 8 and Figures. 5.4 -5.7). For EQ-5D-5L and EQ-5D-3L respectively, the average (mean) R 2 from the BB model was 76 % (range 51 % to 89 %) and 68 % (range 38 % to 79 %); RMSEs averaged around 0.099 (range 0.069 to 0.155) and 0.113 (range 0.058 to 0.177). Simulations from the Random Effects and LDVM models showed similar performance but were both worse compared to the BB.    Predicted mean utilities were closer to the observed for the EQ-5D-5L: 0.572 vs. 0.575 whereas, for the EQ-5D-3L these were 0.515 vs. 0.518 (Table 8 and Figs 5,6,7 and 8). Hence, out of sample predictions for the EQ-5D-5L appeared more accurate than those of the EQ-5D-3L, particularly with the BB model. When a different cut-off was used (e.g. 75 % to model the data and 25 % for prediction), there were no changes in conclusions.

Discussion
We have developed and compared three mapping algorithms for the EQ-5D-5L and EQ-5D-3L using contemporary and novel modelling methods. We have shown that EQ-5D-5L may offer better prediction at poorer health states where several previous algorithms with EQ-5D-3L have, by and large, over-predicted. Modest improvements of an algorithm based on EQ-5D-5L over one based on EQ-5D-3L in terms of statistical metrics (e.g. R 2 , percent predicted) have been confirmed with a BB model in this and previous analyses [5]. Young et al. [33]   suggested that two-part models may offer a way to predict the different parts of the distribution in the context of mapping with improved performance for handling overprediction. More recently, Crott [34] confirms the suitability of the BB type models over other models. In this analysis we have confirmed the bimodal nature of the EQ-5D-5L value sets noted earlier (Oppe et al.) [24] (Fig. 6). This is the first time to our knowledge a mapping algorithm has been developed simultaneously from EQ-5D-5L and EQ-5D-3L in the same lung cancer patients using EORTC-QLQ-C30 and compared with each other in a real world NHS setting. Previous works with the EQ-5D-5L highlighted some of the limitations of the EQ-5D-3L relating to aspects such as bi-modality of utilities and a lack of sensitivity to detect differences between treatment groups [35][36][37]. Some earlier mapping models did not take this into account. Cheung et al. [25] for example, report an algorithm using the FACT-B in a breast cancer population with R 2 of around 48 % (AIC was not reported).
In this analysis, over-prediction at poorer health states still exists with EQ-5D-5L, although it is not as marked as EQ-5D-3L. It is yet to be seen whether the final value sets (Oppe et al.) [24] currently being developed and validated will impact predictions at poorer health states. The reasons for over-prediction may be due to several factors, including the functional form of the model, the range of the scale (5 point vs 3 point scale), number of health states and other clinical characteristics. Khan & Morris [5] previously suggested over-estimates at poorer health states may be related to other factors such as poorer prognosis. Preliminary evidence of this is shown by observing the relationship between ECOG performance and EQ-5D utilities (Table 9). It is possible a further complexity is required in the modelling by using the joint distribution of utilities and other outcomes (e.g. Adverse events) to model the QLQ-C30 scores.
In this study, the EQ-5D-5L and 3L assessments were taken close together in time. Therefore, there may be some concern about 'carryover' or recall bias. To check this, we determined whether health state responses were recorded similarly. For example, if a response of 11112 was observed for EQ-5D-3L, we checked whether this was also observed for EQ-5D-5L (responses >3 are not possible for EQ-5D-3L). We noted that for 15 of the 146 (EQ-5D-5L) health states, the responses for EQ-5D-5L and EQ-5D-3L were the same -for example, patients with responses of 11111 to both EQ-5D-5L and EQ-5D-3L in 18 of the 985 (pairs) of observations (<2 %). In the vast majority of cases the responses were different. This suggests that patients did not recall the previous responses and the presence of carryovermay be unlikely.
There are several limitations of this research. The first is that this is a small sample size with relatively few health states, although the sample size is larger than the algorithm reported by Kontodimopoulous (2009) [38]. Secondly, inferences need to be restricted to a similar NSCLC population until further evidence emerges of wider applicability across tumour types. Thirdly, external validity was not possible in an independent data set and therefore crossvalidation was used as a 'second best' accompanied by simulation for out of sample predictions. Fifthly, insufficient numbers of events were available for reliable computation of QALYs and therefore the impact on QALYs could not be reliably observed at this time (a sufficient number of events are not yet available for this to be estimated reliably). Finally, the values of the EQ-5D-5L are cross-walked from the EQ-5D-3L and are therefore subject to uncertainty. However, in the absence of a readily identified set of value sets, and given that the EQ-5D-5L is being used in current clinical research, using the EQ-5D-3L cross-walk sets should be considered acceptable in the interim.
Despite these limitations, this is the first mapping algorithm for the EQ-5D-5L using real world data with enhanced generalizability outside the RCT context. That further research is required, is consequently inevitable.