Mapping the EQ-5D index from the cystic fibrosis questionnaire-revised using multiple modelling approaches

Background This study was designed to develop a mapping algorithm to estimate EQ-5D utility values from Cystic Fibrosis Questionnaire-Revised (CFQ-R) data. Methods A cross-sectional survey of adults with cystic fibrosis (CF) was conducted in the UK. The survey consisted of the CFQ-R, the EQ-5D and a background questionnaire. Eight regression models, exploring item and domain level predictors, were evaluated using three different modelling approaches: ordinary least squares (OLS), Tobit, and a two-part model (TPM). Predictive performance in each model was assessed by intraclass correlations, information criteria (Bayesian information criteria and Alkaike information criteria), and root mean square error (RMSE). Results The survey was completed by 401 participants. For all modelling approaches the best performing item level model included all items, and the best performing domain level model included the CFQ-R Physical-, Role- and Emotional-functioning, Vitality, Eating Disturbances, Weight, and Digestive Symptoms domains and a selection of squared terms. Overall, the item level TPM, including age and gender covariates performed best within sample validation, but OLS and TPM domain models with squared terms performed best out-of-sample and are recommended for mapping purposes. Conclusions Domain and item level models using all three modelling approaches reached an acceptable degree of predictive performance with domain models performing well in out-of-sample validation. These mapping functions can be applied to CFQ-R datasets to estimate EQ-5D utility values for economic evaluations of interventions for patients with cystic fibrosis. Further research evaluating model performance in an independent sample is encouraged. Electronic supplementary material The online version of this article (doi:10.1186/s12955-015-0224-6) contains supplementary material, which is available to authorized users.


Background
Cystic fibrosis (CF) is a hereditary and life-threatening autosomal recessive disorder. An estimated 80,000 children and young adults suffer with CF worldwide, with a rate of 1 case per 2,500 births [1]. If untreated, patients are likely to suffer from chronic respiratory infections, pancreatic enzyme insufficiency and associated complications. Advances in treatment and management have resulted in an increase in survival rates. The predicted median age of survival for a person with CF is the late 30s, and with over half of children born in the 1990s expected to survive into their fifth decade [2]. Despite these advances though the disease still represents a very significant burden for patients in terms of their symptoms, loss of functioning and poor health related quality of life (HRQL) [3].
HRQL is a multi-dimensional concept, which reflects individual's subjective evaluation of his or her daily functioning (i.e. physical, psychological, emotional and social functioning) and well-being. Poor lung functioning [Forced Expiratory Volume in 1 second (FEV 1 ) < 30% predicted] and pulmonary exacerbations in the past 6 months have been related to poor HRQL [4,5]. The Cystic Fibrosis Questionnaire-Revised (CFQ-R) is a validated patientreported outcome (PRO) measure of HRQL specifically designed for individuals with CF [6,7]. The CFQ-R is commonly used in CF clinical trials where it has demonstrated responsiveness [8,9], and been used to support PRO label claims.
Decision makers within drug licensing authorities such as the US Food and Drug Administration (FDA) and payers such as the National Institute for Health & Care Excellence (NICE) in the UK have become increasingly interested in the information that can be captured from HRQL PROs. NICE and many other health technology assessment bodies globally, are interested in understanding the benefits of health technologies in terms of quality-adjusted life years (QALYs): a metric incorporating length and quality of life. Estimating QALY requires a specific type of HRQL data that reflects the value that people place on HRQL rather than just a psychometric score. This value is referred to as utility and is measured on a scale of 0 (dead) to 1 (full health). UK national guidelines regarding the data used in health technology appraisals recommend the use of generic preference-based measures to capture utility, with a stated preference for the EQ-5D questionnaire [10]. However these data are not always collected in clinical trials. To address this data gap it is possible to estimate EQ-5D scores from a different PRO, such as the CFQ-R, with the development of a robust mapping algorithm. Mapping studies often also incorporate demographic characteristics into model estimation to increase a models predictive performance [11][12][13]. This approach is endorsed by NICE [14] and there is a growing body of literature related to the development of mapping functions linking source disease specific HRQL measures onto target preference-based measured using regression models [15].
The present study was designed to develop a mapping algorithm to estimate EQ-5D utility values from CFQ-R data, with and without adjustment for demographic characteristics (age and gender). This will enable existing and future trial datasets, which include CFQ-R (but not EQ-5D), to be used by decision makers to understand the value of new health technologies in CF.

Study design and participants
A cross-sectional observational study conducted as an on-line survey was undertaken in the UK. The option to complete a pen and paper survey through the post was provided but not utilised by any respondents. The survey was advertised by the Cystic Fibrosis Trust (CF Trust) by placing adverts on the CF Trust website, forum, Facebook page, Twitter account and Google Adword. Potential respondents were informed that the CF Trust would receive a £50 donation for every completed survey; respondents did not receive any direct remuneration for their participation.
All participants had a self-reported clinical diagnosis of CF, were aged 18 years or above and currently resident in the UK. Participants were also asked to rate their CF severity as mild, moderate or severe during screening to ensure sample variability in HRQL item responses.

Ethics
Independent ethical review was sought and granted by Schulman Associates Independent Institutional Review Board Inc. Informed consent was obtained from all participants prior to completion of the online survey.

Survey
Interested participants followed a link provided by the CF Trust to be taken to an information sheet describing the purpose of the survey, the consent form and the survey. The survey was conducted from January -March 2012. The survey consisted of three questionnaires: the CFQ-R, the EQ-5D, and a demographic/clinical background form. Each of these measures is descried in more detail below.

CFQ-R
The CFQ-R is a validated disease-specific questionnaire measuring health-related quality of life in CF patients [6,7]. The teen/adult UK English version of the questionnaire, suitable for ages 14+, was used. This consists of 50 items across 12 domains: 'physical functioning', 'role functioning', 'emotional functioning', 'vitality', 'social functioning', 'body image', 'eating disturbances', 'treatment burden', 'health perceptions', 'weight', 'respiratory symptoms', and 'digestive symptoms'. All items use categorical response options, with values ranging from 1 -4. Domain scores were calculated using the developer's guidelines, which produces a potential range of scores from 0-100, with higher scores indicting better HRQL.

EQ-5D
The EQ-5D-3L is a generic preference-based measure of HRQL [16][17][18]. The questionnaire consists of five domains: 'mobility', 'self-care', 'usual activity', 'pain/ discomfort' , and 'anxiety/depression'. Participants also indicate their current health on a visual analogue scale ranging from 0 (worst imaginable health state) to 100 (best imaginable health state). Health utilities were derived from the EQ-5D using UK general population preference weights [19], which provide a potential range of scores from -0.59 to 1.0; a score of 1 represents full health, a score of 0 represents a state equivalent to dead, and a score below 0 represents a state worse than dead. NICE state that EQ-5D is the preferred source of utility values for use in economic evaluation [10].

Demographic/clinical background form
The demographic/clinical background form gathered data on respondents' age, sex, ethnicity, employment status, time since CF diagnosis, FEV 1 (if known), date of last FEV 1 assessment, and exacerbation occurrence since last FEV 1 assessment.
None of the respondents had missing data in the EQ-5D, CFQ-R, age or gender.

Analysis
Model development and specifications Figure 1 shows the distribution of EQ-5D utility scores, which was used to determine which modelling approaches to use. 19% had a score of 1 (i.e. full heath) while 3% had a score less than 0. Three regression modelling approaches were used to identify the most parsimonious prediction model with the best fit: an ordinary least squares (OLS) model, a Tobit model, and a two-part model (TPM). The OLS approach is used to estimate the unknown parameters in a linear regression model by minimizing the sum of squared errors from the data. This model has frequently been identified as the most parsimonious and best fitting model in utility mapping studies when compared to other methods designed to cope with bounded and multi-modal distributions [15,20]. The Tobit model (also known as the censored regression model) takes better account of the censored nature of EQ-5D data, deals with truncated data and can approximate for skewed data by setting the upper limit to 1. Censored least absolute deviation (CLAD) models have also been advocated to deal with censoring but these are median-based models while most economic evaluation models are mean-based [14] therefore CLAD was not assessed. The TPM approach deals with the high proportion of values are at 1.0 [21][22][23]. The first part of the two-part model uses a logit regression to estimate the probability that an individual is in full health. The second part estimates EQ-5D utilities for remaining observations using a truncated OLS model which can lie between −0.594 and 0.99. The two parts of the model are combined using the expected value method to calculate the EQ-5D score as: [EV = Expected value; P(EQ-5D = 1) = Probability of being at score 1 predicted from part 1; (P(EQ-5D ≠ 1) = 1-P (EQ-5D = 1), probability of not being at 1; Predicted EQ-5D part 2 = predicted EQ-5D from a truncated OLS regression for those who score less than 1].
Based on recommendations in the literature [14], separate models were tested for the CFQ-R domains and items were used to predict EQ-5D utility scores using the three modelling approaches. The CFQ-R Health domain and its constituent items were not selected for the regression models as all of the remaining CFQ-R items also measured health; thus these items would either be redundant or cause problems of multicollinearity which would violate the regression assumptions and render the model unreliable. Item 43 (How has your mucus been?) from the CFQ-R was also removed from the regression analysis as this item was a sub-question which not all participants provided a response. Gender and age were also included in one of the regression specifications for all three models. Self reported FEV 1  was not included in any models as the aim was to estimate a mapping function specifically from the CFQ-R, rather than a combination of measures of CF. In total, eight different sets of independent variables were evaluated to ensure the best model specification was selected and repeated using OLS, Tobit and TPM mapping methods: In all item level models, the items were reverse coded if appropriate, and dummy coded with a score of 1 (poor health) as the reference category. In Model 7 unordered items (where coefficients did not follow the predicted order of magnitude across good to poor response options) were dichotomised to 'no problems' versus 'other'. In the TPM model, item level models were also collapsed to 2 or 3 levels for those in full health.
The Ramsey Regression Equation Specification Error Test was used to assess misspecification in the linear models obtained using OLS. The linktest was used to assess misspecification in the Tobit model and the second part of the TPMs. Multicollinearity was assessed using the variance inflation factor (VIF) with values greater than 10 indicating a problem. Bootstrapped bias-corrected (2000 replications) or robust standard errors are reported for all models.

Model validation and comparison
Model goodness of fit was assessed by adjusted/pseudo R 2 statistics (OLS and Tobit models only), Bayesian information criteria (BIC) and Alkaike information criteria (AIC) statistics. Lower BIC and AIC values would indicate a better fitting model. To examine the predictive performance of the model the differences between the predicted and observed EQ-5D scores at the individual level were examined by computing the mean squared error (MSE) and root MSE (RMSE). Smaller error values are indicative of better performing models. Plots of the observed and predicted EQ-5D scores are used to examine the performance of the models. Predicted and observed EQ-5D utility scores and RMSE were also compared across different EQ-5D ranges and CF severity as measured by percentage of predicted FEV 1 (FEV 1 groups: mild = >70%, moderate = 70% -41%, severe = < 41%). ANOVA models were used to examine differences in predicted scores across EQ-5D ranges and FEV 1 severity groups. Intra-class correlations, which measure the level of agreement between the predicted and observed scores, were also assessed.
It is recommended that where possible an external dataset is used as a validation dataset to determine the accuracy of predicted utility values of the selected models out-of-sample [14]. However, no external dataset was available for the present study and therefore the performance of the mapping algorithms were assessed using a cross-validation approach. The sample was randomly split into four groups of 25% each. The best fitting models within sample were re-run on three of the four group and applied to the excluded group to ensure in an iterative process until each of the samples had been used as both estimation and validation samples. 75% of the data were used as an estimation dataset for building models, and 25% were used as a validation dataset. The proportion of responses for the estimation dataset is larger than for the validation dataset to enhance model accuracy with a greater number of responses.
All regression analyses were conducted using STATA v 11.

Sample characteristics
A total of 401 participants completed the survey; all surveys were completed online. The demographic and clinical characteristics of participants, by FEV 1 severity group and for the sample as a whole, are presented in Table 1. The sample represented a broad range in terms of demographics and disease severity.
Descriptive statistics for EQ-5D and CFQ-R Observed EQ-5D utility and CFQ-R domain scores of the participants are shown in Table 2. The mean EQ-5D score was 0.67 (SD = 0.28), ranging from -0.35 to 1, which is only slightly narrower than the theoretical range of -0.59 to 1. Both the EQ-5D and CFQ-R mean scores reflect the self-reported disease severity as measured by FEV 1 , with utility and almost all CFQ-R domain scores declining with increased severity. The digestive symptoms domain was the only domain not reflecting FEV 1 severity.

Regression modelling
24 models were explored in total (8 specifications for OLS, Tobit and TPM), the goodness of fit and predictive performance statistics from the best domain and item level model for each regression type are presented in Table 3. The identification of the best domain and item level models was based on an examination of all goodness of fit and predictive performance statistics. The performance statistics of the 18 models not presented are available upon request.
In the OLS, Tobit and TPM regressions, the best performing domain level model within sample was model 3: including statistically significant domains at the 10% level, plus significant squared terms. There was no evidence of multicollinearity in any of the domain level models (mean and individual variable VIF < 10) apart from where expected when squared terms are included. There was evidence of misspecification in all the OLS models including model 3 but the Tobit and TPM model 3 were not misspecified. The best performing item level model within sample was model 5 (all CFQ-R items included in analysis) for OLS, Tobit and TPM; however, the TPM model 5 was improved with the addition of age and gender as covariates (model 8). Item level models had mean VIF <10 but some individual dummy variables (19/139) had VIF greater than 10 which indicates problems with multicollinearity when all items were included. Item level models also had evidence of misspecification for all models apart from model 7.
As shown in Table 3 all six best performing models demonstrated good predictive performance within sample; all predicted means (0.671 -0.691) were within 0 -0.02 of the observed mean (0.671), and the fitted ranges of the EQ-5D preference-based values were within 0.128 -0.296 of the lower bound observed value (−0.349). As would be expected only OLS models exceeded the upper bound observed value of 1. In all instances the item level models performed marginally better than the  domain level models. All the models showed good ICC (>0.7) between predicted and observed EQ-5D values. This is further illustrated in Figure 2 and Table 4, where the mean observed and predicted EQ-5D preferencebased values by health state ranking indicate over prediction for more severe health states (where the observed EQ-5D value was less than 0.3), and under prediction for very mild health states (where the observed EQ-5D value was above 0.9). However, Table 4 also illustrates that all six best performing domain and item level models demonstrated responsiveness to severity as assessed by EQ-5D and FEV 1 sub-groups. There were statistically significant differences (all p's < 0.001) across EQ-5D and FEV 1 health states for each model's predicted EQ-5D values. The best performing within sample model overall was the item level TPM, including age and gender covariates. This model performed best when predicting values across the range of EQ-5D observed scores, did not include out of range predicted values, and demonstrated good predictive performance with the lowest RMSE values. All 6 models were tested in the out-of-sample crossvalidation. A one way analysis of variance test indicated no significant differences between the mean observed EQ-5D values of the validation and estimation samples across the 4 samples (F 397,3 = 0.05, p = 0.985). Table 5 provides summary statistics of the observed and predicted EQ-5D utility scores in each of the four samples based on models ran in the other 3 samples e.g. sample 1 predicted scores are based on models undertaken in the combined 2 to 4 samples. Mean values tend to be larger or smaller (difference 0.001 to 0.02) than the observed mean values for most of the models with either OLS and Tobit domain models (model 3) having the smallest differences in samples 1 to 3 and TPM item model (Model 8) having the smallest difference in sample 4. In all the samples apart from sample 2, all the models perform poorly at predicting the full observed range particularly at the poor end of health (difference 0.03 to 0.62). In sample 2 the OLS item model (model 5) and the TPM domain model (model 3) are within 0.004 of the observed minimum score. Tobit and TPM item models (5 and 8) predict the maximum accurately while OLS models predict values greater than 1 particularly in the item models. In all the samples, RMSE is smallest in the OLS and TPM domain models (0.118 to 0.146) and largest in the TPM item level models (0.182 to 0.223). ICC is larger in the domain models (0.50 to 0.81) compared to the item models (0.29 to 0.56) indicating better agreement between observed and predicted scores in the former. Assessment of RMSE across the EQ-5D range indicates that all models are poor at predicting at the poor end of health but TPM item level models also have larger RMSE in other parts of the EQ-5D range as well (see the Additional file 1 detailing the results of Table 4 for each of the 4 cross-validation samples).
Based on RMSE and ICC and mean predicted values, the OLS and TPM domain model (model 3) perform best out-of-sample but are not good at predicting the range of values. This contrasts with within sample predictions where TPM item model (model 8) performs best. This may be in part due to poor performance of these models when the samples are smaller as is the case when running the models in only 75% of the sample. However, the item models also have misspecification and multicollinearity, which may increase the variation in predicted scores. We therefore recommend the OLS or TPM model 3 (Table 6) for generating EQ-5D utility scores where they are not available.

Discussion
This study is the first attempt, to our knowledge, to develop a mapping function to estimate EQ-5D preferencebased values from a condition-specific measure for patients with CF. The results from this relatively large survey of 401 patients with different levels of disease severity confirmed that EQ-5D preference-based values, or utility values, can be estimated from the CFQ-R using mapping functions. These predicted utility values can be used to inform cost effectiveness models. The study sample included a diverse range of CF severity, as measured by FEV 1 and observed EQ-5D values, with good sample sizes across FEV 1 severity categories and close to the full range of theoretical EQ-5D scores represented (1 to −0.35 versus 1 to −0.59). The range of CFQ-R scores was also broad, with means from 21 -77. This represents a broader and more severe range than that included in the CFQ-R validation (mean range = 51 -92) [24], but similar to that reported by Bradley et al. (mean range = 25 -85) [25]. The slight difference in ranges may be due sampling methodology, which allowed completion of the questionnaires in the privacy of the patients' home rather than on site, and as participants were not recruited through clinics they may also represent a less adherent/controlled group. In addition our sample only included adults (aged 18+), and had a slightly higher proportion of females; age and female gender having both been associated lower (worse heath) scores [24]. As mapping is best supported by datasets with a rectangular distribution to increase the predictive  performance of the final algorithm across the entire spectrum of scores, this diversity is likely to have contributed to the consistently strong mapping results seen across regression approaches and item and domain level models. Assessment of models within sample indicated that the item level models (model 5) outperformed the domain models in terms of predicting the mean, the range, minimising RMSE and levels of agreement with the observed EQ-5D utility scores. However, item models suffered from misspecification and there was evidence of some multicollinearity. Domain level models with squared terms were better specified than the item level models and had no problems with multicollinearity apart from where expected in the squared terms. The domain model with squared terms also performed relatively well within sample in terms of RMSE and ICC. Within sample predictions, the TPM performed marginally better than the OLS or Tobit models in terms of RMSE, ICC and the range of predictions.
In the out-of-sample validation, testing of the best performing domain (model 3) and item level models (model 5 or 8) showed that unlike within sample, domain level models performed better in terms of predicting the mean, minimising the RMSE and level of agreement between observed and predicted scores based on ICC, while item level models performed better in terms of predicting the range of scores. OLS models were better at predicting the mean and minimising the RMSE while TPM models tended to have larger RMSEs. The TPM models performed better in terms of ICC with slightly higher ICCs in TPM model 3 compared to the same model in OLS. Overall, the best performing models in out-of-sample validation were the OLS and TPM domain models (model 3); these included the 'physical functioning', 'role functioning', 'emotion', 'vitality', 'eat', 'weight', and 'digestion' domains. Thus, given the misspecification and multicollinearity problem associated with item level models, these two domain models are recommended for generating EQ-5D utility scores from CFQ-R data when no utility data exists. These domain model algorithms can be applied to item level data when domain scores are generated, or when item level data is not available as is often the case when effectiveness information is drawn from published trial data.
When considering the ranges of the predicted values of all mapping functions to the observed range of EQ-5D values, there was a tendency of over prediction in all models for observed values of EQ-5D lower than 0.3, and to a lesser extent, under prediction above observed EQ-5D values > 0.9 both within sample and in out-of-sample validation. Over prediction of low preference-based values is not uncommon in the mapping literature when mapping to the EQ-5D [13,26,27]. The sample did not cover the full range of EQ-5D scores and only a small proportion (3%) had scores less than 0, which makes it difficult to accurately predict in this part of the scale. However, as the over prediction occurs at the very severe end of the EQ-5D spectrum, lower than the observed EQ-5D mean reported in the self-reported FEV 1 'severe' group, this is likely to have limited impact on the application of these algorithms. It is important that the uncertainty around mapped estimates should be considered when applying these values to cost-effectiveness analysis. It is interesting to note that the respiratory symptoms domain was not a significant predictor of EQ-5D utility in any of the models. This is likely to be due to the fact that the impact of respiratory symptoms is captured through functioning dimensions of the CFQ-R, which map onto the dimensions in the EQ-5D. It is not uncommon for symptoms that are very specific to a condition to be unrelated to utility scores. However, given the focus of respiratory symptoms in CF trials it may be worth exploring the potential to increase sensitivity in utility scores by developing a conditionspecific preference-based measure.

Limitations
Recruitment was conducted through the CF Trust rather than clinical sites; thus diagnosis and FEV 1 values were self-reported. However this method allowed for the recruitment of a diverse range of participants, with good age and gender variability across severity levels, and FEV 1 values in line with a CF population [28,29]. Furthermore, the key measures included in the present study were the EQ-5D and the CFQ-R, these are developed to be patientreported, and the values reported in the present study are in line with those previously reported in CF [25]. A second limitation is the use of a splitsample method for the estimation and validation of the best fitting model. Validation of the model should be conducted on an independent sample rather than a subset as required here due to sample size. However, the cross-validation method employed in this study permitted the best use of the data to maximise the assessment of model performance.   OLS, ordinary least squares; TPM, two-part model; Model 3 = CFQ-R domains that are statistically significant at the 10% level + statistically significant squared terms. SE = standard error; ***p < 0.01, **p < 0.05, *p < 0.1.