Mapping onto Eq-5 D for patients in poor health

Background An increasing amount of studies report mapping algorithms which predict EQ-5 D utility values using disease specific non-preference-based measures. Yet many mapping algorithms have been found to systematically overpredict EQ-5 D utility values for patients in poor health. Currently there are no guidelines on how to deal with this problem. This paper is concerned with the question of why overestimation of EQ-5 D utility values occurs for patients in poor health, and explores possible solutions. Method Three existing datasets are used to estimate mapping algorithms and assess existing mapping algorithms from the literature mapping the cancer-specific EORTC-QLQ C-30 and the arthritis-specific Health Assessment Questionnaire (HAQ) onto the EQ-5 D. Separate mapping algorithms are estimated for poor health states. Poor health states are defined using a cut-off point for QLQ-C30 and HAQ, which is determined using association with EQ-5 D values. Results All mapping algorithms suffer from overprediction of utility values for patients in poor health. The large decrement of reporting 'extreme problems' in the EQ-5 D tariff, few observations with the most severe level in any EQ-5 D dimension and many observations at the least severe level in any EQ-5 D dimension led to a bimodal distribution of EQ-5 D index values, which is related to the overprediction of utility values for patients in poor health. Separate algorithms are here proposed to predict utility values for patients in poor health, where these are selected using cut-off points for HAQ-DI (> 2.0) and QLQ C-30 (< 45 average of QLQ C-30 functioning scales). The QLQ-C30 separate algorithm performed better than existing mapping algorithms for predicting utility values for patients in poor health, but still did not accurately predict mean utility values. A HAQ separate algorithm could not be estimated due to data restrictions. Conclusion Mapping algorithms overpredict utility values for patients in poor health but are used in cost-effectiveness analyses nonetheless. Guidelines can be developed on when the use of a mapping algorithms is inappropriate, for instance through the identification of cut-off points. Cut-off points on a disease specific questionnaire can be identified through association with the causes of overprediction. The cut-off points found in this study represent severely impaired health. Specifying a separate mapping algorithm to predict utility values for individuals in poor health greatly reduces overprediction, but does not fully solve the problem.


Background
In recent years there has been an increasing amount of publications concerned with 'mapping' condition specific measures on EQ-5 D to estimate EQ-5 D utility values. Mapped EQ-5 D utility values are accepted as evidence in cost-utility analyses by reimbursement agencies such as the National Institute of Health and Clinical Excellence (NICE) [1] (see § 5.4.6) but suffer from non-trivial problems like the overprediction of utility values for patients in poor health. A mapping algorithm can be estimated by regressing a non-preference-based measure onto a preference-based measure on a dataset external to your study dataset [2]. The resulting mapping equation is used to estimate the utility values of the preferenced-based measure in the study dataset where such a measure is absent. Criteria for the quality of a mapping algorithm do not currently exist although it is well known that utilities estimated by mapping algorithms typically have larger errors for lower utility values [2] and mapped EQ-5 D utilities show a systematic overprediction of utility values for patients in poor health [3]. For instance, a study mapping SF-12 on EQ-5 D report predicted values under 0.5 to be notably higher than observed values, for both 2 nd and 4 th order models [4]. Another study, mapping the modified Rankin scale measurement, which assesses disability after stroke, on EQ-5 D reports decreased accuracy for patients in poor health and significant overprediction of low values [5]. While it is unlikely for such overprediction to be a problem in all samples, given that many studies have reasonably high mean EQ-5 D values [6], it is likely to occur in patient (sub) samples containing a significant proportion of individuals in poor health. The current study explores whether the causes of overprediction of utility values for patients in poor health found in the literature can inform a method to minimize that overprediction. The proposed solution involves the use of a different algorithm for patients in poor health, where health status is determined using available information from a condition-specific non-preference-based measure.
There are several causes for the overprediction of low utility values. First, the non-preference based measure may have different severity content than the preferencebased measure. For instance, the lowest possible range of scores on the Health Assessment Questionnaire Disability Index (HAQ-DI) is between 2.5 and 3.0 which is not necessarily associated with the lowest value of -.59 on the EQ-5 D, but with a value near .1 [7], as the HAQ measures different dimensions of health [8]. Adding additional covariates to the mapping functions, like clinical variables or dimension scores of other questionnaires may overcome this problem, but this limits the use of the function to datasets that hold all those variables.
Second, in many clinical studies, health states are not normally distributed: most patients typically experience mild to moderate health problems and few experience severe problems [6,8,9]. Progression from moderate to poor states, for instance moving from 'some problems with washing or dressing myself' to 'unable to wash or dress myself', results in a steep drop in utilities. This 'drop' may not be adequately predicted in a linear model which is powered on the large group of patients which reports mild to moderate health problems. This has led to the suggestion that using Ordinary Least Squares regression on the entire sample, which is more accurate for mean values than for extremes, may contribute to the problem of overprediction [2]. Specifying other models may lead to better predictions, but will rarely overcome overprediction.
Alternatively, one option is to specify a separate mapping function for patients in poor health whose utility values are overpredicted. Such an approach would require a method to identify the 'poor health' population. A study, mapping SF-36 onto EQ-5 D, reported overprediction of utility values for poorer health states (EQ-5 D index values < 0.5) for existing algorithms from the literature and algorithms estimated in the study [3]. The study hypothesized that this may be observed because more severe health states (utility value <0.5) have at least one of five EQ-5 D health dimensions at the most severe level causing the aforementioned steep decline in utility values. Further support for this hypothesis is that in many patient populations a 'bimodal distribution' of EQ-5 D utility values is observed. Bimodal distribution refers to the observation of high (> 0.5) mean utility values for EQ-5 D states with no dimensions at the most severe level and low (< 0.5) mean utility values for EQ-5 D states with one or more dimensions at the most severe level. This bimodal distribution has a 'gap' in the distribution of EQ-5 D utility values around the .5 value [9]. This observation is limited to EQ-5 D, as prediction errors are also increased for patients in poor health when mapping to SF-6 D [10], but no systematic overprediction is present.
This suggests that the alternative mapping function ought to be estimated on the lower part of the bimodal distribution of EQ-5 D values. However, as the EQ-5 D is absent by definition if a mapping algorithm is applied, it is difficult to assess which predicted values are overpredicted. It is plausible that values can be identified on the condition-specific instrument that are associated with the lower part of the EQ-5 D utility distribution, which represents 'poor health'. To this purpose mapping algorithms and datasets for three condition-specific measures, the arthritis Health Assessment Questionnaire (HAQ) and the cancer EORTC's Quality of Life Questionnaire C-30 (version 2) are investigated. When available mapping algorithms systematically overpredict utility values for patients in poor health, it is explored whether it is possible to identify the 'poor health' population by the health status reported on the condition specific measure. If so, we estimate a separate mapping algorithm for use in patients in poor health.

Method
Existing and new mapping algorithms will be applied to one sample of patients with arthritis [11] (arthritis sample) and two samples of patients with cancer: patients with Multiple Myeloma (MH sample) and patients with Non-Hodgkin's Lymphoma (NH sample) [12,13]. A short description of the population characteristics of the samples (pooled data for 8 follow-up time points of QLQ-C30, baseline of HAQ) on which the algorithms are run is presented in Table 1. Thus all work presented in this paper is performed using these datasets, limiting generalizability to different types of cancer.

Instruments
The EuroQol EQ-5 D is a generic preference-based measure of health related quality of life. It classifies health states on five dimensions (mobility; self-care; usual activities; pain/discomfort and anxiety/depression)  (16,9) with three severity levels each: level one represents no problems; level two represents some problems; and level three represents extreme problems. The classification system defines 243 unique health states which are given a utility score using an existing tariff. The EQ-5 D tariff represents the preferences of the general public as elicited using time trade-off, and differs per country. Here the UK tariff [14] and Dutch tariff [15] are used. The EORTC QLQ-C30 (version two) is a cancer specific questionnaire which consists of 30 items across 6 functioning scales (physical, role, cognitive, emotional, social, global quality of life) and 9 symptom scales (fatigue, nausea and vomiting, pain, dyspnoea, sleep disturbance, appetite loss, constipation, diarrhoea, financial impact). High scores on the functioning and global health status scales reflect good quality of life, while high scores on the symptom scales represent a high level of symptoms [16].
The Health Assessment Questionnaire (HAQ) was first developed for use in patients in rheumatology. The most widely used version of the HAQ assesses the functional ability of patients using 20 items across eight domains (dressing, arising, eating, walking, hygiene, reach, grip and usual activities) [17]. Questions are scored on a four level disability scale from zero to three, where three represents the highest degree of disability. Scores for the eight domains are adjusted for the use of aids or devices and averaged into an overall disability index value, HAQ-DI (Health Assessment Questionnaire Disability Index), with a range from zero to three and adjacent steps of 0.125 (e.g. 0, 0.125, 0.250), which represents the extent of functional ability of the patient. A value between one and two represents moderate to severe disability [18].

Algorithms
Algorithms are taken from the literature and predict EQ-5 D index values from either the QLQ-C30 (version 2) or the HAQ. All algorithms have been tested on another dataset with the exception of one HAQ model that was developed for this article, from now on referred to as a test model.
The original articles in which the algorithms were presented labelled them as suitable for estimating utility values [8,19,20]. Details of the algorithms are presented in Table 2. All models were developed using ordinary least squares regression. The HAQ algorithm developed and tested by Bansback et al. [19] was estimated on patient samples from Canada (N = 319) and the United Kingdom (N = 151) who were clinically diagnosed with rheumatoid arthritis (RA). The algorithm computes EQ-5 D utility values based on the UK tariff. We estimated one additional HAQ algorithm, the test model, for this article based on a larger group of patients than was used for the published algorithm, as this sample holds more patients in severe conditions [8]. The test model was developed using the Rotterdam Early Arthritis Cohort with 493 patients with and without clinically diagnosed RA recruited from the Erasmus Medical Centre in the Netherlands. It is not recommended for use as not all patients are clinically diagnosed with RA. A tested HAQ model that predicts Dutch utilities is presented elsewhere [8]. The QLQ-C30 algorithm by McKenzie & Van der Pol [20] was developed on a sample of 199 patients with inoperable esophageal cancer. The algorithm computes EQ-5 D utility values based on the UK tariff. The QLQ-C30 algorithm by Versteegh et al. [8] was developed and tested on pooled data from two clinical trials for patients with multiple myeloma (pooled N = 723) and patients with aggressive non-Hodgkin's lymphoma (pooled N = 789). It computes EQ-5 D utility values based on the Dutch tariff.
All models used in this study were thus taken from other studies. Despite their use to investigate our methodological point, generalizability of mapping functions between different types of cancer or arthritis is an empirical matter that still needs thorough investigation.

Analysis
First we determine if the mapping algorithms estimated on a relatively healthy patient sample overestimate utility values of patients in poor health. As the EQ-5 D is absent by definition, we need to specify a threshold value on the condition specific measure for which we would expect a regular mapping algorithm to overpredict utility values to be able to anticipate whether a mapping algorithm is expected to be inaccurate in a certain population. Then we develop a mapping algorithm for that population. Six steps are described below, aimed at systematically exploring the topic.
Step one. Each published algorithm used here was found in its original article to be successful at predicting mean EQ-5 D values. The same diagnostics have also been applied to the test model and indicate this model is successful at predicting mean EQ-5 D values. However, a successful prediction of a mean EQ-5 D utility value in a sample with a relatively high mean value does not guarantee a successful prediction in a sample with a much lower mean EQ-5 D value. Therefore we compare the predicted values are compared to the observed values over the range of observed EQ-5 D values.
Step two. It has been suggested that reporting a level '3' answer on EQ-5 D and the large utility decrement associated with it in the EQ-5 D country tariff is a cause of overprediction [3]. Using the UK tariff [14] an EQ-5 D utility value of .52 is the lowest obtainable value without a level 3 answer (state 22222), and 0.56 is the highest obtainable value with a level 3 answer (state 11311), which is respectively 0.57 and 0.64 for the Dutch tariff. These values will be used to interpret the distribution of utility values in the three samples.
Step three. The frequently observed bimodal distribution of utility values in patient samples has been associated with 'N3-term' [9] and the bimodal pattern has been presented by others as a specific feature of the EQ-5 D [21]. The N3 term is a model feature of the UK and Dutch EQ-5 D country tariff and adds an extra utility decrement if any dimension on the EQ-5 D scores a '3', representing extreme problems. However, it is hypothesized here that the 'N3' in itself does not cause a bimodal distribution. To test this, a random set of EQ-5 D cases is generated (N = 300) with an equal distribution of answer categories across the 5 domains.
Step four.
Step one and two investigate whether the utility values of patients who report 'extreme problems' on at least one of the EQ-5 D dimensions are overpredicted. The next step is to investigate which QLQ-C30 and HAQ value is associated with level '3' answers on the EQ-5 D. The use of this exercise is to identify scores on the condition specific measure that are related to a possible cause of overprediction in mapped utility values: at those scores standard mapping algorithms might be inaccurate. As the QLQ-C30 provides no overall score, the functioning scale scores are used, since these have the highest correlation with EQ-5 D scores [22]. For the HAQ, the HAQ-DI value (which combines all items) is used.
Step five. The next step is exploring the performance of a separate algorithm for use on patients in poor health. An alternative algorithm will be developed on a sample in poor health, in this case on a within sample selection of patients which are in poor health as determined by the cut-off point identified in step 4. The utility value of the EQ-5 D, using the UK tariff will be regressed on the disease specific questionnaires. In the cancer population the algorithm will be developed on the multiple myeloma sample and tested on the non-Hodgkin's sample. A variety of model specifications are estimated using OLS. All algorithms are applied at the individual level. Mean utility values are used to compare predicted and observed values.
Step six. Typically mapping algorithms are used to predict the mean utility value of a population that is in moderate to good health. In step 5 we specify a separate algorithm for patients in poor health which may reduce overprediction of utility values for patients in poor health. If only a part of the patient population is in poor health, a second algorithm is needed to be able to estimate the mean utility value of the entire sample. Thus computing utilities with the 'low utility' algorithm and a separate algorithm for patients in relatively good health may reduce prediction errors for a 'typical' sample where the majority of respondents are in moderate to good health. Such an approach would require two algorithms: one for the part of the population which is in poor health, as determined by a score under a cut-off point on the condition specific measure, and one for the population in better health, determined by a score higher than the cut-off point specified under step 4, as sketched in Figure 1. The 'low utility' algorithm estimated in step five will be complemented by a 'high' utility algorithm and tested on the non-Hodgkin's sample.

Results
All mapping algorithms applied here suffer from overprediction at the lower end of the scale, where predicted values are higher than observed values for observed  EQ-5 D utility values below ≈.5. Figure 2 and 3 compare predicted and observed EQ-5 D utility values, and are representative for the other mapping algorithms investigated in this study.
Step one. Figure 2 and 3 indicate that overprediction begins to occur around EQ-5 D utility value ≈.5. As is mentioned in the method section: the utility value of ≈.5 is related to the scoring 'extreme problems' on any EQ-5 D dimension. Patients that have one or more dimensions at level 3 have a maximum observed EQ-5D UK tariff score of 0.56 in the MM and NH samples and of 0.43 in the Arthritis sample. Patients that have no dimensions at level '3' have a minimum observed EQ-5D UK tariff score of 0.52 in all samples (state 22222). A utility value of 0.52 and lower guarantees the presence of at least one level 3 answer in the UK tariff. Scores higher than 0.52 but below 0.57 do not guarantee the absence of at least one level 3 answer. Interestingly enough Figure 3 shows overprediction to occur at a slightly higher value, but not at the expected 11311 score with utility value 0.64. Upon inspection the highest observed Dutch utility value for a state with a '3' is 0.55, for state 11321, thus the graph shows overprediction to start at that state.
Step two. Minimum and maximum EQ-5 D scores of patients with or without at least one dimension at level 3 on the EQ-5 D inform our interpretation of Figure 4 and 5, which indicate bimodal distributions for MM and NH samples. A patient with a 'level 3' answer on EQ-5 D belongs to the left side 'poor health' distribution with a lower mean and less frequent observations than a patient without a 'level 3' answer. The area around a utility value of .5 can fall under either distribution, as indicated by the overlap in minimum and maximum values for cases with and without level 3 answers mentioned in step one.
Step three. Figure 6 shows the distribution of utility values for the randomly generated sample. The utility values have a normal distribution, suggesting that the bimodal distribution is not solely caused by the 'N3' term. The random sample (N = 300) had 163 unique health states. The 34 most frequent health states account for 36% of the observations, which is in stark contrast to the other samples. The NH sample (pooled N = 783) had 78 unique health states of which six states accounted for 53.5% of all observations. The MM Figure 1 Hypothetical use of two separate algorithms.  Step four. Mapping algorithms overpredict utility values under 0.5, which are for patients with 'extreme problems' on at least one of the five EQ-5 D dimensions. This means that mapped utility values are inaccurate for those patients with scores on the conditionspecific measure that are associated with an EQ-5 D utility value below 0.5. However, scores on the HAQ and QLQ-C30 do not provide a straightforward indication of the accuracy of the use of a mapping algorithm. For example, a patient average on the QLQ-C30 functioning scales of 70 could belong to an EQ-5 D utility value between as low as .21 or as high as 1. However, Figure 7 shows that at least half of the patients with an average value of the QLQ-C30 functioning scale lower than 55 have level 3 answers on the EQ-5 D. Although it is a somewhat arbitrary cut-off point, an average of 45 on the functioning scales is a clear indication of the expected overprediction of a mapping algorithm, for at that value approximately 86% of patients in these samples have at least one level 3 response.

McKenzie prediction overvalues states under .5 in NH sample
The HAQ-DI values faced similar problems: a HAQ-DI value of 1.5 (moderate to severe disability) can be associated with an EQ-5 D utility value as low as .21 to .3 or as high as .71 to .8. Figure 8 does indicate that at HAQ-DI values <1.6, over 50% of patients have at least one level 3 response on the EQ-5 D. A HAQ-DI > 2.0 is a clear indication of the expected overprediction of a regular mapping algorithm, for at that value, approximately 72% of patients in this sample has at least one level 3 response.  Step five. The within sample population of cases in poor health (QLQ-C30 <45, HAQ-DI > 2.0) was relatively small (N = 18 Arthritis sample, N = 25 at t = 0 NH-sample, N = 40 at t = 0 MM-sample). Within those subsamples, EQ-5 D was regressed on QLQ-C30 and HAQ using a variety of regression model specifications. The mapping model was developed on the MM-sample, and tested on the NH-sample. The QLQ mapping algorithm contained 5 items after backwise selection, and included items as categorical variables. The mapping algorithm was applied on the NH sample for patients with QLQ average on the functioning scales < 45. In comparison to the standard mapping algorithms, the utility model for patients in poor health outperforms the model from the literature (Table 3) for this selection of the sample and reduces root mean square error by .06 in the first 4 timepoints. As can be seen from the maximum score, 1 individual did not seem to have filled in the EQ-5 D correctly and had a utility value of 1 (but a low score of 25 on the EQ-5 D visual analogue scale). A similar pattern was observed for the last four timepoints, but not deemed trustworthy due to small sample size (N < 8 for the last 4 timepoints of the QLQ-C30 follow up data). The predicted values showed less prediction error than the standard mapping algorithms, but still did not accurately predict mean utility samples in this selection of the sample with root mean squared error of 0.18.
For the REACH study, only a development dataset was available but for both cut-off points (HAQ-DI > 1.6 and HAQ-DI >2.0) the regression model was underpowered with no significant predictor variables due to the small sample size and low correlations between HAQ sum scores and EQ-5 D utilities.
Step six was performed with QLQ-C30 models only.
Step six. The 'high' and 'low' utility algorithms, predicting UK EQ-5 D utilities are presented in Table 4. The low utility model in step five was supplemented with a high utility model developed on patients with an average sum score on the functioning scales of the QLQ-C30 >45. Application of the algorithm in the non-Hodgkin's sample was similar to the development: patients who were in poor health got assigned the utility value as predicted from the 'low utility' model and the rest got assigned the utility value as predicted from the 'high utility' model. The combined variable of predicted values had a lower root mean square error (0.02 lower on average) and a larger range of predicted values than the other QLQ-C30 models discussed in this paper. This suggests a modest improvement and indeed led to a slightly better estimate of the mean utility values (Table 5). Due to data restrictions like few observations of poor health states and the model specifications (items treated as categorical variables) the uncertainty around the parameter estimates of the low utility model was almost three times higher than the uncertainty around the parameter estimates of the high utility model.

Discussion
This paper explored causes of EQ-5 D utility values for patients in poor health when mapping from a nonpreference-based measure, and investigated a possible solution to the problem. We examined the association between the cause of the overestimation and values on the condition specific questionnaire at which overprediction occurs. Our findings suggest that the main cause of overestimation is a combination of the large decrement in utility values in the UK and Dutch EQ-5 D tariffs for having one or more dimensions at level '3', along with few observed responses at level '3'. We argue that this, alongside the large number of EQ-5 D responses at the least severe level, leads to a bimodal distribution of the utility data. A result is that the most linear prediction models can not adequately describe low utility values. We found that the values on the condition specific questionnaire can help inform decisions about the expected errors and hence accuracy of standard mapping algorithms, and that the use of a separate mapping algorithm specified for patients in poor health reduces the amount of overprediction for these patients. Combining such a function with a 'high utility' algorithm leads to a modest improvement of predictions.
Our findings, in accordance with the literature, suggest that the ≈.5 value of the EQ-5D UK tariff is the point at which mapping algorithms start to overpredict utility values. The reason it is the ≈.5 is due to the fact that values under ≈.5 belong to patients who have extreme problems on at least one dimension of EQ-5 D. As the purpose of mapping algorithms is to predict EQ-5 D values when EQ-5 D was not included in the trial, such a value is not informative for the application of mapping algorithms. Here we explored the use of condition specific measures (that we are mapping from) to indicate the expected accuracy of a standard mapping algorithm. An alternative mapping algorithm can then be developed for use in patients in poor health. We found that the ≈.5 utility value itself is not a very useful measure of association with QLQ-C30 or HAQ-DI values, since there is not a one-to-one relationship between measures meaning that a large range of QLQ-C30 and HAQ scores are associated with the ≈.5 EQ-5 D value. Since scoring a '3' on the descriptive system of EQ-5 D is related to the problem of overprediction, we took an alternative approach using the scores on the conditionspecific measure that correspond to having at least one level '3' response. Below a QLQ-C30 average of the functioning scale of 55, about half of the patients scores level 3 answers on the EQ-5 D, as do patients with HAQ-DI > 1.6. At these scores, standard mapping algorithms are likely to overpredict utility values. More conservative and somewhat arbitrary cut-off values we determined are > 2.0 for HAQ-DI and < 45 for the average of the QLQ-C30 functional scales. These cut-off points represent very severe health problems: 45 for the QLQ-C30 is associated with severe cases like postradiotherapy patients with metastatic and/or cardiorespiratory disease [23]; a HAQ-DI value under 2.0 represents severe to very severe RA [18]. At these more conservative values, a standard mapping algorithm is likely to be inaccurate.
A separate utility mapping algorithm estimated on a sample with poor health status is far better at predicting utility values for patients in poor health, when it is possible to estimate such a function. However, using categorical variables introduced problems with perfect colinearity in the low utility model, and the HAQ sample did not allow the estimation of a low utility model due to poorer correlation with EQ-5 D and smaller sample size than QLQ-C30. A model based on sum scores did not suffer from these restrictions but introduced larger prediction errors. The result is a model for low utilities that only uses 5 items of the QLQ-C30 as predictor variables. Item 3 (trouble taking a short walk), 4 (need to stay in bed or a chair), 5 (need help with eating, dressing, washing or using the toilet) 9 (pain) and 21 (feeling tense) together represent physical functioning, emotional functioning and pain. Consequently other quality of life drivers such as role functioning or fatigue are not represented which may lead to problems when applying the function in other cancer types. Furthermore, OLS models used in all mapping algorithms reported here are more precise around mean values than for extremes, which results also in underprediction for utility values near to 1, most notably when regressing EQ-5 D on HAQ. Thus estimating and applying mapping algorithms on datasets with large deviations in health status is likely to be problematic. The extent to which a deviation can be considered 'large' is difficult to assess, since it depends on how a change on the scale of the questionnaire relates to a change on the EQ-5 D index values.
Cut-off points like the ones specified in this study can be used to inform whether a regular mapping algorithm from the literature would suffice or whether a 'low utility algorithm' is better at assessing the quality of life for those patients. Cut-off points can indicate whether there are patients in poor health and therefore whether predicted utility values are likely to suffer from overprediction if only a standard mapping algorithm has been used. Cut-off points can therefore inform users and policy makers whether mapped estimates should be treated with great caution. A weakness of the approach may be that there is no clear cut relation between the break point of utility values in the distribution and values on the condition specific measures. Besides, prediction errors might be reduced even more if there were several mapping functions for each 'severity group'. However, the relation between the condition specific measure and the preference-based measure may not be clear cut enough to identify more sub-groups.
Although overprediction proved to be less of a problem for patients in poor health with our combined prediction model, the largest part of the sample is not in very poor health. This explains why predictions of the mean, as presented in Table 5 do not show much improvement compared to the McKenzie model. However, predicted EQ-5 D values do not capture the full range of observed EQ-5 D values due to overprediction.
As a consequence, they have 'tighter' confidence intervals around the QALY estimates as presented in Table 6 (survival is hypothetical). In probabilistic sensitivity analysis this results in less uncertainty around the estimate of cost per QALY, but that is an incorrect representation of reality.
In addition to the tighter confidence intervals, using mapped utility values may result in an underestimation of the utility-gain between time intervals. As the utility values of patients in poor health are systematically overpredicted, individuals who in reality would improve from poor health to better health (i.e. from a value <0.5 to a value >0.5) would have an underestimated utility gain when using mapped EQ-5 D utilities.
A main point of concern in any effort to map onto a preference-based questionnaire is generalizability of the results. As mentioned earlier, it must be stressed that although the cut-off points presented here are empirically supported by our study, they cannot be considered transferable or generalizable to other types of cancer or arthritis samples prior to thorough empirical testing in different datasets.
The issue of generalizability also applies to the presented methodology. This study focussed on mapping onto EQ-5 D for patients in poor health. The methodology proposed here only applies to mapping onto EQ-5 D using the UK or the Dutch country tariffs. We observed that individuals who report 'extreme problems' on one of the five EQ-5 D dimensions receive overestimated utility values from published mapping functions. Our suggestion is that this is caused by the large utility decrement applied to scoring 'extreme problems' in the UK and Dutch EQ-5 D country tariff, combined with only a few observations of 'extreme problems'. However, other EQ-5 D country tariffs may not have large utility decrements for all 'extreme problems' scores. For instance, the total decrement for scoring 13111 (extreme problems' on the self-care dimension of EQ-5D) has a total utility decrement of 0.564 in the UK tariff and 0.254 in the Japanese tariff. These differences in preferences between populations may be of influence on the methodology used to identify the part of the population which is in poor health and where increased prediction errors are observed. However, if those patients can be identified, specifying a separate mapping function for that part of the populations is still a suggested option to reduce prediction error.
We also investigated the option of combining the application of a low and high utility models, to see if the improvement found for low utility values would contribute to a better estimate of mean EQ-5 D utility values in a sample where only a part of the patients is in poor health. The model led to a modest improvement in root mean square error and range of the values. The range of the values is important, as that allows more statistical sensitivity. Further research is needed to determine if specifying two functions and combining them is to be favoured over other approaches. For instance, the problem mentioned above about the limited number of items available due to collinearity may be solved by using a larger dataset which provides more accurate predictions for summed scores. The approach could also be undertaken using regression techniques such as the probit model and a two-part model and this is an area for future research. An obvious attempt would be to raise variables to a power to allow non-linearity, but a recent study still reported overprediction under a utility value of around .6 for a model with significant second order predictors [24]. Alternatively, stepped linear regression with a specified break-point may allow the utility function to 'curve' according to observed values, but specifying such a breakpoint is not clear cut as is shown in this study.

Conclusion
As the use of mapping in cost-effectiveness analyses of medical interventions is becoming more frequent, guidelines on the appropriateness of using mapping and specific mapping algorithms are needed. We investigated the often observed problem of overprediction in mapping and analysed the use of cut-off scores for the condition specific measures QLQ-C30 and HAQ-DI to indicate when the use of a separate mapping algorithm for patients in poor health is the favoured approach. Overprediction of utility values for patients in poor health can be greatly reduced by predicting the utility values of these patients using a separate mapping algorithm specified and estimated specifically for these patients, when deemed necessary.