A comparison of EQ-5D index scores using the UK, US, and Japan preference weights in a Thai sample with type 2 diabetes

Background Data are scarce on the comparison of EQ-5D index scores using the UK, US, and Japan preference weights in other populations. This study was aimed to examine the differences and agreements between these three weights, psychometric properties including test-retest reliability, convergent and known-groups validity, and the impact of differences in the EQ-5D scores on the outcome of cost-utility analysis in Thai people. Methods A convenience sample of 303 type 2 diabetic outpatients (18 years or older) from a cross-sectional study was examined. ANOVA and pos-hoc Bonferroni tests were used to determine the differences among the three EQ-5D scores. The agreements among the EQ-5D scores were assessed employing intraclass correlations coefficients (ICCs) and Bland-Altman plots. The ICCs were utilized to examine the test-retest reliability. Spearman's rho correlation coefficients were used to assess the convergent validity between the EQ-5D scores and sociodemographic & clinical data, and health status. Mann-Whitney U tests were used to test the differences in EQ-5D scores between the known groups including HbA1c level (cut point of 7%), and the presence of diabetic complications namely neuropathy, retinopathy, nephropathy and cardiovascular diseases. Seven hypothetical decision trees were created to evaluate the impact of differences in the EQ-5D scores on the incremental cost-utility ratio (ICUR). Results The US weights yielded higher scores than those of the UK and the Japan weights (p < 0.001, both), while the UK and the Japan weighted scores did not differ (p > 0.05). Both UK and US scores had more agreement with each other than with the Japan scores. Regarding psychometric properties, the Japan scheme provided better test-retest reliability, convergent and known-groups validity than both UK and US schemes. The variation in EQ-5D scores estimated from UK, US, and Japan preference weights had a marginal impact on ICUR (range: 1.23–6.32%). Conclusion Since the Japan model showed more preferable psychometric properties than the UK and the US models and the differences in these EQ-5D scores had a small impact on ICUR, we recommended that for both clinical and policy purposes the Japan scheme should be used in Thai people. However, more research needs to be done.


Background
The health utility (HU) approach to assessing healthrelated quality of life (HRQoL) is a commonly used technique for determining preferences for health outcomes in evaluation of public health and healthcare interventions such as cost-utility analyses (CUA) [1,2]. In CUA, a utility score is assigned to the health state on the cardinal scale in which dead = 0 and perfect health = 1 to indicate their preferences for different outcomes. The utility score is incorporated into quality-adjusted life-year (QALYs) which combine, in a single index, gains or losses in quantity (life expectancy) and quality of life (HU). The Euro-QoL (EQ-5D) is the most frequently used HU instrument for calculating QALYs based on actual measurements of patients' HRQoL [3].
The EQ-5D instrument consists of a five-item descriptive system of health states and a visual analog scale (VAS). Scores for the five health states can be converted into a utility index score by using scores from value sets (preference weights) elicited from a general population. The best-known preference weights were derived from samples of the United Kingdom (UK) population which is the original one for estimating EQ-5D index scores [4]. The UK-based preference weights are applied to other populations when country-specific weights are not available. However, evidence suggests valuations of health states could differ for people in different countries due to differences in demographic backgrounds, social-cultural values, and economic systems [5][6][7][8]. Thus, it is advisable to use country-specific weights in a given country if available.
Unfortunately, preference weights of EQ-5D for Thai people are not available yet. Valuation of the EQ-5D health states nationwide is a complex, time-consuming, and expensive task, so applying other existing preference weights is essential if not available in the country. Nevertheless, whose weighting scheme or which cultural/country-specific populations are appropriate are not known for Thai population. Besides the UK weights, there are a number of other countries having their own populationbased preference weights for the EQ-5D [7,[9][10][11][12][13][14]. Of these, the United States (US) weight scheme is a unique D1 model [13] different from the UK model (N3 model) that was applied to other countries' models. Studies have also shown that EQ-5D scores derived from the US weights were different from those of the UK [15][16][17].
Japan has been the first Asian country to develop its own preference weights of EQ-5D since 2002 [11]. The Japan model was chosen to represent Asian preference weights. We were interested in knowing how different EQ-5D index scores using the UK, US, and Japan preference weights were. Little was also known about psychometric properties of these schemes in different cultural contexts and specific patient samples (all models were developed in general population). Therefore, we would like to determine the differences and agreements among these three countries' preference weighted scores (the three countries are located in three different continents as well) using a Thai patient sample. Their psychometric properties including test-retest reliability, convergent and knowngroups validity were also explored. The psychometric properties would provide additional evidence of validity for the use of the EQ-5D index score in Thai settings. Moreover, we would examine the impact of differences in the EQ-5D scores on the outcome of CUA employing hypothetical scenarios.

Subjects and procedures
The data used in this paper was derived from a cross-sectional study [18]. In this study, a convenience sample of 303 type 2 diabetic outpatients was collected from the General Police Hospital in Bangkok, Thailand, between January-June, 2007. Patients with type 2 diabetics waiting for seeing physicians were approached to participate in this study. Patients who were eligible for the study were at least 18 years old and were able to understand the Thai language. Patients with health problems or cognitive impairments that could not complete interview were excluded. The face-to-face interviews include Morisky Medication Adherence Scale, Center for Epidemiologic Studies Depression (CES-D), EQ-5D questionnaire, VAS, sociodemographic and clinical data, together with reviewing medical records. In addition, about one-fifth of this sample (N = 64) was randomly selected to conduct onetwo week test-retest reliability via telephone. This study was approved by the Ethics Committee of the Police Hospital.

EQ-5D: UK, US, and Japan preference weights
The EQ-5D includes a five-item descriptive system, with one item for each of the following health attributes: mobility, self care, usual activity, pain/discomfort, and anxiety/depression. Each attribute has three levels: no problem, some problem, and major problem. A total of 243 possible health states are generated.
The UK valuation study was conducted based on the Measuring and Valuation Health (MVH) protocol to collect a general adult population in the United Kingdom (England, Scotland, and Wales) [4,19]. The preference values for 42 core health states were elicited using time tradeoff (TTO) methods. The valuations of the 42 health states were then interpolated by regression models to predict the index scores for all EQ-5D possible health states. The UK model consists of a set of variables representing each EQ-5D health dimension, with two dummy variables representing the levels of each dimension. A dichotomous var-iable (N3) was also added to the model to indicate if level 3 (major problem) occurs within at least one dimension.
The US health state valuation study was derived based on the UK MVH protocol. But the US algorithm replaced the N3 variable by D1, which represents additional number of dimensions at level 2 and 3 beyond the first [13].
The Japan valuation study is a quasi-replication of the UK MVH protocol using the modified protocol, where each respondent was presented with 17 health states, instead of 42 health states. The plain main effects model was preferred [11].

Data analysis
The EQ-5D index scores were calculated using the UK, US, and Japan preference weights. We first determined the differences among the three index scores using ANOVA, followed by pos-hoc Bonferroni tests. The agreements among the EQ-5D scores using the UK, US, and Japan preference weights were also assessed employing intraclass correlations coefficients (ICCs) and Bland-Altman plots [20]. We then examined the psychometric properties of these EQ-5D scores using the following approaches: one-two week test-retest reliability, convergent validity and known-groups validity [21].
To evaluate the test-retest reliability, intraclass correlations coefficients (ICCs) were employed. For convergent validity, we assessed the associations between the three EQ-5D scores and sociodemographic & clinical data and health status including age, gender, income, duration of diabetes, body mass index (an indicator of obesity), HbA1c level, number of diabetic complications, CES-D scores, and VAS scores using Spearman's rho correlation coefficients.
Concerning known-groups validity, we examined the ability of the three EQ-5D scores using the UK, US, and Japan preference weights to discriminate between clinical known groups including HbA1c level (below versus equal or above 7%), and presence and absence of diabetic complications namely neuropathy, retinopathy, nephropathy and cardiovascular. Mann-Whitney U tests were used to test the differences in EQ-5D index scores between these known groups because the distributions of EQ-5D utility scores had a number of outliers. All analyses were performed using SPSS version 13.0.
To evaluate the impact of the differences in EQ-5D index scores using UK, US, and Japan preference weights on CUA, seven hypothetical decision trees were created. We compared a new drug (Drug A) with an existing drug (Drug B). The details of each data component of the basecase scenario (decision tree 1) are presented in Table 1. We also created decision trees 2-7 which overestimated the base-case utility scores by mean and median differences in EQ-5D index scores between three pairs of preference weights: UK versus US, UK versus Japan, and US versus Japan, respectively. We computed the incremental cost-utility ratio (ICUR) which is equal to the ratio of incremental costs (cost of drug A minus cost of drug B) over incremental QALY (QALY of drug A minus QALY of drug B).

Bland-Altman Plots
Bland-Altman plots were created to compare the agreement among the three EQ-5D scores (Figures 2A-C). These plots showed the differences between scores (Yaxis) and the means of scores (X-axis). The mean of the differences (d) and the limits of agreement were indicated by dotted lines. The 95% limits of agreement were obtained by using the following formula [20]: d ± 1.96*SD of d.
The Bland-Altman plot of UK and US weights showed that 96.4% of the difference scores were between the limits of agreement, 3.3% below the lower agreement line, and 0.3% above the upper agreement line (Figures 2A).
Approximately 64% of the UK weights were lower than the US weights (less than zero), 31% are equal, and 5% are higher (greater than zero).  The Bland-Altman plot of UK and Japan weights showed that 96% of the difference scores were between the limits of agreement, 4% below the lower agreement line, and none of the scores above the upper agreement line (Figures 2B). Approximately 64% of the UK weights were higher than the Japan weights (greater than zero), 21% are equal, and 15% are lower (less than zero).
The Bland-Altman plot of US and Japan weights showed that 96.4% of the difference scores were between the limits of agreement, 3.6% below the lower agreement line, and none above the upper agreement line ( Figures 2C).
Approximately 75% of the US weights were higher than the Japan weights (greater than zero), 21% are equal, and 4% are lower (less than zero).

Test-retest reliability
The one-two week test-retest reliability of EQ-5D index scores using UK, US, and Japan preference weights (N = 64) is presented in Table 6. It was found that the Japan schemes provided the highest ICCs (0.78) among the three schemes, while the UK and US weights had the same ICCs of 0.74. Rosner suggests that ICC < 0.40 indicates poor agreement, 0.40 ≤ ICC < 0.75 indicates fair to good agreement, and ICC ≥ 0.75 indicates excellent agreement [23]. Based on this criterion, the Japan weights had excellent agreement, whereas both UK and US weights had good agreement on test-retest reliability. It should be noted that in this study the test-retest reliability was conducted via telephone interview whose test-retest correlations were generally lower than those by face-to-face interview [24]. If the test-retest via face-to-face interview had been done, the ICCs of the three approaches should have been increased. Thus, the UK and the US weights might have excellent agreement on test-retest reliability. However, this would not affect the results that the Japan scheme yielded the highest ICC because all three preference weights would have higher ICCs.

Convergent validity
EQ-5D scores derived from UK, US, and Japan preference weights were significantly associated with all sociodemographic, clinical, and health status variables except for age (Table 7). Spearman's rho correlation coefficients range -0.14 to -0.50. Based on Colton's criteria [25], EQ-5D scores had a little to medium correlation with these variables. However, most magnitudes of correlation between the Japan weighted scores and these variables were higher than those between both UK and US weighted scores and these factors. The magnitudes of correlation between the UK and the US weights and all variables were quite similar.

Known-groups validity
Among the three weighting schemes, the Japan weights obviously showed better discriminant validity than both Distribution of EQ-5D scores derived from the UK, US, and Japan preference weights Figure 1 Distribution of EQ-5D scores derived from the UK, US, and Japan preference weights. UK and US weights for all known groups including HbA1c level (above and below 7%) and diabetic complications (presence and absence) namely neuropathy, retinopathy, nephropathy and cardiovascular diseases ( Table 8). The relative precision values suggest that the Japan weights discriminated more efficiently than the UK and US weights (the ratios of Japan versus UK and Japan versus US greater than 1.00). Between the UK and US weights, the UK weighted discriminated better for the presence and absence of neuropathy, retinopathy, and cardiovascular complications (the ratios of UK versus US greater than 1.00), whereas, the US weights did more efficiently for HbA1c level and the presence and absence of nephropathy (the ratios of UK versus US less than 1.00).

The impact of the differences in EQ-5D index scores using UK, US, and Japan preference weights on cost-utility analysis
As shown in Table 9, the incremental cost of drug A over drug B was 300,000 Baht for all scenarios. In the base-case scenario, the incremental QALY of using drug A over drug B was 2.4, thus providing an ICUR of 125,000 Baht/ QALY. The ICUR for all alternative decision trees ranged from 117,096 Baht/QALY (6.32% difference from the base case) to 123,457 Baht/QALY (1.23% difference from the base case). The seventh decision tree that had the largest percent difference in ICUR from the base case was the scenario using the median difference between US and Japan weights, while the third decision tree that had the smallest percent difference in ICUR from the base case was the scenario employing the mean difference between UK and Japan weights.

Discussion
To the best of our knowledge, this is the first study examining the differences and cross-cultural validation between EQ-5D scores derived from UK, US, and Japan preference weights. The results showed that there were significant differences across the three EQ-5D index scores. US weights yielded higher scores than those of UK and Japan weights (p < 0.001, both), while the UK and Japan weighted scores did not differ (p > 0.05). The EQ-5D index scores derived from both UK and Japan weights were also comparable to that of a previous study which showed that type 2 diabetes provided the mean EQ-5D score of 0.75 [26]. Both UK and US scores had more agreement with each other than with the Japan scores. As for psychometric properties, the Japan scheme provided better test-retest reliability, convergent and known-groups validity than both UK and US schemes. We also determined the impact of the differences in these EQ-5D index scores on the outcome of CUA. It was found that variation in utility scores estimated from UK, US, and Japan preference weights had a relatively small impact on CUA (range: 1.23-6.32%).
Our study showed that the US weighted scores were higher than the UK weighted scores. This result is consistent with the previous study conducted in US patients living with HIV infection [17]. However, our study yielded larger mean difference scores (mean difference = 0.05) than those of the previous study (mean difference = 0.03). This may be due to differences in health states of patient populations. Johnson et al found that the discrepancy between the US and UK schemes was smaller for better health states, but larger for extreme health problems [15]. The Bland-Altman plots of EQ-5D scores derived from the UK, US, and Japan preference weights Figure 2 The Bland-Altman plots of EQ-5D scores derived from the UK, US, and Japan preference weights.
This finding is also similar to our study (please see Figure  2A). In the previous US study, the HIV patients had better health (mean EQ-5D scores using US and UK was 0.87 and 0.84, respectively) than those of the diabetic patients in the present study (mean EQ-5D scores using US and UK was 0.81 and 0.76, respectively). Therefore, this may be the reason why the larger mean difference between US and UK was found in the present study.
This study also showed that the EQ-5D index scores using the US scheme were higher than those of the Japan scheme with the estimated mean difference of 0.07, while the UK model yielded slightly higher scores than the Japan model with the mean difference of 0.02 (not statistically significant). No previous study has compared between US and Japan weighted scores; however, the large discrepancy may be attributable to differences in algorithms, cultures, research methods, and/or other factors. Tsuchiya and colleagues have reported that the Japan scheme yielded consistently higher scores than the UK weights except for the very mild states [11]. This finding contrasted with our results that the mean UK weighted scores had slightly higher than the mean Japan index scores but they were not significantly different. Also, the Bland-Altman plot ( Figure 2B) presented that the majority of the UK weighted scores (62%) was higher than the Japan weights except for the extreme health states. These different results may be due to the fact that they did the previous study in a general population, but we used a real patient population. The utility weights derived from a heterogeneous general population and applied to a patient population may be less precise to detect differences across cultures. In addition, due to differences in population ratings and healthcare settings between Japan and Thailand, EQ-5D valuations would perform differently when applied to different populations.
It is not surprising that UK and US preference weights had more agreement with each other than with Japan weights because they are western countries whose cultures are different from that of Japan which is in Asia. Moreover, the Japan scheme provided better test-retest reliability, convergent and known-groups validity than both UK and US schemes in this Thai sample. These results may reflect the fact that Thailand is an Asian country whose culture is closer to Japan than to both UK and US. Thus, it is more likely that the Japan weights should be used for EQ-5D valuations for Thai people than the UK or US weights.
Even though our results showed that there was difference in EQ-5D scores derived from the UK, US and Japan weights, the impact on ICUR was marginal. This leads to the question of which preference weights should be used and in what situations. All of our results suggest that if the EQ-5D index scores is used as a HRQoL measure for the purpose of clinical decision making such as using the utility scores to be a clinical indicator to monitor patients' health status, the Japan should be applied for Thais. However, if one would like to evaluate CUA or CEA whose outcomes are QALYs gained, the choice of weighting scheme does not matter. Nevertheless, if we have to recommend a method, the Japan should be the most appropriate one because they demonstrated better psychometric properties than the UK and US weights.
The results of this study need to be interpreted in the light of these following limitations. First, we used only crosssectional data. Differences in change scores may be likely to have a greater impact on ICUR than changes in absolute scores. Thus, further study should be done in longitudinal data. Second, our data were derived from diabetic outpatients, so the results were limited to a specific patient group. The findings are not likely to be able to be generalized to other patient populations. Other clinical populations need investigation. Finally, we utilized a simple hypothetical decision tree model to examine the impact of variability in EQ-5D index scores on ICUR. Therefore, using real CUA data should be more informative.    Decision tree 1 represents the base-case scenario, while decision trees 2-7 mean the scenarios where base-case utility scores overestimated by 0.051 (mean difference between UK versus US weights), 0.016 (mean difference between UK versus Japan weights), 0.067 (mean difference between US versus Japan weights), 0.035 (median difference between UK versus US weights), 0.028 (median difference between UK versus Japan weights), and 0.081 (median difference between US versus Japan weights).

Conclusion
In this study, we compared weights on EQ-5D valuations using algorithms developed in the UK, US, and Japan general populations, but cross-validated using a Thai patient sample. Our results suggest that the US scheme provided higher EQ-5D index scores than the UK and Japan schemes, while the UK and Japan weighted scores did not significantly differ. However, the impact of the differences in these EQ-5D index scores on the outcome of CUA was quite small. Both UK and US scores had more agreement with each other than with Japan scores. The Japan scheme provided better test-retest reliability, convergent and known-groups validity than both UK and US schemes. We recommended that among these three weights the Japan model should be used in Thai people. However, more research needs to be done.