A head-to-head comparison of the EQ-5D-3L index scores derived from the two EQ-5D-3L value sets for China

Objective Two EQ-5D-3L (3L) value sets (developed in 2014 and 2018) co-exist in China. The study examined the level of agreement between index scores for all the 243 health states derived from them at both absolute and relative levels and compared the responsiveness of the two indices. Methods Intraclass correlations coefficient (ICC) and Bland–Altman plot were adopted to assess the degree of agreement between the two indices at the absolute level. Health gains for 29,403 possible transitions between pairs of 3L health states were calculated to assess the agreement at the relative level. Their responsiveness for the transitions was assessed using Cohen effect size. Results The mean (SD) value was 0.427 (0.206) and 0.649 (0.189) for the 3L2014 and 3L2018 index scores, respectively. Although the ICC value showed good agreement (i.e., 0.896), 88.9% (216/243) of the points were beyond the minimum important difference limit according to the Bland–Altman plot. The mean health gains for the 29,403 health transitions was 0.234 (3L2014 index score) and 0.216 (3L2018 index score). The two indices predicted consistent transitions in 23,720 (80.7%) of 29,403 pairs. For the consistent pairs, Cohen effective size value was 1.05 (3L2014 index score) or 1.06 (3L2018 index score); and the 3L2014 index score only yielded 0.007 more utility gains. However, the results based on the two measures varied substantially according to the direction and magnitude of health change. Conclusion The 3L2014 and 3L2018 index scores are not interchangeable. The choice between them is likely to influence QALYs estimations.


Introduction
The EQ-5D-3L (3L) is one of the most widely used utility instruments in measuring health-related quality of life (HRQoL) [1][2][3][4] for use in quality-adjusted life years (QALYs) calculation. It has a classification system consisting of five dimensions: mobility (MO), self-care (SC), usual activities (UA), pain/discomfort (PD), anxiety/depression (AD), with three functioning levels (no problems, some problems, and extreme problems) in each dimension. The system thus defined 243 (3 5 ) possible health states [5], and each of them can be coded into a five-digit number ranging from "11111" to "33333" (e.g., 12321 means no problems in mobility, some problems in self-care, extreme problems in usual activities, some problems in pain/discomfort and no problems in anxiety/depression). A single utility index score can be assigned to each health state by using a value set, which was developed in a valuation study based on general population's health preferences.
Open Access † Ruo-Yu Zhang and Wei Wang contributed equally to this work. *Correspondence: wang_p@fudan.edu.cn Since health preferences differ across populations [6,7], a number of 3L value sets have been derived in different countries/regions [8]. Some countries (e.g., Korea, USA, and China) even developed two value sets due to respective reasons [9][10][11][12][13][14]. Taking China for example, compared to the first value set developed in 2014 (i.e., 3L 2014 value set) using a sample comprising residents mainly form urban areas, the second value set developed in 2018 (i.e., 3L 2018 value set) adopted a more representative sample of residents from both rural and urban areas (Table 1).
Despite the availability of the EQ-5D-5L (5L, a new version of 3L) index score with improved psychometric properties [15][16][17][18], the 3L index score is still with great usefulness due to the considerations of consistency and continuity in decision making process [19]. Indeed, the National Health Service Survey in China continually used the 3L to measure the HRQoL of Chinese residents even after the publication of the 5L value set for China in 2017 [20]. Moreover, the 3L can also be used to generate the 5L index score based on the 5L information and a crosswalk function [21], thus utilizing the advantages of 5L descriptive system.
Similarly, the 3L 2014 value set is still more frequently used than the 3L 2018 value set, albeit with its disadvantage in the sampling method. According to Web of Science, the former has been cited in 139 articles by April 16, 2021, 62 of which cited it after the availability of the latter. In contrast, the 3L 2018 value set has been cited only eighteen times since its publication [22,23]. Given the noticeable differences in coefficients of scoring algorithms for the two value sets (Table 1), it is unlikely that the two value sets would yield identical utility index scores for the same health state. However, it remains unclear to what extent the use of different utility scores generated from the two value sets would affect results of QALYs computation, which mainly depends on the difference in utility scores rather than the absolute utility scores. Moreover, it is not known whether the difference in the utility scores is clinically important as well. Our previous study has compared the two 3L indices in diabetes patients, and found that they had different discriminative power  20:80 and the choice between them may impact the QALYs estimation [24]. Another study has also compared them in patients with gastric cancer and healthy controls, and showed that the 3L 2014 index score had better ability to distinguish the patients from controls [25]. A study published in Chinese also compared the two 3L value sets in measuring the HRQoL of Tibet residents and concluded they could not be used interchangeably [26]. Nevertheless, all the previous studies were based on either a single disease group or a special group, it is not known whether the findings could be generalized to general populations or other patients in China. Hence, the study aimed to: (1) examine the level of agreement at both absolute and relative levels of all the 243 index scores derived from the two 3L value sets for China; and (2) compare the responsiveness of two indices (i.e. to capture the real changes in health states over time).

The two 3L indices generated from the two 3L value sets for China
The two 3L value sets were developed using different sampling methods, valuation protocols [27,28], modeling methods, leading to distinct algorithms for calculating the 3L index scores (Table 1). For example, the utility score for health state "23221" is 0.466 (i.e., 1-0.039-0.099-0.208-0.074-0.092-0.022) according to the 2014 algorithm or 0.568 (i.e., 1-0.077-0.291-0.037-0.027) according to the 2018 algorithm. In the study, both algorithms were used to generate the two index scores of all the 243 3L health states for analysis. There are three main differences between them. First, for the 3L 2014 value set, respondents were selected from urban areas through a quota sampling; while for the 3L 2018 value set, a more representative sample of respondents were obtained from both rural and urban areas by using a random sampling method. Second, the 3L 2014 and 3L 2018 value sets were developed using the Paris protocol and the Measurement and Valuation of Health (MVH) protocol, respectively, whereby the former protocol is an improvement of the latter. Third, the time-trade off (TTO) technique for the 3L 2014 value set was based on the 'death' state to elicit health utility scores, but not for the 3L 2018 value set. Those differences led to distinct algorithms for calculating the 3L index scores (Table 1).

Statistical analysis
We assessed the distributions of the two indices (i.e., 3L 2014 index score and 3L 2018 index score) using the Shapiro-Wilk test. T-test or Wilcoxon rank-sum test were then used to compare their mean values wherever appropriate.
A two-way mixed intraclass correlation coefficient (ICC) [29] and Bland-Altman plot [30] were adopted to assess the degree of agreement between the two indices at absolute level. The agreement was considered good when the ICC value was higher than 0.7. The Bland-Altman plot was used to visualize and assess the level of agreement across different utility segments, whereby the Y-axis depicts the differences in score between the two indices, and the X-axis represents their mean values. A limit of 0.074, that is the minimally important difference (MID) of the 3L index score, [31] was used to determine whether the magnitude of the difference would be clinically important.
To examine the agreement of the two 3L index scores at relative level, we simulated all the possible health states transitions that may occur over time. All the 243 health states were paired to form 29,403 (C 2 243 ) health state combinations, each of which was used to simulate a pair of health states before and after treatment. It was assumed that the health states with higher index scores were as the states after treatment (post-treatment), and the lower were as the health states before treatment (pretreatment) [32]. Hence, the health gains of our simulated treatment were always positive. However, the index score of the same health state may vary when changing from one value set to the other, thus a health state labeled as pre-treatment when using the 3L 2014 value set may represent post-treatment instead when using the 3L 2018 value set in the same pair, or vice versa. This was what we considered as an "inconsistent" pair of health states [33], whereby the choice of index scores would have a substantial impact on health outcomes, i.e. one may generate a positive health gain, while the other may result in health losses.
On the contrary, for a "consistent" pair, the health state representing pre-treatment remained unchanged regardless of using either the 3L 2014 or 3L 2018 value set. Given the magnitude of health gains may vary from one value set to another, the consistent group was further divided into four subgroups according to the perceived direction and magnitude of the change before and after treatment: (1) major improvement (i.e. at least one dimension in the health transition is increased from level 3 to level 1 or level 2, and no dimension is decreased); (2) minor improvement (i.e. at least one dimension in the health transition is increased from level 2 to level 1, and no dimension is increased from level 3 to 1 or 2, nor is the level of any dimension decreased); (3) mixed response with minor deterioration (i.e. at least one dimension is decreased from level 1 to 2 and no dimension is decreased from level 1 or 2 to 3); (4) mixed response with major deterioration (i.e. at least one dimension is decreased from level 1 or 2 to 3) [33]. It should be  20:80 noted that, if the level of one dimension deteriorates yet the level of the others improves in a health transition, it would be considered as a mixed response with some deterioration and thus assigned to either subgroup 3 or 4.
We then compared the health gains yielded from the two 3L indices for all the transitions, consistent transitions, and each subgroup of the consistent transitions. Moreover, in order to help understand how a singlelevel change in severity of descriptive systems would result in different utility change between the two value sets, we computed changes in utility values between pairs of adjacent health states for each value set. We called them "adjacent" when two health states are exactly the same except for one dimension where the severity level differs by one only [15,[34][35][36]. For example, health states "21111" and "11111" were considered adjacent.
We also compared the responsiveness of the two 3L indices within the consistent group by using Cohen effect size [37] . It is commonly used to measure the effect size of a treatment, and is independent of the sample size which is unlike the significance test. It is calculated as the difference in the mean scores between post-treatment and pre-treatment divided by the standard deviation of the pre-treatment. The effect size was categorized as small (0.2-0.5), moderate (> 0.5-0.8), or large (> 0.8) [37]. Given that the hypothetical treatment was fixed in our simulation, the effect size would reflect the ability of an index score to discern changes in two known health states. The higher the effect size, the more responsive the index score is. We calculated and compared Cohen effect size for all the consistent pairs and each subgroup of the pairs. Microsoft Excel and Stata and SAS were used for statistical analysis.

Results
The two 3L indices were both normally distributed according to the Shapiro-Wilk test (Fig. 1). Overall, the 3L 2014 value set generated systematically lower index scores compared with those yielded from the 3L 2018 value set. The mean (SD) value of all the index scores was 0.427 (0.206) for the former and 0.649 (0.189) for the latter, with the difference in mean being 0.222 (p < 0.001) ( Table 2); the 3L 2014 value set also had lower scores for 239 out of 243 health states. Meanwhile, the difference and variance between the two index scores were not invariant but generally increased with the increasing in healthstate severity (Fig. 2). For example, the index score of the second-best health state was 0.887 (for state "11211") and 0.973 (for state "11121"); while the minimum index score was −0.149 and 0.170 (for the worst state "33333") according to the 3L 2014 or 3L 2018 value set, respectively. Although the overall agreement between the two kinds of index scores was good (ICC = 0.896), 88.9% (216/243) of the points were beyond the MID limit according to Bland-Altman plot (Fig. 3).
On the other hand, the difference between the two indices was not so obvious for the 29,403 health transitions: the mean differences (SD) were 0.234 (0.173)  In the subgroups of major/minor improvement, the 3L 2014 index score yielded greater magnitude of health gains at 0.411/0.151(vs. 0.310/0.072 from the 3L 2018 index score). However, it generated similar or lower health gains compared to the 3L 2018 index score in the subgroups of "mixed response with minor deterioration" (health gains: 0.246 for both index scores) and "mixed response with major deterioration" (health gains: 0.069 vs. 0.118) ( Table 3) . The absolute change in utility values between any two adjacent states computed using the 3L 2014 value set was larger than that using the 3L 2018 value set, expect for pairs that involve a change between level 2 and 3 in either "mobility" or "self-care" dimension (Table 4). Essentially, it reflected the fact that differences between coefficients of the same dimension in the scoring algorithm vary from one value set to another.
The two indices also showed a similar level of sensitivity to change for all the consistent changes, with Cohen effect size values at 1.05 and 1.06, respectively. Nevertheless, the value varied substantially across the subgroups. In the subgroups of major/minor improvement, the 3L 2014 index score demonstrated higher values than the  Table 3).

Discussion
In the study, we compared the agreement of all the two 3L index scores generated from the two 3L value sets for China. We found that the 3L 2014 index score was systematically lower than the 3L 2018 index score at absolute level, but their differences at relative level varied in terms of the direction and magnitude of the health change.
It is not surprising that the 3L 2014 index score was much lower given the 3L 2014 algorithm has larger values in 8 out of 10 parameters and two more terms (i.e., constant and N3) further pulling down the scores ( Table 1). The difference and variance between the two index scores were also increased with the increasing in health-state severity. Regarding the former, the difference in level-3 (L3) parameters between the two algorithms is in general larger than the difference in level-2 (L2) parameters. This, plus the use of N3 term, lead to the increased difference. The latter could be ascribed to the fact that the 3L 2018 algorithm has two L3 parameters with larger values (i.e., MO3 and SC3) than those of the 3L 2014 algorithm. As a result, for health states including the problems, the difference between the index scores may be reduced rather than increased, resulting in larger variance for all health states including L3 problems. Difference in algorithm parameters may be attributed to several factors such as the valuation protocol, modeling method, as well as the sample used [13,14]. The sample for the 3L 2018 algorithm   Table 3 Responsiveness of the two EQ-5D index scores in simulated transitions between EQ-5D-3L health states  including the rural population, who may be more likely to live with economic hardships over years. Hence, they may be able to endure more pain and suffering, leading to a relatively higher estimate in utility values for health problems than the better-off residents. In addition, the 3L 2018 value set used an open-ended TTO question. The developers of the 3L 2018 value set believed that due to cultural reasons, death is a taboo in China, especially in rural areas. When using the TTO method, the interviewers did not tell the respondents to imagine die immediately after living in a hypothetical health state for a period of time. Therefore, the respondents may make variant assumptions about the length of life and health states of the continued lives, which may have led to an overestimation of the TTO. The two indices generated consistent results for the majority (80.7%) of health transitions. For the transitions involving improvement only, the results would always be consistent regardless the differences in scoring algorithms. On the other hand, the inconsistent results would be presented for the transitions including both improvement and deterioration in different dimensions. Compared to the 3L 2014 algorithm, the parameter coefficients of the 3L 2018 algorithm display greater variance. Its parameter value for L2 and L3 problems of the 3L 2018 algorithm varied from 0.027 (PD2) to 0.077 (MO2), and 0.041 (PD3) to 0.291(SC3); while such the parameters for the 3L 2014 algorithm ranged from 0.074 (UA2) to 0.099 (MO2) and 0.205(AD3) to 0.246 (MO3). For example, a health transition resulted from health state "11131"to "11113" would be considered as health gain and health loss according to the 3L 2014 (0.031) algorithm and 3L 2018 (−0.136) algorithm, respectively.
With regard to all the consistent health transitions, both the index scores showed similar health gains and responsiveness, but they varied considerably across the four subgroups. The health gains and responsiveness of the 3L 2014 index score were found to be better or greater than those of the 3L 2018 index score in the "major improvement" and "minor improvement" subgroups, which suggested that the use of the 3L 2014 algorithm would tend to result in larger QALY gains for the two subgroups. On the other hand, in the subgroups of "mixed response with minor deterioration" and "mixed response with major deterioration", the two index scores generated similar or even reversed results. For the subgroups 1 & 2, the 3L 2014 algorithm overall has larger parameter values, indicating the health gain from a transition from extreme/some problems to no problems is much greater according to it. Similarly, the magnitude of difference between L2 and L3 parameters is also generally larger for the 3L 2014 algorithm, leading to comparable conclusions for the transitions from extreme problems to some problems. This point became clearer when we compared changes in utility values of two adjacent health states between the two value sets, as shown in Table 4. For the subgroups 3 & 4, the 3L 2014 algorithm has relatively similar parameter values across the five L2 and the five L3 parameters. Hence, for a health transition involving both improvement and deterioration, the magnitude of health gain from the improvement in a certain dimension may be offset to a large extent by the deterioration from another dimension according to the 3L 2014 algorithm. The resulting health gains and responsiveness were therefore not larger or better than those based on the 3L 2018 algorithm in the subgroups.
It should be bear in mind that in reality the frequencies of the 243 health states and 29,403 transitions would be distributed disproportionately. For example, the state "11111" has been the most frequently observed in a number of studies in China, which may lead to different conclusions. [24] When measuring individuals who are expected to be either stable or gain improvement in all the 5 dimensions of 3L from an intervention, the 3L 2014 value set may be a more preferable choice. But in other scenarios, the choice becomes less straightforward and thus it is recommended to apply both value sets in data analyses as part of a robustness check. Also, the absolute utility score could also influence the QALY calculation to some extent. Hence, more empirical studies are Table 4 Differences in utility change of adjacent health states between two value sets *For illustration, only some adjacent health states are presented to reflect that a single "one-level" change in the 3L descriptive system would result in a change in utility values †Column "change" lists all the possible absolute changes in utility values between any pair of adjacent health states for each value set EQ-5D-3L state* 3L 2014 20:80 warranted to further assess the impact in various settings in China.We also acknowledge a new 3L value set for China's rural population developed by Liu et al. has been available recently [38]. They also found that the utilty scores generated from the value set were generally lower than those of the two 3L value sets used in the current analysis. We did not include the value set as we have finished the analysis and paper writing before its publication. Nevetherless, the differences among the three kinds of 3L utilities may necessitate the valuation of 5L health states from both rural and urban respondents since the current 5L value set for China is based on urban respondents only.

Conclusion
Our results suggested a substantial difference between the 3L 2014 and 3L 2018 index scores at absolute level; while their differences at relative level differed according to the type of health change. Our findings suggested that choosing which value set to generate 3L index score is very likely to influence QALYs estimate in China.