Mapping the Functional Assessment of Cancer Therapy - Breast (FACT-B) to the 5-level EuroQoL group’s 5-dimension questionnaire (EQ-5D-5L) utility index in a Multi-ethnic Asian population

Purpose To develop an algorithm for mapping the Functional Assessment of Cancer Therapy – Breast (FACT-B) to the 5-level EuroQoL Group’s 5-dimension questionnaire (EQ-5D-5L) utility index. Methods A survey of 238 breast cancer patients in Singapore was conducted. Models using various regression methods with or without recognizing the upper boundary of utility values at 1 were fitted to predict the EQ-5D-5L utility index based on the five subscale scores of the FACT-B. Data from a follow-up survey of these patients were used to validate the results. Results A model that maps the physical, emotional, functional well-being and the breast cancer concerns subscales of the FACT-B to the EQ-5D-5L utility index was derived. The social well-being subscale was not associated to the utility index. Although theoretical assumptions may not be valid, ordinary least square outperformed other regression methods. The mean predicted utility index within each performance status level at follow-up deviated from the observed mean less than the minimally important difference of EQ-5D for cancer patients. Conclusions The mapping algorithm converts the FACT-B to the EQ-5D utility index. This enables oncologists, clinical researchers and policy makers to obtain a quantitative utility summary of a patient’s health status when only the FACT-B is assessed.


Introduction
The EuroQoL Group's 5-dimension questionnaire (EQ-5D) is a generic, preference-based instrument that measures respondents' health status using a utility index. This intervaltype utility index indicates the value of various specific health states using standardized valuation methods such as time trade-off or standard gamble, and is essential in health economic evaluation like cost-utility and qualityadjusted life-year analyses. The EQ-5D is preferred by health technology assessment organisations such as the National Institute for Health and Care Excellence (NICE), United Kingdom, for eliciting health utility values (http:// www.nice.org.uk, accessed 6 March 2014). However, preference-based measures are not always used in clinical studies. Most disease-specific instruments are profilebased and only provide ordinal-level measurement scales that cannot be used for health economic analysis. One solution to utilize non-preference-based instruments to perform economic evaluation is to map such measures to the EQ-5D utility index. The NICE has released guidelines and technical support documents to guide the selection and use of mapping algorithms to EQ-5D [1][2][3]. Since then, there was a substantial increase in the number of studies reporting mapping algorithms to EQ-5D, from one study per year between 2000 and 2003 to 17 studies per year in 2012 and 2013 [4].
The Functional Assessment of Chronic Illness Therapy (FACIT) Measurement System is a collection of healthrelated quality-of-life (HRQoL) questionnaires for various chronic diseases [5,6]. The core component of the FACIT system is the Functional Assessment of Cancer Therapy -General (FACT-G) for patients of any cancer type. This 27-item instrument can also be extended to a more specific instrument by adding a cancer type specific module. For example, the FACT-G becomes the FACT-Breast (FACT-B) if a 10-item breast cancerspecific module is added. Studies that map the scores of the FACT instruments to the EQ-5D utility index are not rare. Examples include the mapping of FACT-G [7], FACT-Prostate [8] and FACT-Melanoma [9] to the EQ-5D utility index. However, there is no available mapping algorithm in the literature to convert the FACT-B to EQ-5D. Breast cancer has been reported in the Global Cancer Statistics 2011 as the most common type of cancer among women worldwide [10]. In 2008, 1.38 million women were diagnosed of breast cancer accounting for 23% of the total new cancer cases. The purpose of the current study is to formulate a mapping algorithm for generating the EQ-5D health utility index from the FACT-B, so that oncologists and clinical researchers can obtain both a psychometric description and a quantitative utility summary of a patient's health status from a single assessment using the FACT-B, without imposing additional assessment burden on patients.

Design and recruitment
This is a secondary analysis of a study that aimed to evaluate the measurement properties of the English and Chinese versions of two instruments, namely the FACT-B and the 5-level EQ-5D (EQ-5D-5L), in breast cancer patients in Singapore [11]. The study was approved by the Singapore Health Services Institutional Review Board. Inpatients were recruited from the oncology wards of the Singapore General Hospital while outpatients from the specialist outpatient clinics of the National Cancer Centre, Singapore. Patients who aged 21 years or above, were histologically confirmed breast cancer, able to understand Chinese or English or both, with no evidence of brain metastasis, psychosis or severe depression, and willing to give written informed consent were recruited. They chose to answer either a Chinese or an English questionnaire package according to their preference. Each package included the EQ-5D-5L and FACT-B, together with some questions on demographics and performance status. Approximately one week after the baseline assessment, the patients were sent a similar package with a postage-paid return envelope enclosed to evaluate their HRQoL, health utility and performance status for assessment of test-retest reliability of the instruments. Up to two reminders with the questionnaire package were sent at two-weekly intervals if the package was not returned. Other clinical information was provided by the patients' treating oncologists. At baseline, the questionnaire package was self-administered by the patients, or by a research assistant upon patient's request. At follow-up, it was self-administered by the patients. Questionnaires not self-administered were excluded from this analysis.

Instruments and variables
The FACT-B is a breast cancer-specific HRQoL instrument of the FACIT system. The 37-item English and (simplified) Chinese FACT-B version 4 are divided into five subscales, namely physical (PWB), social/family (SWB), emotional (EWB), functional well-beings (FWB), and the additional concerns for breast cancer (BCS) [12,13]. We have reported the validity and reliability of, and the comparability between the two language versions in an earlier study [14]. Each item is rated on a 5-point Likert scale. Negatively worded items were recoded such that a higher score indicates a better HRQoL. The FACT-B total score is the sum of scores of all five subscales, the FACT-G score is the sum of PWB, SWB, EWB and FWB, while the Trial Outcome Index (TOI) is the sum of scores of the PWB, FWB and BCS. Missing values were imputed as the mean of observed items provided more than half of the items comprising a subscale were answered, i.e. the "half-rule" [5].
The EQ-5D-5L contains five questions (mobility, selfcare, usual activities, pain/discomfort, and anxiety/depression), plus a vertical, 0-100-point visual analogue scale for rating the overall health status. Respondents could choose one of the five levels to describe their health state on the day of survey. These five levels include "no problem", "slight problems", "moderate problems" and "severe problems" in all five dimensions, and "unable" in mobility, self-care and usual activities or "extreme problems" in pain/discomfort and anxiety/depression [15]. In this study, experimental English and Chinese versions of the EQ-5D-5L were used because the official version was not available at the time. The differences between the official and experimental versions are minor [16]. In the official English version the responses start with "I have" or "I am," which are omitted in the experimental English version. For the self-care dimension, the official Chinese version asks the degree of problems in "washing and dressing" while the experimental Chinese version asks about "combing, washing and dressing." Other differences in Chinese version only involve the use of some words which are actually synonyms and commonly used in daily life, e.g., the English word "usual" is translated to "ri chang" (in Chinese phonetic transcription here) in the official Chinese version and "pin chang" in the experimental version. Recently, we reported the validity and reliability of, and the comparability between the English and Chinese versions of the EQ-5D-5L [16].
The answers to the same five questions in an older 3level version, EQ-5D-3 L, can be converted to a utility index through some country-specific value set. For the conversion of the EQ-5D-5L to a utility index, however, an official value set has not been released by the EuroQoL Group at the time of writing this report. Instead, a crosswalk project converting the EQ-5D-3 L value set to the EQ-5D-5L was conducted by the EuroQoL Group which resulted in the interim value sets [17]. We obtained and used the Japanese value set (Dr. Rosalind Rabin, the EuroQoL Group), the only Asian value set available in the crosswalk project. Using this value set, the EQ-5D-5L utility index has a possible range of −0.111 to 1 [18]. As a sensitivity analysis, the UK value with possible range of −0.594 to 1 was also employed. A higher index means a better health state, with 1, 0, positive values between 1 and 0, and negative values corresponding to full health, death, health states better than death but worse than full health, and health states worse than death, respectively.
The performance status is known to be strongly associated with patients' quality of life [19], and can be assessed by both the oncologists and the patients themselves [20]. Respondents choose the most appropriate one that describes the cancer patient from five options ranging from 0 (without symptoms) to 4 (bedridden) [21]. The score of 5 (death) was not applicable in this study.

Statistical analysis
The baseline survey was used for developing the mapping functions. The EQ-5D-5L utility index was regressed on a combination of the five subscales of the FACT-B. Five different models were examined. Model 1 included all five subscales of the FACT-B, Model 2 the four subscales of the FACT-G, and Model 3 the three subscales of the TOI. Based on some preliminary analysis, the coefficients of the SWB subscale in Models 1 and 2 were found to be insignificant and small in magnitude, and in some cases negative. Therefore, two more models (Models 4 and 5) were conducted by dropping the SWB subscale. Since equivalence between English and Chinese versions of the two instruments could not be surely confirmed in the previous studies [14,16], we also conducted some sensitivity analyses by fitting models that included language as well as interactions between language and the subscales as predictors to investigate whether it should be included in the mapping algorithm.
The EQ-5D-5L utility index is bounded from above by 1 which indicates full health. Previous studies pointed out that this upper bound may invalidate the normality assumption of ordinary least square (OLS) method. Tobit model, an alternative to OLS to deal with censored data, has been suggested for use in mapping. However, it is inconsistent to the presence of heteroscedasticity [22,23]. Censored least absolute deviations (CLAD) is a solution to the heteroscedasticity problem in the Tobit model [24]. This has also been used in the current context. However, an upper bound is not the same as a censoring threshold. Thus, apart from OLS, Tobit and CLAD models, we also considered two regression methods, namely quantile regression and logistic quantile regression [25,26]. Quantile regression is also called median regression if the second quantile, i.e. median, is fitted (as in this study) [25]. The difference between CLAD and median regression is that the former assumes the dependent variable is a censored value of some unobserved latent variable, while the latter does not. In other words, CLAD is a censored version of median regression. Logistic quantile regression is an approach for handling bounded outcomes [26], which is theoretically correct for mapping of utility values. Suppose the outcome y is bounded from below and from above by two known constants, y min and y max , respectively. A logistic transform is applied to the outcome to obtain h y ð Þ ¼ log y−y min y max −y ; and a quantile regression is then fitted by regressing the transformed outcome h(y) on the independent variables. In practice, to ensure the logistic transform is defined for all observed values, y min is set to be slightly smaller than the smallest observed outcome, and y max slightly larger than the largest observed outcome. In this study, we set y min = 0.17885 for Japanese value set and −0.28265 for UK value set (i.e. half of the observed smallest increment less than the smallest observed utility value from the respective value set [27]), and y max = 1.001 for both value sets. The visual analogue scale of the EQ-5D was not used in the analysis. Model performance was examined by several goodnessof-fit measures. The coefficient of determination, R 2 , in OLS may not be well-defined in other regression methods. The pseudo-R 2 , an alternative to R 2 defined by the likelihood ratio between the intercept-only model and the full model, is not comparable across different regression methods [28]. Instead, we computed the square of the correlation coefficient (r) between the observed and predicted values from each model. Note that R 2 is equivalent to r 2 in OLS. Parallel to OLS, to penalize for the complexity of the model, we considered an adjusted r 2 defined as where n is the sample size and p is the number of parameters in the model. Goodness-of-fit was also examined by mean square error (MSE) and mean absolute deviation (MAD). Because OLS minimizes the sum of squared deviations whereas CLAD and quantile regression minimize the sum of absolute deviations, one would expect that the MSE tends to favor the OLS while MAD tends to favor the CLAD and quantile regression. Therefore, the interpretation should not focus exclusively on one index. The distributions of the observed and mapped utility values were also compared. The follow-up survey was used for validating the resulted algorithms. The differences between the observed and predicted utility values at follow-up were tested by signed-rank tests within each performance status level. All statistical analyses were performed in SAS system version 9.3 (SAS Institute, Cary, NC, USA) and Stata version 10 (StatCrop, College Station, TX).

Results
Two hundred and eighty female breast cancer patients consented to participate and answered the questionnaire package. At baseline, 39 patients did not self-administer the questionnaire package and were thus removed from the analysis. Three patients were further excluded due to missing values in the EQ-5D-5L or FACT-B beyond imputation by the half-rule. As a result, 238 patients at baseline were used in the development sample. Their demographic and clinical information at baseline is summarized in Table 1. Most of the patients answered an English package (67.2%), were ethnic Chinese (81.1%), married (70.9%) and outpatients (70.6%). Among them, 221 returned the questionnaire package with no missing values, and were used for validation. Table 2 describes the distributions of the EQ-5D-5L utility index and the FACT-B total and subscale scores of the sample at baseline and follow-up. There were approximately a quarter of patients reporting full EQ-5D-5L utility value at baseline and one fifth at follow-up. No patient reported a maximum FACT-B total score at both time points, but the FACT-G score and four subscales (PWB, SWB, EWB and FWB) reached the upper bound in some patients. For the BCS subscale, one patient had maximum score at baseline, but none at follow-up.
The results of the regression analyses using the Japanese value set are displayed in Table 3. For each regression method, five models were fitted. Among the five models, Model 4 consisting of the PWB, EWB, FWB and BCS had the largest adjusted r 2 , regardless of the regression method used. The MSE and MAD of the five models were similar, but mostly Model 4 had the smallest, with the only exception in the MAD (0.0913) which was marginally larger than that of Model 1 (0.0912) for CLAD and quantile regression. Among the five regression methods, OLS generally had the largest adjusted r 2 (0.4782 to 0.4887) and smallest MSE (0.0132 to 0.0135) and MAD (0.0913 to 0.0925), followed by CLAD and quantile regression (adjusted r 2 ranged from 0.4775 to 0.4882; MSE from 0.0133 to 0.0135; MAD from 0.0912 to 0.0923). The latter two had the same point estimates for the regression coefficients in all models. However, due to different model assumption and estimation process, their standard errors were not the same (details not shown), and hence the p-values were different. In each of the models that include language and/or interactions between language and the subscales as predictors, the coefficient estimates were small in magnitude and insignificant. Moreover, adding language into the model did not improve the goodness-of-fit nor alter qualitatively the parameter estimates of other variables. Models with interaction and quadratic terms of the subscales were also fitted. However, these terms were insignificant and did not improve the model in terms of goodness-of-fit (details not shown). Moreover, the results using the UK value set were similar to that of the Japanese value set, hence were not shown here either. One point worth noting is that CLAD and quantile regression had different point estimates when the UK value set was used.
The observed and predicted EQ-5D-5L utility index values by Models 3 through 5 at baseline survey were compared (Table 4). For central tendency, the five methods performed differently. As restricted by the estimation process, the means of the predicted values based on OLS were always the same as that of the observed values; but the median of the predicted values were larger than that of the observed ones. The Tobit models tended to produce larger predicted values than the observed ones, hence resulting in larger means and medians. CLAD, quantile regression and logistic quantile regression had the mean predicted values slightly smaller than the observed mean, but the medians were also larger than the observed median. For dispersion, however, predicted values by all methods had a smaller spread than the observed values. Compared with the observed values, the predicted values by each model had a smaller standard deviation, larger minimum and 10 th percentile but smaller 90 th percentile and maximum. It was possible that the predicted EQ-5D-5L utility index fell outside the defined range of −0.111 to 1 (for Japanese value set). However, this only happened in less than 3% of the sample in Models 3 and 4 when the Tobit method was used.   Table 5 presents the mean observed and predicted EQ-5D-5L utility index by Models 3 through 5 classified by performance status at follow-up. The utility index was significantly overestimated in the group with performance status of 1, but underestimated in the group with performance status of 0 only when the logistic quantile regression was used. The largest differences between observed and predicted values were approximately 0.06. The predicted utility index showed significantly decreasing trend with performance status (all p-values for trend < 0.001).

Discussion
The regression analyses showed that the EQ-5D-5L index of the breast cancer patients in our sample was best predicted by the model consisting of four FACT-B subscales, i.e., PWB, EWB, FWB and BCS (Model 4). Although the coefficients of BCS and EWB were respectively significant in Model 3 and Model 5 for most regression methods, they were not both significant in Model 4. However, this study aimed to construct a model that best predicts the utility index, so whether the regression coefficients are statistically significant is of secondary consideration. The most important criterion is the accuracy of the prediction. The observed and predicted values correlated more closely in Model 4 than in Models 3 and 5 even after penalizing for model complexity, and the MSE and MAD were the smallest in Model 4. Although BCS contains several emotionrelated items, those items are more specific to breast cancer patients than those in the EWB subscales, e.g., "I feel sexually attractive," "I am able to feel like a woman", etc. Therefore, it is appropriate to include both BCS and EWB in the model. On the other hand, due to a lack of social-related item in the EQ-5D-5L, SWB was not associated with the utility index. In the current study and two previous mapping studies of FACT-G and FACT-Prostate, the coefficient for SWB was negative despite being insignificant [7,8]. Although this is counter-intuitive as one would not expect that better social well-being leads to lower health utility, it is possible that patients in declining health perceive social-related quality-of-life more positively than those in better health.
In this study, OLS had the best goodness-of-fit while the performance of CLAD and quantile regression were close to OLS, but Tobit model and logistic quantile regression did not perform well. Previous studies showed that CLAD and OLS were superior to Tobit model in Table 3 Coefficient estimates and goodness-of-fit measures of various regression models mapping the FACT-B subscales to EQ-5D-5L utility index based on baseline survey (N = 238) (Continued) developing a mapping algorithm to the EQ-5D utility index, but whether CLAD or OLS performed better, or in what situation (e.g. proportion of patients attaining full health) one method outperformed the other was inconclusive [28,30,31]. The OLS approach has some merits that are not shared by other methods, such as well-developed diagnostic checking techniques and its availability in common statistical packages. In fact, as reported in two recent reviews, OLS was the most commonly used method in mapping studies [4,32]. Both reviews, however, were concerned by the quality of these mapping algorithms using OLS. Hence, further research on the usability of OLS in mapping studies is warranted. Similar to our findings from the Japanese value set, Austin et al. also obtained identical point estimates in CLAD and quantile regression [22]. It has been pointed out that Tobit and CLAD models should not be used for analyzing health utility index because it is conceptually bounded from above by 1, rather than censored at 1 [24]. In contrast, quantile regression recognizes the data truly cannot exceed the upper boundary, as opposed to assuming censoring at the boundary. Therefore, quantile regression is conceptually more appropriate than Tobit or CLAD. Logistic quantile regression, although not performing well in the current study, has a theoretical advantage that the logistic transform can deal with the boundary problem [26]. This not only suits the distributional property of health utility index, but also guarantees the predicted values would not exceed the boundary of 1.
The r 2 values were no larger than 0.5 for all models ( Table 3), implying that at least half of the variance in the data has not been accounted for. This moderate level of accuracy is not uncommon in studies mapping FACT instruments to EQ-5D utility index, e.g., R 2 ranged from 0.317 to 0.451 for FACT-G [7], from 0.535 to 0.582 for FACT-Prostate [8], and from 0.328 to 0.499 for FACT-Melanoma [9]. On the other hand, the models were also estimated in terms of the mean utility index in patients with different level of performance status at follow-up. Some of the differences between the mean observed and predicted values within the same level were found statistically significant. However, the differences were small and less than the minimally important differences of EQ-5D for cancer patients (0.08 for UK value set and 0.06 US value set) [33]. Thus we believed that the differences were clinically unimportant. Furthermore, the mean predicted values showed a clear trend in relation to the performance status, suggesting their good discriminating ability. Hence, the mapping algorithm appears to be appropriate for use at group level and may not be so at individual level. We also tried mapping the individual items, instead of subscales, of the FACT-B to EQ-5D-5L utility index using OLS with stepwise selection. Although the model using items as predictors obtained comparable r 2 , MSE and MAD to that of Model 4 using OLS in Table 3, this model resulted in a smaller adjusted r 2 due to a larger number of predictors, which implies a more complex model. More importantly, in practice, missing values can be imputed by the "half-rule" if subscales are being mapped to utility index, but may not be properly handled if individual items are used. Another possibility to derive a FACT-B preference-based score is direct valuation. However, unlike the EQ-5D that has only 5 items, FACT-B has 37 items that makes direct valuation difficult due to the large number of possible health states. Therefore, it is practically not feasible.
A limitation of this study is that the value set for the 5-level EQ-5D instrument was based on a crosswalk project that linked the EQ-5D-3 L to the EQ-5D-5L [17]. Some recently published studies have reported the value set of several Asian countries for EQ-5D-3 L but not for EQ-5D-5L. The EQ-5D-5L value sets are best to be estimated through direct valuation of the health states using methods such as time trade-off. Nevertheless, this crosswalk algorithm is currently the only available one for converting the EQ-5D-5L responses to a utility index, and the results are robust as indicated by a sensitivity analysis using the UK value set. In fact, both Japan and Singapore are highly modernized and their well-educated populations are exposed to international influence. In a review study of the development of HRQoL instruments in Asian languages, Cheung and Thumboo found that QoL concepts do not vary a lot with cultures, and items that are highly specific to a society are uncommon [34]. Sakthong et al. compared the psychometric properties among 3 value sets of the EQ-5D-3 L using a sample of individuals from Thailand, another south-east Asian country, and revealed that the Japanese value set obtained better test-retest reliability, convergent validity and known-group validity than the UK and US value sets [35]. Therefore, the use of the Japanese value set, though not ideal, is expected to be appropriate for Singapore. That said, our findings are best verified when there is a local value set available in the future. Moreover, nearly 90% of the patients in this study had a performance status of 0 to 1. This may limit the generalizability of the results to more severe patients. Furthermore, instead of answering both English and Chinese versions of the instruments, bilingual subjects only selected the language they preferred to use. This limited head-to-head comparison between the language versions and assessment of equivalence. That said, survey language was not significantly associated with the EQ-5D-5L utility index and did not improve the goodness-of-fit nor qualitatively influence the coefficient estimates of other independent variables, suggesting that pooling the data from English and Chinese versions for analysis is reasonable.
In conclusion, this study confirmed the feasibility of mapping the FACT-B subscale scores to the EQ-5D-5L utility index. Among the various regression methods tested, OLS not only provided the best goodness-of-fit measures, but also has the widest availability of statistical packages for analysis and diagnostic checking. Hence, we recommend OLS for mapping the FACT-B to the EQ-5D-5L utility index. The findings are useful in cost-utility and quality-adjusted life-year analyses for breast cancer patients when only the FACT-B data are available.

Competing interests
We declare that we have no potential conflict of interest; the funding agencies have no involvement in any procedure of this study. This manuscript has not been published and is not under consideration for publication elsewhere. We have full control of all primary data and we agree to allow the journal to review the data if requested.  [29]. ** P-value < 0.01; * P-value < 0.05 for signed-rank test between observed and predicted utility index within the same performance status level.