Mapping SF-36 onto the EQ-5D index: how reliable is the relationship?
© Rowen et al; licensee BioMed Central Ltd. 2009
Received: 14 October 2008
Accepted: 31 March 2009
Published: 31 March 2009
Mapping from health status measures onto generic preference-based measures is becoming a common solution when health state utility values are not directly available for economic evaluation. However the accuracy and reliability of the models employed is largely untested, and there is little evidence of their suitability in patient datasets. This paper examines whether mapping approaches are reliable and accurate in terms of their predictions for a large and varied UK patient dataset.
SF-36 dimension scores are mapped onto the EQ-5D index using a number of different model specifications. The predicted EQ-5D scores for subsets of the sample are compared across inpatient and outpatient settings and medical conditions. This paper compares the results to those obtained from existing mapping functions.
The model including SF-36 dimensions, squared and interaction terms estimated using random effects GLS has the most accurate predictions of all models estimated here and existing mapping functions as indicated by MAE (0.127) and MSE (0.030). Mean absolute error in predictions by EQ-5D utility range increases with severity for our models (0.085 to 0.34) and for existing mapping functions (0.123 to 0.272).
Our results suggest that models mapping the SF-36 onto the EQ-5D have similar predictions across inpatient and outpatient setting and medical conditions. However, the models overpredict for more severe EQ-5D states; this problem is also present in the existing mapping functions.
Clinical trials use a multitude of health status measures in order to measure health and health related quality of life. However, most of these measures cannot be used in assessments of cost effectiveness using cost per Quality Adjusted Life Year (QALY). Preference-based measures such as the EQ-5D are commonly used to do this, but are not always used in clinical studies. One solution to this problem is to apply a mapping function to convert non-preference based health data into one of the generic preference-based measures; this is helpful to those submitting evidence to agencies such as NICE . However the accuracy and reliability of the mapping models employed is largely untested, and there is little evidence of their suitability in patient datasets.
A recent review of mapping non-preference-based measures onto generic preference-based measures  found 29 studies. However, most of these used simple OLS modelling procedures on comparatively small data sets. Further, existing studies have neglected to investigate the robustness of the models across patient data sets.
The purpose of this paper is to examine whether mapping models are reliable and accurate in terms of their predictions for a large and varied patient dataset. The mapping relationship examined here is between the EQ-5D index, a generic preference-based measure of health related quality of life and the SF-36, a generic non-preference-based health status measure commonly used in clinical trials. A mapping relationship is estimated using a range of techniques and statistical specifications. We examine the mapping relationship across inpatient and outpatient settings and medical conditions according to ICD classification. Furthermore, we compare the mapping approach used here to existing models [3, 4] in terms of predictive performance.
The SF-36 assesses health across eight dimensions using 36 items. The SF-36 produces a score on a 0–100 scale for each of the eight dimensions, which are specific health domains such as physical functioning, social functioning and vitality. These scores are not comparable across dimensions and are not based on individual preferences, therefore they cannot be used to generate QALYs. The SF-36 can be used to generate a preference-based index via the SF-6D .
The EQ-5D is the most widely used generic preference-based measure of health-related quality of life which produces utility scores anchored at 0 for dead and 1 for perfect health. The utility scores represent preferences for particular health states. The descriptive system has 5 dimensions (mobility, self-care, usual activity, pain/discomfort and anxiety/depression) and 3 levels (no problems, some problems, extreme problems) which create 243 unique health states. This study uses the UK TTO value set in its main analysis . The EQ-5D valued using the UK TTO value set is preferred by NICE . The SF-6D has been found to differ from the EQ-5D  and so to achieve comparability between studies using different measures this paper explores an alternative strategy of mapping.
Regression analysis is used to examine the relationship between the EQ-5D utility score and the SF-36 using the 8 dimension scores; physical functioning, role-physical, bodily pain, general health, vitality, social functioning, role-emotional and mental health, squared dimension scores and interaction terms derived using the product of two dimension scores. The dependent variable, the EQ-5D utility score, is measured on a -1 to 1 scale. The 8 dimension scores of the SF-36 are rescaled onto a 0–1 scale to enable easier interpretation of the results and the squared terms and interaction terms are generated using the rescaled scores.
where i = 1,2,..., n represents individual respondents and j = 1,2,..., m represents the 8 different dimensions. The dependent variable, y, represents the EQ-5D utility score, x represents the vector of SF-36 dimensions, r represents the vector of squared terms, z represents the vector of interaction terms and ε ij represents the error term. This is an additive model which imposes no restrictions on the relationship between dimensions. The squared terms are designed to pick up non-linearities in the relationship between dimension scores and the EQ-5D index. There is no reason for it to be linear and there is evidence in physical functioning, for example, that the same differences in scores at the lower end of the scale indicate larger differences in functioning than at the upper end . Interaction terms are important since there is evidence from other measures that dimensions are not additive . Statistical measures of explanatory power, predictive ability, and model specification are reported.
The sample used here is a patient dataset (described below) where respondents are included each time they are treated, and hence some respondents have multiple observations. Random effects models are used to take account of this data structure. The estimated models are used to generate predicted EQ-5D scores. Predictive ability is assessed using line graphs of the observed and predicted EQ-5D utility scores ordered by observed tariff value of EQ-5D state, mean error, mean absolute error and mean squared error.
where is the observed EQ-5D utility score and y i is the bounded measure of the EQ-5D score.
However, the tobit model also produces biased estimates in the presence of heteroscedasticity or non-normality [10, 11]. The censored least absolute deviations (CLAD) model is also used here since it produces consistent estimates in the presence of heteroscedasticity and non-normality [10, 12]. STATA version 9 was used for all regression analysis and CLAD was performed using programs written for , SPSS version 12 was used for statistical analysis.
Reliability and robustness
In order to examine whether the estimated relationships are reliable and robust across inpatient and outpatient setting and medical conditions, we estimate model (3) as outlined above for subsets of the sample datai. The model is estimated for inpatients and outpatients and for the medical conditions of neoplasms, diseases of the circulatory system and diseases of the digestive system as measured according to ICD classifications C, I and K respectively.
Comparison to existing mapping functions
Our models are compared to existing approaches [3, 4, 10] to determine whether their mapping approaches are more or less reliable for a patient dataset. The existing models from the literature are estimated using the published results and algorithms rather than re-estimating the models using our dataset. We take this approach because mapping is used in economic evaluations to estimate the EQ-5D using the SF-36 (or SF-12) when this is the only health status measure that has been included in the trial. Therefore in practical applications the published results and algorithms are used and it is not feasible to re-estimate the model.
Franks et al.  regress the EQ-5D utility score on PCS-12 and MCS-12, squared terms and cross-products using OLS. PCS and MCS are the physical and mental component summary scores estimated using factor analysis and shown to contain most of the information contained in the 8 dimensions of the SF-36 . In accordance with this approach PCS-12 and MCS-12 are centred on the means used in the paper  and the published coefficients are used to produce predicted EQ-5D utility scores.ii Another study  uses similar variables and estimation techniques to  in order to predict EQ-5D scores from the SF-12 and hence the model is not analysed here separately.
Gray et al.  use a response mapping approach that uses a multinomial logit model to estimate the probability that a respondent will choose a particular level for each dimension of the EQ-5D using responses to the 12 items included in the SF-12 (general health, climbing stairs, moderate activities, accomplish less due to physical health, work limitations, accomplish less due to emotional problems, work carefully, pain interference, calm, energy, down-hearted and low, interference with social activities). Subsequently predicted EQ-5D level responses for each dimension are generated using Monte Carlo simulation methods and the corresponding EQ-5D utility score for that health state is calculated. We use the available algorithm to predict EQ-5D utility scores .iii
Sullivan and Ghushchyan  regress the US EQ-5D utility score on PCS-12 and MCS-12, the product of PCS-12 and MCS-12 and sociodemographic variables using OLS, tobit and CLAD. It is not appropriate to use the exact model  as they use the US-based EQ-5D values  rather than the UK-based values  and further only report models including sociodemographic variables unavailable in our dataset. Instead we have used the tobit and CLAD estimation techniques suggested in  as outlined above and re-estimated the model using our dataset.
The Health Outcomes Data Repository, HODaR, is a dataset collated by Cardiff Research Consortium. The data is collected from a prospective survey of inpatients and outpatients at Cardiff and Vale NHS Hospitals Trust, which is a large University hospital in South Wales, UK. The survey is linked to existing routine hospital health data to provide a dataset with sociodemographic, health related quality of life and ICD classification dataiv. The survey includes all subjects aged 18 years or older and excludes individuals who are known to have died. The survey also excludes people with a primary diagnosis on admission of a psychological illness or learning disability. As well as information on inpatients, the survey includes outpatient clinics on a rotational basis where all patients within the selected clinic are surveyed. The response rate in HODaR prior to October 2003 was around 36% and subsequently strategies were implemented to improve response rates to around 50% .
The inpatient sample has 31,236 eligible observations across 27,620 individuals from August 2002 to November 2004, and of these there are 25,783 complete responses across 23,179 individuals for SF-36 and EQ-5D questions and hence this is the sample used here. The outpatient sample has 9,081 eligible observations across 8,610 individuals collected from June 2002 to November 2004, and of these there are 7,465 complete responses across 7,122 individuals. The dataset covers a wider range of conditions and severity than the general population datasets used in existing mapping approaches, and hence may be more similar to datasets used in economic evaluation.
Descriptive data for the inpatient and outpatient samples
UK population normsv
SF-36 dimension scores
SF-12 summary scores
Physical component score
Mental component score
Prediction models for inpatients using dimensions, squared terms and interaction terms
Random effects GLS
Physical functioning (PF)
Role physical (RP)
Bodily pain (BP)
General health (GH)
Social functioning (SF)
Mental health (MH)
Physical functioning (PF)
Role physical (RP)
Bodily pain (BP)
General health (GH)
Social functioning (SF)
Mental health (MH)
PF × RP
PF × BP
PF × GH
PF × VIT
PF × SF
PF × RE
PF × MH
RP × BP
RP × GH
RP × VIT
RP × SF
RP × RE
RP × MH
BP × GH
BP × VIT
BP × SF
BP × RE
BP × MH
GH × VIT
GH × SF
GH × RE
GH × MH
VIT × SF
VIT × RE
VIT × MH
SF × RE
SF × MH
RE × MH
Mean error, mean absolute error and mean squared error of predicted compared to actual utility scores by EQ-5D utility range for random effects GLS models, random effects tobit models, CLAD model, Franks et al. model and Gray et al. model
EQ-5D utility score
Random effects GLS
Random effects tobit
Franks et al. 
Gray et al. 
Mean absolute error
Mean squared error
Inpatients and outpatients
Figure 1 shows the observed and predicted EQ-5D scores for inpatients and outpatients, ordered by observed tariff value of the EQ-5D state. The predictions are generated using model (3) estimated using random effects GLS. The mapping relationship follows the same pattern across inpatient and outpatient settings and both overpredict for more severe EQ-5D states. Wald test statistics calculated to determine whether the estimated coefficients for inpatients are equal to the estimated coefficients for outpatients for models with exactly the same specification indicate that the estimated coefficients are not equal and hence the models are not robust to different samples. However, differences in predictions are small with mean absolute difference at the state level of 0.069 and mean squared difference of 0.012. Wald test statistics were also calculated for subsets of the inpatient sample according to medical condition for the ICD classifications with the largest number of observations in the dataset, which are the medical conditions of neoplasms (n = 2,574), diseases of the circulatory system (n = 3,522) and diseases of the digestive system (n = 3,114) as measured according to ICD classifications C, I and K respectively. The test statistics again indicate that the estimated coefficients are not equal and hence are not robust across subsets of the inpatient sample according to medical condition, but differences in predictions are small with highest mean absolute difference at the state level of 0.054 and highest mean squared error of 0.005.
Comparison to existing mapping
Re-estimation of the EQ-5D
The patient dataset used here is much better than general population datasets in terms of diversity of conditions and severity of health. Our results suggest that the mapping relationship between the EQ-5D index and the SF-36 for a large and varied UK patient dataset is reliable and accurate across inpatient and outpatient settings and medical conditions. One advantage of using this approach in the UK is that the EQ-5D is currently recommended by NICE (2008) for use in economic evaluation. NICE (2008) also state that mapping can be used when EQ-5D was not included in the trial. However, our results indicate that the mapping relationship is not accurate and reliable for more severe EQ-5D health states. The inclusion of squared and interaction terms in the models improves diagnostics, mean error, MAE and MSE, suggesting that the mapping relationship is non-linear and dimensions are additive. The mapping approach used here is compared to existing approaches [3, 4] and all suffer from overprediction for more severe EQ-5D health states. The added complexity of the response mapping approach used by Gray et al.  does not seem to improve the predictability for all health states in comparison to our approach.
One potential reason for the overprediction for more severe health states are the floor effects of the SF-36. We have tried to account for these floor effects by using squared terms and interaction terms in our model, but, as the figures illustrate, this does not resolve the problem. We also tried re-estimating the EQ-5D utility tariff using the original dataset used to estimate the UK tariff  but omitting the N3 term. Although Figure 3 demonstrates better predictions for more severe health states, the problem of overprediction is still evident. Indeed, if the preferences regarding more severe health states is a property of the dataset rather than the estimation technique, then the valuation produced here will still demonstrate the same properties. We also estimated our model using the US-based EQ-5D values, and although Figure 4 demonstrates better predictions for more severe health states, again the problem of overprediction is still evident.
The importance of the problem of overprediction in economic evaluations is difficult to measure, since it depends on the patient group and the effect of treatments. Ara and Brazier  predict mean cohort EQ-5D utility values using mean cohort scores for the dimensions of the SF-36 from published datasets. They find mean errors of 0.285 and 0.158 in prediction for the 5 out of 63 cohorts in an out of sample dataset with mean EQ-5D utility value below 0.175 and between 0.175 and 0.35 respectively. The impact at the group level may be less important since few patients have EQ-5D utility values below 0.5, and the inpatient and outpatient datasets used here each have 17% of observations with an EQ-5D utility value below 0.5, suggesting that not many observations will be affected by the overprediction for more severe states that is presented here. Therefore for most studies this may not matter, only where many patients have EQ-5D utility values below 0.5.
The results suggest that there are differences in the EQ-5D and SF-36 health status measures for more severe health states which make mapping unreliable for these states. Another finding is that the vitality, role physical and role-emotional dimensions of the SF-36 did not significantly effect the EQ-5D index, hence interventions aimed at improving these dimensions will not be reflected in the mapping model. However, these domains were found to be important to members of the public in the valuation of the SF-6D . Mapping is increasingly being used between condition specific measures and generic measures of health (refer to ). However, the lack of overlap in the dimensions covered by many condition specific measures and EQ-5D limit the usefulness of this approach as these problems may be worsened if the health domains included in the measures are different.
Mapping enables utility scores to be estimated in trials where a non-preference based health status measure has been used but no generic preference-based measure. Our results suggest that approaches mapping the SF-36 onto the EQ-5D are robust across setting and medical condition but overpredict for more severe EQ-5D states. Our results raise doubt over the suitability of mapping for patient datasets which have a proportion of subjects with poorer health or where dimensions are not represented in the target measure. Potential policy implications are that mapping the SF-36 onto the EQ-5D can be useful, but may not be suitable for all populations.
i The estimation results are not reported here but are available from the authors.
ii Other models are estimated in  but these are not analysed here as these models use demographic variables not available in the dataset used here. Furthermore it was found that more complex models explained only minimally additional variance .
iii The algorithm is available from the HERC website http://www.herc.ox.ac.uk/downloads/supp_pub/sf12eq5d
iv See  for further details on HODaR.
We would like to thank Cardiff Research Consortium for use of the HoDAR data. We would also like to thank Fotios Psarras for preliminary analysis.
- NICE: Guide to the methods of technology appraisal. NICE, London; 2008. [http://www.nice.org.uk/aboutnice/howwework/devnicetech/technologyappraisalprocessguides/guidetothemethodsoftechnologyappraisal.jsp]
- Brazier J, Yang Y, Tsuchiya A: Review of methods for mapping between condition specific measures onto generic measures of health. Report prepared for the Office of Health Economics; 2007.
- Franks P, Lubetkin EI, Gold MR, Tancredi DJ, Haomiao J: Mapping the SF-12 to the EuroQol EQ-5D Index in a National US Sample. Medical Decision Making 2004, 24: 247–254. 10.1177/0272989X04265477View ArticlePubMed
- Gray AM, Rivero-Arias O, Clarke PM: Estimating the Association between SF-12 Responses and EQ-5D Utility Values by Response Mapping. Medical Decision Making 2006, 26: 18–29. 10.1177/0272989X05284108View ArticlePubMed
- Brazier J, Roberts J, Deverill M: The estimation of a preference-based measure of health from the SF-36. Journal of Health Economics 2002, 21: 271–292. 10.1016/S0167-6296(01)00130-8View ArticlePubMed
- Dolan P: Modeling Valuations for EuroQol Health States. Medical Care 1997, 35: 1095–1108. 10.1097/00005650-199711000-00002View ArticlePubMed
- Brazier J, Roberts J, Tsuchiya A, Busschbach J: A comparison of the EQ-5D and SF-6D across seven patient groups. Health Economics 2004, 13: 873–884. 10.1002/hec.866View ArticlePubMed
- Brazier J, Harper R, Thomas K, Jones N, Underwood T: Deriving a preference based single index measure from the SF-36. Journal of Clinical Epidemiology 1998, 51: 1115–1129. 10.1016/S0895-4356(98)00103-6View ArticlePubMed
- Feeny D, Furlong W, Torrance GW, Goldsmith CH, Zhu Z, DePauw S, Denton M, Boyle M: Multiattribute and Single-Attribute Utility Functions for the Health Utilities Index Mark 3 System. Medical Care 2002, 40: 113–128. 10.1097/00005650-200202000-00006View ArticlePubMed
- Sullivan PW, Ghushchyan V: Mapping the EQ-5D Index from the SF-12: US General Population Preferences in a Nationally Representative Sample. Medical Decision Making 2006, 26: 401–409. 10.1177/0272989X06290496View ArticlePubMed
- Greene WH: Econometric Analysis. New Jersey: Prentice Hall; 2000.
- Powell JL: Least Absolute Deviations Estimation for the Censored Regression Model. Journal of Econometrics 1984, 25: 303–325. 10.1016/0304-4076(84)90004-6View Article
- Chay KY, Powell JL: Semiparametric Censored Regression Models. Journal of Economic Perspectives 2001, 15: 29–42.View Article
- Ware JE, Kolinski M, Keller SD: How to score the SF-12 physical and mental health summaries: a user's Manual. Boston: The Health Institute, New England Medical Centre, Boston, MA; 1995.
- Lawrence WF, Fleishman JA: Predicting EuroQoL EQ-5D Preference Scores from the SF-12 Health Survey in a Nationally Representative Sample. Medical Decision Making 2004, 24: 160–169. 10.1177/0272989X04264015View ArticlePubMed
- Shaw JW, Johnson JA, Coons SJ: US valuation of the EQ-5D health states: development and testing of the D1 valuation model. Medical Care 2005, 43: 203–220. 10.1097/00005650-200503000-00003View ArticlePubMed
- Currie CJ, McEwan P, Peters JR, Patel TC, Dixon S: The Routine Collation of Health Outcomes Data from Hospital Treated Subjects in the Health Outcomes Data Repository (HODaR): Descriptive Analysis from the First 20,000 Subjects. Value in Health 2005, 8: 581–590. 10.1111/j.1524-4733.2005.00046.xView ArticlePubMed
- Kind P, Hardman G, Macran S: UK Population Norms for EQ-5D. In Centre for Health Economics Discussion Paper 172. University of York, York; 1999.
- Jenkinson C, Layte R, Wright L, Coulter A: The UK SF-36: An analysis and interpretation manual. Oxford: Health Services Research Unit; 1996.
- Ara R, Brazier J: Deriving an Algorithm to Convert the Eight Mean SF-36 Dimension Scores into a Mean EQ-5D Preference-Based Score from Published Studies (Where Patient Level Data Are Not Available). Value in Health 2008, 11: 1131–1143. 10.1111/j.1524-4733.2008.00352.xView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.