Properties of patient-reported outcome measures in individuals following acute whiplash injury

Background The aim of this study was to assess the acceptability, reliability, validity and responsiveness of the Short-Form Health Survey (SF-12) and its preference-based derivative (SF-6D), the EQ-5D and the Neck Disability Index (NDI) in patients recovering from acute whiplash injury. Methods Data from the Managing Injuries of the Neck Trial of 3,851 patients with acute whiplash injury formed the basis of this empirical investigation. The EQ-5D and SF-12 were collected at baseline, and all three outcome measures were then collected at 4 months, 8 months and 12 months post-randomisation. The measures were assessed for their acceptability (response rates), internal consistency, validity (known groups validity and discriminant validity) and their internal and external responsiveness. Results Response rates were broadly similar across the measures, with evidence of a floor effect for the NDI and a ceiling effect for the EQ-5D utility measure. All measures had Cronbach’s α statistics of greater than 0.7, indicating acceptable internal consistency. The NDI and EQ-5D utility score correlated more strongly with the physical component scale of the SF-12 than the mental component scale, whilst this was reversed for the SF-6D utility score. The smaller standard deviations in SF-6D utility scores meant there were larger effect sizes for differences in utility score between patients with different injury severity at baseline than for the EQ-5D utility measure. However, the EQ-5D utility measure and NDI were both more responsive to longitudinal changes in health status than the SF-6D. Conclusions There was no evidence of differences between the EQ-5D utility measure and NDI in terms of their construct validity, discriminant validity or responsiveness in patients with acute whiplash injury. However, both demonstrated superior responsiveness to longitudinal health changes than the SF-6D.


Introduction
Whiplash injuries are soft tissue injuries of the neck that result from an acceleration-deceleration energy transfer mechanism. The prevalence of whiplash injuries is high and is increasing worldwide, particularly within developed countries [1]. Within the United Kingdom (UK) alone the incidence of whiplash injuries is suggested to be around 400,000 per year [1], with the Association of British Insurers noting a 25% rise in whiplash claims during 2002-2008 [1]. Approximately 30-50% of people suffering whiplash injuries report chronic symptoms [2], with an annual cost to the UK economy in 2002 of over £3.1 billion, made up primarily of health service costs and productivity losses [3]. Various treatments for whiplash associated disorders have been proposed, including advice, active management consultations and physiotherapy sessions, but there has been a lack of evidence for both the effectiveness and cost-effectiveness of these interventions [4]. The Managing Injuries of the Neck Trial (MINT) was conducted to fill some of the gaps in this evidence base [5].
Patient reported outcome (PRO) instruments can be used to measure the effects of whiplash injuries in terms of health-related quality of life (HRQoL), and measure the benefits of interventions aimed at their prevention or alleviation. However, there is currently a paucity of evidence on the measurement properties of these instruments when completed by individuals with whiplash injuries. Patient-reported outcome measures (PROMs) are increasingly important outputs for randomised controlled trials [6], as they provide a scientifically robust way of reflecting the patient perspective in the assessment process [7]. Moreover, trial-based economic evaluations are often reliant on preference-based PROMs to calculate the HRQoL component of the quality-adjusted life-year (QALY) metric. Health and Care Excellence (NICE) in England and Wales, for economic evaluations [8].
With the increasing need for quantitative assessment of the impact of preventive or treatment interventions, it is important to identify appropriate outcome measures for use in patients with whiplash injuries. Furthermore, these measures should ideally possess properties, such as internal consistency and construct validity, that satisfy broader regulatory and reimbursement requirements [9]. The MINT study included two generic instruments, the Short-Form Health Survey version 1 (SF-12) [10] and the preference-based EQ-5D-3 L [11], and one neck injury specific measure, the Neck Disability Index (NDI) [12]. Generic instruments are designed to be applicable across a range of health conditions and patient populations, and can be useful to detect unexpected outcomes or side-effects of interventions, which may not be picked up by condition specific measures designed to capture the predicted health status changes. Conversely, more narrowly targeted condition specific measures can provide outputs with a greater clinical relevance, and are often associated with an increased responsiveness compared to generic measures [13]. These differing properties have led to the recommendation for the joint use of generic and condition specific measures in clinical trials [7]. This study compares the three different measures listed above, all of which are commonly used in whiplash injury trials, in terms of their acceptability, reliability, validity and responsiveness in patients with whiplash injuries [14,15].

Study population
Data for this study were drawn from MINT, a pragmatic, cluster randomised controlled trial that recruited patients with acute whiplash injury from 15 NHS emergency departments in the UK [5]. To be eligible for inclusion, patients needed to have a whiplash associated disorder of grades I-III [16]. Patients younger than 18 years of age, with a non-transient loss of consciousness, a Glasgow Coma Score of 12 or less, fractures or dislocations of the spine or other bones, requiring inpatient admission or having a severe psychiatric illness, were excluded. The centres were randomised to provide either active management (including The Whiplash Book [17]) or usual care. Patients with substantial symptoms persisting beyond 3 weeks were eligible for further individual randomisation to either a single or six physiotherapy sessions. Since in this study we are primarily interested in the properties of the outcome measures used within MINT, rather than any evaluation of the interventions in the trial, all MINT participants were included in these analyses, regardless of trial allocation.
Patients who consented to be part of the MINT study at the emergency departments were sent an information letter and questionnaires to complete within three days of attendance, including the SF-12 and EQ-5D health outcome measures. These data, which are used as baseline measurements for the analysis, were returned an average of two weeks post emergency department attendance. Further data was collected by postal questionnaires at 4, 8 and 12 months after the initial emergency department attendance, with SF-12 and EQ-5D data, as well as the NDI, collected at each of these time points. On each follow-up occasion, patients failing to respond within a week were sent a second questionnaire and reminder letter, with those still not responding within a further week called twice over the telephone in an attempt to obtain the core MINT outcome measures (NDI and EQ-5D).

Data collection instruments
The SF-12 consists of 12 questions with a one week recall period measuring various aspects of physical and mental health, from which three summary scores can be extracted. The Physical Component Summary Score (PCS) and Mental Component Summary Score (MCS) are both standardised to have a mean of 50 and a standard deviation of 10 [18], whilst a six dimension health-state classification based on the SF-12, called the SF-6D, can also be constructed, containing 7,500 potential health states with utility values, calculated using the standard gamble technique, ranging from 0.345 to 1 [19].
The EQ-5D contains six items, and asks people about their health state on the day they complete the questionnaire. The first five items ask the respondent to describe their mobility, self-care, usual activities, pain/discomfort and anxiety/depression in the form of a health state classification system. Responses to each of these five dimensions are divided into three ordinal levels coded: (1) no problems; (2) some or moderate problems; and (3) severe or extreme problems. A total of 243 (3 5 ) health states are generated by the EQ-5D descriptive system. Responses to the five item descriptive system can be converted into utility scores using a UK specific tariff [20], calculated from a time trade-off study, taking values between −0.59 and 1, with 1 corresponding to "perfect health" and 0 representing a health state considered to be equivalent to death [11]. The sixth item of the EQ-5D consists of a visual analogue scale (VAS) and asks people to rate their current overall health on a scale from 0 (the worst health state they can imagine) to 100 (the best health state they can imagine).
The NDI consists of 10 questions measuring neck pain-related activity restrictions, with each item scored on a scale from 0 (no restriction) to 5 (severe restriction). These scores are then summed to give a total score ranging from 0 to 50 then doubled to scale to a score from 0 to 100. Vernon et al. have published a categorisation for these NDI scores, with a total score of 4 or less corresponding to no disability, 5-14 mild disability, 15-24 moderate disability, 25-34 severe disability and greater than 34 complete disability [12]. These categorisations are now a commonly used approach in analyses making use of the NDI [21].

Statistical analysis
The MINT study contained no specific questions looking at the acceptability of the different outcome measures used, and no information was collected on the reasons for missing data where a particular questionnaire was returned, but not all items within it were completed. Therefore, the acceptability of the measures (EQ-5D (utility), EQ-5D (VAS), SF-6D, SF-12(PCS), SF-12(MCS) and NDI) was assessed by looking at the response rates to each measure at each time point of assessment, as well as the individual item and dimension completion rates [22]. Whilst this will provide less information than would have been available if patients had been directly questioned [23], there is evidence of a link between response rates and acceptability of a questionnaire to respondents [24,25].
The internal consistency of the EQ-5D(utility), SF-6D and NDI, that is, the extent to which multiple items in each scale measure the same underlying concept, was assessed by calculating Cronbach's α coefficients [26]. This has also often been used, in the absence of any better method, as a proxy for the overall reliability of the instrument, that is, the stability and consistency of the concept being measured. Whilst there are questions as to how relevant a concept internal consistency is for preference-based measures [22], an established convention has been to deem a score of ≥0.70 to be sufficient for use in research, and a score of ≥0.90 for broader use in routine clinical practice [25]. It would be expected that the NDI, since it covers a narrower range of outcomes than the EQ-5D and SF-6D, would have the highest Cronbach's α.
The construct validity of the measures was assessed in terms of both known groups validity and discriminant validity [14]. In known groups validity, we take prespecified groups where we would expect there to be a difference in health status, and thus instrument scores. The different scores between groups for alternative measures can then be compared to see if there is a pattern in the sensitivity to these expected differences [26]. We classified patients according to whiplash associated disorder (WAD) grades at baseline, and performed independent samples t-tests for differences in baseline EQ-5D(utility), EQ-5D(VAS), SF-6D, SF-12(PCS) and SF-12(MCS) scores, or differences in NDI at 4 months, the latter measure not having been included at baseline. The magnitude of these differences was compared by calculating effect sizes, i.e. the mean difference between the WAD grade groups (either WAD grade 1 versus WAD grade 2 or WAD grade 1 versus WAD grade 3) standardised by dividing by the pooled standard deviation of the two groups. This standardisation allows for the unbiased comparison of measures with differing scales [27]. A standard, if again largely arbitrary, classification system devised by Cohen regards an effect size of 0.20 as small, 0.50 as moderate and 0.80 or greater as large [27].
Discriminant validity, the extent to which different instruments with overlapping constructs converge or diverge, was tested by calculating Pearson's correlation coefficients and Spearman's rank correlation coefficients between each of the summary scores of the EQ-5D, SF-6D and NDI. A higher correlation between one of the two utility measures and the NDI cannot interpreted as evidence of superiority in psychometric terms over the other utility measure (the NDI cannot be regarded as a gold standard and generic, preference based measures are not intended to measure the same constructs as condition specific measures [22]). Nevertheless, it could be regarded as evidence of a greater degree of construct overlap between that utility measure and the NDI. Spearman's correlations were also calculated for the individual items within and between each measurement, with the assumption being that similar dimensions in different measures should correlate more highly than different dimensions within the same measure.
We assessed both the internal and external responsiveness of the EQ-5D(utility), SF-6D and NDI. The internal responsiveness of a measure represents its ability to detect changes over a specified timeframe [28]. We calculated effect sizes (the mean change in measure over time divided by the standard deviation pooled across the two time points) and standardised response means (these differ from effect sizes as they are standardised by dividing by the standard deviation of the difference between the measures at the two time points) for the changes in each measure over time, together with the associated 95% confidence intervals [29]. We also calculated the proportion of patients in floor (the lowest possible) or ceiling (the highest possible) health states at each time point, as a high proportion of individuals at one end of the scale can indicate a lack of specificity in that region as well as a lack of responsiveness to change.
External responsiveness considers whether the changes registered by a measure over time correspond to those expected based on an external reference measure of health [28]. We made use of two different reference measures to measure two different aspects of responsiveness, responsiveness to self-reported changes in neck injury status and responsiveness to changes in NDI score. Firstly, we used a question asked to the patient at each follow-up point as to whether their neck injury was much worse, worse, the same, better or much better than at the time of completion of the previous questionnaire. Mean differences, standardised response means and effect sizes were calculated for the changes in outcome measures for patients in each of these self-reported groups. A more responsive measure should show larger differences between the self-reported groups. Secondly, we used various categorisations of the NDI as our reference measure, to see which of the utility measures, EQ-5D or SF-6D, better captured changes in neck disability. The NDI categorisations used were: change in NDI score between 4 months and 12 months, change in Vernon category between 4 months and 12 months, 4 month Vernon categories, 12 month Vernon categories and finally a categorisation of patient outcome trajectories defined by Sterling et al., where neck injuries are classed as either mild, moderate or chronicsevere [30]. These trajectories had been constructed from a previous data set of 155 individuals monitored for one year post whiplash injury [30].

Results
3,851 individuals were randomised in MINT, 1,006 of whom were complete responders, that is, they returned the SF-12 and EQ-5D measures at baseline, 4 months, 8 months and 12 months and the NDI at 4 months, 8 months and 12 months. The baseline characteristics of the whole population and of complete versus noncompete responders are given in Table 1. There were significant differences (at the 95% level) between responders and non-responders in all the characteristics examined, with the exception of WAD grade at baseline [5]. Complete responders tended to be older, and were more likely to be female, with lower pain intensity at baseline. Response rates were also higher from people randomised to the control group (at either randomisation) than those assigned to the MINT interventions.
Acceptability Table 2 shows the response rates (assessed in terms of complete responses to all relevant questions) for each of the measures. Baseline response rates varied from 78.6% (the SF-6D subset of the SF-12) to 89.1% (EQ-5D utility), whilst response rates at the end of the follow-up period varied from 50.4%-69.2%. There were very low rates (<2%) of partial completion (defined as failure to complete at least one item) across measures and follow-up points, with individuals tending to either complete the whole measure or not respond at all.

Reliability
Cronbach's alpha scores were 0.790, 0.871 and 0.922 for the EQ-5D(utility), SF-6D and NDI, respectively, all above the threshold of 0.70 recommended for broader use in clinical research. The Cronbach's alpha score for the NDI was also above the 0.90 cut-off recommended for use in routine clinical practice. A higher value would The percentages may not add up to 100% due to missing data. *The denominators are lower here as not all participants were randomised at step 2.
be expected for this measure due to the narrower range of impacts it aims to capture.

Validity
Descriptive statistics for each of the measures at baseline are shown in Table 3 (with the exception of the NDI for which descriptive statistics at the 4 month follow-up are presented). There is some evidence of a floor effect (scores of 0) with the NDI and a ceiling effect (scores of 1) with the EQ-5D utility measure, but no measure has more than 11.5% of scores at either extreme of a scale. The results of tests of known groups validity summarised in Table 4 show that there were differences in scores for all measures at baseline between pre-specified WAD groups (1 versus 2 and 1 versus 3). All differences are statistically significant at the 5% level between WAD grades 1 and 2. The small number of individuals in WAD grade 3 (n = 104) meant that only the EQ-5D(VAS), SF-6D and SF-12(MCS) differences are significant between WAD grades 1 and 3, despite the magnitude of the differences being larger in all cases than those observed for WAD grades 1 versus 2. The SF-6D had larger effect sizes than the EQ-5D utility measure across both comparisons (grade 1 versus grade 2 and grade 1 versus grade 3), though they both fall into the small-moderate range as defined by the Cohen classifications. Specifically, the effect sizes for the EQ-5D and SF-6D, respectively, were 0.310 and 0.364 between grades 1 and 2, and 0.353 and 0.496 between grades 1 and 3. Table 5 shows the correlation coefficients between the various summary measures, with all correlations statistically significant at the 1% level. The SF-6D correlates more strongly with the mental component scale (rather than the physical component scale) of the SF-12, whilst the EQ-5D (both utility and VAS measures) and NDI correlate more strongly with the physical component scale, with the NDI being more strongly correlated with the EQ-5D utility measure than the SF-6D. Individual item correlations followed the expected patterns (i.e. significant positive correlations between worsening health states on all items within and between the SF-6D, EQ-5D utility and NDI measures) with a smallest correlation coefficient of 0.202 (between the self-care dimension from the EQ-5D and the mental health dimension from the SF-6D). Dimensions measuring similar constructs also correlated more highly than others with, as an example, the pain questions on each measure all having correlations of greater than 0.615 between one another.

Responsiveness
Tables 6, 7 and 8 display measures of the responsiveness of the EQ-5D(utility), SF-6D and NDI, respectively, using self-reported change in neck injury as the referent. Tables 9 and 10 display similar results for the EQ-5D(utility) and SF-6D, but using the NDI as the referent. In Table 6, when data were combined across all possible time points of comparison, there were statistically significant differences in changes in EQ-5D utility scores between alternative categories of self-reported neck injury, ranging from a change of −0.2961 for patients reporting their injury had got much worse to a change of 0.0955 for those reporting it had got much better. This was also the case for the  SF-6D (Table 7) with the exception of the difference in change in utility score between the better (0.0643) and much better (0.0613) self-reported categories, which went in the reverse order to that which would be expected. Effect sizes and standardised response means were consistently larger for the EQ-5D(utility) than for the SF-6D (by an average of 49.8%), and were also consistently ordered across self-reported categories for the EQ-5D utility measure, which was not the case for the SF-6D. There was no consistent pattern of differences between the EQ-5D(utility) and NDI, with effect sizes differing by a smaller average of 16.9%. Furthermore, there was a consistent pattern across all three measures for individuals reporting that their neck injury was the same as 4 months previously, with all showing (when time points were pooled) a small improvement in score. For the analyses using NDI categorisations as reference categories, summarised in Tables 9 and 10, both the EQ-5D(utility) and the SF-6D were consistently more responsive when a longitudinal reference category was used, that is, the referent was delineated as a change in a measure rather than a value at a given time point. Whilst there was considerable variability between effect sizes and standardised response means based on the reference category used, the EQ-5D utility measure again came out as consistently more responsive than the SF-6D (effect sizes and standardised response means were respectively, on average, 22.5% and 13.1% higher for the EQ-5D utility measure than for the SF-6D).

Discussion
The intention of this study was to compare the properties of different patient-reported outcome measures that have been used following acute whiplash injury. The results show significant variation between instrument properties (known groups discrimination, responsiveness etc.) when used in this population.
When comparing different patient-reported outcome measures, there are a number of specific difficulties with interpretation that it is important to note [33]. First, the underlying concepts and domains of health measured will not be the same with, in our case, the EQ-5D and SF-12 being generic health measures whilst the NDI is neck-injury specific. They also relate to different time  periods, with the EQ-5D asking specifically about an individual's health 'today' , the version of the SF-12 in MINT using a one-week recall period and the NDI asking about current capabilities without specifying a time frame. Scales and the outcome space of possible answers also differ, a problem that can be partially, though not entirely, addressed by standardisation (i.e. effect sizes or standardised response means), and the directions of values for better health are not always the same, with higher NDI scores corresponding to worse health, the reverse being the case for the other measures. When considering effect sizes and standardised response means, it is important to remember that differences between measures can be driven by differences in magnitude, differences in variability or both, which can make interpretations of these statistics more difficult. With all these provisos taken into account, there was little evidence of differences in response or completion rates between the different measures. Whilst there were higher response rates to the EQ-5D and NDI as opposed to the SF-12 this can be explained, at least in part, by the follow-up methodology within MINT (missing EQ-5D and NDI questionnaires were chased by postal reminders and telephone contacts, whilst missing SF-12 questionnaires were chased by postal reminders only). There were no meaningful differences if response rates were compared prior to the additional telephone contacts. In the postal questionnaires, the NDI was presented as the first question, the SF-12 the second and the EQ-5D the third, meaning that if questionnaire length is leading to participant fatigue and subsequent non-completion, we would expect higher response rates to the NDI than the EQ-5D. However, we in fact find the reverse pattern, with very slightly (though non-significantly) higher response rates to the EQ-5D.
The EQ-5D(utility) and NDI both appear to be more responsive to longitudinal changes in health status than the SF-6D and give results consistent with the expected trend of deteriorating health status resulting in lower utility values (EQ-5D) or increasing scores (NDI), whilst the SF-6D does not. The EQ-5D(utility) correlates more strongly with the NDI than the SF-6D does, perhaps implying a higher level of construct overlap, and both the EQ-5D(utility) and NDI correlate more strongly with Table 6 Responsiveness of the EQ-5D over time, anchored by self-reported change in neck injury     the PCS of the SF-12 than the MCS, the opposite of the case for the SF-6D. The low level of correlation between the MCS and PCS scales of the SF-12 (0.298, the lowest between any two measures) indicates that these constructs are indeed non-overlapping to a considerable extent.
In contrast, the SF-6D produces larger effect sizes for differences in injury severity (WAD grade) than the EQ-5D utility measure at a fixed time point. This may, however, be driven by the lower standard deviation for SF-6D values (in turn driven, at least in part, by the lower possible range of outcome values) rather than larger differences between the groups themselves. Indeed, the differences in mean utility values between the groups are again larger for the EQ-5D utility measure than for the SF-6D. The SF-6D does have the advantage of showing no discernible floor or ceiling effects, in contrast to both the NDI and EQ-5D utility measure. The EQ-5D-5 L, a modification of the standard EQ-5D that provides five response levels in each dimension, should help to address this issue, but it is not yet in widespread use [34].
In order to try and understand the reasons for these differences, it is important to consider both the descriptive systems of the instruments and, for preference based measures, the valuation methods [22], and there are marked differences between the SF-6D and EQ-5D in both these areas [35]. The SF-6D has more levels than the EQ-5D, and is more concentrated on milder health problems, with the worst states in the SF-6D descriptive system arguably less severe than those in the EQ-5D descriptive system [35,36]. There is evidence that the SF-6D is better able to detect small changes in health, and is more sensitive to changes in health status at the top end of the distribution, whilst the EQ-5D is more sensitive to health change in individuals with poor baseline health [35]. There are also differences in the valuation method, with the EQ-5D valued using the time-trade off approach and the SF-6D valued using the standard gamble approach. There is empirical evidence that the time-trade off approach results in higher values for milder states and lower values for more severe states, which can thus partially account for the greater range of index values for the EQ-5D [36].
The fact that utility scores appear to change over time, when patients report that their neck injury is the same, is Table 9 Responsiveness of the EQ-5D to changes between 4 and 12 months, anchored by NDI classifications evidence of potential response shift bias, where a patient's subjective views and expectations change over time, causing a drift in the outcome score [37]. However, we have no evidence that this is more pronounced in one measure. There are specific tests available to assess whether this utility drift is actually the result of a response shift, rather than simply measurement error, such as a then-test, where patients are asked to retrospectively recall their health status (as they now perceive it) at a previous time point, and these are compared to the answers they gave at that time point itself [38]. However, such data were not available from the MINT study so no such test could be performed. This study was helped by having access to a large cohort of patients with whiplash associated disorders, in contrast to many studies looking at the properties of instruments that have much smaller sample sizes. The collection of data at four separate time points is also an advantage over simply having two data points per individual. However, since the data used came from a clinical trial, all the usual caveats apply about the differences between trials and clinical practice, and the possibility for this to bias results, though since this was a pragmatic trial this should have less of an effect than in other situations [39]. Further, the lack of NDI data at baseline is a substantial limitation, making comparison between the NDI and other measures much more problematic than for those where we have contemporaneous data. There is also a concern due to the large amount of missing data in the study (less than 50% of participants returned questionnaires at all 4 time points), which could introduce bias. However, these response rates were similar to those for patient-reported outcome measures in other trials looking at whiplash interventions [40,41].
In conclusion, the evidence suggests that, for whiplash studies where only one generic health outcome measure is to be included, the EQ-5D is likely to offer advantages over the SF-12 and its preference-based derivative (SF-6D). Whilst this is the first study to look specifically at whiplash injuries, the finding that the EQ-5D and SF-6D do not provide interchangeable utility values, and that the EQ-5D is likely to have advantages over the SF-6D, is supported by other studies looking at neck injuries [42]. Comparisons with the NDI are more difficult, as there may be particular reasons for incorporating a condition-specific measure as Table 10 Responsiveness of the SF-6D to changes between 4 and 12 months, anchored by NDI classifications opposed to a generic one in studies of whiplash associated disorders, whilst conversely the EQ-5D has the advantage of being preference-based, and can thus be used in costutility evaluations. Previous studies have shown the NDI to have good internal consistency, test-retest reliability and responsiveness [12,43]. Nevertheless, we found little evidence for better performance by the NDI when compared with the EQ-5D.