Interpreting change from patient reported outcome (PRO) endpoints: patient global ratings of concept versus patient global ratings of change, a case study among osteoporosis patients

Background Regulatory guidance recommends anchor-based methods for interpretation of treatment effects measured by PRO endpoints. Methodological pros and cons of patient global ratings of change vs. patient global ratings of concept have been discussed but empirical evidence in support of either approach is lacking. This study evaluated the performance of patient global ratings of change and patient global ratings of concept for interpreting patient stability and patient improvement. Methods Patient global ratings of change and patient global ratings of concept were included in a psychometric validation study of an osteoporosis-targeted PRO instrument (the OPAQ-PF) to assess its ability to detect change and to derive responder definitions. 144 female osteoporosis patients with (n = 37) or without (n = 107) a recent (within 6 weeks) fragility fracture completed the OPAQ-PF and global items at baseline, 2 weeks (no recent fracture), and 12 weeks (recent fracture) post-baseline. Results Results differed between the two methods. Recent fracture patients reported more improvement while patients without recent fracture reported more stability on ratings of change than ratings of concept. However, correlations with OPAQ-PF score change were stronger for ratings of concept than ratings of change (both groups). Effect sizes for OPAQ-PF score change increased consistently with level of change in ratings of concept but inconsistently with ratings of change, with the mean AUC for prediction of a one-point change being 0.72 vs. 0.56. Conclusions This study provides initial empirical support for methodological and regulatory recommendations to use patient global ratings of concept rather than ratings of change when interpreting change captured by PRO instruments in studies evaluating treatment effects. These findings warrant being confirmed in a purpose-designed larger scale analysis. Electronic supplementary material The online version of this article (doi:10.1186/s12955-016-0427-5) contains supplementary material, which is available to authorized users.


Background
As with all outcome measures used to evaluate the impact of a medical product, one of the biggest challenges for patient reported outcome (PRO) endpoints is how to interpret the change in scores between two time points or the difference in change scores between treatment groups [1]. Most PRO instruments comprise one or more scales, with items aggregated into multi-item scales and scores for each individual often considered at group-level. Interpretation of PRO data requires an understanding of potential complexities associated with self-report, comparison of different response scales, and of psychometrics in general [1]. Further, the statistical significance of any score change over time does not guarantee that differences are clinically meaningful, with statistical significance sometimes achieved for notably small score changes, particularly if the sample size is large [2,3].
The history of debate over the methods for interpreting change in longitudinal studies has been recently summarised by Wyrwich and colleagues [4]. The debate heightened during the consultation period following publication of the Food and Drug Administration (FDA) draft guidance [5] and continued after the publication of the FDA final guidance report for Industry PRO measures [2], with Burke and Trentacosti [6] offering further considerations beyond those provided in the final FDA guidance. A key shift in approach over this period has been a move away from using the group level minimum important difference (MID) to evaluate treatment benefit, the focus of much of the prior efforts to develop values to aid interpretation of change [4], to using the patient level responder definition. The MID is the amount of difference between treatment groups in the change observed in a PRO measure that can be interpreted as a treatment benefit [5], whereas the responder definition is the amount of change in an individual patient which can be interpreted as a treatment benefit: the proportions of individuals in each trial arm who meet this threshold (or indeed a variety of thresholds) for PRO score change are compared between treatment arms.
Empirical approaches for defining both the MID and the responder definition can be either anchor-based or distribution-based. The FDA has stated a preference for anchor-based rather than distribution-based approaches for establishing the responder definition [2]. Anchorbased methods explore the association between the PRO instrument and a related external anchor, where different types of anchors can be utilised. Distribution-based approaches evaluate score change in the context of score variability e.g. ½ standard deviation, or 1 standard error of measurement (SEM). Typically, distribution-based approaches have been used to establish the MID. Where anchor-based approaches have been used, this tends to be termed minimum clinically important difference (MCID), although these terms have been used interchangeably. The FDA considers distribution-based approaches to be supportive of anchor-based approaches, providing the minimum value for a responder definition derived from anchor-based methods. This is because a responder definition must be at least large enough to be beyond a score change that might reasonably be expected by chance alone [2].
The FDA PRO Guidance [2] makes specific reference to the use of patient ratings of change as an anchor, for which the patient rates how much change they have experienced on a single-item scale. Importantly, this scale must relate conceptually to the content of the PRO instrument that will be used to evaluate treatment benefit (e.g., pain or physical function), an approach originally developed by Jaeschke et al. [7] and the most commonly reported anchor-based method. This single-item scale has response options ranging from deterioration through to improvement in the concept of interest, with a midpoint of no change. The corresponding PRO score change for patients who have rated a certain, such as a small or moderate, level of change is then taken to indicate a meaningful PRO score change for use in PRO score interpretation and to identify responders (i.e., those whose level of PRO score change has have met the responder definition). There are, however, concerns about the use of patient global ratings of change, specifically concerning recall bias associated with retrospective assessments over long periods of time [8] and for diseases with a high level of symptom variability across short periods of time, such as irritable bowel syndrome [6]. Thus, the FDA has more recently recommended the use of patient global ratings of concept items, for which the patient rates their current state on a relevant PRO concept at each key time-point, with the change in global concept ratings between the time-points calculated for analysis [6].
Whilst methodological pros and cons of patient global ratings of change versus change in patient global ratings of concept have been discussed [4], empirical evidence for the relative benefits of each approach is lacking. This study compares the statistical performance of two anchor based approaches, patient global rating of change and patient global rating of concept, for interpreting patient stability and patient improvement using data collected for the psychometric validation of an osteoporosis-targeted PRO, the Osteoporosis Assessment Questionnaire -Physical Function (OPAQ-PF).

Methods
The design, sample and procedures of the full OPAQ-PF psychometric validation study are reported in detail elsewhere [9].

Design
Post-menopausal women aged ≥50 years, diagnosed with moderate-to-severe osteoporosis, with or without a recent (within six weeks prior to baseline) fragility fracture, were recruited through ten clinical sites in the US. These patients completed the OPAQ-PF and global items at baseline (global concept), 2 weeks (no recent fracture, global concept and change) and 12 weeks (recent fracture, global concept and change) post-baseline (Fig. 1). Participants who had not experienced a recent fracture were expected to experience stability in their ability to perform daily activities of physical functioning at week two while those with recent osteoporotic fracture were expected to experience change, specifically improvement, in their ability to perform daily activities of physical functioning between baseline and week 12. Institutional review board (IRB) approval was obtained for the study (Protocol OXO2550; Independent Investigational Review Board, Inc.: 21 October 2011).

Measures
The OPAQ-PF [9,10] is designed to evaluate the participant's ability to perform their daily activities of physical function during the past seven days. The instrument covers mobility (5 items), physical positions (6 items) and transfers (4 items). Items are rated on a six-point Likert response scale ranging from 'no difficulty' [score 0] to 'completely avoided doing this' [score 5] (subsequently modified to 'unable to do'; see discussion). All 15 items are reverse scored, summed and transformed to a 0-100 scale to provide a total score, where 0 indicates the most difficulties and 100 no difficulties. A qualitative study with 32 participants demonstrated content validity of the OPAQ-PF in post-menopausal women who had, and had not, previously sustained a fracture [9]. A prospective study of 144 postmenopausal women with moderate to severe osteoporosis demonstrated that the OPAQ-PF was: unidimensional; had good internal consistency (α = 0.974); good test-retest reliability (ICC = 0.993); differentiated between patients with/without a recent fracture and by severity of osteoporosis; and correlated strongly with hypothesizedrelated scales and performance based measures (r ≥ 0.6, p < 0.001) [10].
Three patient global ratings of concept items (ratings of concept) and three patient global ratings of change items (ratings of change) were developed to evaluate the ability of the OPAQ-PF to detect change and to evaluate interpretation of change [4]. These ratings reflected the three content areas of the OPAQ-PF (mobility, physical positions and transfers). Ratings of concept items were self-completed by participants to reflect overall difficulty in the last seven days in these areas due to osteoporosis. For example, for mobility participants were asked "Overall, how much difficulty have you had with mobility (e.g. walking or climbing stairs) due to your osteoporosis in the last 7 days?" The participant rated difficulty on a five-point scale ranging from 'no difficulty' (0), through 'a little difficulty' (1), 'some difficulty' (2), 'moderate difficulty' (4), to 'severe difficulty' (5). Ratings of change items were self-completed by participants who were asked to rate their overall change in the same three areas since the last study visit. For example, the mobility rating of change item asked "Overall, compared to your last visit, how has your mobility (e.g. ability to walk or climbing stairs) due to your osteoporosis changed?" Participants rated each item on a seven-point scale ranging from 'much better' (3), through 'moderately better' (2), 'a little better' (1), 'no change' (0), 'a little worse' (−1), 'moderately worse' (−2), to 'much worse' (−3).

Procedures
Full study procedures are reported elsewhere [10]. Participants completed the OPAQ-PF and the three ratings of concept at baseline. Participants without a recent fracture completed the OPAQ-PF, the three ratings of concept, and three ratings of change two weeks (median 14 days, Inter-Quartile Range (IQR) 14-18 days) after baseline, over which time change was not expected. Participants with a recent fracture attended a visit at 12 weeks (median 12, IQR 12.0-12.7) post-baseline and completed the OPAQ-PF, the three ratings of concept and three ratings of change items at each follow-up visit (Fig. 1). Improvement was expected among recent fracture participants during this time period. Participants with a recent fracture also attended a visit at 24 weeks postbaseline [10]; these data are not used in the current analysis because the rating of change items asked the patients to compare their functioning with the previous visit (week 12) rather than baseline.

Statistical analysis
The change on each rating of concept was calculated at week two (no recent fracture group) and week 12 (recent fracture group) relative to the score at baseline. In order to evaluate possible recall bias, ratings of change were correlated with ratings of concept completed at the same and previous time-point using Spearman's correlation coefficient (r s ). Retrospective recall is considered unbiased if ratings of change are positively correlated with follow-up scores and negatively correlated with baseline scores [8], and to an equal degree [11]. Correlations were expected to be at least moderate at ≥ |0.30| [12]. Change scores in OPAQ-PF at weeks two and 12 were calculated and OPAQ-PF score changes correlated with patient global ratings of concept and patient global ratings of change scores (r s ). Mean and median OPAQ-PF change scores were compared between participants in each patient-rated level of change and change in patient ratings of concept group using ANOVA and Kruskal-Wallis (K-W) tests. Tests for linear trend (and departures from linearity) were conducted within the ANOVA. Cohen's d effect size for OPAQ-PF change (mean change / baseline standard deviation, SD; [12]) were calculated at each level of rating of change and change in ratings of concept.
Receiver operator characteristic (ROC) curves were used to identify the OPAQ-PF change score which best distinguishes individual patients who improved to a specified extent from those who did not (the 'best cut point' , BCP) [13,14]. ROC curves were plotted for participants reporting at least a one unit improvement on the specific ratings of concept and ratings of change at weeks 2 (no recent fracture group) and 12 weeks (recent fracture group). ROC curves plot sensitivity, the proportion of true 'positives' detected (y-axis) against 1specificity, the proportion of true 'negatives' detected (x-axis) for all possible cut-points of the OPAQ-PF. The 'best cut point' is identified as the test value which maximises the sum of sensitivity and specificity, i.e. the test value associated with the point closest to the top left hand corner of the ROC space. The area under the curve (AUC) can also be calculated: the closer it is to 1.0, the better the differentiation of the scale. Therefore, the greater the AUC, the greater the ability of the OPAQ-PF to differentiate those who reported change from those who did not.
Statistical significance throughout was taken at the 5 % level (p < 0.05).

Sample
The overall sample (n = 144), comprised 107 patients without recent fracture and 37 recent fracture patients. Baseline sample characteristics are presented in Table 1 (further details reported elsewhere [10]).

Patient ratings of change and change in global concept
At week 2, the no recent fracture patients were more likely to report stability (no change) on the ratings of change than on the ratings of concept (i.e., difference in ratings of concept between time-points = 0): mobility n = 79 (75 %) vs. n = 72 (69 %), physical positions n = 81 (77 %) vs. n = 65 (62 %), transfers n = 88 (84 %) vs. n = 62 (60 %) (Fig. 2). At week 12, the recent fracture patients were generally less likely to report stability or a small degree of improvement on the ratings of change, being instead much more likely to report feeling'much better' vs. an improvement of 3 or more on the ratings of concept: mobility n = 12 (35 %) vs. n = 2 (6 %), physical    (Fig. 2). Patient ratings of change at 2 and 12 weeks were at least moderate and correlated significantly with patient ratings of concept at the same assessment, but were smaller and (at 12 weeks) generally not significantly correlated with ratings of concept at the previous assessment. Thus, the correlations between 2-week ratings of change and concept were:

Correlation with OPAQ-PF score change
Correlations with the OPAQ-PF score change were stronger for change in ratings of concept ( Table 2) than ratings of change (Table 3) OPAQ-PF score changes by patient ratings of change and change in global concept In terms of comparisons of OPAQ-PF change scores between the categories of ratings of concept ( Table 2) and ratings of change (Table 3), while there were significant differences in scores for two of the six evaluations of global ratings of concept (Transfers at week 2, p < 0.05, and Physical positions at week 12, p < 0.01) and global ratings of change (Mobility at week 2 and Physical Positions at week 12, both p < 0.05), the associations were more likely to show significant linearity for the ratings of concept: Physical Positions at week 2, Transfers at week 2, Mobility at week 12 (all p < 0.05), Physical Positions at week 12, Transfers at week 12 (both p < 0.01); compared with the ratings of change: Transfers at week 12 (p < 0.05) and Physical Positions at week 12 (p < 0.01).

Effect sizes
In line with the patterns shown for linearity, effect sizes for change in OPAQ-PF score at each time point were notably irregular across categories of ratings of change (Table 3) while generally increasing consistently by level of change for ratings of concept (Table 2 (and Additional file 1: Figure S1)). For example, at week 12, effect sizes for OPAQ-PF score change increased from 0.10 in those reporting no change in the Physical Positions concept, to 0.82 in those reporting a 1-point change, 3.48 in those reporting a 2-point change, and 4.97 in those reporting a 3 to 4-point change. In terms of Physical Positions ratings of change, OPAQ-PF effect sizes were 0.30 'no change' , 1.11 'a little better' , 6.4 'moderately better' , and 0.86 'much better'.

ROC curves
ROC curves were obtained for the OPAQ-PF based on the ratings of concept and ratings of change for minimum of a one point change (Additional file 2: Figure S2). The characteristics of the ROC curves are summarised in Table 4 showing the AUC (with 95 % confidence intervals) and best OPAQ-PF cut-points for at least a one point improvement on the ratings of concept and rating of change items at weeks 2 and 12. The OPAQ-PF showed good ability to differentiate patients who had/had not shown a one point improvement on the ratings of concept/ratings of change, although disparities were found between the two methods with the ratings of concept generally being associated with greater predictive power of the OPAQ-PF. The ratings of change results at each time point had an overall mean AUC of 0.56 (range 0.37-0.78), with the AUC being less than 0.5 for each of the week 2 ROC curves, showing it is not predictive. For the ratings of concept, the mean AUC was 0.73 (range 0.60-0.87), and all were ≥0.5 and therefore predictive. The ratings of concept had slightly worse sensitivity but better specificity compared with the ratings of change (mean sensitivity over all time points 0.68, range 0.48-0.88 vs. mean 0.76, range 0.71-0.79; specificity mean 0.66, range 0.23-0.79 vs. mean 0.49, range 0.27-0.73), with the greater sensitivity of the ratings of change being obtained at the expense of low specificity.

Discussion
This study provides empirical data to support previous discussions of the methodological advantages and disadvantages of patient global ratings of change and patient global ratings of concept for interpreting PRO score change [4]. This study included osteoporosis patients who were expected to remain stable in terms of the concept of interest over a two-week period from study baseline (no recent fracture patients) and those who were expected to report improvement on the concept of interest over 12 weeks (recent fracture patients). Therefore the study design allowed for an evaluation of the performance of patient global ratings of change and patient global ratings of concept for interpreting both patient stability and improvement in the PRO of interest, the OPAQ-PF. Substantial disparities were found between the performance of the ratings of change and ratings of concept  in terms of level of change identified, but the ratings of concept consistently outperformed the ratings of change in terms of better informing interpretation of change in OPAQ-PF scores. Thus, while the patients without recent fracture were more likely to report stability (no change) two weeks after baseline on the ratings of change items, any changes which were reflected in the change in OPAQ-PF scores were more likely to be identified by the ratings of concept: correlations with OPAQ-PF score change at two weeks were higher for ratings of concept than ratings of change. Similarly, although the recent fracture patients were more likely to report substantial improvement on the ratings of change, in line with expected change, correlations with OPAQ-PF score change were stronger and more likely to be statistically significant for changes in the ratings of concept than the ratings of change. Thus, OPAQ-PF change scores were more likely to be different between the ratings of concept change than the ratings of change categories across both time points: effect sizes for OPAQ-PF score change generally increased linearly by level of ratings of concept change but showed an irregular pattern for ratings of change. The ROC curves also indicated that in terms of relative balance between sensitivity and specificity and the overall AUC, the OPAQ-PF had stronger discriminating properties in terms of the ratings of concept than those based on the ratings of change. Results are stronger for the week 12 data because of the greater likelihood of stability at week 2 rather than change.
It is important to note that as the patients completed both measures of change and concept on the same occasions the discrepancies identified in this analysis reflect differences in the way in which patients completed the ratings of change compared with the ratings of concept. At week 12, patients were required to think back 12 weeks to their baseline visit in order to evaluate change in the concept of interest (e.g. mobility). It is likely to have been a challenge for participants to think back accurately over this time period in order to be able to rate their change. Correlation analysis conducted in this study indicates a systematic bias in patient ratings of change. The greater correlation between ratings of change and ratings of concept at the same time-point, compared with correlations with baseline ratings of concept suggest patients are influenced more by how they feel currently than by an accurate assessment of how they felt previously. This is consistent with previous reports of retrospective recall of change at follow-up being positively correlated with concurrent PRO scores and either un-correlated or positively correlated with baseline scores [8,15]. These indicate that respondents with good health at follow-up are more likely to assume that their health has recently improved, and respondents with poor health are more likely to assume that it has worsened [16].
There are issues associated with osteoporosis which are likely to increase error measurement in the patient's self-reports. Specifically, the length of recall required in order to capture change associated with the healing of a fracture meant that patients were asked to recall over a substantial period of time (12 weeks) for the global rating of change item. Given the older age of osteoporosis patients, this length of recall may be a specific challenge for these patients leading to greater recall inaccuracies than may be experienced in other indications. It was for this reason that the global ratings of change at 24-weeks asked about change since the last visit rather than from baseline. The significant comorbidity experienced in osteoporosis may also influence the reporting on both the ratings of concept and ratings of change, as patients may find it hard to separate physical function impacts that are a specific consequence of osteoporosis from those associated with comorbidities. Specifically, over a third of the patients with a recent fracture had osteoarthritis. The global items asked patients to report their difficulty with or change in mobility, physical positions and transfers 'due to your osteoporosis' and the extent to which patients were able to attribute their experience to their osteoporosis or osteoarthritis was not evaluated in this analysis. It is possible that change in difficulty with mobility, physical positions and transfers may have occurred due to the patient's osteoarthritis, which might have presented a reporting challenge to these patients or meant that these patients were not as stable as the analysis has understood them to be. This study had several limitations, most notably that only those in the relatively small 'recent fracture' group in this study were hypothesized to change, and therefore there were few subjects in each of the relevant categories of change. This leads to instability of and uncertainty around the estimates calculated. This study was not purpose-designed to evaluate the research question presented, and instead represents secondary analysis of data that was designed to evaluate the psychometric measurement properties of the OPAQ-PF reported elsewhere [10]. This study is therefore limited to providing an indication of the relative performance of the two approaches. The findings from this study need to be confirmed in a purpose-designed larger-scale analysis before firm conclusions can be drawn regarding the statistical performance of patient global ratings of concept compared with patient global ratings of change.
Further limitations to the study include the fact that the sample was more suitable for evaluating stability or change in terms of improvement rather than decline. Although the ratings of change and ratings of concept allowed for report of decrement, the validation study inclusion criteria were designed to identify patients who were anticipated to remain stable (with no recent fracture history) or improve (following a recent fracture). Further work is required in order to determine whether the benefit of ratings of concept over ratings of change reported here are maintained in a study which sees patients experiencing decrement in the concept of interest. Secondly, no overall physical function global ratings of concept and change items were developed to match conceptually the final uni-dimensional structure of the OPAQ-PF; instead three patient global ratings of concept and equivalent ratings of change were developed to reflect the three content areas of the OPAQ-PF (mobility, physical positions and transfers). However, this did provide more granular results than would be possible with a single global item. Following completion of the data collection on which this analysis is based, the OPAQ-PF response option 'completely avoided doing this' was subsequently changed to 'unable to do' in line with feedback from the regulatory authorities. Finally, the impact of response shift has not been considered, where an individual's criteria for the construct of interest changes during the course of illness and treatment, possibly leading to a modification of their internal standards, values and conceptualization of the target construct [17]. Response shift is an issue for patients reporting on ratings of concept as much as it is for those reporting on ratings of change, adding unmeasured variability not considered in our results.

Conclusion
This study provides initial empirical support for methodological and regulatory recommendations to use patient global ratings of concept when evaluating interpretation of change for PRO instruments in studies evaluating treatment effects. It provides further evidence for the role of present state bias in leading patients systematically to overestimate their degree of improvement (or worsening) when using patient global ratings of change. These findings warrant being confirmed in a purpose-designed larger scale study.

Additional files
Additional file 1: Figure S1. Effect sizes for OPAQ-PF total score change from baseline at 2 weeks (no recent fracture patients) and 12 weeks (recent fracture patients) by Mobility, Physical Positions, and Transfers global ratings of change and change in ratings of concept. (DOCX 169 kb) Additional file 2: Figure S2. ROC curves identifying the best cut point (BCP, indicated by the arrow) of OPAQ-PF change scores for an improvement of 1 point on Mobility, Physical Positions, and Transfers ratings of change and ratings of concept at weeks 2 (no recent fracture patients) and 12 (recent fracture patients). (DOCX 188 kb)

Competing interests
At the time of the research AN, HD and CK were employees of ICON Patient Reported Outcomes, Oxford, UK. RB and ANN are employees of Eli Lilly. ICON Patient Reported Outcomes received payment from Eli Lilly for conducting the analysis and writing the manuscript. Eli Lilly paid the article-processing charge.
Authors' contributions AN: Took the lead on the design of the validation study that provided the data for this study, oversaw data collection, contributed to results interpretation, and took the lead on manuscript development. HD: Designed the analysis, performed the analysis, took the lead on results interpretation, and contributed to manuscript development. CK: Contributed to the study design of the validation study, oversaw data collection, contributed to results interpretation, and contributed to manuscript development. RB: Contributed to the design of the validation study, contributed to results interpretation and contributed to manuscript development. ANN: Contributed to the design of the validation study, contributed to results interpretation and contributed to manuscript development. All authors read and approved the final manuscript.