Detecting short-term change and variation in health-related quality of life: within- and between-person factor structure of the SF-36 health survey

Background A major goal of much aging-related research and geriatric medicine is to identify early changes in health and functioning before serious limitations develop. To this end, regular collection of patient-reported outcome measure (PROMs) in a clinical setting may be useful to identify and monitor these changes. However, existing PROMs were not designed for repeated administration and are more commonly used as one-time screening tools; as such, their ability to detect variation and measurement properties when administered repeatedly remain unknown. In this study we evaluated the potential of the RAND SF-36 Health Survey as a repeated-use PROM by examining its measurement properties when modified for administration over multiple occasions. Methods To distinguish between-person (i.e., average) from within-person (i.e., occasion) levels, the SF-36 Health Survey was completed by a sample of older adults (N = 122, Mage = 66.28 years) daily for seven consecutive days. Multilevel confirmatory factor analysis (CFA) was employed to investigate the factor structure at both levels for two- and eight-factor solutions. Results Multilevel CFA models revealed that the correlated eight-factor solution provided better model fit than the two-factor solution at both the between-person and within-person levels. Overall model fit for the SF-36 Health Survey administered daily was not substantially different from standard survey administration, though both were below optimal levels as reported in the literature. However, individual subscales did demonstrate good reliability. Conclusions Many of the subscales of the modified SF-36 for repeated daily assessment were found to be sufficiently reliable for use in repeated measurement designs incorporating PROMs, though the overall scale may not be optimal. We encourage future work to investigate the utility of the subscales in specific contexts, as well as the measurement properties of other existing PROMs when administered in a repeated measures design. The development and integration of new measures for this purpose may ultimately be necessary.


Background
A chief goal of aging-related research and geriatric medicine is to enhance quality of life with the identification of early changes in health and functioning that may herald more serious problems in the future, and to intervene before serious limitations develop. A potential avenue to monitor these changes is through regular collection of patient-reported outcome measures (PROMs), self-reports related to symptoms (e.g., type, frequency, severity, duration), functioning (e.g., health limitations, activities of daily living), perceptions (e.g., satisfaction with treatment) and overall well-being. The early cognitive, behavioral and physical changes that characterize advancing age are difficult to detect and may vary from day to day within individuals [1]. Consequently, opportunities to provide intervention efforts at clinically relevant times are diminished which may reduce possibilities for prevention and advanced care planning. The assessment of change and variation over time using PROMs is promising but an under-examined area of the literature.
Both patient self-report and direct measurement of functioning can be collected regularly, in a clinic or home setting, to monitor a patient's health and change in functioning. Based on the premise that the most robust clinical approaches for detecting change or the effects of clinical interventions require repeated measurements, regular collection of PROMs is a means to establish stable patient baselines against which fluctuations and systematic changes are identifiable and used to trigger clinical interventions. Frequent assessment provides less biased and more representative sampling of patient symptoms, functioning and quality of life indicators than single assessments which are susceptible to recall bias and other errors [2]. In addition, measures on a single point in time evaluate scores against norm reference standards and their clinical sensitivity/specificity is inherently limited by the heterogeneity of the populations within which they are employed. However, very few research studies have investigated the utility and impacts of repeated PROM administration as a means for enabling the individual to establish their own baseline or reference standard. Repeated assessment of PROMs permits the detection of individual change, such as in response to treatment, with a high degree of sensitivity and with the interpretation of change relative to their own prior level rather than the more typical normative between-person interpretation.
Despite the measurement limitations inherent in the norm-referenced paradigm, one-time and short-term use of PROMs has demonstrated improvements in patient satisfaction, well-being and autonomy, patient-physician communication and the detection of mental health diagnoses [3][4][5][6]. In addition, patient reports have been found to contribute unique predictive power to models of mortality [7] and in some cases are more informative than physician ratings of patient status [8][9][10], which may reflect the broader impacts of symptoms on everyday life and overall well-being captured by self-reports [7]. As such, regular utilization of PROM data within the health care system has foreseeable benefits for both clinical practice and research, and may provide an important complement to clinician-derived health data. With recent wider adoption of electronic health records and renewed patient-centred focus, Wu and colleagues [11] have argued that the integration of PROMs in health care is now more feasible than ever before. Despite these calls for action [12], PROMs are not typically part of routine clinical care appointments or standard prognostic assessments [13,14], with few exceptions (e.g., [15]). This may result in a loss of relevant individualized patient information which may be useful to complement health records, guide medical decision-making and patient management, and inform research efforts [16,17].
Clinical use of PROMs may prove useful to improve patient health behaviours, outcomes and patient management, though evidence is mixed (e.g., [18,19]). In large part, this lack of consensus is limited by a general lack of repeated measure studies to drive evidence-based medicine; related to this, the psychometric properties of PROMs when administered over multiple occasions remains largely unknown. Given that the majority of these measures were designed for one-time use to compare between-person (i.e., average) differences, their suitability to monitor fluctuations and systematic changes should not be taken for granted [20]. However, while the development of new measures is a time-consuming process, the adaptation and evaluation of an already-available measure may be a more practicable approach to achieving the goal of regular in-clinic patient assessment. Examples include the NIH-funded Patient-Reported Outcomes Measurement Information System [21], the Functional Assessment of Chronic Illness Therapy questionnaires [22], the Health Assessment Questionnaire [23], the RAND 36-Item Short Form Health Survey (SF-36; [24]) and disease-specific scales such as the Diabetes Health Profile [25]. The SF-36 is brief, disease-generic and readily available free of charge (version 1.0); despite critiques of its measurement properties [26][27][28][29][30] its ubiquity and its conceptualization of health as a well-rounded concept comprised of eight domains make it a suitable first candidate for this investigation.
Two major research questions remain to be answered, and are addressed in the present study. First, is the SF-36 sensitive enough for reliable detection of short-term variation at the within-person (i.e., occasion) level? This demonstration is an important first step in the selection of a measure to monitor long-term change. The second question concerns the psychometric properties of the SF-36 when administered over multiple occasions: when used in this new way, are its factor structure and goodness of fit indices different from standard administration? To further explore these questions, the present study applied an intensive repeated measure design to disaggregate withinperson variance (i.e., daily deviations from personal levels) from between-person variance. This approach allows simultaneous but separate modeling of the daily withinperson fluctuations and the between-person differences to yield both within-person and between-person factor structures. Such a continuously intensive design is likely not feasible in real-world applications of PROMs, where wider assessments repeated at regular intervals in the context of a measurement burst design would suffice to broadly capture person-level change by distinguishing short-term variations from long-term changes [31,32]. As of yet, the within-person structure of the SF-36 remains unexamined, though this analysis is central to our understanding of its potential as a repeated-use PROM.

Sample
We recruited 122 older adults through advertisements placed in a local family health clinic seeking people aged 50 and older for research on health and well-being during aging. The sample for analysis had a mean age of 66.28 years (SD = 8.57, range: 50-88), was evenly split between the sexes (55 % female) and rated general health as good on the SF-36 general health item (M = 6.73, SD = 1.62). All participants provided informed consent to participate and ethical approval was obtained from the University of Victoria and Vancouver Island Health Authority Joint Research Ethics Sub-Committee (protocol number J2012-70).

Measures
The RAND 36-Item Short Form Health Survey (SF-36; version 1.0) was developed at RAND Health as part of the Medical Outcomes Study. It is a brief and easilyadministered measure of health-related quality of life and consists of 36 multiple-choice items assessing eight health domains: physical functioning; role limitations due to physical health; role limitations due to emotional problems; vitality; mental health; social functioning; bodily pain and general health. Summary physical component and mental component summary scores can also be computed. Scores for each domain range from 0 to 100 where 100 indicates an excellent health state and no reported symptoms. This simple linear transformation was performed to improve interpretation of small estimates.

Procedures
Participants completed the standard SF-36 at baseline and provided up to seven responses on consecutive days to the survey modified for repeated administration. This simple modification instructed participants to respond based on the previous 24 h by adjustment of the timescale to which items referred. Exemplars are presented in Table 1. Items 33-36 under the General Health subscale were not included in the daily survey as they were not relevant to daily experience. Data from two items were lost due to technical difficulties with the electronic medical record system used for data collection. Of a possible 854 total assessments (122 patients X 7 days), 694 complete observations were obtained (81 %; M = 5.69) which provides sufficient statistical power for our analyses. Each session was completed via computer through a web-based patient portal survey tool and required approximately 20 min daily. Though burden can be a concern in intensive repeated measures designs such as this, over 96 % of participants reported willingness to take part in similar future studies, an indication that the time commitment was not too great.

Analytic approach
To evaluate our first research question, the ability of the SF-36 to detect short-term variation at the within-person level, we computed intraclass correlation coefficients (ICC) for the 30 items and eight subscales of the survey. This metric provides the proportion of between-person variance to total variance. The remaining proportion of the variability (i.e., 1-ICC) gives an indication of the amount of within-person variability. Thus interpretation of the ICC can be summarized as small values (i.e., <0.50) indicating items which capture more within-(i.e., occasion) than between-person (i.e., average) variation. To evaluate our second research question, confirmatory factor analyses (CFA) were run based on the published eight-factor (eight subscales among the 36 items) and two-factor (two summary component scores among the 36 items) SF-36 structure [33]. Single-level CFA models were run for the standard survey administered at baseline and multilevel CFAs were run for the modified survey administered daily to evaluate both the within-person and the between-person factor structure. In multilevel factor analysis, the within-person factor structure reflects common covariance among the items on each specific day, pooled across days and individuals. The between-person factor structure reflects common covariance in individual mean levels of each item aggregated across time [34].
Goodness of fit and sources of model misfit were examined with the comparative fit index (CFI), Tucker-Lewis index (TLI), root-mean-square error of approximation (RMSEA) and standardized root-mean-square residual (SRMR). Ideal values range from .90 (acceptable) to greater than .95 (good) for CFI and TLI; and from .08 (plausible/acceptable) to less than .05 (good) for RMSEA and SRMR [35]. While the CFI, TLI and RMSEA are indicators of overall fit, the SRMR provides separate fit indices for both the within-and betweenperson levels. All models were estimated using Mplus Version 7 [36] with maximum likelihood for robust estimates (MLR).

Detection of within-person variability by the daily SF-36
Means, standard deviations and ICC values for the 30 items and eight subscales of the daily SF-36 are presented in Table 2. At the subscale level, we observed a wide range of ICC values, an indication that some subscales captured more daily fluctuations than others. The Emotional Role Limitations subscale with an ICC of .38 captured the largest proportion (62 %) of withinperson variability; that is, response patterns to this subscale were more closely aligned with occasion-specific fluctuations than with stable differences between individuals. On the other hand, the physical functioning subscale with an ICC of .89 captured only 11 % of within-person variability. The majority of the variance in responses to this subscale was due to between-person differences. As with the subscales, item-level ICC values ranged from .09 for yesterday (General Health; 91 % within-person variability) to .84 for walk mile (Physical Functioning; 16 % within-person variability). Items within the same subscale generally exhibited similar proportions of within-person variation with only a few exceptions (see the Physical Functioning and General Health subscales). Figure 1 provides an illustration of the extent of dynamic, withinperson variation from day to day on each SF-36 subscale (variation around the person-mean), as well as the degree of stable, between-person differences (variation around the sample mean).
Within-and between-person reliability estimates were computed for each subscale from the application of the multilevel omega (ω) to both levels of the multilevel CFA models [37]. Reliability estimates could not be computed for three subscales (i.e., Social Functioning, Bodily Pain and General Health) due to an insufficient number of items per subscale. Within-person reliability estimates ranged from .60 (Mental Health) to .74 (Vitality). Between-person reliability ranged from .90 (Mental Health) to .96 (Physical Functioning) and was consistently higher for the daily SF-36 than for the standard SF-36, which ranged from .76 (Mental Health) to .92 (Physical Functioning).
Within-and between-person factor structure of the standard and daily SF-36 Factor loadings and goodness of fit indices for the correlated eight-factor and two-factor models are presented in Tables 3 and 4, respectively.
The eight-factor model fit to the standard SF-36 administered at baseline was not optimal as per Hu and Bentler's criteria [35]. All items loaded onto their respective subscale factor with the exception of one item under General Health (last year). Significant, moderateto-high correlations were observed between all eight factors (range r = .41 to .80, ps < .001). Model fit was not substantially different for the modified SF-36 administered daily; all items loaded significantly onto their respective subscale factor with the exception of one item at the between-person level (yesterday under General Health) and three items at the within-person level (vigorous under Physical Functioning; calm and happy under Mental Health). All factors were significantly correlated at the between-person level (range r = .34 to .84, ps < .01). Within-person factor correlations were smaller (range r = .29 to .85, ps < .01) and not found between the Mental Health factor and others, or between the Physical Functioning and Physical or Emotional Role Limitations factors (see Table 5).
For both versions of the SF-36, overall model fit as assessed by the CFI, TLI and RMSEA was better in the eight-factor model than in the two-factor model representing physical and mental summary components. All items in the standard SF-36 loaded onto their respective summary component, but the within-person factor loadings of several physical summary scale items on the daily SF-36 were non-significant. This includes all of the Physical Role Limitations items and one each under Physical Functioning, Bodily Pain and General Health. At the between-person level in the daily SF-36, only yesterday under General Health did not load onto the physical summary while the Emotional Role Limitations items and two Mental Health items did not load onto  the mental summary factor. Both summary factors were significantly correlated to a moderate-high degree in both the standard (r = .68, p < .001) and daily (r within = .61, p < .05; r between = .63, p < .05) SF-36 models. As per the hypothetical factor structure originally proposed by Ware, Kosinski and Keller [38], orthogonal models were evaluated for both solutions. Model fit across all indices was poor (results not shown here).

Discussion
In recent years, the integration of PROMs into clinical practice to improve health outcomes and the patient experience [3][4][5][6] has been increasingly recognized as a worthwhile pursuit feasible through the use of electronic medical record systems [11][12][13]. Regular use may facilitate the identification of early changes that may herald more serious health problems in the future, and may provide opportunities for clinical intervention. The first step to achieving this goal and evaluating its impact is to investigate which measures are best able to detect shortterm variation and systematic change (from an established baseline) at a within-person level. However, the majority of available PROMs were designed to detect between-person differences, a snapshot of one point in time, rather than monitor change. This hurdle may help to explain why the literature is largely missing repeated measures studies of PRO to address the first step. To facilitate this type of research and avoid the necessarily time-consuming complexities of the development and validation of a new survey measure, the present study investigates whether the widely-used and disease-generic SF-36 can serve as a repeated PROM; that is, whether it can reliably detect person-level variation without sacrificing measurement properties as determined by its factor structure.
Our first research question asked whether the SF-36 is sensitive enough for the detection of short-term within-  Note. Overall fit index. Non-significant factor loadings are italicized WP within-person, BP between-person, CFI comparative fit index, TLI Tucker-Lewis index, RMSEA root-mean-square-error of approximation, SRMR standardized root-mean-square residual a Indicators of overall fit Table 4 Standardized factor loadings and goodness of fit indices from multilevel confirmatory factor analyses of the baseline (standard) and daily administrations of the SF-36 (correlated 2-factor model) person variation. This was answered through inspection of ICC values for each item and subscale after seven consecutive days of responses. We found a wide range of ICC values, indicating that some items captured a greater proportion of daily dynamics relative to stable, between-person differences, than others. Visual inspection of scatterplots for a random sample further illustrated varying degrees of within-person variation across days and subscales. In particular, the Emotional Role Limitations, Mental Health, Social Functioning and General Health subscales revealed the largest magnitude of within-person variation, suggesting that these components of health may be key indicators for PROM monitoring, perhaps because they are more likely to be impacted by daily events and activities. The presence of day-to-day variation highlights the need to utilize repeated measurements in order to disaggregate within-person variations from between-person differences. Failing to account for these within-person fluctuations in health outcomes assumes that they are stable and prevents us from understanding the impact that daily variations in health have on the individual. Our second research question asked whether the psychometric properties of the SF-36 were maintained during "off-label" use as a repeated PROM. That is, do items continue to load onto their respective subscales, and subscales onto summary components, to comprise the same latent factors yielded by standard use of the Note. Overall fit index. Non-significant factor loadings are italicized WP within-person, BP between-person, CFI comparative fit index, TLI Tucker-Lewis index, RMSEA root-mean-square-error of approximation, SRMR standardized root-mean-square residual a Indicators of overall fit survey? To evaluate this, we compared the factor structure of the standard survey administered at baseline to that of the daily survey. We found no substantial differences between them, indicating that summarizing item responses by subscales and summary components is appropriate to monitor person-level change. However, the fit indices of both versions were sub-optimal. To evaluate the sources of model misfit, we inspected the modification indices and noted that the primary sources were in the Vitality and Mental Health subscales, particularly at the within-person level, such that the positive items loaded together. This is in line with the structure of most measures of positive and negative affect such as Watson's Positive and Negative Affect Schedule [34,39]. We evaluated an alternative factor structure for both the baseline and daily SF-36, allowing positive and negative Vitality and Mental Health items to load onto separate factors, but found that it did not substantially improve overall model fit (results not shown).
Our findings on the sub-optimal model fit of the SF-36 are consistent with previous work which has raised issue with the factor structure and construct validity obtained by the recommended orthogonal scoring procedure [25][26][27][28][29] and the reduction to summary component measures [27,[40][41][42][43][44]. Thus, although the daily SF-36 exhibited similar psychometric properties to the standard survey, sub-optimal fit indices in both cases lead us to recommend caution in using the SF-36 in its entirety as a repeated PROM. However, while the overall multifactor model of the SF-36 exhibited sub-optimal fit indices, many of the subscales demonstrated acceptable to good reliability estimates when examined independently. Researchers may find utility in focusing on improving and expanding the specific subscales for use in certain contexts. Including additional items for the subscales that contained only two items and reconsidering the arrangement of the Mental Health and Vitality subscales into positive and negative affect subscales (e.g., [34,39]) are two potentially fruitful avenues to explore.
A limitation of this study is the relatively healthy sample, which may explain why some items on the daily SF-36 (e.g., walk mile under Physical Functioning, magnitude under Bodily Pain) exhibited little within-person variation. Alternatively, this may be because some health-related factors are simply unlikely to show shortterm change, or because some items are not sensitive enough to detect the occurrence of short-term changes. A second limitation is the extent to which sample heterogeneity may have contributed to the overall poor fit. There is some evidence that the SF-36 factor structure may differ among patient subgroups, particularly those with comorbidity (e.g., [42,44,45]); that is, some survey subscales may have disease-specific relationships with either summary score. However, this has only been found to be a concern for the two-factor structure of the survey so is unlikely to have substantially affected our results given our focus on the eight-factor structure.
This study extends prior research on PRO assessment to consider the utility of the SF-36 as a PRO measure for repeated administration. This was accomplished through evaluation of its factor structure at the within-person level. We found that the SF-36 modified for repeated administration has a similar factor structure to the standard version, indicating maintenance of measurement properties when used "off-label," though model fit remained sub-optimal. However, many subscale reliabilities ranged from acceptable to good at both the within-person and between-person levels. Therefore, while we conclude that the SF-36 in its entirety may not be an adequate measure for repeated PRO assessment, we recommend future work to examine the utility of the subscales in specific contexts, as well as the within-person factor structure of other PROMs currently in use (e.g., [20][21][22]). This is an important first step in the measurement of daily PRO assessments in primary health care. Future research can build upon this work in moving toward the goal of regular in-clinic patient assessment and early detection of the cognitive, behavioral and physical changes that characterize potentially reversible conditions and personalizing interventions and health care. This may be more easily facilitated by the adaptation and integration of existing measures than the development of new surveys.

Conclusions
Many of the subscales of the modified SF-36 for repeated daily assessment were found to be sufficiently reliable for use in repeated measurement designs incorporating PROMs, though the overall scale may not be optimal. We encourage future work to investigate the utility of the subscales in specific contexts, as well as the measurement properties of other existing PROMs when administered in a repeated measures design. The development and integration of new measures for this purpose may ultimately be necessary.