Skip to main content

Detecting short-term change and variation in health-related quality of life: within- and between-person factor structure of the SF-36 health survey



A major goal of much aging-related research and geriatric medicine is to identify early changes in health and functioning before serious limitations develop. To this end, regular collection of patient-reported outcome measure (PROMs) in a clinical setting may be useful to identify and monitor these changes. However, existing PROMs were not designed for repeated administration and are more commonly used as one-time screening tools; as such, their ability to detect variation and measurement properties when administered repeatedly remain unknown. In this study we evaluated the potential of the RAND SF-36 Health Survey as a repeated-use PROM by examining its measurement properties when modified for administration over multiple occasions.


To distinguish between-person (i.e., average) from within-person (i.e., occasion) levels, the SF-36 Health Survey was completed by a sample of older adults (N = 122, M age  = 66.28 years) daily for seven consecutive days. Multilevel confirmatory factor analysis (CFA) was employed to investigate the factor structure at both levels for two- and eight-factor solutions.


Multilevel CFA models revealed that the correlated eight-factor solution provided better model fit than the two-factor solution at both the between-person and within-person levels. Overall model fit for the SF-36 Health Survey administered daily was not substantially different from standard survey administration, though both were below optimal levels as reported in the literature. However, individual subscales did demonstrate good reliability.


Many of the subscales of the modified SF-36 for repeated daily assessment were found to be sufficiently reliable for use in repeated measurement designs incorporating PROMs, though the overall scale may not be optimal. We encourage future work to investigate the utility of the subscales in specific contexts, as well as the measurement properties of other existing PROMs when administered in a repeated measures design. The development and integration of new measures for this purpose may ultimately be necessary.


A chief goal of aging-related research and geriatric medicine is to enhance quality of life with the identification of early changes in health and functioning that may herald more serious problems in the future, and to intervene before serious limitations develop. A potential avenue to monitor these changes is through regular collection of patient-reported outcome measures (PROMs), self-reports related to symptoms (e.g., type, frequency, severity, duration), functioning (e.g., health limitations, activities of daily living), perceptions (e.g., satisfaction with treatment) and overall well-being. The early cognitive, behavioral and physical changes that characterize advancing age are difficult to detect and may vary from day to day within individuals [1]. Consequently, opportunities to provide intervention efforts at clinically relevant times are diminished which may reduce possibilities for prevention and advanced care planning. The assessment of change and variation over time using PROMs is promising but an under-examined area of the literature.

Both patient self-report and direct measurement of functioning can be collected regularly, in a clinic or home setting, to monitor a patient’s health and change in functioning. Based on the premise that the most robust clinical approaches for detecting change or the effects of clinical interventions require repeated measurements, regular collection of PROMs is a means to establish stable patient baselines against which fluctuations and systematic changes are identifiable and used to trigger clinical interventions. Frequent assessment provides less biased and more representative sampling of patient symptoms, functioning and quality of life indicators than single assessments which are susceptible to recall bias and other errors [2]. In addition, measures on a single point in time evaluate scores against norm reference standards and their clinical sensitivity/specificity is inherently limited by the heterogeneity of the populations within which they are employed. However, very few research studies have investigated the utility and impacts of repeated PROM administration as a means for enabling the individual to establish their own baseline or reference standard. Repeated assessment of PROMs permits the detection of individual change, such as in response to treatment, with a high degree of sensitivity and with the interpretation of change relative to their own prior level rather than the more typical normative between-person interpretation.

Despite the measurement limitations inherent in the norm-referenced paradigm, one-time and short-term use of PROMs has demonstrated improvements in patient satisfaction, well-being and autonomy, patient-physician communication and the detection of mental health diagnoses [36]. In addition, patient reports have been found to contribute unique predictive power to models of mortality [7] and in some cases are more informative than physician ratings of patient status [810], which may reflect the broader impacts of symptoms on everyday life and overall well-being captured by self-reports [7]. As such, regular utilization of PROM data within the health care system has foreseeable benefits for both clinical practice and research, and may provide an important complement to clinician-derived health data. With recent wider adoption of electronic health records and renewed patient-centred focus, Wu and colleagues [11] have argued that the integration of PROMs in health care is now more feasible than ever before. Despite these calls for action [12], PROMs are not typically part of routine clinical care appointments or standard prognostic assessments [13, 14], with few exceptions (e.g., [15]). This may result in a loss of relevant individualized patient information which may be useful to complement health records, guide medical decision-making and patient management, and inform research efforts [16, 17].

Clinical use of PROMs may prove useful to improve patient health behaviours, outcomes and patient management, though evidence is mixed (e.g., [18, 19]). In large part, this lack of consensus is limited by a general lack of repeated measure studies to drive evidence-based medicine; related to this, the psychometric properties of PROMs when administered over multiple occasions remains largely unknown. Given that the majority of these measures were designed for one-time use to compare between-person (i.e., average) differences, their suitability to monitor fluctuations and systematic changes should not be taken for granted [20]. However, while the development of new measures is a time-consuming process, the adaptation and evaluation of an already-available measure may be a more practicable approach to achieving the goal of regular in-clinic patient assessment. Examples include the NIH-funded Patient-Reported Outcomes Measurement Information System [21], the Functional Assessment of Chronic Illness Therapy questionnaires [22], the Health Assessment Questionnaire [23], the RAND 36-Item Short Form Health Survey (SF-36; [24]) and disease-specific scales such as the Diabetes Health Profile [25]. The SF-36 is brief, disease-generic and readily available free of charge (version 1.0); despite critiques of its measurement properties [2630] its ubiquity and its conceptualization of health as a well-rounded concept comprised of eight domains make it a suitable first candidate for this investigation.

Two major research questions remain to be answered, and are addressed in the present study. First, is the SF-36 sensitive enough for reliable detection of short-term variation at the within-person (i.e., occasion) level? This demonstration is an important first step in the selection of a measure to monitor long-term change. The second question concerns the psychometric properties of the SF-36 when administered over multiple occasions: when used in this new way, are its factor structure and goodness of fit indices different from standard administration? To further explore these questions, the present study applied an intensive repeated measure design to disaggregate within-person variance (i.e., daily deviations from personal levels) from between-person variance. This approach allows simultaneous but separate modeling of the daily within-person fluctuations and the between-person differences to yield both within-person and between-person factor structures. Such a continuously intensive design is likely not feasible in real-world applications of PROMs, where wider assessments repeated at regular intervals in the context of a measurement burst design would suffice to broadly capture person-level change by distinguishing short-term variations from long-term changes [31, 32]. As of yet, the within-person structure of the SF-36 remains unexamined, though this analysis is central to our understanding of its potential as a repeated-use PROM.



We recruited 122 older adults through advertisements placed in a local family health clinic seeking people aged 50 and older for research on health and well-being during aging. The sample for analysis had a mean age of 66.28 years (SD = 8.57, range: 50–88), was evenly split between the sexes (55 % female) and rated general health as good on the SF-36 general health item (M = 6.73, SD = 1.62). All participants provided informed consent to participate and ethical approval was obtained from the University of Victoria and Vancouver Island Health Authority Joint Research Ethics Sub-Committee (protocol number J2012-70).


The RAND 36-Item Short Form Health Survey (SF-36; version 1.0) was developed at RAND Health as part of the Medical Outcomes Study. It is a brief and easily-administered measure of health-related quality of life and consists of 36 multiple-choice items assessing eight health domains: physical functioning; role limitations due to physical health; role limitations due to emotional problems; vitality; mental health; social functioning; bodily pain and general health. Summary physical component and mental component summary scores can also be computed. Scores for each domain range from 0 to 100 where 100 indicates an excellent health state and no reported symptoms. This simple linear transformation was performed to improve interpretation of small estimates.


Participants completed the standard SF-36 at baseline and provided up to seven responses on consecutive days to the survey modified for repeated administration. This simple modification instructed participants to respond based on the previous 24 h by adjustment of the timescale to which items referred. Exemplars are presented in Table 1. Items 33-36 under the General Health subscale were not included in the daily survey as they were not relevant to daily experience. Data from two items were lost due to technical difficulties with the electronic medical record system used for data collection. Of a possible 854 total assessments (122 patients X 7 days), 694 complete observations were obtained (81 %; M = 5.69) which provides sufficient statistical power for our analyses. Each session was completed via computer through a web-based patient portal survey tool and required approximately 20 min daily. Though burden can be a concern in intensive repeated measures designs such as this, over 96 % of participants reported willingness to take part in similar future studies, an indication that the time commitment was not too great.

Table 1 Original SF-36 exemplar items and modification for daily survey

Analytic approach

To evaluate our first research question, the ability of the SF-36 to detect short-term variation at the within-person level, we computed intraclass correlation coefficients (ICC) for the 30 items and eight subscales of the survey. This metric provides the proportion of between-person variance to total variance. The remaining proportion of the variability (i.e., 1-ICC) gives an indication of the amount of within-person variability. Thus interpretation of the ICC can be summarized as small values (i.e., <0.50) indicating items which capture more within- (i.e., occasion) than between-person (i.e., average) variation.

To evaluate our second research question, confirmatory factor analyses (CFA) were run based on the published eight-factor (eight subscales among the 36 items) and two-factor (two summary component scores among the 36 items) SF-36 structure [33]. Single-level CFA models were run for the standard survey administered at baseline and multilevel CFAs were run for the modified survey administered daily to evaluate both the within-person and the between-person factor structure. In multilevel factor analysis, the within-person factor structure reflects common covariance among the items on each specific day, pooled across days and individuals. The between-person factor structure reflects common covariance in individual mean levels of each item aggregated across time [34].

Goodness of fit and sources of model misfit were examined with the comparative fit index (CFI), Tucker-Lewis index (TLI), root-mean-square error of approximation (RMSEA) and standardized root-mean-square residual (SRMR). Ideal values range from .90 (acceptable) to greater than .95 (good) for CFI and TLI; and from .08 (plausible/acceptable) to less than .05 (good) for RMSEA and SRMR [35]. While the CFI, TLI and RMSEA are indicators of overall fit, the SRMR provides separate fit indices for both the within- and between-person levels. All models were estimated using Mplus Version 7 [36] with maximum likelihood for robust estimates (MLR).


Detection of within-person variability by the daily SF-36

Means, standard deviations and ICC values for the 30 items and eight subscales of the daily SF-36 are presented in Table 2. At the subscale level, we observed a wide range of ICC values, an indication that some subscales captured more daily fluctuations than others. The Emotional Role Limitations subscale with an ICC of .38 captured the largest proportion (62 %) of within-person variability; that is, response patterns to this subscale were more closely aligned with occasion-specific fluctuations than with stable differences between individuals. On the other hand, the physical functioning subscale with an ICC of .89 captured only 11 % of within-person variability. The majority of the variance in responses to this subscale was due to between-person differences. As with the subscales, item-level ICC values ranged from .09 for yesterday (General Health; 91 % within-person variability) to .84 for walk mile (Physical Functioning; 16 % within-person variability). Items within the same subscale generally exhibited similar proportions of within-person variation with only a few exceptions (see the Physical Functioning and General Health subscales). Figure 1 provides an illustration of the extent of dynamic, within-person variation from day to day on each SF-36 subscale (variation around the person-mean), as well as the degree of stable, between-person differences (variation around the sample mean).

Table 2 Means, standard deviations, intraclass correlation coefficients and reliability estimates (ω) for the baseline (standard) and daily administrations of the SF-36
Fig. 1
figure 1

Panel plot illustrating between- and within-person variability across subscales on the daily SF-36. Thin lines indicate raw scores across sessions for three randomly selected participants. Thick lines indicate person-mean and sample-mean (black) scores. Note. PF = physical functioning; RP = role physical; RE = role emotional; VT = vitality; MH = mental health; SF = social functioning; BP = bodily pain; GH = general health

Within- and between-person reliability estimates were computed for each subscale from the application of the multilevel omega (ω) to both levels of the multilevel CFA models [37]. Reliability estimates could not be computed for three subscales (i.e., Social Functioning, Bodily Pain and General Health) due to an insufficient number of items per subscale. Within-person reliability estimates ranged from .60 (Mental Health) to .74 (Vitality). Between-person reliability ranged from .90 (Mental Health) to .96 (Physical Functioning) and was consistently higher for the daily SF-36 than for the standard SF-36, which ranged from .76 (Mental Health) to .92 (Physical Functioning).

Within- and between-person factor structure of the standard and daily SF-36

Factor loadings and goodness of fit indices for the correlated eight-factor and two-factor models are presented in Tables 3 and 4, respectively.

Table 3 Standardized factor loadings and goodness of fit indices from multilevel confirmatory factor analyses of the baseline (standard) and daily administrations of the SF-36 (correlated 8-factor model)
Table 4 Standardized factor loadings and goodness of fit indices from multilevel confirmatory factor analyses of the baseline (standard) and daily administrations of the SF-36 (correlated 2-factor model)

The eight-factor model fit to the standard SF-36 administered at baseline was not optimal as per Hu and Bentler’s criteria [35]. All items loaded onto their respective subscale factor with the exception of one item under General Health (last year). Significant, moderate-to-high correlations were observed between all eight factors (range r = .41 to .80, ps < .001). Model fit was not substantially different for the modified SF-36 administered daily; all items loaded significantly onto their respective subscale factor with the exception of one item at the between-person level (yesterday under General Health) and three items at the within-person level (vigorous under Physical Functioning; calm and happy under Mental Health). All factors were significantly correlated at the between-person level (range r = .34 to .84, ps < .01). Within-person factor correlations were smaller (range r = .29 to .85, ps < .01) and not found between the Mental Health factor and others, or between the Physical Functioning and Physical or Emotional Role Limitations factors (see Table 5).

Table 5 Between-person and within-person correlation coefficients between subscales of the eight-factor solution for the baseline (standard) and daily administrations of the SF-36

For both versions of the SF-36, overall model fit as assessed by the CFI, TLI and RMSEA was better in the eight-factor model than in the two-factor model representing physical and mental summary components. All items in the standard SF-36 loaded onto their respective summary component, but the within-person factor loadings of several physical summary scale items on the daily SF-36 were non-significant. This includes all of the Physical Role Limitations items and one each under Physical Functioning, Bodily Pain and General Health. At the between-person level in the daily SF-36, only yesterday under General Health did not load onto the physical summary while the Emotional Role Limitations items and two Mental Health items did not load onto the mental summary factor. Both summary factors were significantly correlated to a moderate-high degree in both the standard (r = .68, p < .001) and daily (r within  = .61, p < .05; r between  = .63, p < .05) SF-36 models. As per the hypothetical factor structure originally proposed by Ware, Kosinski and Keller [38], orthogonal models were evaluated for both solutions. Model fit across all indices was poor (results not shown here).


In recent years, the integration of PROMs into clinical practice to improve health outcomes and the patient experience [36] has been increasingly recognized as a worthwhile pursuit feasible through the use of electronic medical record systems [1113]. Regular use may facilitate the identification of early changes that may herald more serious health problems in the future, and may provide opportunities for clinical intervention. The first step to achieving this goal and evaluating its impact is to investigate which measures are best able to detect short-term variation and systematic change (from an established baseline) at a within-person level. However, the majority of available PROMs were designed to detect between-person differences, a snapshot of one point in time, rather than monitor change. This hurdle may help to explain why the literature is largely missing repeated measures studies of PRO to address the first step. To facilitate this type of research and avoid the necessarily time-consuming complexities of the development and validation of a new survey measure, the present study investigates whether the widely-used and disease-generic SF-36 can serve as a repeated PROM; that is, whether it can reliably detect person-level variation without sacrificing measurement properties as determined by its factor structure.

Our first research question asked whether the SF-36 is sensitive enough for the detection of short-term within-person variation. This was answered through inspection of ICC values for each item and subscale after seven consecutive days of responses. We found a wide range of ICC values, indicating that some items captured a greater proportion of daily dynamics relative to stable, between-person differences, than others. Visual inspection of scatterplots for a random sample further illustrated varying degrees of within-person variation across days and subscales. In particular, the Emotional Role Limitations, Mental Health, Social Functioning and General Health subscales revealed the largest magnitude of within-person variation, suggesting that these components of health may be key indicators for PROM monitoring, perhaps because they are more likely to be impacted by daily events and activities. The presence of day-to-day variation highlights the need to utilize repeated measurements in order to disaggregate within-person variations from between-person differences. Failing to account for these within-person fluctuations in health outcomes assumes that they are stable and prevents us from understanding the impact that daily variations in health have on the individual.

Our second research question asked whether the psychometric properties of the SF-36 were maintained during “off-label” use as a repeated PROM. That is, do items continue to load onto their respective subscales, and subscales onto summary components, to comprise the same latent factors yielded by standard use of the survey? To evaluate this, we compared the factor structure of the standard survey administered at baseline to that of the daily survey. We found no substantial differences between them, indicating that summarizing item responses by subscales and summary components is appropriate to monitor person-level change. However, the fit indices of both versions were sub-optimal. To evaluate the sources of model misfit, we inspected the modification indices and noted that the primary sources were in the Vitality and Mental Health subscales, particularly at the within-person level, such that the positive items loaded together. This is in line with the structure of most measures of positive and negative affect such as Watson’s Positive and Negative Affect Schedule [34, 39]. We evaluated an alternative factor structure for both the baseline and daily SF-36, allowing positive and negative Vitality and Mental Health items to load onto separate factors, but found that it did not substantially improve overall model fit (results not shown).

Our findings on the sub-optimal model fit of the SF-36 are consistent with previous work which has raised issue with the factor structure and construct validity obtained by the recommended orthogonal scoring procedure [2529] and the reduction to summary component measures [27, 4044]. Thus, although the daily SF-36 exhibited similar psychometric properties to the standard survey, sub-optimal fit indices in both cases lead us to recommend caution in using the SF-36 in its entirety as a repeated PROM. However, while the overall multifactor model of the SF-36 exhibited sub-optimal fit indices, many of the subscales demonstrated acceptable to good reliability estimates when examined independently. Researchers may find utility in focusing on improving and expanding the specific subscales for use in certain contexts. Including additional items for the subscales that contained only two items and reconsidering the arrangement of the Mental Health and Vitality subscales into positive and negative affect subscales (e.g., [34, 39]) are two potentially fruitful avenues to explore.

A limitation of this study is the relatively healthy sample, which may explain why some items on the daily SF-36 (e.g., walk mile under Physical Functioning, magnitude under Bodily Pain) exhibited little within-person variation. Alternatively, this may be because some health-related factors are simply unlikely to show short-term change, or because some items are not sensitive enough to detect the occurrence of short-term changes. A second limitation is the extent to which sample heterogeneity may have contributed to the overall poor fit. There is some evidence that the SF-36 factor structure may differ among patient subgroups, particularly those with comorbidity (e.g., [42, 44, 45]); that is, some survey subscales may have disease-specific relationships with either summary score. However, this has only been found to be a concern for the two-factor structure of the survey so is unlikely to have substantially affected our results given our focus on the eight-factor structure.

This study extends prior research on PRO assessment to consider the utility of the SF-36 as a PRO measure for repeated administration. This was accomplished through evaluation of its factor structure at the within-person level. We found that the SF-36 modified for repeated administration has a similar factor structure to the standard version, indicating maintenance of measurement properties when used “off-label,” though model fit remained sub-optimal. However, many subscale reliabilities ranged from acceptable to good at both the within-person and between-person levels. Therefore, while we conclude that the SF-36 in its entirety may not be an adequate measure for repeated PRO assessment, we recommend future work to examine the utility of the subscales in specific contexts, as well as the within-person factor structure of other PROMs currently in use (e.g., [2022]). This is an important first step in the measurement of daily PRO assessments in primary health care. Future research can build upon this work in moving toward the goal of regular in-clinic patient assessment and early detection of the cognitive, behavioral and physical changes that characterize potentially reversible conditions and personalizing interventions and health care. This may be more easily facilitated by the adaptation and integration of existing measures than the development of new surveys.


Many of the subscales of the modified SF-36 for repeated daily assessment were found to be sufficiently reliable for use in repeated measurement designs incorporating PROMs, though the overall scale may not be optimal. We encourage future work to investigate the utility of the subscales in specific contexts, as well as the measurement properties of other existing PROMs when administered in a repeated measures design. The development and integration of new measures for this purpose may ultimately be necessary.



comparative fit index


intra-class correlation


patient-reported outcome


root-mean-square-error of approximation


RAND 36-Item Short Form Health Survey 1.0 Statistics


standardized root-mean-square


  1. Martin M, Hofer SM. Intraindividual variability, change, and aging: Conceptual and analytical issues. Gerontology. 2004;50:7–11.

    Article  PubMed  Google Scholar 

  2. Kaplan RM, Stone AA. Bringing the laboratory and clinic to the community: Mobile technologies for health promotion and disease prevention. Annu Rev Psychol. 2013;64:471–98.

    Article  PubMed  Google Scholar 

  3. Chen J, Ou L, Hollis SJ. A systematic review of the impact of routine collection of patient reported outcome measures on patients, providers and health organisations in an oncologic setting. BMC Health Serv Res. 2013;13:211.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Detmar SB, Muller MJ, Schornagel JH, Wever LDV, Aaronson NK. Health-related quality-of-life assessments and patient-physician communication. JAMA. 2002;288:3027–35.

    Article  PubMed  Google Scholar 

  5. Marshall S, Haywood K, Fitzpatrick R. Impact of patient-reported outcome measures on routine practice: A structured review. J Eval Clin Pract. 2006;12:559–68.

    Article  PubMed  Google Scholar 

  6. Velikova G, Booth L, Smith AB, Brown PM, Lynch P, Brown JM, et al. Measuring quality of life in routine oncology practice improves communication and patient well-being: A randomized controlled trial. J Clin Oncol. 2004;22:714–24.

    Article  PubMed  Google Scholar 

  7. Quinten C, Maringwa J, Gotay CC, Martinelli F, Coens C, Reeve BB, et al. Patient self-reports of symptoms and clinician ratings as predictors of overall cancer survival. J Natl Cancer Inst. 2011;103:851–1858.

    Article  Google Scholar 

  8. Gotay CC, Kawamoto CT, Bottomley A, Efficace F. The prognostic significance of patient-reported outcomes in cancer clinical trials. J Clin Oncol. 2008;26:1355–63.

    Article  PubMed  Google Scholar 

  9. Singh JA, Nelson DB, Fink HA, Nichol KL. Health-related quality of life predicts future health care utilization and mortality in veterans with self-reported physician-diagnosed arthritis: The Veterans Arthritis Quality of Life Study. Semin Arthritis Rheum. 2005;34:755–65.

    Article  PubMed  Google Scholar 

  10. Cunningham WE, Crystal S, Bozzette S, Hays RD. The association of health-related quality of life with survival among persons with HIV infection in the United States. J Gen Intern Med. 2005;20:21–7.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Wu AW, Kharrazi H, Boulware LE, Snyder CF. Measure once, cut twice – adding patient-reported outcome measures to the electronic health record for comparative effectiveness research. J Clin Epidemiol. 2013;66:S12–20.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Black N. Patient reported outcome measures could help transform healthcare. Brit Med J. 2013;28:f167.

    Article  Google Scholar 

  13. Snyder CF, Jensen RE, Geller G, Carducci MA, Wu AW. Relevant content for a patient-reported outcomes questionnaire for use in oncology clinical practice: Putting doctors and patients on the same page. Qual Life Res. 2010;19:1045–55.

    Article  PubMed  Google Scholar 

  14. Basch E, Abernathy AP. Commentary: Encouraging clinicians to incorporate longitudinal patient-reported symptoms in routine clinical practice. J Oncol Pract. 2011;7:23–5.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Bausewein C, Simon ST, Benalia H, Downing J, Mwangi-Powell FN, Daveson BA, et al. Implementing patient-reported outcome measures (PROMs) in palliative care: Users’ cry for help. Health Qual Life Outcomes. 2011;20:27–37.

    Article  Google Scholar 

  16. Chang C. Patient-reported outcomes measurement and management with innovative methodologies and technologies. Qual Life Res. 2007;16:157–66.

    Article  PubMed  Google Scholar 

  17. Snyder CF, Aaronson NK, Choucair AK, Elliott TE, Greenhalgh J, Halyard MY, et al. Implementing patient-reported outcomes assessment in clinical practice: A review of the options and considerations. Qual Life Res. 2011;21:1305–14.

    Article  PubMed  Google Scholar 

  18. Hilarius DL, Kloeg PH, Gundy CM, Aaronson NK. Use of health-related quality-of-life assessments in daily clinical oncology nursing practice: A community hospital-based intervention study. Cancer. 2008;113:628–37.

    Article  PubMed  Google Scholar 

  19. Valderas JM, Kotzeva A, Espallargues M, Guyatt G, Ferrans CE, Halyard MY, et al. The impact of measuring patient-reported outcomes in clinical practice: A systematic review of the literature. Qual Life Res. 2008;17:179–93.

    Article  PubMed  CAS  Google Scholar 

  20. McKenna SP. Measuring patient-reported outcomes: Moving beyond misplaced common sense to hard science. BMC Med. 2011;9:86.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Cella D, Yount S, Rothrock N, Gershon R, Cook K, Reeve B, et al. The Patient-Reported Outcomes Measurement Information System (PROMIS): Progress of an NIH Roadmap Cooperative Group during its first two years. Med Care. 2007;45:S3–11.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Webster K, Cella D, Yost K. The Functional Assessment of Chronic Illness Therapy (FACIT) Measurement System: Properties, application, and interpretations. Health Qual Life Outcomes. 2003;1:1–7.

    Article  Google Scholar 

  23. Hays RD, Sherbourne CD, Mazel RM. The RAND 36-Item Health Survey 1.0. Health Econ. 1993;2:217–27.

    Article  PubMed  CAS  Google Scholar 

  24. Fries JF, Spitz P, Kraines RG, Holman HR. Measurement of patient outcomes in arthritis. Arthritis Rheum. 1980;23:137–45.

    Article  PubMed  CAS  Google Scholar 

  25. Meadows K, Steen N, McColl E, Eccles M, Shiels C, Hewison J, et al. The Diabetes Health Profile (DHP): A new instrument for assessing the psychosocial profile of insulin requiring patients – Development and psychometric evaluation. Qual Life Res. 1996;5:242–54.

    Article  PubMed  CAS  Google Scholar 

  26. de Vet HCW, Adèr HJ, Terwee CB, Pouwer F. Are factor analytical techniques used appropriately in the validation of health status questionnaires? A systematic review on the quality of factor analysis of the SF-36. Qual Life Res. 2005;14:1203–18.

    Article  PubMed  Google Scholar 

  27. Taft C, Karlsson J, Sullivan M. Do SF-36 summary component scores accurately summarize subscale scores? Qual Life Res. 2001;10:395–404.

    Article  PubMed  CAS  Google Scholar 

  28. Tucker G, Adams R, Wilson D. Observed agreement problems between sub-scales and summary components of the SF-36 Version 2 – An alternative scoring method can correct the problem. PLoS One. 2013;8:1–10.

    Google Scholar 

  29. Reed PJ. Medical Outcomes Study Short Form 36: Testing and cross-validating a second-order factorial structure for health system employees. Health Serv Res. 1998;33:1361–80.

    PubMed  CAS  PubMed Central  Google Scholar 

  30. Wolinsky FD, Stump TE. A measurement model of the MOS 36-Item Short Form Health Survey (SF-36) in a clinical sample of disadvantaged, older, Black and White men and women. Med Care. 1996;34:537–48.

    Article  PubMed  CAS  Google Scholar 

  31. Nesselroade JR. The warp and the woof of the developmental fabric. In: Downs RM, Liben LS, Palermo DS, editors. Visions of aesthetics, the environment & development: the legaciy of Joachim F Wohlwill. Hillsdale: Erlbaum; 1991. p. 213–40.

    Google Scholar 

  32. Sliwinski MJ. Measurement-burst designs for social health research. Soc Personal Psychol Compass. 2008;2:245–61.

    Article  Google Scholar 

  33. Ware Jr JE, Gandeck B. Overview of the SF-36 Health Survey and International Quality of Life Assessment (IQOLA) Project. J Clin Epidemiol. 1998;51:903–12.

    Article  PubMed  Google Scholar 

  34. Rush J, Hofer SM. Differences in within- and between-person factor structure and predictors of positive and negative affect: Analysis of two intensive measurement studies using multilevel SEM. Psychol Assess. 2014;26:462–73.

    Article  PubMed  PubMed Central  Google Scholar 

  35. Hu L, Bentler PM. Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Struct Equ Model. 1999;6:1–55.

    Article  Google Scholar 

  36. Muthén LK, Muthén BO. Mplus user’s guide. 7th ed. Los Angeles: Muthén & Muthén; 2012. p. 1998–2012.

    Google Scholar 

  37. Geldhof GJ, Preacher KJ, Zyphur MJ. Reliability estimation in a multilevel confirmatory factor analysis framework. Psychol Methods. 2014;19:72–91.

    Article  PubMed  Google Scholar 

  38. Ware Jr JE, Kosinski M, Keller SD. SF-36 Physical and Mental Health Summary Scales: A User’s Manual. Boston: The Health Institute; 1994.

    Google Scholar 

  39. Watson D, Clark LA, Tellegen A. Development and validation of brief measures of positive and negative affect: The PANAS scales. J Pers Soc Psychol. 1988;54:1063–70.

    Article  PubMed  CAS  Google Scholar 

  40. Wu C, Lee K, Yao G. Examining the hierarchical factor structure of the SF-36 Taiwan version by exploratory and confirmatory factor analysis. J Eval Clin Pract. 2007;13:889–900.

    Article  PubMed  Google Scholar 

  41. Banks P, Martin CR, Petty RKH. The factor structure of the SF-36 in adults with progressive neuromuscular disorders. J Eval Clin Pract. 2012;18:32–6.

    Article  PubMed  Google Scholar 

  42. Beals J, Welty TK, Mitchell CM, Rhoades DA, Yeh J, Henderson JA, et al. Different factor loadings for SF36: the strong heart study and the national survey of functional health status. J Clin Epidemiol. 2006;59:208–15.

    Article  PubMed  Google Scholar 

  43. Farivar SS, Cunningham WE, Hays RD. Correlated physical and mental health summary scores for the SF-36 and SF-12 Health Survey, V. I. Health Qual Life Outcomes. 2007;5:54–8.

    Article  PubMed  PubMed Central  Google Scholar 

  44. Hann M, Reeves D. The SF-36 scales are not accurately summarised by independent physical and mental component scores. Qual Life Res. 2008;17:413–23.

    Article  PubMed  Google Scholar 

  45. Banks P, Martin CR. The factor structure of the SF-36 in Parkinson’s disease. J Eval Clin Pract. 2009;15:460–3.

    Article  PubMed  Google Scholar 

Download references


This study was supported, in part, by funding from a grant from Island Health. Portions of this study were presented at the Family Medicine Forum (November 6 2013, Vancouver, BC).

Author information

Authors and Affiliations


Corresponding authors

Correspondence to Amanda Kelly or Scott M. Hofer.

Additional information

Competing interests

The authors report no competing interests.

Authors’ contributions

All authors designed the research study and analytic approach. AK, ES and JR were involved in data collection. Data analysis was performed by AK, JR, PR and SMH. All authors contributed to writing the manuscript, provided critical comments and approved the final manuscript.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kelly, A., Rush, J., Shafonsky, E. et al. Detecting short-term change and variation in health-related quality of life: within- and between-person factor structure of the SF-36 health survey. Health Qual Life Outcomes 13, 199 (2015).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: