This is the first study to examine the measurement properties of the generic SF-36v2 in Dutch patients with RA. The scaling assumptions of the SF-36v2 were generally supported and the questionnaire demonstrated internal reliability and internal construct validity similar to those found in the general US population. The individual scales and components demonstrated the expected pattern of associations with patient-reported and clinical outcome measures and were able to discriminate well between patients with low and moderate to high levels of disease activity. Especially the physical scales were adequately responsive to changes in disease activity. Overall, the findings suggest that the SF-36v2 is a psychometrically robust measure of HRQOL in Dutch patients with RA.
Excellent scaling success rates were found for four of the SF-36 scales (RP, BP, RE, and MH), which corresponds with findings from the original SF-36 version in the general Dutch population and in chronic disease populations . All items of the SF-36v2 passed the test for item internal consistency, except for item 9a (Did you feel full of life?). This item correlated too weakly with the other vitality items and slightly more strongly with the mental health scale. Although this finding is not too surprising given the item phrasing, it has not been reported in previous studies. Given that the overall internal consistency of the vitality scale was acceptable, however, it did not substantially affect the performance of this scale. The finding that the overall general health item (item 1) also correlated substantially with several other scales corresponds with previous studies in specific patient samples [9, 32, 33] which also showed the lowest percentage of scaling successes for item discriminant validity of the GH scale in these populations. Despite these deviations, all eight scales met the internal reliability standards required for comparing groups of patients, and the physical function, role-physical, and role-emotional scales appear to be suitable for monitoring individuals.
In general, the observed high percentage of scaling successes lends strong support to the hypothesized scale structure of the SF-36v2 in patients with RA. The internal construct validity was further supported by the scales’ correlations with the physical and mental components of health. Principal component analysis supported the existence of the two hypothesized dimensions underlying the SF-36v2. Together the two dimensions accounted for a significant proportion (73.45%) of the reliable variance in the eight scale scores. The correlations of the scales with their principal components were as expected and were fairly similar to the hypothesized measurement model of the SF-36v2 in the general population [16, 17] and those found for the original SF-36 in previous studies in patients with RA [7, 9].
The vitality scale, however, correlated evenly strong with both components, whereas it correlated most strongly with mental health in the general population. Apparently, vitality is closely related to the other physical problems associated with RA, such as pain and physical functioning, a finding that is supported by the recent attention focused on the issue of increased fatigue in RA [34–36]. Similar problems with the vitality scale have also been observed in patients with severe functional somatic syndromes  and in people with ischemic stroke . Other studies have also challenged the assumption that the way in which the eight scales relate to the physical and mental component is uniform across both diseased and healthy individuals. Findings from these studies generally suggest that the vitality scale in particular may relate to physical and mental health differently, depending upon whether a patient’s main condition is a physical or mental illness . The finding that all other scales were associated with the two dimensions as expected and the high percentage of scaling successes for all scales, however, does support the legitimacy of generating scores for the eight scales and two summary measures using the standard algorithms. Moreover, using the standard US-based scoring algorithm, the PCS and MCS were negligibly correlated (r = 0.16), further supporting the orthogonal nature of the US-based component summary scores.
One of the aims of the developers of the SF-36v2 was to increase the internal reliability and to reduce the floor and ceiling effects that have been reported in the literature for the role-emotional and role-physical scales by increasing the number of response options for these scales from two to five . The findings in this study suggest that these scales are indeed more reliable than in the previous version [16, 32] and that especially their floor effects have been strongly reduced. Both role scales and the social functioning scale still demonstrated substantial ceiling effects, although these were much smaller than those observed in the general population [17, 18]. These improvements are likely to have increased the ability of the SF-36v2 scale to discriminate between groups and to detect changes over time as compared with the original version.
The SF-36v2 demonstrated excellent convergent/discriminant and known-groups validity. The DREAM registry data allowed for a direct comparison of SF-36v2 scores with simultaneously collected self-reported and clinical core disease activity parameters . The different scales of the SF-36v2 correlated as expected with the core measures of disease activity. All scales were additionally able to distinguish between patients with low disease activity and those with moderate to high disease activity as measured with the DAS-28. The DAS-28 is currently the standard-of-care measure of disease activity in RA  and the best determinant of the physician’s clinical judgment of response to treatment . As expected, the physical scales, including bodily pain, were most discriminative. However, the physical functioning scale did not perform as well as the HAQ-DI, which over the years has become the standard measure of self-reported disability in many rheumatic conditions . The HAQ-DI was still about 53% more effective in distinguishing between known groups, a finding similar to the one recently observed in patients with gout .
The finding that the SF-36v2 was able to discriminate well between patients with low and moderate to high disease activity, but also to detect improvements over the first six months of treatment, suggests that it can be useful for both discriminative and evaluative purposes  in patients with RA. The generic nature of the SF-36v2 additionally offers the opportunity of comparing the HRQOL of RA patients with those in other rheumatic and non-rheumatic conditions and with general population norms. These advantages, however, come with a potential loss of responsiveness and relevance to specific patient groups [2, 11, 12]. Nevertheless, in accordance with previous studies examining the original SF-36 in RA [10, 13–15], this study showed that the generic nature of the SF-36v2 did not result in substantially reduced responsiveness. Especially the physical and bodily pain scales showed moderate or large improvements in those patients achieving low disease activity. Moreover, although the disease-specific HAQ-DI had better known-groups validity than the physical functioning scale, it was only slightly and non-significantly more responsive to improvements over time.
The finding that the SF-36v2 meets psychometric criteria does not necessarily mean that the questionnaire covers all issues specifically relevant to patients with RA. For instance, the physical functioning scale of the SF-36v2 mainly covers functions related to mobility and other activities requiring the use of the lower extremities, whereas finger function is not captured at all and arm function only by three items related to daily activities . A recent review further indicated that the scale has limited content validity as it has no content relevant to the assessment of domestic life . Therefore, for a thorough and comprehensive assessment of health-related quality of life the common recommendation to use both disease-specific and generic measures if possible [15, 46] still holds.
In this light, recent initiatives to integrate and cross-calibrate generic and disease-specific measures of health-related quality of life using applications of item response theory and computerized adaptive testing are particularly interesting. Based on existing questionnaires, the NIH Patient-Reported Outcomes Measurement Information System (PROMIS) project has developed large calibrated item banks that can be used to measure key symptoms and health concepts across a wide variety of chronic diseases and in the general population . This blended approach is likely to overcome the limitations of the current generation of disease-specific and generic questionnaires and may allow for more relevant, precise, and efficient assessment of health status and comparability of experiences across diseases.
Finally, it should be noted that in the current study several comparisons were made with normative data from the US general population as no Dutch norms are currently available for version 2 of the SF-36. The US norms, however, are not necessarily generalizable to other countries or cultures. Some studies comparing the US norms with those of other countries have suggested that although the magnitude of differences is generally small, for some scales they are close to or just above the difference of 5 points considered to be clinically meaningful [48–50].
In conclusion, the SF-36v2 demonstrated adequate psychometric properties in patients with RA. Using the SF-36v2 along with disease-specific measures, will allow the identification of HRQOL issues and changes in HRQOL that are important to patients and will facilitate comparisons across different disease states.