Validity of SF-12 summary scores in a Greek general population

Background The 12-item Health Survey (SF-12) was developed as a shorter alternative to the SF-36 for use in large-scale studies, particularly when overall physical and mental health are the outcomes of interest instead of the typical eight-scale profile. The main purpose of this study was to assess the validity of the Greek version of the SF-12. Methods A stratified representative sample (N = 1005) of the Greek general population was interviewed. The survey included the SF-36, the EQ-5D and questions on socio-demographic and health-related characteristics. SF-12 summary scores were derived using the standard US algorithm. Factor analysis was used to confirm the hypothesized component structure of the SF-12 items. Construct validity was investigated with "known groups" validity testing and via convergent and divergent validity, which in turn were assessed by the correlations with the EQ-5D dimensions. Concurrent validity was assessed by comparisons with SF-36 summary scores. Results SF-12 summary scores distinguished well, and in the expected manner, between groups of respondents on the basis of gender, age, education, socio-economic status, self-reported health problems and health services utilization, thus providing evidence of construct validity. Effect size differences between SF-36 and SF-12 summary scores were generally small (<0.2), supporting concurrent (criterion) validity. Significantly lower mean PCS-12 and MCS-12 scores were observed for respondents reporting chronic conditions compared to those without (P < 0.001). Convergent and divergent validity were supported by expected relationships with the EQ-5D. Reporting a problem in an EQ dimension was associated with lower SF-12 summary scores, supporting concurrent validity. Sensitivity of the Greek SF-12 and replication of the original measurement and conceptual model were demonstrated. Conclusion The results provide evidence on the validity of the Greek SF-12 and, in conjunction to future studies addressing test-retest reliability and responsiveness, support its use in Greek health status studies as a brief, yet valid, alternative to the SF-36.


Background
In medical research and evaluation, there is an increasing interest in instruments used to measure health-related quality of life (HRQOL) in general population surveys, as well as across a variety of diseases and conditions. HRQOL is a multidimensional concept that includes physical, psychological and social domains of health and is generally accepted as an important outcome measure of health care [1]. The two main approaches to measuring HRQOL are generic and disease-specific instruments, and the majority of experts recommend the use of both concurrently [2]. Regarding the generic instruments, the Short Form Health Survey (SF-36) is probably the one that is most widely used [3].
The SF-36 includes eight dimensions: physical functioning (PF), role physical (RP), bodily pain (BP), general health (GH), vitality (VT), social functioning (SF), role emotional (RE), and mental health (MH) [4]. Each dimension is scored on a 0-100 scale with 0 and 100 corresponding to worst and best HRQOL respectively [5], and the eight dimensions can be summarized in two summary scores of physical and mental health [6], hereinafter referred to as PCS-36 and MCS-36.
The 12-item Health Survey (SF-12) was developed as a shorter alternative to the SF-36 for use in large-scale studies, and its reliability and validity have been documented [7]. Scale scores are estimated for four of the health concepts (PF, RP, RE and MH) using two items each, whereas the remaining four (BP, GH, VT and SF) are represented by a single item. All 12 items are used to calculate the physical and mental component summary scores (PCS-12 and MCS-12) by applying a scoring algorithm empirically derived from the data of a US general population survey [8]. Performance of the component summary scores was initially studied in nine languages and it has been recommended that the US-derived summary scores, which yield a mean of 50 and a SD of 10, be used in order to facilitate cross-cultural comparison of results [9].
The SF-12 has been extensively used in health status studies involving the general population [10][11][12], as well as in studies with disease groups [13][14][15][16]. As for the SF-36, it has been translated into Greek and its reliability and validity were established in a sample of 1007 adults living in the greater Athens area. It was found to have high internal consistency reliability, convergent and discriminative validity and able to distinguish between groups of respondents in the expected manner (known-groups validity) on the basis of gender, age and socio-economic status [17]. Using the same sample, the eight-scale structure of the Greek version of the instrument has been confirmed as well [18].
Recently, the SF-12 (embedded within the SF-36) was administered to a nationally representative sample in a large-scale study aiming to assess the health of the Greek population. The aim of the present study was to examine the psychometric properties of the SF-12 summary scores in terms of the measurement and conceptual model, sensitivity, "known groups" construct validity, convergent and divergent validity and concurrent (criterion) validity and, hence, to increase confidence in using the SF-12 in Greek studies as an alternative to the more time-demanding SF-36.

Sample and data collection
The study was conducted in September 2006 and involved a sample (>18 years old) residing in urban (>2,000 inhabitants) and rural (<2,000 inhabitants) areas of the country and in each of the 13 geographical regions. According to the latest Population Census (2001), the survey population consisted of 8,880,924 individuals. Non-fluent Greek speakers, institutionalized subjects and those incapable of reasoning and decision-making on their own were excluded. Participants were grouped, proportionally to the Greek population, by socio-demographic characteristics according to a three-staged sampling methodology. In the first stage, a random sample of building blocks was selected proportionally to size. In the second, households were randomly selected by systematic sampling. In the third stage an eligible participant was selected by simple random sampling in each household. In total 1,005 willing subjects, out of 1,388 initially approached (response rate 72.4%), agreed to participate and were interviewed face-to-face by trained interviewers. The Research Committee of the Hellenic Open University ethically approved the study and all subjects provided informed consent.

Survey
Two HRQOL instruments, the SF-36 and the EuroQol EQ-5D, were included in the study. The latter is a two-part, preference-based HRQOL measure, developed by a multidisciplinary transnational consortium of investigators [19]. It addresses five domains: mobility, self-care, usual activities, pain/discomfort and anxiety/depression, with each divided into three severity levels. The second part consists of a vertical 0-100 visual analogue scale (VAS). The EQ-5D has been translated into most major languages, including Greek, and initial evidence on its applicability and adaptability to the Greek environment has been provided [20]. Currently, a large-scale general population study is in progress aiming to demonstrate the construct validity of the Greek version of the instrument. Subjects reported information on socio-demographic variables such as gender, age, marital status, education and employment, with the latter two serving as proxy-estimators of socio-economic status, as information on income could not be reliably collected. Data were also collected regarding various clinical conditions, which are known to be reliable when self reported [21,22]. Utilization of health services such as past-month physician consulta-tions and past-year hospital admissions were also recorded, as they have been shown to be associated with HRQOL [23,24].

Psychometric properties
The sensitivity of the SF-12 measurement model was evaluated by examining: i) response distributions for each item in order to ensure that the full range of possible responses is used and ii) summary floor and ceiling effects to assess the ability of the items to capture the full range of health states. To ensure that the original conceptual model is satisfactorily replicated, principal components factor analysis with varimax rotation was used, a procedure previously performed in similar studies [13,25]. It was hypothesized that two factors would be obtained. In addition, items originally belonging to the PF, RP, BP and GH domains were hypothesized to load higher on the physical health factor, whereas the VT, SF, RE and MH items were hypothesized to load higher on the mental health factor. However, it has been suggested that VT and SF crossload on both physical and mental components [8] and a crossloading of ≥0.40 is considered to be meaningful [26].
Furthermore, the correlation between physical component items and the PCS score should be higher than with the MCS score and vice versa, i.e. the correlation between mental component items and the MCS score should be higher than with the PCS score [7]. These relationships were examined for the SF-12 items. The proportion of the total variance of PCS-36 and MCS-36 scores explained by PCS-12 and MCS-12 scores respectively was used to assess content validity and the expected standard was ≥90% [8]. This was further evaluated by Pearson's correlation coefficients between SF-12 and SF-36 PCS and MCS scores, and the expected standard was ≥0.9 [8,9]. "Known groups" construct validity was assessed by examining hypothesized relationships between sociodemographic and health-related variables and SF-12 component scores. Specifically, it was expected that females, older subjects, widowed or divorced persons, those with less education and the unemployed would report poorer health [10,11]. It was also expected that those reporting greater use of health services and/or existing clinical conditions would have a lower HRQOL as well [21][22][23][24]. Effect size differences between corresponding SF-12 and SF-36 PCS and MCS scores were used to determine if the SF-12 gave similar results to the SF-36 (criterion validity). The effect size difference between SF-36 and SF-12 scores was calculated by dividing their difference by the standard deviation of the SF-36 summary score. To assess the relative magnitude of change, it has been suggested that an effect size of 0.2 is regarded as small, 0.50 as moderate and 0.80 as large [27].
The ability of the SF-12 to discriminate between different levels of health was determined by comparing mean summary scores for subjects reporting no problem, a moderate problem or a serious problem for a given EQ-5D dimension, and it was expected that scores would be higher in the first case [28]. Convergent and divergent validity of the SF-12 were examined via the relationships with the EQ-5D, and it was expected that comparable summary scores and dimensions, e.g. PCS-12 with mobility, selfcare, usual activities and pain/discomfort and MCS-12 with anxiety/depression would correlate better, compared to less comparable dimensions. Contrarily, the EQ VAS should correlate reasonably well with both SF-12 summary scores [29].

Statistical analysis
Data were analyzed using SPSS ver. 13.0 (SPSS Inc., Chicago IL). Summary scores, according to subgroups, were compared with t-Test and ANOVA. Linear regression was performed to determine the total variance of the PCS-36 and MCS-36 scores explained by the SF-12 items. Pearson's correlation coefficient (r) was used to measure the association between SF-12 and SF-36 PCS and MCS scores and between EQ dimensions and VAS with SF-12 summary scores. Correlations >0.50 were regarded as strong [30]. For all tests, statistical significance was assumed for P values <0.05.

Results
The SF-12 item and summary descriptive statistics are presented in Table 1. Four of the items were recoded so that higher scores correspond to better health. Responses were clustered at the upper end of the measurement scale, as could have been expected in a general population. Despite this, the full range of possible responses has been used satisfactorily, supporting the overall sensitivity of the measurement model. The PCS-12 and MCS-12 summary scores were negatively skewed since respondents scored towards the higher end of the health spectrum. However, no floor or ceiling effects were observed, implying that the SF-12 items captured the full range of health states.
The two-factor conceptual structure of the SF-12 was confirmed (Table 2). Principal components analysis, after varimax rotation, showed that PF, RP, BP and GH items loaded higher on the physical component, whereas RE and MH items loaded higher on the mental component. The VT and SF items expectedly loaded on both components and similar results have been reported elsewhere [7,13]. Correlations (Pearson's r) of individual items and the SF-12 summary scores are also shown in Table 2. Items comprising the PF, RP, BP and GH domains correlated higher with the PCS score, whereas the SF, RE and MH items correlated better the MCS score. These results confirmed the hypothesized item-component correla-tions, with one exception, namely the VT item appearing to correlate slightly higher with the PCS score.
The PCS-12 and MCS-12 summary scores explained 93.2% and 86.9% of the total variance of the PCS-36 and MCS-36 summary scores respectively (expected standard 90%), supporting content validity of the Greek SF-12. This was further supported by the correlations between SF-36 and SF-12 summary scores exceeding the expected 0.9 standard. Specifically r = 0.97 (P < 0.01) between PCS-36 and PCS-12 and r = 0.93 (P < 0.01) between MCS-36 and MCS-12. These high correlations are an indication of the validity of the SF-12 scores, with the SF-36 scores acting as criterion variables. Higher loadings of each item on a factor and higher correlations with a SF-12 component are indicated in bold 1 P < 0.01 for all correlations Significant differences were observed within both SF-12 component scores across the distributions of the demographic and health-related variables (Table 3). Men scored higher than women and both summary scores were negatively associated with age. The adopted proxy indicators of socio-economic status (education and employment) were positively related to HRQOL. Furthermore, being divorced/widowed, suffering from a clinical condition and higher health service utilization (physician consultations and hospital admissions) all correlated negatively with PCS-12 and MCS-12 summary scores. These differences were statistically significant (P < 0.01) and confirmed expected relationships in support of the construct validity of the instrument.
SF-12 summary scores were compared to the respective SF-36 components and the concordance between them was noteworthy. Scores were almost identical and, in any case, differences were never greater than two percentage points in any of the subgroups on either the PCS-12 or MCS-12 components, and such a difference would not be subjectively or clinically meaningful [3,5]. Effect size differences between SF-36 and SF-12 scores were generally small (<0.2), implying that the SF-12 gave similar results to the SF-36 and support concurrent (criterion) validity of the Greek version of the instrument.
Significantly lower mean PCS-12 and MCS-12 scores were observed for respondents reporting specific health problems, compared to those without (Table 4). It should be noted that "sleeping disorders" refers to negative responses, from the subjects, to duration-and quality-ofsleep questions, and not to diagnosed insomnia and this perhaps justifies the high prevalence of this condition in the sample. Along with hypertension and obesity, these subgroups contained the largest number of positive respondents, i.e. people reporting the specific health problem and this perhaps implies that the score differences observed in the other disease groups were sometimes insignificant due to the smaller number of people reporting those particular conditions. In any case, the results are indicative of the discriminative ability of the SF-12 since for every health problem, at least one summary score was significantly lower in the group of positive respondents.
Subjects indicating a moderate or severe problem on any of the EQ-5D dimensions had significantly lower (P < 0.001) mean SF-12 component scores compared to subjects reporting no problems, confirming the ability of the SF-12 to discriminate between different levels of health ( Table 5). The MCS-12 summary scores appeared to differentiate better than the PCS-12 ones between the three levels in each EQ dimension, except for usual activities where mean scores were quite similar for those reporting moderate and severe problems. On the other hand, the PCS-12 summary scores discriminated better between respondents of the lowest and highest EQ-5D levels (approximately 20 percentage points or more). It should be noted that the number of severe problem reporters in the mobility, self-care and usual activities dimensions was small and could have affected these particular results.

Discussion
This study reports on the first ever examination of the psychometric properties of the Greek SF-12 and is expected to add to the growing list of languages and cultures for which the instrument has been evaluated. Initial evidence was provided on the construct and concurrent validity of the instrument, supported by self-reported data on sociodemographic and clinical characteristics. This implies that the SF-12 is potentially suitable for inclusion in large-scale health surveys in Greece and for cross-cultural quality of life comparisons, as a valid alternative to the SF-36.
The embedded form of the SF-12, i.e. as a subset of the SF-36, was used in the present study. It has been demon- * P < 0.05, ** P < 0.01, *** P < 0.001 strated that both the embedded and stand-alone versions are similar in terms of item ordering, that factor content and structure are equivalent [8] and that responses to the SF-12 items abstracted from the SF-36 are the same as those obtained from the SF-12 administered alone [31]. Perhaps the unembedded form would have been ideal for this study in light of timesaving, however, the use of the embedded form does not pose a threat to the validity of the results.
The two-factor structure of the instrument and the itemfactor loadings were confirmed using principal components analysis, thus ensuring that the conceptual model of the original US version was satisfactorily replicated. Hypotheses regarding the correlation of individual itemcomponent correlations were tested and confirmed, except for the VT item which appeared to correlate higher with the PCS-12 than with the MCS-12 score. In general, this was expected since the VT scale is a general measure and usually correlates with both components [6]. Furthermore, VT loaded highly on both summary components.
In the cross-cultural context, this particular result has been observed in studies involving general as well as disease populations [7,13,32].
No floor or ceiling effects in the SF-12 scores were observed in this general population sample, indicating the ability of the instrument to capture a full range of health states. Correlations between SF-36 and SF-12 summary scores reached the expected 0.9 standard and the variability in the PCS-36 explained by the PCS-12 and in the MCS-36 explained by the MCS-12 was 93.2% and 86.9% respectively. The concordance between PCS-12 and PCS-36 and between MCS-12 and MCS-36 observed here is in agreement with results from general population studies in the US [7] and Europe [9] as well as with others involving patient populations [13,31,33].
The SF-12 summary scores were able to distinguish between groups of respondents in the expected manner (known-groups validity) on the basis of gender, age, socio-economic status, self-reported health problems and health services utilization (a proxy of HRQOL), providing evidence of its construct validity. The finding that MCS scores decreased with increasing age is not consistent with the majority of the literature that notes that MCS scores tend to improve with increasing age (as opposed to PCS scores which generally decline). A possible explanation for our finding is that 43% and 59% of the 55-64 and >65 age groups respectively reported suffering from multimorbidity, i.e. the co-occurrence of two or more chronic conditions [34], and specifically diabetes, hypertension and heart problems, all of which are clearly associated with impaired HRQOL in all domains [35,36]. In a recent Greek study involving elderly diabetic multimorbid patients, SF-36 subscales hypothetically correlating with the MCS (i.e. VT, SF, RE and MH), were significantly reduced [37]. In another SF-36 study involving a Greek general population, the MCS scores also appeared to decline with increasing age [24].
In a future study specifically aimed at measuring HRQOL, it would be interesting to examine the effect of each sociodemographic and health-related characteristic since, e.g. lower scores for divorced/widowed persons may be due to being older. The same applies for being retired. SF-12 summary scores were compared to SF-36 scores and were found to be very close, within two percentage points at most. These differences are small and unlikely to be of clinical relevance, since it has been suggested that a minimal threshold difference for the SF-36 is around five points [38]. These results, in conjunction with the small effect size differences between the SF-36 and SF-12 scores (<0.2), provide evidence to support the content and concurrent (criterion) validity of the Greek SF-12.
Health conditions, known to be reliable when selfreported, had an effect on SF-12 summary scores and significantly lower mean PCS-12 and MCS-12 scores were expectedly recorded for respondents reporting diabetes, hypertension, heart problems, asthma, hip/knee problems, depression, sleeping disorders or obesity, compared to those without. Using the EQ-5D as a previously tested and accepted standard helped to further support validity. The SF-12 discriminated well between subjects reporting no problem, a moderate problem or a serious problem for a given EQ-5D dimension, since indicating a health problem resulted in significantly (P < 0.001) lower mean SF-12 , and this implies that these results should be dealt with cautiously. Finally, convergent and divergent validity of the SF-12 were confirmed by the relationships with the EQ-5D. Comparable summary scores and dimensions correlated higher than less comparable ones, whereas the EQ VAS correlated reasonably well with both SF-12 summary scores.

Conclusion
Based on the results from this study, the psychometric properties of the Greek SF-12 appear to be sound and suggest its potential for measuring health status in large-scale studies, particularly when overall physical and mental health are the outcomes of interest instead of the typical eight-scale profile. Its major advantage stems from its brevity, which results in fewer burdens for researchers and respondents. It appears to satisfactorily replicate SF-36 summary scores constituting it an attractive generic instrument to use in clinical practice or research when studying HRQOL. In this particular study, cross-sectional construct validity and sensitivity of the Greek SF-12 have been fairly demonstrated. On the other hand, issues such as testretest reliability, longitudinal construct validity and responsiveness have not been addressed and should be considered for future studies. This is particularly important as health status changes over time and the instrument must be able to detect these changes, particularly those of clinical importance.

Competing interests
The author(s) declare that they have no competing interests.

Authors' contributions
NK was responsible for analyzing and interpreting the data and drafting the manuscript. EP assisted in the statistical analysis. DN designed the study and revised the manuscript for intellectual content. YT was responsible for conception of the study. All authors have read and approved the final manuscript.