Psychometric evaluation of the SF-36 (v.2) questionnaire in a probability sample of Brazilian households: results of the survey Pesquisa Dimensões Sociais das Desigualdades (PDSD), Brazil, 2008

Background In Brazil, despite the growing use of SF-36 in different research environments, most of the psychometric evaluation of the translated questionnaire was from studies with samples of patients. The purpose of this paper is to examine if the Brazilian version of SF-36 satisfies scaling assumptions, reliability and validity required for valid interpretation of the SF-36 summated ratings scales in the general population. Methods 12,423 individuals and their spouses living in 8,048 households were selected from a stratified sample of all permanent households along the country to be interviewed using the Brazilian SF-36 (version 2). Psychometric tests were performed to evaluate the scaling assumptions based on IQOLA methodology. Results Data quality was satisfactory with questionnaire completion rate of 100%. The ordering of the item means within scales clustered as hypothesized. All item-scale correlations exceeded the suggested criteria for reliability with success rate of 100% and low floor and ceiling effects. All scales reached the criteria for group comparison and factor analysis identified two principal components that jointly accounted for 67.5% of the total variance. Role emotional and vitality were strongly correlated with physical and mental components, respectively, while social functioning was moderately correlated with both components. Role physical and mental health scales were, respectively, the most valid measures of the physical and mental health component. In the comparisons between groups that differed by the presence or absence of depression, subjects who reported having the disease had lower mean scores in all scales and mental health scale discriminated best between the two groups. Among those healthy and with one, two or three and more chronic illness, the average scores were inverted related to the number of diseases. Body pain, general health and vitality were the most discriminating scales between healthy and diseased groups. Higher scores were associated with individuals of male sex, age below 40 years old and high schooling. Conclusions The Brazilian version of SF-36 performed well and the findings suggested that it is a reliable and valid measure of health related quality of life among the general population as well as a promising measure for research on health inequalities in Brazil.

Results: Data quality was satisfactory with questionnaire completion rate of 100%. The ordering of the item means within scales clustered as hypothesized. All item-scale correlations exceeded the suggested criteria for reliability with success rate of 100% and low floor and ceiling effects. All scales reached the criteria for group comparison and factor analysis identified two principal components that jointly accounted for 67.5% of the total variance. Role emotional and vitality were strongly correlated with physical and mental components, respectively, while social functioning was moderately correlated with both components. Role physical and mental health scales were, respectively, the most valid measures of the physical and mental health component. In the comparisons between groups that differed by the presence or absence of depression, subjects who reported having the disease had lower mean scores in all scales and mental health scale discriminated best between the two groups. Among those healthy and with one, two or three and more chronic illness, the average scores were inverted related to the number of diseases. Body pain, general health and vitality were the most discriminating scales between healthy and diseased groups. Higher scores were associated with individuals of male sex, age below 40 years old and high schooling.
Conclusions: The Brazilian version of SF-36 performed well and the findings suggested that it is a reliable and valid measure of health related quality of life among the general population as well as a promising measure for research on health inequalities in Brazil.

Background
The use of standardised questionnaires with general health measures provides the opportunity to compare the health profiles of groups with different diagnoses, illness severities, or treatment regimens; to monitor transitions in health status over time [1]; to measure the burden of disease in populations with chronic and psychiatric diseases and in healthy populations; and to compare health outcomes across different health systems [2]. The standardised Short Form Health Survey 36 (SF-36) is one of the most common instruments used in health research, both in population-based surveys and in studies to evaluate health policies [3]. Its aim is to detect medically-and socially-relevant differences in health status and changes in health status over time using a small number of statistically-efficient dimensions. For this purpose, a multi-item scale was developed that employed multidimensional health concepts used in comprehensive health surveys, including measures of well-being and self-evaluation of health status [4][5][6]. The items in the questionnaire were selected from the set of 149 items of the Functioning and Well-Being Profile, which covered 40 health concepts used in the Medical Outcomes Study (MOS), and organised in a standard version, which is available since 1990 [7]. The Short Form 36 (SF-36) consists of 36 questions: one of them measures health transitions over a one-year period and is not used in scale calculation, and the remaining questions are grouped into eight scales or domains. The eight scales can be aggregated into two independent summary measures: physical component summary (PCS) and mental component summary (MCS). Higher scores indicate better health.
The SF-36 was translated into various languages and used in several countries to assess the health perceptions of both the general population and people affected by disease [4,7]. Even though its accuracy is 10% to 20% lower than that of longer questionnaires used in the MOS, its completion time of 5-10 minutes, versatility of use (self-completion, personal or telephone interview with persons aged over 14 years), and levels of reliability and validity above the recommended minimum standards make it an attractive tool for use in combination with other questionnaires in population surveys. Study results show that the SF-36 meets the criteria for data quality and scaling assumptions: the two main components used in the scales -Physical (PCS) and Mental (MCS)explained 74% of the total variance. Experiences using the questionnaire and its reported shortcomings, such as cross-cultural non-equivalence, difficulties with some word meanings, floor and ceiling effects, poor performance of the two Role Function scales and standard layout, were used as a basis for implementing changes in the second version (v.2) of the SF-36, in use since 1996 [8]. These changes included adjusting the layout horizontally, improving the wording of questions to make them less ambiguous, changing the response options of items related to Social and Emotional Functioning from binary to ordinal, eliminating one response option from the Vitality and Mental Health scales, and normalising scale values in order to improve comparability among different groups [4]. The results of studies that used the SF-36 version 2 showed an improvement in accuracy, reliability and validity, without compromising the underlying structure of the conceptual model [6,9].
In Brazil, the SF-36 was used in studies on the quality of life of patients with end stage renal disease undergoing intermittent haemodialysis [10], hypertensive patients [11], patients subjected to surgical repair of hip fracture [12], patients living with HIV/AIDS [13], and in a household survey of residents of the state of São Paulo [14]. In these studies, the scores for SF-36 domains obtained in adult populations showed high reliability and good criterion validity compared to other instruments for assessing quality of life. In 2008, a survey on the social dimensions of inequality named Pesquisa Dimensões Sociais das Desigualdades (PDSD), coordinated by Instituto Universitário de Pesquisas do Rio de Janeiro (IUPERJ) with the participation of various teaching and research institutions in Brazil (UFMG, UFF, FIOCRUZ, UFRJ, PUC-RJ, UFBA), interviewed people around the country to assess the current situation of the Brazilian society with regard to education, health, and professional paths, with the objective of informing social policies. The Health module of the SSDI evaluated several aspects of health using the standard SF-36 (v.2), whose questions relate to the 4 weeks prior to the interview. Unlike previous applications in the country, which dealt with limited samples of individuals with specific health problems, the PDSD used the SF-36 on a probability sample of Brazilian households, thus estimating national scores to be used in future applications of this instrument. The aim of this paper is to assess whether the scales obtained from the SF-36 (v.2) questionnaire used in the PDSD project meet the minimum psychometric standards of data quality, scaling assumptions, reliability, and validity; reproduce the hypothesised mental and physical dimensions; and the relations between factors and scales predict their associations with external criteria for physical and mental health.

Data source and sampling
The Survey on the Social Dimensions of Inequality (PDSD) was a population-based household survey that interviewed, from July to December 2008, 12,423 heads of households and their spouses living in 8,048 permanent private households in common, non-special areas (including slums) in all regions of Brazil, in both urban and rural settings. The population was divided into sets called domains, defined according to region and setting (urban or rural); 6 domains were established, and the study aimed to obtain indicators for each of them, as well as for the population as a whole. Moreover, since the subject of the study was inequality, a sampling stratum consisting of the richest 10% of each census tract was created in order to improve the accuracy of the indicators of inequality. The sample comprised 1,374 census tracts, divided as follows: 200 in urban areas of the North and Central-West Regions (1,320 households); 336 in urban areas of the Northeast Region (1,776 households); 368 in urban areas of the Southeast Region (1840 households ); 260 in urban areas of the South Region (1,300 households); 60 richest tracts in metropolitan areas (420 households); 54 richest tracts in other areas (432 households); 48 tracts in rural areas of the Northeast Region (480 households); and 48 in other rural areas of the country (480 households). The percentage of households with only one eligible respondent ranged from 96% in rural areas of the Northeast to 31% in the metropolitan region of Rio de Janeiro, and 23% in the richest tracts of metropolitan areas. The estimated number of households in the sample accounted for replacement, in every socioeconomic stratum, due to absence from household or refusal to participate in the study.
Among the households in the initial sample, 571 were ineligible and 20% were replaced, mainly due to the refusal of one spouse to take part in the study or because one of the spouses was not at home during the interview, even though it was scheduled in advance. To circumvent this problem, a pair of interviewers returned to such households during weekends to interview the couples simultaneously in different rooms of the house. In the upper class (wealthier) tracts, apart from the difficulties mentioned above, contact with the subjects was more complicated due to the inaccessibility of buildings and private neighbourhoods (even when not gated) and the difficulty to convince them to answer the questionnaire. As for the collection process, the material produced each day was counted, checked, and filtered by the supervisors; the interviewer was contacted and returned to the field when necessary. After this process, all the questionnaires from each census tract were submitted to the team responsible for collecting field data. The questionnaires were then coded, typed, and had their logical and analytical consistency checked (via SPSS syntax) by a team of 20 researchers who returned to the field when necessary for correction/confirmation. Data entry used automated controls that restricted input only to the valid values for each question. Ten percent of all the material typed was reviewed and stratified according to the 30 data typists, which guaranteed the quality of data entry. The sample size in this study met the International Quality of Life Assessment Project (IQOLA) criteria for comparison between sexes and age groups [15]. Research procedures were in accordance with Helsinki Declaration for protection of human subjects from research risks and consent of research subjects and informants was obtained in advance as mandated by the Code of Ethics of the International Sociological Association.

Data Collection Instrument
The instrument used in the PDSD included, apart from the Brazilian version of the SF-36 (v.2) [16], questions related to education, work, relationships and housing. The Brazilian version differed from the original questionnaire only in questions 3B, 3G, 3H, and 3I, since bowling and golf are not popular activities in Brazil and because the metric system of units is used in the country. The theoretical model of the SF-36 assumes that the Physical Functioning (10 items), Bodily Pain (2 items), and Role Physical (4 items) scales correlate strongly with the Physical Component and its summary measure (PCS). In turn, the Mental Health (5 items), Role Emotional (3 items), and Social Functioning (2 items) scales correlate more strongly with the Mental Component and its summary measure (MCS). Scales related to physical health are also expected to identify groups of respondents who have physical conditions and to show a lower performance than scales related to mental health in identifying groups with mental conditions. The Vitality (4 items), General Health (5 items) and Social Functioning (2 items) scales should correlate with both components. Thus, scales more focused on the PCS are more sensitive to treatments that target physical diseases, whereas scales more focused on the MCS are more sensitive to drugs and therapies that target mental diseases. The procedures for item recoding, summing the responses for each of the variables that make up the scale, transforming the scales into scores ranging from 0 to 100, and standardisation and normalisation, in which average values vary around value 50 with a dispersion factor of 10, followed the recommendations of the SF-36 developers for calculating the domains [17].

Data analysis
The socio-demographic characteristics of respondents are described in a frequency table. The completeness, distribution and internal consistency of items and scales were calculated in accordance with methods described in the literature for testing scaling assumptions [7,17].
The internal consistency of items was evaluated by analysis of correlations between the items and their respective scales, applying correction for attenuation in order to correct the effect of adding/subtracting items to/from the estimates [18]. Estimates of internal consistency with values above 0.40 were considered satisfactory. Measures of asymmetry in the distribution of scores and the internal consistency of scales were calculated using Cronbach's alpha coefficient; values greater than 0.70 were taken as the minimum ideal condition for analysis at the group level. In addition, the consistency of responses to the 15 pairs of questions was evaluated, as suggested by the authors of the SF-36 (v.2) [17]. The discriminant validity of items was calculated to assess the integrity of scale construction. For each scale, the success rate was calculated as the ratio of the number of successes to the total number of items tested; a success was counted whenever the correlations between the item and its respective scale were at least two standard errors above the correlations between the same item and the other scales. The percentage of respondents who achieved the highest (ceiling effect) or lowest (floor effect) scores was calculated to assess the instrument's ability to detect changes over time. The equality of item-scale correlations was assessed based on each item's contribution to the total score of the hypothesised scale, and when these correlations ranged from 0.40 to 0.70 it was assumed that the item contributed substantially to the score. The associations between scales and the summary measures of components were calculated using Spearman's correlation coefficients and rotation matrices in factor analysis. Exploratory factor analysis using principal component analysis of the 8 SF-36 scale scores was conducted to extract the hypothesized two components from the correlations among the SF-36 scales. Two factors with eigenvalues greater than 1 were extracted and rotated to orthogonal simple structure using the varimax method to facilitate comparisons with published results and for ease of interpretation. The construct validities of the scales for each component were obtained through the ratio of the squared loading of each scale on the factor and the highest common variance of the respective component. Total, explained and reliable variance were obtained, respectively, from the extraction value of the communalities in each scale and from the division of this value by the scale's Cronbach's alpha. The construct validity of each scale was measured by its ability to detect statistically significant variations in different groups, defined by the presence or absence of chronic disease through the ratio of F-statistic values obtained from the comparison of these groups. The relative validity estimated for each scale was calculated as the ratio of the largest F-value obtained among scales to the F-value of the scale. Data from the heads of households and their spouses were weighted to represent the total Brazilian population. The software SPSS v.17 was used for statistical analysis.

Characteristics of the Sample
Among study participants, 5,255 (42.3%) were male, and about half of the respondents were between 40 and 64 years of age (mean: 48.5, SD = 16.0 years), self-classified as white and had more than 4 years of schooling ( Table  1). The presence of at least one chronic disease was reported by 63.3% of respondents; the most common conditions were diseases of the vertebral column (36.0%) and hypertension (28.3%). The vast majority (71%) of respondents were married or lived with a partner.

Characteristics of the Scales
The response rate for the SF-36 was 100%, i.e., all questions were answered by all respondents, despite the fact that 20% of households were replaced due to refusal. However, such units are not sampling losses or selection bias, since the sampling design estimated a surplus of about 25% of cases. The indicator for the quality of understanding of the 15 pairs of questions revealed that only 7.4% showed inconsistency for a single pair of questions, while 7.3% showed inconsistency for 2 to 4 pairs of questions. In the pair of responses that showed the greatest inconsistency (3.7%), respondents claimed both severe limitation of activities such as bathing or dressing and no limitation of vigorous activities. The distribution of items showed that respondents used all categories, with a tendency towards more favourable health status among males aged under 40 and with higher educational level. All scales showed monotonically decreasing gradients with regard to co-morbidities and reported health status (p < 0.05).
The order of the means of item scores within each scale was consistent with the hypothesised expectations ( Table 2). In the Physical Functioning scale, the item about vigorous activities (3D) had the lowest mean, and the item about milder activities (3J) had the highest mean. The means decreased over items about functioning ordered in a Guttman scale; for example, a higher frequency of limitations was reported when walking more than 1 km than when walking 100 m. Items in the Physical Functioning scale had the lowest mean scores. The mean scores of items that assessed whether the respondent had accomplished less than he/she would like (physical and emotional aspects) were high, indicating little disability. In the Vitality scale, the mean scores of items that addressed energy (well-being) were higher than the mean scores of items that addressed fatigue. In the Mental Health scale, item 9H (positive aspect of affection) had the highest mean and item 9B (negative aspect of affection) had the lowest mean. The mean score of the item that addressed health transitions was 2.90, which shows that respondents considered that their health was a little better than a year before the interview.
The descriptive and consistency measures for the eight dimensions addressed by the SF-36 are shown in Table  3. All correlations of items with their respective scales exceeded the suggested criterion (r = 0.40) for the internal consistency of items (median = 0.69) and scales, ranging from 0.73 for Social Functioning (SF) and Vitality (VT) to 0.96 for Physical Functioning (PF) and Role Physical (RP). The scales had success rates of 100%, and the smallest difference between the correlations of items with the hypothesised and non-hypothesised scales was 0.10 (9H-MH and 9H-VT), which is more than two standard errors. The General Health, Vitality and Mental Health scales showed the lowest ceiling and floor effects.
The Physical and Mental Components explained 67.5% of the variance. The correlations between scales in the two dimensions of health showed a pattern that resembles the one described in the literature [7], except for the Role Emotional and Vitality scales, which were strongly correlated with the Physical and Mental Components, respectively, and the Social Functioning scale, which was moderately correlated with both components ( Table 4). The Role Physical and Mental Health scales were, respectively, the most valid measures of the Physical and Mental Health Components.
In the comparisons between groups that differed by the presence or absence of depression, subjects who reported having the disease had lower mean scores in all scales (Table 5); the Mental Health scale (MH) discriminated best between the two groups, followed by SF and VT. Among the healthy group and the groups with one, two, or three or more conditions, mean scores decreased as the number of conditions increased. The Bodily Pain, General Health and Vitality scales discriminated best between those groups. Table 6 summarizes the comparisons between groups according to certain socio-demographic characteristics. The mean scores in all scales were higher in men than in women, and decreased with increasing age. Comparisons according to years of schooling showed that respondents with lower educational level had lower mean scores in all scales. The differences related to age and schooling were statistically significant (p < 0.05).    Respondents who self-classified as black reported worse health status in all scales, but these differences were significant only for Role Physical, General Health, Social Functioning, and Role Emotional. The Mental Health, Vitality, and Bodily Pain scales discriminated best between sexes, while the Physical Functioning, Role Physical, and General Health scales discriminated best between groups that differed by age, schooling, and race/colour.

Discussion
The findings in this study showed that the psychometric properties of the Brazilian version of the SF-36 (v.2) questionnaire meet the standards established by the IQOLA project [7] Even though the SF-36 had been previously tested in samples of the Brazilian population, this is the first time the Brazilian translation of the questionnaire is used in a nationally representative probability sample. Data quality was satisfactory, with a high response rate and use of all response categories, suggesting that there were no problems related to the translation of items and categories in the questionnaire. Mean item scores corresponded to the hypothesised scales, except for the Role Physical and Role Emotional scales, due to the change in SF-36 (v.2) questionnaire from binary to ordinal and the consequent increase in the number of response options and categories. The items in the Role Physical scale showed higher mean scores than those found in other studies [19]. These results suggest that the presence of physical and emotional problems in the study population did not lead to significant impairment of daily activities or that, since this is a sensitive question asked by an interviewer, respondents tended not to report that kind of impairment [20].
The reliability estimates exceeded the minimum level (α = 0.70) suggested for comparisons between groups, especially in the case of the Role Physical and Role Emotional scales, which had the highest coefficients and a reduction in ceiling and floor effects. Compared with the estimates in the original version, substantial improvements were noted in item correlations and in   [6,9]. These improvements, as well the higher sensitivity shown by the Role Physical scale to discriminate between groups that differ by age, schooling, and race/ colour, can be attributed to changes in the categorisation of the items that make up these scales. The correlations between items and their respective scales and the success of scaling were consistent with previous studies [19,21,22]. The correlations between scales and components also showed patterns similar to other studies that used the SF-36, except for the Role Emotional scale, which showed a strong correlation with the Physical Component, in contrast with what was predicted by the model and observed in other studies that used the SF-36 (v.2) [6,9].
In general, construct validity tests showed that PCS scales discriminated better between groups that differed by the presence or absence of chronic diseases, while MCS scales discriminated better between groups that differed by the presence or absence of mental diseases. Men reported better health status than women, age was an important factor related to health, and lower educational levels were associated with poorer health status [23]. Similarly, the percentage of respondents who selfrated their health status as fair or poor was higher among women and increased with age, a pattern also found in the reports of limitation of physical activities and presence of chronic disease. These findings are  [14,24,25]. The findings of this study showed that the Brazilian version of the SF-36 (v.2) questionnaire has good discriminatory power between groups of people with or without chronic diseases, suggesting good construct validity. On the other hand, the validity of the Mental Component of the Brazilian version of the SF-36 (v.2) was lower than reported in other studies in view of the lower factor loadings of the Social Functioning and Role Emotional scales used to estimate this component. It has been speculated that cultural and social aspects in developing countries have pivotal role in individual's daily life and may influence the performance of the Social Functioning and Role Emotional scales [26].

Conclusions
The findings of this study show that the changes made to the SF-36 (v.2) resulted in improved accuracy, reliability, and validity; the study also showed that the Portuguese translation of the questionnaire is adequate, given the completeness of responses and its internal consistency. The results of tests of scaling assumptions support the hypothesised scale structure of the SF-36 questionnaire in Brazil, and the factor loadings obtained can be used to weight the dimensions of the Physical and Mental Components in studies using population samples.