Measurement properties and normative data for the Norwegian SF-36: results from a general population survey

Background The interpretation of the SF-36 in Norwegian populations largely uses normative data from 1996. This study presents data for the general population from 2002–2003 which has been used for comparative purposes but has not been assessed for measurement properties. Methods As part of the Norwegian Level of Living Survey 2002–2003, a postal survey was conducted comprising 9,164 members of the general population aged 16 years and over representative for Norway who received the Norwegian SF-36 version 1.2. The SF-36 was assessed against widely applied criteria including data completeness and assumptions relating to the construction and scoring of multi-item scales. Normative data are given for the eight SF-36 scales and the two summary scales (PCS, MCS) for eight age groups and gender. Results There were 5,396 (58.9%) respondents. Item levels of missing data ranged from 0.6 to 3.0% with scale scores computable for 97.5 to 99.8% of respondents. All item-total correlations were above 0.4 and were of a similar level with the exceptions of the easiest and most difficult physical function items and two general health items. Cronbach’s alpha exceeded 0.8 for all scales. Under 5% of respondents scored at the floor for five scales. Role-physical had the highest floor effect (14.6%) and together with role-emotional had the highest ceiling effects (66.3-76.8%). With three exceptions for the eight age groups, females had lower scores than males across the eight health scales. The two youngest age groups (<30 years) had the highest scores for physical aspects of health; physical function, role-physical, bodily pain and general health. The age groups 40–49 and 60–69 years had the highest scores for role-emotional and mental health respectively. Conclusions This SF-36 data meet necessary criteria for applications of normative data. The data is more recent, has more respondents including older people than the original Norwegian normative data from 1996, and can help the interpretation of SF-36 scores in applications that include clinical and health services research.


Background
The Short Form 36 (SF-36) Health Survey is the most evaluated health status instrument and the most reported within randomized controlled trials [1,2]. The instrument has been translated into many languages and the results of these studies are published in peer-reviewed journals [3]. SF-36 Version 1 [4] and the RAND-36 [5] include the same items and continue to be widely used, including in the great majority of Norwegian studies that include this instrument. The SF-36 is available in self-or interview-administered formats and standard (four weeks) and acute (one week) recall periods.
The SF-36 was developed as part of the Medical Outcomes Study (MOS), a key objective of which was to develop more practical tools for monitoring the outcomes of medical care [4,6,7]. The instrument includes 36 items or questions that assess functional health and well-being from the perspective of the patient. The items contribute to eight health domains of physical functioning, role limitations due to physical problems, bodily pain, general health, vitality, social functioning, role limitations due to emotional problems and mental health. The eight domains all contribute to physical component summary (PCS) and mental component summary (MCS) scores, with their relative weights based on the results of factor analysis [8]. Short-forms include the SF-12 [9] and SF-8 [10] which give summary scores along with single item scores for each domain in the case of the latter.
Normative data derived from surveys of representative samples of the general population aid the interpretation of the SF-36 scale and summary scores [11]. Normative data has been available following early evaluations of the instrument, for example as part of the International Quality of Life Assessment (IQOLA) Project [3,12]. Much of this data was collected in the 1990s following forward backward translations and testing for crosscultural equivalence [3,13,14]. These normative data continue to be used [15][16][17] but more recent data is available for countries that were not included in the IQOLA Project [18][19][20].
The Norwegian SF-36 version 1.1 was forward backwards translated according to the IQOLA procedures and evaluated in patients with rheumatoid arthritis recruited from a patient register for Oslo [21]. Problems with missing data and suboptimal psychometric characteristics led to slight revisions to five items in version 1.2 [12], the one commonly used in Norway. This version was evaluated in a nationally representative sample of the Norwegian general population in the spring of 1996 and was used to derive the Norwegian norms [12]. The data is over 20 years old and may no longer be representative of the general population due to changes in both the composition of the general population and how individuals respond to such questions.
The present study presents more recent normative data for the Norwegian SF-36 v1.2 [22]. This data has been used to help the interpretation of SF-36 scores in Norwegian studies since 2013 [23][24][25]. Compared to the original Norwegian norms [12], there are a larger number of respondents including older people, which further contributes to the appropriateness of this new normative data. However, the measurement properties of this normative data have not been reported. Norms are also given for the SF-36 summary scales, which were developed later and hence were not included in the original normative data. The study also presents norms for the two scales that have a different scoring algorithm according to the RAND scoring together with alternative scoring for the summary scales [26][27][28]. The present study follows the IQOLA project and existing studies that have evaluated the SF-36 in general populations including tests of data quality and internal consistency.

Data collection
The postal survey comprised 9,164 members of the general population aged 16 years and over that were representative for Norway (Fig. 1). It was conducted as part of the Norwegian "Level of Living Survey 2002" cross sectional study on health undertaken by Statistics Norway and included home and telephone interviews prior to the postal survey [22]. The postal questionnaire included the Norwegian SF-36 version 1.2 mailed in the period 15 November 2002 to 15 May 2003. SF-36 data were available for the 5,396 interview participants only from the Norwegian Social Science Data Services AS (NSD).

Measurement properties
The analysis followed the measurement criteria evaluated as part of the IQOLA project that included the Norwegian version of the SF-36 [3]. Data completeness was evaluated by considering the percentage of respondents with missing data at the item and scale levels including the percentage of scale scores calculable according to the SF-36 scoring. According to classical test theory and the construction of summated rating scales, item means are expected to be roughly equal but this is seldom the case due to heterogeneity of item content. For the physical functioning scale it was hypothesized that items assessing the least strenuous activities would have the highest mean scores and that the climbing stairs and walking items would have item means ordered as a Guttman scale. For the two role functioning scales it was hypothesized that the items relating to "accomplished less than you would like" would have the lowest item means. For the vitality scale it was hypothesized that vitality items assessing well-being would have lower mean scores than items assessing disability, since the former define higher levels of health. For the mental health scale it was hypothesized that items assessing positive affect would have lower item means than those assessing negative affect. Internal consistency was assessed by item-total correlation and Cronbach's alpha. Item-total correlations of 0.4 or higher were considered satisfactory and should be approximately equal within each scale [3]. Definite scaling success was defined as an item correlating by two standard errors or more with its scale than with another scale and probable scaling success when the correlation was higher but not by two standard errors [3]. Cronbach's alpha should be at least 0.70 and 0.90 for group and individual level analyses respectively [3]. Floor and ceiling effects were assessed through the percentage of respondents with the lowest and highest scale scores.

Normative data
Normative data are presented in the same manner as previous SF-36 studies and are broken down by age and gender [12,14]. For the PCS and MCS, normative data are given for the standard scoring derived using an uncorrelated (orthogonal) factor solution [8] and scoring based on a correlated (oblique) factor solution [26]. The former is based on data for the general population of the US standardized to have a mean of 50 and standard deviation of 10 [8]. The latter uses weights derived from an oblique factor solution [26] standardized to have a mean of 50 and standard deviation of 10 in the current sample. The RAND scoring of the SF-36 is an alternative scoring for the same questionnaire (here Norwegian version 1.2). It has slightly different scoring for the bodily pain and general health scales. This study gives normative data for these scales alongside the alternative scoring for the PCS and MCS.
IBM SPSS 23 was used for descriptive statistics and to assess the measurement properties.

Data collection
Of 9,675 eligible members of the general population, 511 people did not receive a questionnaire because of disability, language difficulties, or they refused. Of the 9,164 who received a questionnaire, SF-36 data were available for the 5,396 (55.8%) respondents who had also participated in the interviews ( Fig. 1) and their background characteristics are shown in Table 1 [22]. Table 2 shows that the item levels of missing data ranged from 0.6 to 3.0% for the bodily pain item "how much did pain interfere with your normal work" and general health item "I seem to get sick easier than others" respectively. Levels of complete data for the eight scales ranged from 95.4 to 98.6% for general health and social functioning respectively. Following score computation the level of missing data ranged from 0.2 to 2.5% for these two scales. Levels of missing data were slightly higher for the summary scales, which are dependent on complete data for scale scores.

Measurement properties
For the physical functioning scale, the easiest and most difficult items had the highest and lowest means respectively (Table 2). Item means increased with Guttman scale ordering across the two sets of items relating to climbing stairs and walking. The items "accomplished less than you would like" had the lowest means for the two role functioning scales. For vitality, the item "have a lot of energy" had the lowest mean score. For mental health the two items assessing positive affect had the lowest mean scores. The mental health item assessing the worst mental health state "so down in the dumps that nothing could cheer you up" had the highest mean score. The item score standard deviations were roughly equivalent within scales with the exceptions of the easiest and most difficult physical functioning items and the vitality and mental health scale items relating to positive and negative aspects of health.
The item-total correlations all exceeded the 0.4 criterion and in general were fairly similar in size with two exceptions including the easiest and most difficult physical functioning items. The two general health items relating to "I seem to get sick easier than others" and "I expect my health to get worse" also had somewhat lower correlations than the other items for this scale. With the exceptions of the physical functioning item relating to vigorous activities which had two correlations indicative of probable scaling success (within two standard errors) with the role-physical and general health scale items, there was 100% scaling success for all of the items. Cronbach's alpha exceeded 0.8 for all scales and the physical functioning, role-physical and pain scales met the criterion for individual level analysis.
Less than 5% of respondents scored at the floor for six scales. The highest floor effect of 14.6% was for the rolephysical scale, which together with the role-emotional scale also had the highest ceiling effects of 66.3 and 76.8% respectively. Ceiling effects were also high for the social functioning scale and over 35% for the physical functioning and bodily pain scales.
PCS and MCS were computable for 95.5% respondents with mean scores of 49.5 (10.2) and 51.2 (9.1) for the standard scoring. Tables 3 and 4 give the normative data by gender for the different age groups. Table 3 is based on the standard scoring for the PCS and MCS [8] and Table 4 is based on the oblique (correlated) factor solution [26] and also includes the RAND scoring for bodily pain and general health. Across the age groups, females had lower scores than males, the only exceptions being small differences for physical functioning for 15-19 years, bodily pain for 20-29 years and general health for those over 79 years. Most of the differences were within two scale points up to the age group 50-59 years. However, females had lower scores of up to seven scale points for roleemotional in the age range 15-19 years. Much smaller differences were found for the remaining groups up to 50-59 years, where females scored two or more points lower for all scales with the exception above. For this and the older groups, the differences between the two genders generally increased for physical function, rolephysical, bodily pain and social function with the largest differences for the oldest age group being for physical functioning at over 14 points. The difference for the remaining scales decreased for the two oldest age groups. The two youngest age groups had the highest scores for physical aspects of health; physical function, role-physical, bodily pain and general health. The age groups 40-49 and 60-69 years had the highest scores for role-emotional and mental health respectively.

Normative data
Across the age groups females had lower PCS and MCS scores, the only exception being for MCS in the age group over 79 years with the standard scoring ( Table 3). The younger age groups had the highest PCS scores, which declined with successive age groups. For the standard scoring the MCS scores increased with successive age groups until the age group 60-69 and declined in the two older age groups. For the alternative scoring, MCS scores were very similar across the age groups above 15-19 years and there was a slightly  sharper decline in scores for two oldest age groups compared to that for the standard scoring (5 versus 2.6 points).

Discussion
This study was based on a general population survey from 2002-2003 [22] and provides more recent normative data for the Norwegian SF-36 version 1.2. This version of the SF-36 continues to be by far the most widely used in Norway together with normative data from 1996. The composition of the Norwegian general population has changed within this time, and the way individuals interpret and respond to items within health surveys also may have changed. Three Norwegian studies have used this more recent general population data for normative comparisons [23][24][25]. The current study is the first to assess this data for necessary measurement properties that have been widely applied in studies relating to normative data for the SF-36 including the IQOLA project [3]. The results of these analyses are an important prerequisite to publishing new normative data and using it for score interpretation. They show that the SF-36 has data completeness and that the instrument meets the criteria underlying the construction and scoring of multi-item scales [3]. Levels of missing data were low and scaling assumptions were met in this population. With the exception of one item relating to bodily pain, items had lower levels of missing data than for those for the Norwegian general population data collected as part of the IQOLA project [12]. The Scandinavian countries taking part in the IQOLA project had consistently higher levels of missing data across the 36-items than the other eight countries [3]. The present study found rates of missing data that were more in line with those for the other countries. All the correlations between the items and hypothesized scales met the criterion of 0.4. The levels of correlation were roughly equivalent with the same exceptions as those found in the IQOLA project [3]. Cronbach's alpha was greater than the criterion of 0.7 for group analyses and met the criterion of 0.9 for individual analyses for three scales. The levels were comparable to those found for Norway in the IQOLA Project with a slightly higher range of 0.81-0.92 compared to 0.79-0.90 [3]. Item means within the scales were generally similar to the original Norwegian normative data [3]. Compared to the earlier norms, item means were slightly lower for physical functioning, role-physical, general health and role-emotional scales. They were slightly higher for vitality and mental health. The levels of floor and ceiling effects were broadly comparable to those found in the IQOLA project.
There are three possible reasons for the differences with the original Norwegian normative data. First, changes in the composition of the general population in the intervening period including age composition and an increased number of immigrants. Second, changes in the way in which individuals respond to SF-36 items which might follow increasing education and welfare levels. Third, this is the same version of the SF-36 as that used in the IQOLA project but subtle differences in the design and layout may have influenced responses. The former used an early standard layout for the SF-36 whereas the present survey used a slightly different more compact layout. It is only possible to speculate about the role of these different factors but together they represent good grounds for collecting and making available up-todate normative data for widely used generic instruments including the SF-36.
Compared to the original normative Norwegian SF-36 data [12] this study has three important strengths. First, there are 3,000 more respondents in the current study compared to the original normative data, which makes the data a more suitable basis for interpreting SF-36 scores and changes in those scores for respondents with different health problems. Normative data has often a lower proportion of older respondents and particularly those aged 70 and over. Life expectancy continues to increase and an increasing proportion of applications of the SF-36 will include older people. The present study included 619 respondents in this age range who completed at least one SF-36 scale compared to just 227 for the original Norwegian normative data [12]. Moreover, there were 181 respondents aged over 79 years in the current study, which will improve the interpretation of SF-36 scores for older people with health problems. Second, during the two decades up to 2010, Norway has experienced better living standards coupled with changes in the composition of the general population including increasing numbers of immigrants, older people and increasing numbers of people living alone. Such changes will contribute to changes in the health status of the  general population and therefore there is a need for more recent normative data. Third, the standard scoring for the SF-36 summary scales has been criticized [5,[26][27][28]. The current study includes normative data for both the PCS and MCS summary scores and the alternative RAND scoring for both these and the scales of bodily pain and general health. This normative data has not previously been reported for Norway. The alternative scoring algorithm is based on a correlated (oblique) physical and mental health factor model that is considered more appropriate given the moderate level of correlation found between physical and mental health [5,[26][27][28]. The authors of the alternative scoring algorithm recommend that weights be derived from other samples [26], which might include Norwegian data together with a comparison of weights based on the standard scoring. However, the use of the published US weights, as in the present study, enables comparisons with existing studies.
There are several possible study limitations. The main weaknesses of the present study are that it was not specifically designed for collecting normative data and the age of the data. Studies that are designed to collect normative data are costly and rarely undertaken. The study was pragmatic in its use of the most recent general population data available in Norway with a sufficient sample size. This data was used for comparative purposes in three recent Norwegian studies [23][24][25] which may be seen as a response to the need for more up-todate normative data. It was therefore necessary to assess data completeness and to test the assumptions underlying the eight multi-item scales which comprise the SF-36 in this general population. The survey was part of a larger survey [22], which included home or telephone interviews with respondents prior to the postal survey described here. It is possible that prior contact including interviews may have influenced the response rate or responses to the postal questionnaire but assessment of such bias was not possible given the study design.

Conclusion
In conclusion, more recent data for the SF-36 version one from a large scale survey of the Norwegian general population met important criteria described in the IQOLA Project [3]. The study found adequate evidence to support the use of the data for normative comparisons in Norwegian studies. It is recommended that this data is used in clinical and health services research for normative comparisons until more up-to-date general population data that are derived from a survey specifically designed for this purpose are available for the SF-36 in Norway.