Comparing the reliability and validity of the SF-36 and SF-12 in measuring quality of life among adolescents in China: a large sample cross-sectional study

Objective We compare the reliability and validity of the Short Form 36 (version 1, SF-36) and the Short Form 12 (version 1, SF-12) in adolescence, the period of life when a child develops into an adult, i.e., the period from puberty to maturity terminating legally at the age of majority (10–19 years), thus supplying evidence for the selection of instruments measuring the quality of life (QOL) and decision-making processes of adolescents in China. Methods Stratified cluster random sampling was adopted according to geographical location, and the SF-36 was administered to assess QOL. The Pearson correlation coefficient was used to show correlation. Cronbach’s alpha and construct reliability (CR) were used to evaluate the reliability of SF-36 and SF-12, while criterion validity and average variance extracted (AVE, convergence validity) were used to evaluate validity. Confirmatory factor analysis was used to calculate the load factors for the items of the SF-36 and SF-12, then to obtain the CR and AVE. The Semejima grade response model (logistic two-parameter module) in item response theory was used to estimate item discrimination, item difficulty, and item average information for the items of the SF-36 and SF-12. Results 19,428 samples were included in the study. The mean age of respondents was 14.78 years (SD = 1.77). Reliability of each domain of the SF-36 was better than for the corresponding domain of the SF-12. The domains of PF, RP, BP, and GH in SF-36 had good construct reliability (CR > 0.6). The criterion validities of some domains of the SF-36 were a little higher in some corresponding dimensions of the SF-12, except for PCS. The convergence validities of the SF-12 were higher than the SF-36 in PF, RP, BP, and PCS. The items of BP, SF, RP, and VT in the SF-12 had acceptable discrimination of items that were higher than in the SF-36. The items’ average amounts of information on BP, VT, SF, RE, and MH in the SF-36 and SF-12 were poor. Conclusion Two component (PCS and MCS) measurements of the SF-12 appeared to perform at least as well as the SF-36 in cross-sectional settings in adolescence, but the reliability and validity of the 8 domains of the SF-36 were better than those of the SF-12. Some domains, for instance SF and BP, were not suitable for adolescents or need to be studied further.


Introduction
Youth involves identity-building. Experiences during this developmental period can shape long-term attributes and attitudes and may lead to the adoption of a lifetime of healthy or risky behaviors [1]. The determinants of current and future health and disease for adolescents span the social and psychological fields [2]. A deeper understanding of how adolescents view their lives allows a greater understanding of their present health. The healthrelated quality of life (HRQOL) of school-age adolescents has been the subject of international interest. The term refers to a comprehensive model of subjective health that covers physical, social, psychological, and functional aspects of individual well-being as a multidimensional and subjective construct [3,4]. The point of all this interest is to guide the organization of resources and decisionmaking processes to promote adolescents' quality of life. To accomplish this, understanding the current quality of adolescents' life is essential [5,6].
The SF-36 was developed and validated as a generic short-form instrument for measuring HRQOL; it was widely applied to assess important QOL domains in the Medical Outcomes Study [7]. The SF-36 consists of eight QOL domains: PF, physical functioning; RP, role physical; BP, bodily pain; GH, general health; VT, vitality; SF, social functioning; RE, role emotional; and MH, mental health; with two summary components having been constructed to summarize the physical and mental components (PCS and MCS, respectively) [8]. The factor structures of SF-36 that have been identified in China suggest that PCS is primarily a comprehensive measure of PF, RP, BP, and GH and that MCS mainly encompasses the domains of VT, SF, RE, and MH. However, the two components somewhat overlap, and especially the VT, GH, and SF domains have noteworthy correlations with both components [9].
One of the major advantages of using the SF-36 is that it allows for QOL scores to be compared with scores in different groups [10]. However, because the SF-36 was not originally designed to measure important aspects of the QOL of adolescents specifically, some studies have determined that the instrument, especially the mental component (MCS), is relatively insensitive to variations in different populations over time [11][12][13].
A substantially shorter instrument, the SF-12 was developed by Ware and colleagues, reducing the number of items from 36 to 12 to create an abbreviated version of the SF-36. [14,15]. Most of the respondents testing the new instrument completed the SF-12 in less than a third of the usual time needed to complete the SF-36 [8]. Ware showed that the two instruments are highly correlated, and about 90% of the variation in both the physical and the mental component summaries measured in the SF-36 was explained by the same summary measures of the SF-12 [16]. Subsequent studies comparing the two scales have suggested varying results based upon the disease or health condition of interest [17][18][19]. The SF-12 and SF-36 are available in many languages and have been applied to all kinds of groups, including adolescents [15,[20][21][22]. Since studies have demonstrated that both scales are valid instruments for this age group, they have been used to evaluate the QOL of adolescents in China [23] as more and more studies have focused on the quality of life of healthy adolescents in that country.
Most studies of adolescent QOL in China have surveyed the perception of QOL among chronically ill adolescent patients and were conducted in hospital or outpatient settings [25,26]. Recently there has been a growing interest in the study of healthy groups of adolescents, leading to studies being performed in other contexts, such as in schools [27,28], because of a growing awareness of the need to recognize and monitor adolescents who are most vulnerable to a poor health-related quality of life [29,30]. In some studies, though the SF-12 and SF-36 were used to investigate perceived adolescent QOL, it was unclear which of the two instruments was more suitable to the age group [23]. Thus, our study aimed to evaluate the QOL of healthy adolescent students at schools in China by using the SF-36 and SF-12 and comparing the reliabilities and validities of both, supplying evidence for the selection of instruments measuring quality of life and decision-making processes and thereby promoting the quality of life of adolescents.

Study design and sample
Stratified cluster random sampling was adopted [31], first dividing regions by geographical location: Dongguan, Shanghai, Shenyang, Wuhan, Xi'an, and Kunming represented the south, east, north, central, northwest, and southwest regions, respectively. These areas were chosen in order to ensure proper representation by including participants from geographically diverse areas. Second, middle schools were randomly selected and followed by grade (first grade of junior school to third grade of high school).
better than those of the SF-12. Some domains, for instance SF and BP, were not suitable for adolescents or need to be studied further.
Keywords: Quality of life, Reliability, Validity, Discrimination, Average information, SF36, SF12, Chinese adolescents The basic sampling frameworks were all middle schools, as reported by each city. In each city, middle schools were selected by simple random sampling according to a random number table. Finally, 17 middle schools were included (4 in Dongguan, 1 in Shanghai, 3 in Shenyang, 1 in Wuhan, 4 in Xi'an, and 4 in Kunming). The number of schools in each city was limited by the research group's local investigative capacity.
All students enrolled from the first grade of junior high school to the third grade of high school were included in the survey. The exclusion criteria were those with any physical or mental condition that made them unable to complete questionnaires or students and their parents who had not signed an informed consent form. The study was approved by the Institutional Review Board (IRB) at the Affiliated Hospital of Guangdong Medical University. Verbal informed consent for publication was obtained from the participants and/or their relatives, as approved by the IRB. The response rate was almost 80%. This present study included 19,428 adolescents with complete information on quality of life measures. The sample sizes for each region were Dongguan (4490, 23.1%), Shanghai (1039, 5.3%), Shenyang (3539, 18.2%), Wuhan (1371, 7.1%), Xi'an (4197, 21.6%), and Kunming (4792, 24.7%).

Instruments and variables
SF-36 (version 1) was used to assess QOL. Compared with version 2, the differences lie in two points. First, the answersrank of RP, RE, MH, and VT are distinct, and second, the scoring rules are different [32]. Since the use of SF-36 (version 2) requires authorization, version 1 was used in this study. Based on the response to individual items comprising the 8 subscales and using a z-score transformation, the scores of each subscale were calculated [33]. First, the domain items were coded; second, the items were scored; and finally, the scores were converted as shown in Formula 1.
Scoring norms for the Chinese version of the SF-36 (version 1) and SF-12 are not given at present by studies, so scores of these instruments were mainly based on American norms in China that have been proven to be valid [23,32,34]. Using Z-transform scores and factor score coefficients, we calculated PCS and MCS scores of the SF-36 according to Formulas 2 and 3: (1) Score = actual score − the lowest possible score of the subscale the highest score of the subscale − the lowest score of the subscale × 100% SF-12 component summary scores (eight subscales, PCS-12, and MCS-12) were calculated using the SF-12 items that are embedded in the SF-36 [35]. This method has been presented as being equivalent to calculating the SF-12 as a stand-alone questionnaire [17]. All summary scores range from 0 to 100, where higher scores indicate better QOL. We calculated PCS and MCS scores of the SF-12 according to the SF-12 scoring algorithm proposed by John E Ware in 1995 that has been widely used in China [36].

Statistical analysis
For descriptive analyses, we aimed to show overall demographics and QOL. We calculated average and standard deviations in QOL scores by SF-36 and SF-12. For testing their relevance, the Pearson correlation coefficient was used to show correlation between the domains of SF-36 and SF-12.
Cronbach's alpha for domains composed of multiple items and construct reliability (CR) were used to evaluate the reliability of the SF-36 and the SF-12, and validity indicators were represented by criterion validity and convergence validity (average variance extracted, AVE). Criterion validity was expressed by the correlation between the response of each domain and self-reported health status. The calculation of the formulas for CR and AVE are shown in Formulas 4 and 5. λ = factor loading, θ = measurement error.
The sample was randomly split into a training set (50%) and a validation set (50%) to examine the construct validity of the SF-36 and SF-12. Using the training set, an exploratory factor analysis (EFA) was performed to explore the latent structure based on correlations matrix, and factor loadings (λ) were estimated by maximum likelihood estimation and rotated by promax. Using the validation set, confirmatory factor analysis (CFA) was used to validate the identified two-factor structure in some Chinese populations. The weighted least-square method was used for the estimation of CFA parameters. Factor loadings (λ) were taken for standardized regression coefficients. The classic goodness-of-fit χ 2 statistic and its degrees of freedom were reported. However, as the χ 2 statistic is highly sensitive in large samples, assessment of goodness-of-fit was based on the fit indices as recommended: the root mean square error of approximation (RMSEA, close to 0.06 or lower) and the comparative fit index (CFI, close to 0.95) [37]. The basic two-factor CFA model (Model I) without correlated errors was first assessed (PCS associated with PF, RP, BP, and GH, whereas MCS associated with VT, SF, RE, and MH). Subsequently, the factor structures PCS and MCS, associated with most of the 8 domains (Model II) as described above, were also incorporated [23]. Then, the EFA or CFA was repeated on another data set, and mean estimates were reported.
According to the evaluation results of the samples, and taking into account the characteristics of the ordered and multi-category forms of the instrument items, the Semejima grade response model (logistic two-parameter module) in item response theory was used to estimate the discrimination, difficulty, and average information of each item [38]. RStudio, Amos 20.0, and Multilog 7.03 were used to process data.

Sample characteristics
Of the 20,226 questionnaires received, 798 had no responses on some of the SF-36 items. In the end, 19,428 samples were included in the study. The mean age of the sample of respondents was 14.78 years (standard deviation [SD] = 1.77), and 49.4% (9,595) were boys. Among the SF-36 and SF-12 domains, the PF mean score was the highest, and the RE mean score was the lowest. PCS was better than MCS. The biggest mean difference in scores between the two instruments was in the domain of SF.
Of the corresponding domains, the RE domains were the most relevant (r = 0.923), while the smallest correlation coefficient was in the VT domains (r = 0.670), which means domains of the SF-12 could reflect the information from 67.0% to 92.3% of the corresponding domains of the SF-36 ( Table 1).

The reliability and validity in classical test theory Factor analysis by EFA
The construct validity of SF-36 was good in adolescents, as determined by the Kaiser-Meyer-Olkin Measure of Sampling Adequacy (0.884). Communalities of all of variables were over 0.5. Factors rotated by the varimax method such that eigenvalues were greater than 1 were extracted. Eight components were produced and explained 69.21% of the total variance. The structure loading of factors extracted and the component score coefficient matrix are presented in Table 2. The structure of the 8 domains identified (PF, RP, BP, GH, VT, SF, RE, and MH) was not supported by EFA. The domains of BP, SF, VT, and MH were not divided into identified structures, due to the strong correlations between BP and SF and between VT and MH. Details are shown in Table 2.
Similarly, the construct validity of the SF-12 was also good in adolescents; the Kaiser-Meyer-Olkin Measure of Sampling Adequacy was 0.732. Eight components were extracted and explained 63.50% of the total variance. Due to the strong correlations between MH and SF and between VT and MH, the domains of SF, VT, and MH were not divided into identified structures in the SF-12 (Table 3).

Factor analysis by CFA
We confirmed two conceptual models. Conceptual Model I assumed that PCS was associated with PF, RP, BP, and GH, whereas MCS was associated with VT, SF, RE, and MH. Conceptual Model II assumed that PCS and MCS were associated with most of the 8 domains. Fit indices of the two models revealed that no matter whether SF-36 or SF-12, Conceptual Model I was better than Conceptual Model II in the structures identified ( Table 4). The structure of Model I has been used widely in studies in China. In our study, we selected the structures of Model I as the two summary scales (PCS and MCS) of the SF-36 and the SF-12. Standardized parameter estimates for CFA on each path are shown in Fig. 1.

Validity and reliability of domains of SF-36 and SF-12
As mentioned above, standardized parameter estimates for CFA in Model I were selected as factor loading. CR and AVE were calculated according to Formulas 4 and 5.
Except for SF domains in the SF-36 (Cronbach's α = 0.211), domains composed of multiple items had generally acceptable internal reliability ( Table 2). The low internal reliability of SF domains was probably because of inconsistent understanding of the meaning of the only two items, which might be biased or difficult to parse for adolescents ("To what extent has your physical health or emotional problems interfered with…" and "How much of the time has your physical health or emotional problems interfered with…"). Moreover, consistent with related studies, the internal reliability of the MH domain in the SF-12 was low (Cronbach's α = 0.369). On the other hand, the internal reliability of the SF-36 in each domain was better than that of the corresponding domains of the SF-12, which was consistent with higher internal reliability due to there being more items. The domains of PF, RP, 706 -BP, GH, and PCS in the SF-36 had good construct reliability (CR > 0.6). Except for RP and PCS, the domains in the SF-12 were not good at construct reliability, especially for the domains of GH, VT, and SF. The criterion validity was calculated based on the item of self-reported health ("In general, would you say your health is…. "). It is worth noting that criterion validities of all the domains of the two instruments were low, but especially so for PF, RP, and SF, which suggests that the correlation between physical health and self-perceived health was weak. Moreover, in PCS, the criterion validity of the SF-12 was much higher than the criterion validity of the SF-36. Although the criterion validities of the SF-36 were higher in other corresponding dimensions, the gaps were small. PF, RP, and PCS had generally acceptable convergence validity whether in the SF-36 or the SF-12. Moreover, in the RP and PCS domains, the convergence validities of the SF-12 were higher than the SF-36, while there was a little bit of difference in the other domains except BP, GH, and VT (Table 5).

Validity and reliability in item response theory
The parameter values and information content of the items according to the Samezima grade response model are shown in Table 6. The discriminations of items were between 0.45-2.73, with a large gap. The difficulty of the Table 3 Results of factors analysis of SF-12 among adolescents (n = 9741) items ascended from the lowest level to the highest level unidirectionally, which met the difficulty assumptions estimated by the model. The average amount of information of each item was between 0.07 and 1.02. In the SF-36, the domains of PF, RP, GH, and RE had acceptable discrimination of items (> 1), but the remaining dimensions were less differentiated, especially BP and SF, probably because for teenagers there was strong homogeneity between individuals in terms of physical pain and social function. On the other hand, in the SF-12, BP, SF, RP, and VT had higher discrimination of items than in the SF-36.

1-PF 2-RP 3-BP 4-GH 5-VT/MH 6-SF\MH 7-RE 8-RE
With reference to the relevant literature, the amount of information measured on the scales > 25 indicated that the quality of the evaluation items was good; the amount of information < 16 indicates that the evaluation items were poor [31]. Given the number of items on the instrument for the SF-36, we divided 25 and 16 by 36 to get the average information amount for each item, so as to obtain the determination criterion: the average information amount of excellent items was > 0.69 (25/36), while items < 0.44 (16/36) were judged to be poor. Similarly, for the SF-12, the average information amount of the excellent items was > 2.08, while items < 1.33 were judged to be poor. Except for PF05 and PF09, the items of the PF domain in the SF-36 were excellent, and the items of the GH domain in the SF-36 were excellent too, though the items of BP, VT, SF, RE, and MH were poor. On the other hand, the average amounts of information in the SF-12 items were poor (Table 6).

Discussion
Psychometric standards were used to evaluate the reliability and validity of the standard Chinese SF-36 and SF-12 instruments in a large sample of Chinese adolescents. Our study suggested that the SF-12 and the SF-36 correlated very highly in this population. Although the reliability and average amount of information of the SF-12 domains and items were lower than that of the SF-36, the convergence validity and item discrimination of some domains in the SF-12 were somewhat better than the corresponding domains in the SF-36. No matter whether the SF-36 or the SF-12 was considered, high correlations existed between some domains, for example, between MH and VT, SF dimensions, etc. The psychometric properties of the two broader components (PCS and MCS) were better than the individual domains. Studies have shown that the two instruments discriminated between adolescents with physical and mental health problems and performed well in associating with other clinical criteria [39][40][41]. A study of 31,357 adolescents in Hong Kong showed the two components and a single general health component of the standard Chinese SF-12 were appropriate health indicators for Chinese adolescents [23]. Studies have also shown that the SF-12 correlated highly with the SF-36 in obese and non-obese patients [3,4]. However, many problems with the two instruments still existed, such as a high correlation between the two components, low internal reliability, and the ceiling effect within individual domains [42]. Comparing the SF-12 and the SF-36, previous studies in patients with specific diseases or health conditions have generally found moderate to high correlations between corresponding domains and components of both instruments [15,19]. Our study also demonstrated these correlations. Since the SF-12 is embedded in the SF-36, we expected reasonably high correlations. Overall, the dimensions of the SF-12 scale could reflect 64.5% to 92.3% of the corresponding dimensions of the SF-36 scale in Chinese adolescents, with low internal reliabilities and convergence validities found in some domains.
A low reliability and validity of the social functioning domain was noted. This might indicate questionable reliability and validity of the instruments or the lack of representation [3]. On the other hand, it might also be attributed to the presence of inconsistent responses, which might occur if respondents completed a questionnaire without comprehending the items, as might occur with adolescents [23]. Due to the brevity of the SF-12 instrument, related research has shown that it is not possible to get reliable information for each of the eight domains, so that one would not be able to draw conclusions about specific domains [43]. Indeed, we found the SF-36 was better than the SF-12 in terms of reliability. At the same time, comparing the SF-12 and the SF-36 in terms of validity, no loss in effectiveness was shown, and there was even a slight improvement. But we also found that the criterion validities of PF, SF, and MH were low. Relevant literature has found that for most adolescents, performing moderately strenuous activities or climbing several flights of stairs would not present problems because this age group is typically physically fit and active, but when combined with a limited social life and less satisfactory mental state, inconsistent responses would be possible [23].
Unlike previous studies [21,[42][43][44], we found the domains of BP and SF in general had poor discrimination of items, while PF in general, as well as BP, SF, RP, and VT on the SF-12, had higher discrimination of items than in the SF-36. We suggest that, compared with PF items, the items in these other domains were not easy for teenagers to understand, resulting in a lack of sensitivity in the measurement. Similarly, a loss of information had been found in the SF-12 that would be provided by the eight dimensions of the SF-36, but utilization of the two summary dimensions of the SF-12 had an advantage for adolescents, which was consistent with the results of other population studies [23].
Methodological limitations should be mentioned. The participants were stratified regarding geographical areas in order to minimize the risk of possible regional differences. However, the regions chosen were vast and included small towns and big cities as well as rural areas [45,46]. Differences due to these circumstances might exist but not have come to light in this design. Additionally, there was a difference in response consistency between the samples because of the characteristics of adolescence, leading to bias in the results [47].

Conclusion
In general, our study suggested that the SF-12 correlated highly with the SF-36 in adolescent groups in China. If focus is restricted to the two broad component measurements (PCS and MCS), the SF-12 appeared to perform at least as well as the SF-36 in cross-sectional settings in adolescence; hence, using the SF-12 in place of the SF-36 might be appropriate in this situation. At the same time, the question of whether some domains, for instance SF and BP, are not suitable for adolescents needs further study.