Feasibility, reliability and validity of the health-related quality of life instrument Child Health Utility 9D (CHU9D) among school-aged children and adolescents in Sweden

Background This study was conducted in a general population of schoolchildren in Sweden, with the aim to assess the psychometric properties of a generic preference-based health related quality of life (HRQoL) instrument, the Swedish Child Health Utility 9D (CHU9D), among schoolchildren aged 7–15 years, and in subgroups aged 7–9, 10–12 and 13–15 years. Methods In total, 486 school aged children, aged 7–15 years, completed a questionnaire including the CHU9D, the Pediatric quality of life inventory 4.0 (PedsQL), KIDSCREEN-10, questions on general health, long-term illness, and sociodemographic characteristics. Psychometric testing was undertaken of feasibility, internal consistency reliability, test–retest reliability, construct validity, factorial validity, concurrent validity, convergent validity and divergent validity. Results The CHU9D evidenced very few missing values, minimal ceiling, and no floor effects. The instrument achieved satisfactory internal consistency (Cronbach’s Alfa > 0.7) and strong test–retest reliability (r > 0.6). Confirmatory factor analyses supported the proposed one-factor structure of the CHU9D. For child algorithm, RMSEA = 0.05, CFI = 0.95, TLI = 0.94, and SRMR = 0.04. For adult algorithm RMSEA = 0.04, CFI = 0.96, TLI = 0.95, and SRMR = 0.04. The CHU9D utility value correlated moderately or strongly with KIDSCREEN-10 and PedsQL total scores (r > 0.5–0.7). The CHU9D discriminated as anticipated on health and on three of five sociodemographic characteristics (sex, age, and custody arrangement, but not socioeconomic status and ethnic origin). Conclusions This study provides evidence that the Swedish CHU9D is a feasible, reliable and valid measure of preference-based HRQoL in children. The study furthermore suggests that the CHU9D is appropriate for use among children 7–15 years of age in the general population, as well as among subgroups aged 7– 9, 10–12 and 13–15 years.


Introduction
Economic evaluations, including cost-utility analysis (CUA), have a central role in healthcare decision-making [1]. CUA is typically expressed as the incremental cost of interventions per quality adjusted life years (QALYs), a common outcome unit that enables comparisons across clinical areas. The QALYs can be calculated by multiplying the duration of time spent in a health state by the health related quality of life (HRQoL) utility weight associated with that health state, using the area under the curve (AUC) method [1]. HRQoL can be captured by multi-attribute utility (MAU) instruments addressing key generic health domains.
Existing Pediatric MAU instruments are mostly adapted from adult instruments or developed from the perspective and preferences of adults [2][3][4]. Such adultbased HRQoL measures may be cognitively challenging for younger populations [5]. They may also capture health aspects less pertinent to pediatric populations, while failing to tap into others of particular importance to children's HRQoL, and adult preferences for health states may differ from child preferences [6]. To overcome some of these problems, a MAU instrument, the Child Health Utility 9D (CHU9D), was developed specifically with and for children [7][8][9], and with scoring algorithms obtained from a parent [10] as well as a child population [11].
The Original English version of the CHU9D has demonstrated sound psychometric properties in 7 to 11-yearolds in the UK [8] and in 11 to 17-year-olds in Australia [12][13][14]. Linguistic and cognitive skills, however, vary over the course of childhood and the ability to understand and respond to a questionnaire may differ between children in different ages. Therefore, acceptable psychometric properties should be assured, also in narrower age strata. Furthermore, to enable valid and reliable CUA in pediatric populations more widely, the CHU9D are being translated into other languages [15]. However, HRQoL is context dependent [16]. Therefore, psychometric properties of translated instruments should be assured in their specific cultural contexts [17].
The CHU9D has been translated into Swedish. The current study is the first to investigate the psychometric properties of the Swedish CHU9D.

Data collection procedure
Headmasters and class-teachers were informed about the study. Following their consent, children were informed about the study in class and parents received written information. Informed consent to participate was sought from the child. In children below 15 years, also parental consent was requested. After obtained consent, children attending school the day of the survey, filled in a questionnaire in school, assisted by a specially trained research assistant (school nurse). Approximately 59% of the approached students participated in the study. Parents of children in grades 1-3 filled in a short questionnaire at home. In the younger age groups (grades 1-3), the research assistant read the questions and response alternatives aloud, one by one. In each grade, approximately half of the participating schoolchildren filled in the CHU9D again, 7-15 days after the first assessment. Whole classes were randomly selected for the second assessment. Teachers facilitated the process, and were present, but not actively involved in the data collection.

Measures
The children answered a questionnaire including three measure of HRQoL, the CHU9D, KIDSCREEN-10 [18,19] and the Pediatric quality of life inventory 4.0 generic core scale (PedsQL) [20,21], along with measures of health and socio-demographic background. Given the lower cognitive ability of the younger children, some questions were omitted from the child questionnaire in grade 1, and in grades 1-3, some background questions were parent-reported.

CHU9D
The preference-based CHU9D measures HRQoL by nine-dimensions: worried, sad, pain, tiredness, annoyed, schoolwork/homework, sleep, daily routines, and ability to join in activities. Recall time is today or last night, and each dimension has five severity levels [7,22]. Reponses were converted to utilities (on a scale from 0 implying dead to 1 implying perfect health) using both available scoring algorithms: the child-generated Australian algorithm [11] and the adult-generated UK algorithm [10], onwards named child-and adult algorithms.

KIDSCREEN-10
KIDSCREEN-10 addresses 8-18-year-old children and captures overall non-preference based HRQoL based upon 10 underlying dimensions: feeling fit and well, full of energy, sad, lonely, and having enough time for one-self, being able to do what you want at spare-time, being treated fairly by parent(s), having fun with friends, doing well in school, and being able to pay attention (in school) [18]. Recall time is one week and items are answered on a 5-point scale from 1 (not at all or never) to 4 (extremely or always). Item scores were recoded into higher values equalling better HRQoL. Then, HRQoL sum scores were calculated and given Rasch-person-parameters (PP), which were further transformed into values with a mean of 50 and a standard deviation of approximately 10 [19]. KIDSCREEN-10 has demonstrated acceptable reliability, construct and criterion validity [18]. This instrument was only filled in by children in grades 2-9.

PedsQL
The non-preference based PedsQL has 23 age-specific items that capture 4-18-year-old children's overall HRQoL, as well as the underlying psychosocial (15 items) and physical (8 items) dimensions [21]. The psychosocial dimension has three sub-dimensions: emotional, social and preschool/school functioning and wellbeing (5 items each). Recall time is one month and items are scored on a 5-point scale between 0 (never a problem) and 4 (almost always a problem). Scores were reversed and linearly transformed to a 0-100 scale with higher scores representing higher HRQoL. Mean values were computed for each dimension. The instrument has demonstrated acceptable psychometric properties in numerous countries, including Sweden [23].

Health
Self-reported general health was assessed by the question "How do you feel in general". The question has five response alternatives, which were merged into three categories: not good/fair (named not good onwards), good, and very good/excellent (named very good onwards). This question was only assessed in grades 2-9.
Long-term illness/disability was studied by asking about the presence of seven specified pediatric chronic or long-term health problems (eczema, asthma, allergies, depression, epilepsy, ADHD, diabetes) and one alternative named "others". In grades 1-3, this information was obtained from the parents and in grades 4-9 from the child. Children with at least one chronic or long-term health problem were classified as having a long-term illness/disability.

Socio-demographic variables
Sex, grade, children's and parent's country of birth, and custody arrangement were measured by specific questions. In grades 1-3, parents provided information about country of birth.
Socio-economic status was measured by the Family Affluence Scale (FAS) [24]. FAS assesses self-reported own bedroom, dishwasher at home, number of: family cars, holidays abroad the past 12 months, bath rooms at home and computers at home. A sum score was generated ranging from to 0-13 points, with higher scores indicating higher affluence. The FAS index has shown acceptable cross-cultural reliability and criterion validity in a study of eight European countries, including Denmark and Norway [24].

Major life events
To assure equivalence of CHU9D scores at the two assessments used for the test-retest analyses (see statistical analysis), children were asked "Has anything out of the ordinary happened that makes you feel better or worse today". The response alternatives were no vs yes, please describe. The answers were independently reviewed by two of the authors, (KL and MV) to determine if any child needed to be excluded out of the testretest analysis due to having a major life event either at the first or second time of participation. If there were any uncertainties, a third author (SP) was consulted.

Statistical analysis
Differences in CHU9D utility scores were estimated using multivariate tests of means for the one-sample test when applying the child and the adult algorithms, respectively.
Feasibility was examined by estimating floor and ceiling effects and the proportion of missing values. The Cronbach's coefficient α was used to evaluate internal consistency reliability, with values ≥ 0.7 considered acceptable for group and ≥ 0.9 for individual comparisons [25]. Furthermore, utility scores from the first and second data collection were compared (test-retest reliability) using Intraclass-, Canonical-and Spearman's correlations for scale comparisons, along with Spearman correlations and Weighted Kappa statistics for dimension comparisons. Correlations of 0.00-0.19 were considered very weak, 0.20-0.39 weak, 0.4-0.59 moderate, 0.60-0.79 strong and 0.80-1.00 very strong [26]. Kappa values below 0.00 were considered to signal poor agreement, 0-0.20 slight, 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, and 0.81-1.00 almost perfect agreement [27].
Construct validity was assessed through confirmatory factor analysis testing the hypothesized one-factor structure of the CHU9D (factorial validity). The following model fit indices and cut-offs were used to confirm model adequacy: comparative fit index (CFI) and Tucker-Lewis index (TLI) values ≥ 0.90 (acceptable fit) or ≥ 0.95 (excellent fit) [28], root mean square error of approximation (RMSEA) values ≤ 0.08 (acceptable fit) or ≤ 0.05 (good fit) [29], and standardized root mean square residual (SRMR) values ≤ 0.08 (acceptable fit) [30]. Construct validity was also explored by comparing the total CHU9D utility scores to the total KIDSCREEN-10 and the PedsQL scores, using Spearman's correlations (concurrent validity). Spearman's correlations were furthermore used to test whether the conceptually alike dimensions of the CHU9D and the other HRQoL instruments (see Table 1) were correlated (convergent validity) and difference between correlations of conceptually alike and dislike dimensions were explored (divergent validity).
Finally, construct validity was assessed by the knowngroups method, i.e. by comparing CHU9D utility scores depending on sex, school stages (grades 1-3, 4-6¸7-9), parental country of birth (Sweden: none-one-both parents) custody arrangement (living with both parents-or not), family affluence (FAS: < 25 percentile, 25-75 percentile, > 75 percentile), general health status (not goodgood-very good), having a long-term illness or disability (yes-no). Differences between groups were estimated using Mann-Whitney U-test and Kruskal-Wallis test. Lower utility scores were anticipated in girls (although no sex-differences expected in early pre-adolescence) and older age groups, and in those with a foreign background, not living with both parents, having lower family affluence, not holding a good general health, or having a longterm illness or disability [12,[31][32][33].
The analyses were performed for grades 1-9 in total, and separately for each of the three school stages studied, using Stata version 16.1 (Stata Corp LP, College Station, Texas, USA). Relationships mentioned in the result section are significant at the 95% level (p < 0.05).

Power
The suggested minimum sample size levels for feasibility and reliability tests were n = 50, and in factor analysis 4-10 persons per variable items or at least n = 100 [34].
There is also a rule of thumb stating that validations of questionnaires should include at least 5-10 persons per item, here equivalent to a sample of n = 45-90 [35]. Thus, for the current analyses, the sample size of minimum 50 children at each grade allows for at least school-stage level analysis (n at least 150).

Descriptive characteristics
Data was collected from 486 children. Of these, 473 (97%) answered all CHU9D questions and thus, were included in the analysis. These participants were evenly distributed between the three studied school stages, but included slightly less girls than boys ( Table 2). The great majority (92%) were born in Sweden, had parents who were both born in Sweden (77%), and lived with both parents (81%). Affluence ranged between 2 and 13 on the Family Affluence Scale, with only two children scoring below 6 points (not seen in the table). Two thirds of the children reported having very good general health and about half reported having a long-term illness/disability, mainly allergy, asthma, and eczema (not seen in the table). HRQoL, as expressed by mean CHU9D utility scores, were 0.74 (SD ± 0.21) when using the child algorithm and 0.85 (SD ± 0.11) when using the adult algorithm (p < 0.001). Measured by KIDSCREEN-10 and PedsQL, the corresponding numbers were 41.17 (SD ± 6.10) and 82.12 (SD ± 12.94), respectively.

Test-retest reliability and internal consistency
In total, 255 children filled in the CHU9D twice. Of these, 13 children were excluded from the test-retest analyses, 11 because of a major event happening between rating occasions and 2 children had missing values in several dimensions at the second assessment. Thus, 242 children were included in the test-retest analyses, 73 from grades 1-3, 81 from grades 4-6 and 88 from grades 7-9.
At scale level, Canonical, Spearman's and Interclass correlations all showed strong (> 0.7) or very strong (> 0.8) correlations between the two occasions ( Table 4). Similar results were seen for each of the three school stages studied. In addition, all analysis of internal consistency revealed Cronbach alpha values above 0.7. Thus, overall, and at each of the studied school stages, the CHU9D scale met the reliability criteria for group level comparisons.
When comparing the individual dimensions-scores from the test-retest occasions, one by one, correlations were moderate or strong (0.41-0.71), while kappa-agreement were moderate or close to moderate for 6 dimensions (0.39-0.54) and fair for 2 dimensions ("pain" 0.32 and "annoyed" 0.35) ( Table 5). These results were to some extent similar for the separate school stages, but at each of the three school stages, there were cases of weak or non-significant correlations and fair or non-significant kappa-agreements. The specific 3 Higher score means higher health related quality of life  Table 3 Floor and ceiling effects for the CHU9D among school children in grades 1-9 and stratified by school stages

Construct validity Factorial validity
Confirmatory factor analyses showed that all dimensions loaded significantly on the latent factor and the loadings were all above 0.4, except for abilities to join in activities: 0.32 (child algorithm). Furthermore, as seen in Table 6, the CFI, TLI, SRMR and RMSEA values met the criteria for acceptable or excellent fit for the studied single factor model, i.e. supporting that the nine dimensions of the CHU9D measures a single latent construct. Acceptable model fit was demonstrated when using both child-and adult algorithms, and it was seen in the whole sample (grades 1-9: CFI and TLI ≥ 0.94; RMSEA and SRMR ≤ 0.05), but also in two of the three school stages separately (grades 4-6 and 7-9: CFI and TLI ≥ 0.92; RMSEA and SRMR ≤ 0.06). In grades 1-3, there was weaker support for the model. Modification indices suggested high correlation in the dimensions "worried" and "sad", which is theoretically plausible. Model fit improved for grades 1-3 after allowing the correlation between these two dimensions in the adjusted model (CFI and TLI 0.83-0.89; RMSEA and SRMR ≤ 0.07).

Concurrent validity
Strong correlations were found between the total score of CHU9D (child/adult algorithm) and the total scores of KIDSCREEN (0.61/0.62) and PedsQL, (0.62/0.61). Stratified analyses showed that, in the two older school stages, the CHU9D total scores correlated strongly with the KIDSCREN-10 and PedsQL scores (r > 0.6 and 0.7, respectively). In grades 1-3, these correlations were moderate with r just above 0.5 (both KID-SCREEN and PedsQL).   19:193

Convergent and divergent validity
The individual nine CHU9D dimensions generally demonstrated moderate (r > 0.4), or close to moderate, relationships to the conceptually alike KIDSCREEN-10 and PedsQL dimensions (Table 7). Also, there were a pattern of stronger correlation between alike dimensions as compared to dislike dimensions. One exception was that, unexpectedly, the CHU9D dimension capturing ability to do daily routines had a slightly closer relationship to the PedsQL dimensions emotional and school functioning (r 0.38 and 0.36, respectively) than to the physical functioning dimension (r 0.31). Another exception was that the dimension on abilities to join in activities showed the strongest relationship with the KIDSCREEN-10 dimension "fit and well" (r 0.33). and not the expected dimension regarding ability to do desirable spare time activities ("able to do things" r 0.26). In each of the three separate school stages, the results overall follow the same pattern as describes above, although correlations seem to be slightly weaker among the youngest children. Table 8 shows that CHU9D utility scores mostly differed as expected when comparing children with varying characteristics. As compared to their counterparts, higher scores were found among boys, younger children, those living with both parents, those reporting better general health and those without a longterm illness or disability (child and adult algorithms, both). However, there were no utility score differences depending on parental country of birth or family affluence. To further investigate the lack of statistical differences in utility scores by family affluence, we conducted several other analyses, in which, we stratified FASscores into 2, 4 and 5 categories with different set and relative cut-offs, tested with and without imputation for the 37 cases with missing FAS scores. All analyses confirmed the initial results (data not shown).

Known-groups validity
The sample size only allowed grade-stratified analyses by sex and long-term illness/disability. In grades 4-6 and 7-9, these analyses showed similar pattern as those reported above, but for long-term illnesses, only when applying the child algorithm. In grades 1-3, no such differences were seen. Table 7 Correlations between CHU9D dimension scores and dimension scores in KIDSCREEN-10 and PedsQL, among children (n = 473)

Discussion
This study investigated the feasibility, reliability and construct validity of the Swedish CHU9D among school children attending grades 1-9 in Sweden (ages 7-15), and separately for children in grades 1-3, 4-6 and 7-9 (ages 7-9, 10-12 and 13-15 years). Very few missing values, minimal ceiling, and no floor effects support the general feasibility of the Swedish CHU9D. However, ceiling effects were relatively high for several of the underlying CHU9D dimensions. This is not surprising given that many children are expected to be at good health in general populations. Similar results have been shown when using the English, Danish and Chinese CHU9D in general populations of school-aged children [13,14,31,33,36]. Notable though, across these studies, as well as in the current study, floor effects in the CHU9D dimensions are mainly below 5% and ceiling effects below 85%. This indicates that the CHU9D is capable of detecting improvement in general population studies of schoolaged children, in Sweden and elsewhere.
The reliability of the Swedish CHU9D was supported by strong test-retest correlations and agreements, along with established internal consistency. Again, this is in line with the result of studies using the English, Danish and Chinese CHU9D in their cultural contexts [33,36,37]. Notably though, for most of the individual CHU9D dimension scores, we only found moderate or close to moderate test-retest correlations and agreements. Similar findings were reported in a UK study of 6-7-year-olds using a shorter test-retest timeframe (morning-to-afternoon) [31] and a Chinese two-week, test-retest study of 8-17-year-olds [36]. Thus, across language versions and cultural contexts, the CHU9D dimensions show signs of some inconsistency over time. This may be due to the shifting nature of the concepts studied, in combination with the short reference time (today). Furber et al. [37], reports that one third of the children do not consider the day of the study a typical day in terms of the assessed CU9D concepts, indicating that these concepts fluctuate day by day. Further research should investigate how this potential shortcoming influences the instruments sensitivity to change over time (responsiveness). The current study confirms the proposed one-factor structure of the CHU9D. In the absence of a gold standard for HRQoL measurement, it is not possible to prove conclusively that this factor measures HRQoL. However, we found a strong correlation between the total scale score of the CHU9D and two HRQoL instruments with demonstrated reliability and validity (KIDSCREEN-10 and PedsQL). Furthermore, the strongest correlations were seen between the CHU9D dimensions and the conceptually overlapping KIDSCREEN-10 and PedsQL dimensions, while correlations were weaker for nonoverlapping concepts. Other studies have shown similar results when comparing the original English CHU9D to KIDSCREEN-10 and PedsQL [14,38] or the Danish and Chinese CHU9D to the PedsQL [33,36]. Taken together, these results indicate that the CHU9D may be used as a measure of HRQoL.
Although we and other researchers [14,33,38] find correlations to be strongest between the CHU9D and alike KIDSCREEN-10 and PedsQL dimension, these correlations are merely moderate. This is not surprising, given that the questions and response alternatives are somewhat differently phrased in the three instruments. Also, the recall period is "today" in the CHU9D, while KIDSCREEN-10 and PedsQL have recall-times of one week and one month, respectively. Consistent with studies from other countries, [12,31,33,36] we confirmed that the Swedish CHU9D is able to detect anticipated HRQoL differences depending on health outcomes and sociodemographic characteristic such as sex, age and custody arrangement. We did however not replicate previously shown differences by socioeconomic status (SES) [13,14] and ethnic origin [32]. This may be attributed to the fact that we required children to be fluent Swedish speakers, leading to the exclusion of families who are less rooted in the Swedish society and thereby potentially to less diversity by ethnicity and affluence. Thus, our results overall support the knowngroup validity of the Swedish CHU9D, but additional studies are required to confirm the ability to discriminate between children with different SES and ethnic origin.
The CHU9D was initially developed for ages 7-11 years [8]. Our study supports that the instrument is acceptable for use among children up to 15 years of age. Furthermore, in each of the three school stages studied, we found acceptable floor and ceiling effects, strongly correlated test-retest CH9D utility scores, an internal consistency allowing for group comparisons, and moderate to strong correlations between the CHU9D scale and two established HRQoL scales. These findings suggest that the CHU9D is feasible, reliable and valid for use, not only in wide age-ranges, but also in narrower strata comprising only children aged 7-9 years, 10-12 year, or 13-15 years.
Our results confirm earlier findings showing that CHU9D utility scores are higher when applying the adult as compared to the child algorithm [9,37,39]. This may be attributed to the disparities in valuation methods used to assess utility weights in child (best-worst) and adult populations (standard gamble) [39]. It may also be explained by adults and children giving different values to CHU9D generated health states, i.e. that adults, as compared to children, generally place less weight on mental health impairment (sadness, worries, being annoyed) and impairment in daily functioning (schoolwork, daily routines, activities) but comparably higher weight on health states dominated by physical impairment (pain, tiredness, sleep problems) [6]. Thus, the CHU9D adult algorithm may not accurately reflect children's preferences.
Notably, we found that the algorithm used influenced the size of between-groups differences. Comparing for instance those with a "not good" and a "very good" general health, we found utility scores difference to be twice as high when applying the child algorithm as compared with the adult algorithm (mean-score difference: 0.33 vs. 0.17). Such differences suggest that the choice of algorithm may have the capacity to influence interpretations of future economic evaluations of pediatric health interventions, highlighting the importance of this choice. We acknowledge some limitations. Although including only fluently Swedish speaking children diminishes biases due to language barriers, which is a strength, it may also have biased the discriminative analysis regarding SES and ethnic origin. Also, given that the study was based on a convenience sample, it cannot provide normative information about HRQoL levels. Another limitation is that this general population study with self-and parentreported health, could not evaluate the applicability of the CHU9D in clinical population. Likewise, the study design did not allow evaluations of responsiveness to change in health status over time. In addition, the longterm illness/disability measure was based on self-report by children or parents, and were not confirmed via medical records.

Conclusions
This study provides support that the Swedish CHU9D is a feasible, reliable and valid measure of HRQoL that holds psychometric qualities comparable to those of the original English CHU9D. The study furthermore, suggests that the CHU9D is appropriate for use among 7-15-year-old children in the general population, as well as among subgroups aged 7-9, 10-12 and 13-15 years. To provide further support for the CHU9D as a useful health outcome measure in health economic evaluations, future studies should investigate the performance of the CHU9D in clinical samples. Longitudinal studies are also needed to test the instruments sensitivity to change.