Assessing the effect of child’s gender on their father–mother perception of the PedsQL™ 4.0 questionnaire: an iterative hybrid ordinal logistic regression/item response theory approach with Monte Carlo simulation

Background This study aimed at investigating the possible confounding effect of children’s gender on the parents’ dyads perception of their child HRQoL at both item and scale levels of PedsQLTM4.0 questionnaire. Methods The PedsQL™ 4.0 Generic Core Scales were completed by 573 children and their father-and-mother dyads. An iterative hybrid ordinal logistic regression/item response theory model with Monte Carlo simulation was used to detect differential item functioning (DIF) invariance across mothers/fathers and daughter/sons. Results Assessing DIF across mother–daughter, father–daughter, mother–son, and father–son dyads revealed that although parents and their children perceived the meaning of some items of PedsQLTM4.0 instrument differently, the pattern of fathers’ and mothers’ report does not vary much across daughters and sons. Conclusion In the Persian version of PedsQLTM4.0, the child’s gender is not a confounding factor in the mothers’ and fathers’ report with respect to their daughters’ and sons’ HRQoL. Hence, paternal proxy-reports can be included in studies, along with maternal proxy-reports, and the reports can be combined short of concerning children gender, when looking at parent–child agreement.


Background
The inclusion of multiple informants in the field of health-related quality of life (HRQoL) of children has become the norm in clinical research practice [1]. Child's self-report and fathers' and mothers' proxyreport are the most important sources of information when assessing the children's HRQoL [2,3]. Agreement between self-and proxy-ratings continues to be a controversial issue in pediatrics HRQoL studies [1,4]. It was shown that child-parent agreement could be affected by child characteristics, such as age, sex and health condition [5,6]. They have also indicated that parents often underestimate HRQoL for sick children, but they tend to rate healthy individuals upper than the children do themselves [1,4]. However, the potential influence of the child's gender has been rarely assessed in the literature, especially with respect to which of the parents are selected as a proxy respondent. In one of our recent studies, the potential interchangeability of the parent dyads in reporting children's HRQoL was assessed on both item and scale levels of the Ped-sQL ™ 4.0 instrument [7]. The study showed that parent-child agreement was not affected by the parents' gender, but the discrepancies between parents and children regarding the child's gender was not taken into account, which could have affected their report. A literature review in the field of child HRQoL indicated that daughters and sons had different relationships with each of their parents; it also showed that fathers and mothers had different perspectives for their child's HRQoL [4]. Therefore, it is not easy to distinguish how far item rating of fathers and mothers is linked to their child's gender [8]. Regarding the results of several studies, it could be hypothesized that mother/daughter and father/son dyads might be interesting subgroups to analyzes their influence on the interchangeability of parent proxy-reports about their children's HRQoL [8][9][10].
Although the agreement between the children's and their parents' perception regarding the children's HRQoL has been investigated at the item and scale levels [11][12][13], it has never been evaluated at item level of PedsQL ™ 4.0 and no other instrument by simultaneously considering children's and parents' gender. According to a systematic review, the PedsQL ™ 4.0 questionnaire is the most widely used instrument for measuring HRQoL amongst children and adolescents [14]. Therefore, the present study aimed to assess the effect of child's gender on their father-mother perception of their child's HRQoL on both item and scale levels of the generic PedsQL ™ 4.0. In other words, we attempted to evaluate the measurement invariance of this instrument among daughter-mother, son-mother, daughter-father and son-father dyad (assessing in the item level) and the discrepancy (assessing in the scale level), to clarify how a child's gender can affect the agreement between fathers and mothers.
It should be mentioned that evaluating the agreement amongst informants regarding their perception on child's HRQoL is currently in transition from classic approaches (e.g. calculating the inter-class correlation or comparing the means) to adapt more modern methods, such as differential item functioning (DIF) analysis. DIF analysis examines whether or not people in different groups respond consistently to a particular item within a scale after controlling the underlying construct measured by the scale. There are two types of DIF: uniform and nonuniform. Uniform DIF is evident when the difference in item response probabilities is constant across complete construct domains. Non-uniform DIF occurs when the direction of DIF differs in various parts of the scale [15]. Hence, the results of this study can provide further evidence on comparability of HRQoL scores across different informants in child self-reports and parent proxy-reports of the PedsQL ™ 4.0, using the iterative hybrid ordinal logistic regression/item response theory (OLR/IRT) approach.

Participants and instrument
The participants comprised of Iranian secondary school children from four educational districts, with diverse socioeconomic backgrounds from Shiraz, a major metropolitan city in southern Iran, along with their mothers and fathers. A two-stage cluster random sampling method was used for the selection process. Out of 60 secondary schools in each district, four were chosen at random (first stage). In the next step, a simple random sampling technique was used to choose two classes from each school by random number table. Then, all the children in the selected classes were automatically taken as samples in the second stage.
The child and parent-report of the Persian version of the PedsQL ™ 4.0 Generic Core Scales which was translated and validated previously in Iran [16] made a questionnaire that was filled out by the children and their mothers and fathers. A trained researcher clarified the objective of this survey and distributed a set of documents among them, containing the child's self-report, two parents' proxy-report, and parents' informed consent form. The children were asked to take the documents to their parents.
Parents and their children filled out the questionnaires at home and returned them to the research team. Out of the 950 distributed triplet questionnaires in 32 classes within 16 secondary schools, 573 were filled out completely, with the overall return rate of 60%. (No more than 5% missing item response was considered acceptable; it provided two students who were excluded from the analysis). In the final sample, 281 (49%) male and 292 (51%) female students with their parents were included. The study was approved by the local ethics committee of Shiraz University of Medical Sciences. The mean ± standard deviation of the fathers' , mothers' , boys' and girls' age were 45.6 ± 6.1, 39.9 ± 6.4, 14.48 ± 1.31 and 14.42 ± 1.58 years, respectively. The characteristics of the participants are presented in Table 1.
The PedsQL ™ 4.0 is a 23-item generic instrument, which consists of four scales including physical, emotional, social and school functioning (An eight-item scale and the three five-item scales). The participants responded to the items on a five-point Likert scale (0 = never a problem, 1 = almost never a problem, 2 = sometimes a problem, 3 = often a problem, and 4 = almost always a problem). The PedsQL ™ 4.0 scoring protocol has reversed-scored items in a way that the higher scores indicate lower HRQoL.

Statistical analysis Differential item functioning analysis with iterative hybrid OLR/IRT approach
In this study, iterative hybrid OLR/IRT approach was implemented in the R package ''lordif''; it was used to examine DIF across daughters/sons and mothers/fathers in PedsQL ™ 4.0 questionnaires [17]. In the DIF analysis through the OLR/IRT approach, along with providing statistical tests to identify the items exhibiting uniform and non-uniform DIF, the different magnitude and impact measures were also obtained to quantify the magnitude of DIF. The special feature of this approach is the usage of trait variable for matching rather than the observed scale score for the traditional OLR. OLR/ IRT uses an iterative procedure to detect the DIF items by purifying trait score estimation during the analysis. At first, the algorithm fits a graded response model (GRM) [18] to obtain trait estimates. After that, a series of nested OLR models was fitted to detect the DIF items based on the OLR model criterion, conditioning on the estimated trait score which were obtained at the previous stage. Then, we refitted the GRM to obtain the revised trait estimate that accounts for just items identified with DIF in the former step. In the following stage, new DIF items are flagged again, and the results are compared with previous ones. If the same items are flagged, the analysis is stopped, but if different items are identified, we iterate the analysis until the discovered DIF and non-DIF items become the same as the ones detected in the previous run (for more details refer to Choi et al. [17]). It is notable that the three nested OLR models which are responsible for identifying DIF items can be written, respectively, as: where P(Y i ≥ k) is the probability of response in category k or higher of the item i, α k is the intercept term which depends on the kth category of item i, β 1 represents the effect of the trait (e.g. emotional functioning), β 2 shows the effect of the group (fathers/mothers and daughters/ sons), and β 3 indicates the interaction effect between trait and group. Uniform DIF could be detected by comparing the log-likelihood values of Models 1 and 2 (i.e. β 2 ≠ 0) and non-uniform DIF could be tested by comparing the log-likelihood values of Models 2 and 3 (i.e. β 3 ≠ 0). Differences in the value of log-likelihoods are compared to the Chi-square distribution with one degree of freedom.
Since statistical power for testing uniform and nonuniform DIF is highly dependent on the sample size, a slight difference in the log-likelihood of the nested models can be statistically significant if there is a sufficiently large sample. In response to this concern, we used the McFadden [19] pseudo-R 2 estimate [20] to quantify the magnitude of DIF and determine the clinical importance of DIF items. In most traditional analyses, classifying DIF is based on Zumbo guidelines (R 2 < 0.13 as negligible, R 2 between 0.13 and 0.26 as moderate and R 2 > 0.26, as large) [21], but in this approach a Monte Carlo simulation-based procedure derives the thresholds or empirical criteria to determine whether the items have DIF, based on Type-I error rates empirically found in the simulated data. The empirical threshold values from Monte Carlo simulations for the Chi-square statistics and magnitude of the measures by item are obtained, based on 1000 simulations and α = 0.01 (α is considered to be 0.01 because DIF procedures are based on logistic regression, known to yield inflated Type-I error rates, especially when the groups differ substantially in the trait being measured [22,23]). This is the unique feature of lordif package, which is not functionally available in other DIF detection approaches (interested readers can refer to Choi et al. [17]).

Analysis of cross-informants agreement
After using the DIF detection technique to evaluate the accuracy of the instrument, paired-sample t-test and intra-class correlation coefficient (ICC: as a measures of agreement) [24] were used to compare the parents and children's grades and assess all dyads agreement in reporting children's HRQoL, respectively. The mean difference was also determined and standardized by dividing the pooled standard deviation of both scores (effect size). In order to ascertain the magnitude of these differences, Cohen's effect size was categorized as small (ES =|0.2|), medium (ES =|0.5|) and large (ES =|0.8|) [25]. The ICC values for agreement were also considered as poor (< 0.40), moderate (0.41-0.60), good (0.61-0.80) and excellent (> 0.81) [24]. In order to assess whether the observed subscale scores across daughter/son and mother/father reports were significantly affected by DIF items, we removed certain items with uniform DIF in all subscales. It is accepted that when the effect of an item with uniform DIF cannot be cancelled out by another uniform DIF item in the opposite direction, its effect can be transferred to the scale level. In this part of the analysis, data processing was carried out, using SPSS 18.0 [26].

Results
The results of cross-informant consistency at both item and scale levels of PedsQL ™ 4.0 are presented in the following part. First, mothers and fathers' perceptions of their daughters and sons' HRQoL are presented and analyzed at the item level of PedsQL ™ questionnaire, by focusing on the effect of adolescence gender on the fathers and mothers' report. Second, agreement between the informants was analyzed at the scale level of PedsQL TM 4.0, by controlling the children's gender.

DIF analysis
Tables 2, 3, 4 and 5 present the results of the hybrid OLR/IRT model to detect DIF across the mothers and daughters, fathers and daughters, mothers and sons, and fathers and sons, respectively. To evaluate the possible confounding effect of the child's gender, the following results compared the result of DIF analysis across fatherchild report with mothers-child report by considering the child's gender.

DIF analysis between mothers and daughters in compare to fathers and daughters
Comparison of the P values with threshold values for the nominal α level associated with Chi-square test of DIF analysis across mother and daughter ( Table 2) indicated that 11 out of 23 items (47%) were flagged with DIF: one item in physical, two items in emotional, three items in Table 2 The results of the hybrid OLR/IRT DIF analysis across mother and daughter on the PedsQL   Table 3 The results of the hybrid OLR/IRT DIF analysis across fathers and daughters on the PedsQL social and all the items in school subscales. Amongst these items, six items (55%) exhibited uniform and five items (45%) non-uniform DIF (The uniform DIF items in the presence of the non-uniform DIF should be considered as non-uniform DIF items [20], e.g. item four in social subscale). For all six items with statistically significant uniform DIF, the differences in McFadden pseudo R 2 (ΔR 2 ) from Model 1 to Model 2 ranged from 0.0097 to 0.0472, which were greater than their own empirical criteria (i.e. all of them are practically important, except item 5 in the emotional subscale). Moreover, for the same six items with uniform DIF, the absolute proportionate β 1 change effect size (Δβ 1 ) ranged from 0.0135 to 0.1236, which were greater than their own empirical threshold values, except for item 5 in the physical subscale. Furthermore, for the four items with statistically significant non-uniform DIF, ΔR 2 from Model 2 to Model 3 varied from 0.0069 to 0.0742, all of which were greater than the threshold values identified in Monte Carlo simulations. The result of DIF analysis across fathers and daughters is shown in Table 3. As indicated by the results, 10 out of 23 items (43%) exhibited DIF; six of them (60%) were flagged uniform and four of them (40%) were non-uniform DIF, of which one item was in physical, two items in emotional, four items in social and three items in school functioning. Regarding the ΔR 2 and Δβ 1 , all are practically important.
Therefore, comparing the result of DIF analysis across fathers and daughters with the mothers and daughters indicated that the pattern of the number of DIF items in different subscales was almost similar to each other. This result is better represented graphically in the first row of Fig. 1, which shows that the expected score function for item 5 in physical subscale (as an example of a DIF item) exhibited the same direction in showing DIF between mothers and daughters, and fathers and daughters. Almost a similar result was obtained for the other DIF items, when comparing mother-report with fatherreport in rating their daughter.
Since it could be interesting for the readers to compare the pattern of DIF between mother/father-daughter to mother/father-son in item 5 in physical subscale, the graphical representation of the latter was also added to the Fig. 1 right here, according to reviewer suggestion.

DIF analysis between mothers and sons compared to fathers and sons
Tables 4 and 5 present the results of DIF analysis across mothers and sons and fathers and sons, respectively. Although in both, 9 out of 23 items (39%) were flagged with DIF, the formation of DIF items and number of uniform and non-uniform DIF amongst several subscales was slightly different. It can be seen that amongst the mothers and sons, seven items (77%) exhibited uniform and two items (23%) revealed non-uniform DIF (Table 4), while it showed exactly a reverse pattern in the result of DIF analysis across fathers and sons (Table 5). To be more specific, in the former, two items in each of the physical, emotional and school subscales and three items in social functioning exhibited DIF, while two items in physical and school subscales, one item in emotional and four items in social showed DIF in the latter. Evaluating the magnitude of the measures, ΔR 2 and Δβ 1 indicated that all of them were practically important. It should be mentioned that in the DIF analysis the effect of items with uniform DIF can be cancelled out at the domain level by other uniform DIF items in the opposite direction. For example, as presented in Fig. 2, from the two items showing uniform DIF in the social subscale, item 1 showed DIF in one direction, whereas item 3 exhibited DIF in the opposite direction; hence, they canceled each other out (this condition is satisfied for both parents rating their sons HRQoL). The same result was obtained for items 3 and 5 in the emotional subscale. Accordingly, by comparing mother-to father-report in rating their sons, it indicated that although the pattern of DIF items was a bit different, in general most uniform DIF was cancelled out from the analysis. Table 6 shows the agreement of mothers and fathers individually with their daughters and sons with and without items with DIF. Within all dyads and based on ICC measures, small-to-moderate agreement was found in all the subscales. The highest agreement was found for physical health and the lowest for social functioning [both between mothers and daughters (ICC = 0.57 and ICC = 0.31, respectively)]. In general, the measure of concordance between mothers and children was observed to be greater than fathers and children in most subscales, regardless of the child's gender.

Measure of cross-informants agreement
Also listed in Table 6 are the means and standard deviations (SD) of mothers and fathers and their children scores, and the related effect size (ES). Although the mean score of the parents' report was significantly different from their children in a few subscales, all the Cohen's effect sizes were negligible. These findings reveal that fathers and mothers were not that different when it came to rating their daughters and sons, and both tended to report slightly the worst HRQoL than their child, except for emotional functioning. It should be mentioned that the result of cross-informant agreement did not change significantly before and after correction for DIF items (Table 6).  Table 4 The

Discussion
This is the first study investigating the effect of children's gender on father and mother's reports of their children's HRQoL at both item and scale levels of Ped-sQLTM4.0 questionnaire. The results were unique, due to the integration of mothers and fathers' views on daughters and sons' HRQoL. Assessing DIF across motherdaughter, father-daughter, mother-son and father-son dyads revealed that although parents and their children perceived the meaning of several items of PedsQL TM 4.0 Fig. 1 Comparison of father-daughter invariance to mother-daughter invariance (first row) and father-son invariance to mother-son invariance (second row) in item 5 in the physical subscale instrument differently, the pattern of fathers and mothers' report did not vary much across daughters and sons. In other words, the Persian version of PedsQLTM4.0 showed that the child's gender was not a confounding factor when mothers and fathers reported their daughters and sons' HRQoL.
In our previous study, it was shown that in the proxy version of PedsQLTM4.0, parents' gender was not a confounding factor in reporting the child's HRQoL [7]. The present study revealed that the child's gender did not affect the results of parents' reports regarding their children's HRQoL. Although the children and their parents  interpreted several items differently, taking the pattern of DIF items across the father-son, father-daughter, mother-son and mother-daughter into account (e.g. Figs. 1, 2), in PedsQL TM 4.0, the parents and children's gender was not an effective confounder when assessing the children's HRQoL.
As far as we know, there is no similar study to compare our findings directly with them. In the closest study, the measurement invariance of the other pediatric HRQoL instruments (KIDSCREEN-27) across the son-parent and daughter-parent dyads was evaluated [8]. Although this report highlights the importance of taking the child's gender into account when evaluating the measurement invariance, they noticed that this assertion should not be definite, without knowing the parent's gender.
The result of parental evaluation of the child's HRQoL at the scale level of PedsQL TM 4.0 revealed a small to moderate level of agreement across the parents and children's reports in all subscales (ICC = 0.31-0.57). It should be mentioned that the degree of parental agreement was a little different across the daughters and sons; although both fathers and mothers had a tendency to underestimate their children's general HRQoL (except for emotional functioning which was overestimated), both parents had greater agreement with their daughters, and also father-son agreement was the lowest in all domains. This finding could be due to the fact that boys, as compared to girls, tend to be more independent in their activities [27]. In this study, a greater degree of agreement was detected between children and their mothers, especially girls, who see their mothers as their confidant, and this could be the result of the parents' distinct roles in a family. In most cultures, including Iran, fathers are the providers while mothers are involved in rearing and raising their children. In a recent systematic review, Hemmingsson et al. assessed all studies related to the parent-child agreement in HRQoL research [28]. Despite showing small to moderate level of agreement, they could not reach consistent results, concerning whether or not the parent-child agreement was related to their children's gender. For example, two studies found higher parentchild agreement in daughters [29,30], which is in line with the current findings. In contrast, Carlston and Ogles showed greater disagreements between the daughters and parents, while the sons and parents exhibited more pervasive but less severe discrepancies [31]. Buck et al. also found that parents exaggerated their daughter's overall HRQoL on the PedsQL questionnaire of psychosocial functioning, but they understated their sons [32]. In several aspects, this finding was in contrast with our results, which might be due to the differences in the study design and the statistical methods used for data analysis.
From a methodological point of view, measurement invariance of the PedsQL TM 4.0 across the informants was assessed, using hybrid OLR/IRT model, through lordif, a powerful freeware package in R software for DIF detection [17]. One unique feature of this platform is the ability to detect DIF based on Type-I error rates which is empirically found in the simulated data. That is, for example, when we used the McFadden pseudo-R 2 to quantify the magnitude of DIF, the values might vary from item to item, depending on the distribution within each response category and the number of response categories [19]. Accordingly, using a single threshold could result in varying powers across items to detect DIF [33]. Hence, simulations can help to inform the choice of sensible thresholds. In other words, if a single threshold is to be used across all items, it should be set above the highest value identified in simulations. For instance, the maximum McFadden pseudo-R 2 in Table 2 was 0.0189; thus, a rational lower bound that could avoid Type-I errors might be 0.02, which interestingly corresponds with a non-negligible (i.e. small) Cohen effect size [25].
This study had a number of limitations that has to be considered before drawing any conclusion. First, in the present study, the majority of the participants were parents and children of apparently healthy population; if children or parents had a serious chronic illness, crossinformant agreement could have been affected. For example, in adolescents with significant health conditions, fathers and mothers attended to the daily functioning of their children. It seems that, in Iran, mothers, as compared with fathers, are more concerned about their children's health; thus, it is unclear to what extent a child's health status could influence the results of DIF analysis across fathers/mothers and daughters/sons. As a second limitation, the current study was limited to the adolescents aged 13-17 years-old since the fathers and mothers' item response patterns was likely to be biased for samples that combine younger children and adolescents. Given the amount of time that adolescence, especially boys, spend away from home, agreement across father/mother and son/daughter might be potentially attenuated and the results of DIF analysis is confounded. Therefore, the results of this study cannot be generalized to children younger than 13 years. A third limitation arises from the point that the hybrid IRT/OLR models were conducted separately in each domain for evaluating DIF items. Nonetheless, considering multidimensional approaches for analyzing multidimensional PRO instrument, such as PedsQL TM 4.0, could be much better in dealing with correlation amongst subscales and might principally change our results [34][35][36]. Further studies are warranted to identify the possible effect of multidimensional analysis in exploring DIF items. Although the potential dependency between parents group and children leads to the fourth limitation of this study, no simulation-based study so far has extended the iterative hybrid OLR/IRT approach for longitudinal data which could be much better handling dependency amongst the groups and controlling its possible effect on DIF detection [37,38]. However, some other DIF detection techniques were introduced which could deal with this problem and model the between groups covariance. The actor-partner interdependence models [39,40] and the longitudinal factor analysis based-models [41], which are tested measurement invariance over the time, are among these methods. Nonetheless, none of these methods could provide a simulation-based mechanism to evaluate statistical criteria for detecting DIF. Therefore, improving the longitudinal version of iterative hybrid OLR/IRT approach with Monte Carlo simulation could be considered for the future studies. The fifth limitation of the study arises from the fact that 40% of students did not take the questionnaires back to the research team. Since no socioeconomic indicators were available for nonparticipant students, we could not evaluate the potential enrollment bias. Finally, further research should consider these limitations and try to expand the findings to other pediatric HRQoL measures, such as KIDSCREEN-27 and KINDL, in order to develop a more reliable assessment tool for parent-child agreement studies in different cultures.

Conclusion
In conclusion, this study revealed that although fathers/ mothers and daughters/sons perceived the meaning of PedsQL ™ 4.0 items differently, the pattern of the fathers and mothers' report did not vary much across the daughters and sons. In the Persian version of PedsQLTM4.0, the child's gender was not a confounding factor when the parents reported their daughters and sons' HRQoL. This indicates that the mothers and fathers' scores in reporting their children's HRQoL are comparable without taking the child's gender into account, suggesting that in Iran paternal proxy-reports can be included in the maternal proxy-reports, and the reports can be combined without considering the children's gender.

Abbreviations
HRQoL: Health related quality of life; DIF: Differential item functioning; OLR: Ordinal logistic regression; IRT: Item response theory; GRM: Graded response model.