Validation of the Dutch version of the Swallowing Quality-of-Life Questionnaire (DSWAL-QoL) and the adjusted DSWAL-QoL (aDSWAL-QoL) using item analysis with the Rasch model: a pilot study

Background The Swallowing Quality-of-Life Questionnaire (SWAL-QoL) is considered the gold standard for assessing health-related QoL in oropharyngeal dysphagia. The Dutch translation (DSWAL-QoL) and its adjusted version (aDSWAL-QoL) have been validated using classical test theory (CTT). However, these scales have not been tested against the Rasch measurement model, which is required to establish the structural validity and objectivity of the total scale and subscale scores. Thus, the purpose of this study was to examine the psychometric properties of these scales using item analysis according to the Rasch model. Methods Item analysis with the Rasch model was performed using RUMM2030 software with previously collected data from a validation study of 108 patients. The assessment included evaluations of overall model fit, reliability, unidimensionality, threshold ordering, individual item and person fits, differential item functioning (DIF), local item dependency (LID) and targeting. Results The analysis could not establish the psychometric properties of either of the scales or their subscales because they did not fit the Rasch model, and multidimensionality, disordered thresholds, DIF, and/or LID were found. The reliability and power of fit were high for the total scales (PSI = 0.93) but low for most of the subscales (PSI < 0.70). The targeting of persons and items was suboptimal. The main source of misfit was disordered thresholds for both the total scales and subscales. Based on the results of the analysis, adjustments to improve the scales were implemented as follows: disordered thresholds were rescaled, misfit items were removed and items were split for DIF. However, the multidimensionality and LID could not be resolved. The reliability and power of fit remained low for most of the subscales. Conclusions This study represents the first analyses of the DSWAL-QoL and aDSWAL-QoL with the Rasch model. Relying on the DSWAL-QoL and aDSWAL-QoL total and subscale scores to make conclusions regarding dysphagia-related HRQoL should be treated with caution before the structural validity and objectivity of both scales have been established. A larger and well-targeted sample is recommended to derive definitive conclusions about the items and scales. Solutions for the psychometric weaknesses suggested by the model and practical implications are discussed. Electronic supplementary material The online version of this article (doi:10.1186/s12955-017-0639-3) contains supplementary material, which is available to authorized users.


Background
Health-related quality of life (HRQoL) refers to a complex, multidimensional construct and is based on the individuals' subjective perceptions of functioning and wellbeing among the physical, psychological and social domains of health [1][2][3]. The construct HRQoL is not directly measurable or is unobservable or latent [4] and should preferably be measured with patient-reported outcome (PRO) measures using multiple items, each assessing a different aspect of the underlying construct [5]. Many PROs have been developed to measure HRQoL in patients with oropharyngeal dysphagia [1]. These PROs use self-reported questionnaires, assess the presence and severity of dysphagia symptoms, and measure the influence of dysphagia on a person's HRQoL [1,[6][7][8]. These PROs add useful information to the clinical swallowing examination and instrumental investigations [9] and can be used as outcome measures of therapeutic interventions [1,[6][7][8]. The applicability and appropriateness of a PRO in a specific population depend on the target population (i.e., persons with oropharyngeal dysphagia), its feasibility and the quality of its psychometric properties (i.e., reliability and validity) [4,10].
The Swallowing Quality-of-Life questionnaire (SWAL-QoL) is a 44-item disease-specific scale that is distributed into 10 subscales and the Symptom scale [11]. The SWAL-QoL is considered the gold standard for assessing HRQoL in oropharyngeal dysphagia [12]. The psychometric properties of the SWAL-QoL as well as the Dutch translation of the SWAL-QoL (DSWAL-QoL) and its adjusted version (aDSWAL-QoL) have been demonstrated to be sufficient according to classical test theory (CTT) [7,11,13]. The CTT-psychometric assessment included an examination of internal consistency based on Cronbach's alpha, test-retest reliability via the intraclass correlation coefficient (ICC), and/or construct validity based on principal component analysis (PCA) techniques [7,11,13]. Some drawbacks related to CTT methods are recognized [14,15], such as test and sample dependence [14][15][16] and the assumption of equal weight for all of the items even if there is a difference in the level of difficulty [16]. The scale's total sum score is based on ordinal values and the standard error of measurement is assumed to be constant [10,15], in contrast to the Rasch methodology.
The Rasch model within modern item response theory (IRT) has been considered the gold standard against which scales summarizing item responses must be tested [17]. Item analysis using the Rasch model involves formal testing of a scale against a mathematic measurement model that specifies what should be expected in the item responses to provide interval-based measures instead of ordinal values [18,19]. Interval measures are preferable to ordinal scales because they provide meaningful information about the relative differences and equivalences within the categories of the scale and enable the use of parametric statistics, which provide more powerful and precise results [14,20]. If the observed data fit the model, the following can be concluded: interval data have been generated, the measurement scale demonstrates structural validity and objectivity, and the total score is statistically sufficient [17]. Structural validity is an aspect of construct validity and evaluates the extent to which the scores of a HRQoL-PRO are an adequate reflection of the dimensionality of the construct being measured [4]. Objectivity implies invariance, which indicates that the comparison between two persons should be independent of which particular items have been used and vice versa [21]; therefore, the instrument should work in the same manner across all persons and items. In contrast to CTT, Rasch measurements allow for the provision of scaleindependent person estimates and sample-independent item estimates [5]. The total score of a scale is statistically sufficient if the assessment of the latent variable (i.e., HRQoL) is only a function of that total score and does not depend on the conditional distribution of the item responses underlying the total score [17]. Four assumptions should be satisfied for a measurement scale to meet the criteria of validity, objectivity and statistical sufficiency: 1) unidimensionality (all items in the scale measure the same single construct) [17,21], 2) monotonicity (the scale items function hierarchically from easy to difficult, with increased item scores corresponding to increased levels of underlying ability) [17,22], 3) local item independency (a person's score on one item does not depend on their score on another item), and 4) no differential item functioning (DIF, i.e., a particular item's score does not differ due to other factors, e.g., age, for persons with equal ability levels) [17,21]. In addition to identifying measurement weaknesses, analysis with the Rasch model provides potential solutions for scale improvement. Such improvement has previously been demonstrated with the Taiwan Chinese version of the EORTC QLQ-PR25 questionnaire [23], St. George's Respiratory Questionnaire (SGRQ) [16], and the Patient-Rated Elbow Evaluation (PREE) questionnaire [24].
The purpose of this study was to assess the structural validity and objectivity of both the DSWAL-QoL and aDSWAL-QoL scales and subscales and the statistical sufficiency of the total score and subscale scores using item analysis with the Rasch model.

Participants
A portion of the data was derived from a previous validation study of the aDSWAL-QoL, which has been extensively reported elsewhere [13]. Therefore, the design will be briefly described. A cross-sectional study using convenience sampling was conducted and included 108 persons, among whom 78 were involved in the previous study [13]. People were selected if they were (1) native Dutch speakers, (2) adults (age ≥18 years old) and (3) had oropharyngeal dysphagia of mechanical or neurological origin as assessed with the Mann Assessment of Swallowing Ability (MASA) [25] and/or the Fiberoptic Endoscopic Evaluation of Swallowing (FEES) [26]. Persons without oropharyngeal dysphagia but with a confirmed language and/or cognitive impairment as measured by the auditory and visual comprehension subtests of the Akense Afasie Test (AAT) [27] and the Mini Mental State Examination (MMSE) [28] were also included in this study. Persons were classified into three groups according to whether they suffered from dysphagia (Dys group), had dysphagia accompanied by a language impairment and/or cognitive disorder (DysLC group), or suffered from a language impairment and/or cognitive disorder without the presence of dysphagia (LC group). The proposed criteria for the MASA [25] (further specified in Table 1) and the standardized cut-off scores of 107 for the AAT [27] and 27 for the MMSE [29,30] were used to compose the groups. The exclusion criteria were as follows: (1) severe problems understanding written and spoken Dutch resulting in the inability to complete the questionnaires; (2) severe attention and/or concentration problems that affected the person's ability to maintain concentration during the assessment; (3) the presence of purely esophageal dysphagia; (4) anosognosia, i.e., being unaware of the existence of dysphagia despite clinical confirmation; and (5) severe visual and hearing impairments that prevented the investigators from successfully providing assistance when required. The people were recruited from different settings that included hospitals, rehabilitation centers, nursing homes and private speech-language pathologist (SLP) practices and were identified by SLPs, the appropriate staff in nursing homes and medical doctors based on the inclusion criteria. Verbal and written consent were obtained from the participants prior to the start of the study. Ethical approval for the consent procedure and the experimental protocol of the study was granted by the Committee for Medical Ethics of the Antwerp University Hospital and Antwerp University (B300201318058), and the study was conducted in full accordance with the Declaration of Helsinki.

DSWAL-QoL and aDSWAL-QoL
The DSWAL-QoL is a condition-specific PRO scale that measures the effect of dysphagia on a person's HRQoL [7]. The DSWAL-QoL has been validated for a Flemish population [7] and consists of 44 items that are grouped into the following 11 subscales: General burden, Eating desire, Eating duration, Symptoms, Food selection, Communication, Fear of eating, Mental health, Social functioning, Sleep and Fatigue. The DSWAL-QoL uses a 5-point Likert scale that ranges from 1 = 'severely impaired quality of life' to 5 = 'no impairment'. Based on Likert's method of summated ratings, the scores are transformed into subscale and scale scores that range from 0 = 'strong effect of dysphagia on HRQoL' to 100 = 'no effect on HRQoL' [31]. To increase the feasibility of the DSWAL-QoL for DysLC people, an adjusted version (aDSWAL-QoL) has been developed [13]. Both versions (DSWAL-QoL and aDSWAL-QoL) have been validated using CTT [7,13]. The aDSWAL-QoL has similar content as the DSWAL-QoL (the abbreviated item contents of the DSWAL-QoL and aDSWAL-QoL are presented in Additional file 1) and also uses 5-point response categories. In contrast to the DSWAL-QoL, the number of different response formats in the aDSWAL-QoL is reduced to three (instead of six) and the subscales following the same response format are placed together. Additionally, the response categories in the aDSWAL-QoL are supported by visual line drawings, symbols and colors (Additional file 2).

Procedures
People completed both the DSWAL-QoL and aDSWAL-QoL in a random order to minimize recall effects. A minimum of 15 and a maximum of 30 min elapsed between the administration of the first and the second questionnaires. The people were encouraged to complete the scales as independently as possible, while assistance was provided when required (i.e., on request or when the patient failed to provide a response) [13].

The Rasch model
The Rasch model is based on a probabilistic Guttman pattern [10,18]. This indicates that the probability of affirming a certain response to an item is a logistic function of the difference between the level of the measured construct as expressed by the person and as represented by the item and only a function of that difference [18,21]. Person and item parameter estimates are placed on the same linear logit scale by transforming the original ordinal raw data into equal interval level measures (logits or logodd units). The logit scale represents the latent trait [18] (i.e., dysphagia-related HRQoL) and both parameters are centered around a mean item location of zero [21]. Positive values for the person and item parameters indicate high ability levels (i.e., better HRQoL) and difficult items, and negative values indicate low ability levels (i.e., worse HRQoL) and easy items [32].

Data analysis
The DSWAL-QoL and aDSWAL-QoL include polytomous variables, and significant likelihood ratio tests (p < 0.001) indicated that the unrestricted parameterization of the model (partial credit) should be used rather than the rating scale model [33]. The item analysis with the Rasch model was performed using RUMM2030 [34], which integrates a pairwise conditional maximum likelihood algorithm in the estimation of the item and person parameters [22]. The following properties were examined: overall fit to the model, internal consistency reliability, unidimensionality, threshold ordering, individual item and person fits and differential item functioning (DIF), local item dependency (LID) and targeting. Item analysis with the Rasch model also yielded an iterative process in which strategies such as rescaling the response categories and item reductions were applied to improve the model fit and the construction of the scale.
Overall fit to the model The overall fit to the model was assessed by evaluating three overall fit statistics, specifically two item-person interaction statistics and one item-trait interaction statistic [35]. The overall item and person fit were evaluated by inspecting the mean item and mean person standardized fit residuals (FRs) [35], which should be close to zero with a standard deviation (SD) of '<1.4' [32]. The item-trait interaction that assesses whether the relative difficulties of the items remained constant across the different ability groups of patients [32,35] was measured using a chi-square statistic (χ 2 ). Specifically, the χ 2 summarizes the differences between the observed and the expected values and was considered to be nonsignificant (p > 0.05) to fit the model expectations.
Reliability The internal consistency reliability was assessed with the Person Separation Index (PSI), which is an estimate similar to Cronbach's alpha (α) coefficient [35]. The PSI assesses how adequately the set of items can distinguish subjects on different levels of the scale [36], and a value ≥ 0.70 is required [18,32]. The PSI is also an indicator of the power of the generated fit statistics [24].
Unidimensionality The unidimensionality of the scale was measured by performing t-tests on the two most divergent subsets of items [32]. The items with the greatest positive and negative loadings on their first residual factor (resulting from PCA) were used to create the two subsets [32,37]. The scale was considered unidimensional if < 5% of the person estimates exhibited a significant difference in the scores for the two subtests [32,37] or if the lower bound of an exact binomial confidence interval is < 5% [19]. Performing the t-test requires at least 12 category thresholds in each of the two subsets [19].
Threshold ordering of the polytomous items After the investigation of the overall fit statistics, the ordering of the response categories was examined using a threshold map and category probability curves [18,21]. In the cases of both the DSWAL-QoL and aDSWAL-QoL items, there are 5 response categories, resulting in 4 thresholds (= transitional points) [14]. Figure 1 represents a category probability curve in which the thresholds of a certain item are well ordered and form distinctive regions. Monotonicity was expected, and in cases of disordered thresholds, the item was rescored by combining adjacent categories [18,32].
Individual item and person fit Individual item and person fit were assessed using the FR and χ 2 . A person and item FR ± 2.5 and a χ 2 statistic above a Bonferroniadjusted α-value of 0.05 [32,38] indicated a fit to the model. Misfitting items or persons were removed to improve the overall fit of the model.
Differential item functioning Items were also checked for DIF to ensure that the items of the scale were not biased by the person factors (i.e., language and cognitive impairments and dysphagia) and that the different class intervals followed the expected values of the characteristics of the items themselves. Thus, it was possible to investigate whether the different groups of the sample responded differently to an individual item despite the equal location on the latent trait [21]. The detection of DIF (i.e., uniform [18] and non-uniform DIF [18,21,37]) was made possible via the application of analysis of variance (ANOVA) to the fit residuals [21]. Uniform DIF was adjusted by splitting the item into group-specific items [21]. Items with nonuniform DIFs were considered to misfit the model and were removed [35].
Local item dependency Local item dependency, which might be caused by response dependency (i.e., when a person's response to an item depends on the response to another item) or by trait dependency (i.e., multidimensionality), was investigated using the residual correlation matrix [22,39]. Local item dependence was considered to be present if the item residual correlations were > 0.3 above the average of all of the item residual correlations [22,38]. By grouping the items into one "super-item," called a testlet, the LID can be adjusted [37]. For all analyses, the Bonferroni correction was applied to adjust for multiple testing and was calculated based on the number of items [21].
Targeting of persons and items (person-item threshold distributions) Targeting was examined after fitting the best solutions for the DSWAL-QoL and aDSWAL-QoL total scales and subscales. Targeting was analyzed by comparing the person and item threshold distributions. To be acceptable, the mean person locations were expected to approximate the mean item threshold location (i.e., 0.0 logits) and the item locations were expected to cover approximately the same range of the logit scale as the person locations [21,35].

Sample size
A sample size of 108 persons was suggested to provide 95% confidence that the item calibration or the estimated item difficulty will be within ± 0.5 logits [40].

Participants
In total, 108 persons were included. The Dys group consisted of 35 persons, 43 persons comprised the DysLC group, and 30 persons were in the LC group. The mean age of the total sample was 73.50 years (SD: 14.79). Table 1 presents the demographic characteristics of the persons. Comparison of the three groups revealed that head and neck cancer (48.6%) were most common in the Dys group, stroke (53.5%) was most common in the DysLC group, and dementia (60.0%) was most common in the LC group.

Evaluation of the measurement properties of the DSWAL-QoL and aDSWAL-QoL
Tables 2 and 4 display the results for the overall fit statistics before and after the implementation of the solutions suggested by the Rasch model for the DSWAL-QoL and aDSWAL-QoL scales, respectively. Table 3 provides an overview of the item level fit statistics of both scales.

Evaluation of the measurement properties of the DSWAL-QoL
The analysis revealed that the reliability was good, with a PSI of 0.93 and an excellent power of fit without extreme scores. However, the total DSWAL-QoL scale was found to misfit to the Rasch model as indicated by the item FR SD and person FR SD > 1.4, and by the presence of a significant item-trait interaction ( Table 2). Multidimensionality was present as confirmed by the 16.09% statistically significant different person estimates based on the two subsets of items. Disordered thresholds were found in 38 items, which indicated that the categorization of these items did not work as intended (Table 3). For example, the category probability curve for item 22 revealed that the estimates of the thresholds defining categories 2 and 3 did not form distinctive regions on the latent trait; therefore, these scores (i.e., 'somewhat' and 'a little') were at no time the most probable responses (Fig. 2). Seven items did not fit (items 5, 6, 7, 32, 34, 40 and 41; Table 3), and the individual person fit revealed that 11 persons fell outside the FR range of ± 2.5. As illustrated in Fig. 3, item 5 exhibited a uniform DIF by group for all three groups, and the cognitive group obtained a prominently lower score compared with the those of persons in the other two groups given an equal ability level. Residual correlations > 0.3 were found   for clusters of items within and across subscales; thus, LID was present. To achieve a satisfactory overall model fit, it was necessary to rescore 38 items. For example, scores 2 and 3 for item 22 were collapsed into the score '1'; therefore, the Rasch-suggested scoring solution revealed a three-point response category, i.e., '0, 1 and 2, ' for this item (Additional file 1). During this process of rescoring items, three additional items exhibited disordered thresholds and were also rescored. It was also necessary to delete six items (items 5, 6, 7, 34, 40 and 41; Table 2). The overall model fit improved for the items and no further DIF was present. However, the overall fit worsened for the person FR SD, the item-trait interaction remained significant, and 14 persons did not fit. Due to the limited sample size, the misfit persons were not removed. Multidimensionality remained despite the adjustments, and LID could not be resolved due to its unclear pattern. For the Symptoms and Mental health subscales, the reliabilities were acceptable (PSI ≥ 0.75; Table 2). For the other subscales, the reliabilities were below the recommended level and the power of fit was low. A large number of extreme scores were present for all subscales with the exception of the Symptoms scale. There was no pattern of the extremes across the three person groups. All of the DSWAL-QoL subscales, with the exception of the Eating desire and Fear of eating subscales, exhibited satisfactory overall fit statistics. For the Eating desire subscale, the person FR mean (SD) of −0.10 (1.46) indicated some misfit of the persons. Inspection of the individual person fits revealed five misfit persons. The Symptom subscale exhibited a lack of unidimensionality, whereas the other subscales could not be subjected to t-tests due to insufficient numbers of items (i.e., thresholds). None of the items exhibited misfit, but disordered thresholds were found for a majority of the items within all subscales with the exception of the Communication subscale (Table 3). A uniform DIF was identified for item 26 from the Fear of eating subscale and was biased toward the cognitive group, which obtained higher scores. Local item dependency was demonstrated between items 8 and 9 from the Symptoms subscale. Adjustments of the subscales were performed, and after the items with disordered thresholds were rescored and Fig. 2 Category probability curve with disordered thresholds. Category probability curve graphically highlighting the disordered thresholds for item 22 of the total DSWAL-QoL scale. The point at which the lines for the adjacent response categories intersect in item 22 indicates that the transition between categories 2 and 3 is lower on the trait than the transition between categories 0 and 1. Response categories 2 and 3 never have a point on the continuum at which the most probable response is located Fig. 3 Item characteristic curve displaying a uniform DIF. Item characteristic curve displaying a uniform DIF for item 5. Despite the equal ability level, the three groups responded differently. The cognitive group obtained a prominently lower score than those of the two other groups item 26, which exhibited DIF, was split, all of the items exhibited ordered thresholds, the LID disappeared and the item-trait interaction improved for the Fear of eating subscale (Table 2). However, the overall person FR mean values and/or SDs significantly increased for some of the subscales (i.e., General burden, Eating duration, Eating desire, Social functioning and Sleep), and the item-trait interaction remained significant for the Eating desire subscale. For the General burden, Eating duration, Eating desire, Social functioning and Sleep subscales, the numbers of misfit persons were N = 4, N = 14, N = 11, N = 7, and N = 33, respectively. Unidimensionality could not be established for the Symptom subscale, and improvement for reliability also could not be identified for the subscales. After the solutions, the numbers of extreme scores increased for the General burden, Eating duration, Social functioning and Fatigue subscales ( Table 2). The misfit persons were not removed because of the limited sample size.

Evaluation of the measurement properties for the aDSWAL-QoL
The analysis of the total aDSWAL-QoL scale revealed that the reliability was good (PSI = 0.93), the power of fit was excellent, and there were no extreme scores. Nonetheless, the total aDSWAL-QoL scale significantly deviated from the Rasch model (Table 4). Approximately 18% of the person estimates on the two most divergent subsets of items were significantly different, which indicated multidimensionality. Individual item analysis revealed 37 items with disordered thresholds (Table 3). Eight items (items 2, 5, 6, 25, 29, 32, 40 and 41) exhibited individual item misfit, and six items (1, 2, 6, 9, 32 and 43) exhibited uniform DIFs ( Table 3). Most of the items with DIF were biased toward the LC group, which obtained higher scores, with the exception of item 43. Individual person misfits were found for 16 persons. Local item dependency was present between several item pairs within and across the subscales. After the rescoring of 37 items and six more items that exhibited disordered thresholds during the iterative process, the individual item fits improved for items 2 and 25 and the DIFs disappeared for items 6, 9 and 43. The other misfit items and the items that exhibited DIF remained. The iterative process revealed that to improve the overall fit, it was necessary to remove items 5, 6, 7, 29, 32, 40 and 41 and to split five items that displayed uniform DIFs (items 1, 2, 27, 43 and 44; Table 4). The person FR still indicated some misfit among the persons (SD > 1.4). The misfit persons (N = 19) were not removed because of the relatively limited sample size. Assessing the unidimensionality of the scale was no longer possible because of the item split. Since the LID showed an unclear pattern, it was not possible to create testlets.
For the Symptoms and Mental health subscales, the reliabilities were acceptable (PSI ≥ 0.72; Table 4). For the other subscales, the reliabilities were below the recommended level and the power of fit was low. Extreme persons were identified in all subscales, although the magnitudes were lowest for the Eating desire, Symptoms and Fear of eating subscales. Again, there was no pattern of extremes across the three population groups. The overall fits to the model were demonstrated for all of the aDSWAL-QoL subscales with the exception of the Communication, Fear of eating and Social functioning subscales ( Table 4). The overall person FR indicated a misfit for the Communication subscale (SD = 1.57), and further analysis revealed 22 misfit persons. Multidimensionality was present for the Symptom subscale; however, the other subscales could not be subjected to the test for multidimensionality because there were too few items. Disordered thresholds were found for 29 items across all subscales with the exception of the items in the General burden and Fatigue subscales (Table 3). At the individual item level, item 29 of the Fear of eating subscale did not fit the model. Differential item functioning was found in 3 items (items 3, 26, and 31) and LID was found between two item pairs from the Symptoms subscale (9)(10)(19)(20) and between item pairs 26-28 from the Fear of eating subscale. After rescoring all of the items with disordered thresholds (the DIFs for items 26 and 31 disappeared after rescaling), removing the misfit item 29 and splitting item 3 for DIF, ordered thresholds were found for all items and the item-trait interactions improved for the Fear of eating and the Social functioning subscales ( Table 4). The item FR SD increased for the Communication subscale and the item-trait interaction became significant for the Eating desire subscale (p < 0.05). The person FR SDs increased for the Communication and Social functioning subscales. The numbers of misfit persons were N = 21 and N = 9 for the Communication and Social functioning subscales, respectively, and misfit persons were not removed because of the relatively limited sample size. The number of extreme scores remained unchanged or increased. Improvement for reliability could not be identified for the subscales. The LID disappeared between items 26 and 28 from the Fear of eating subscale. For the Symptom subscale, the lack of unidimensionality remained and the LID persisted between items 9 and 10, disappeared between items 19 and 20 and appeared between items 14 and 18. Adjusting the LID (i.e., creating two testlets: item pair 9-10 and item pair 14-18) did not improve the overall fit statistics for the Symptom subscale (item FR (SD) = −0.10 (0.64); person FR (SD) = −0.32 (1.16); item-trait interaction: χ 2 (df) = 12.16 (12); p = 0.433); however, the reliability increased (PSI = 0.98).

Targeting of persons and items
After the adjustments, targeting was suboptimal for both the DSWAL-QoL and aDSWAL-QoL total scales and subscales (Table 5). For both the total scales and subscales, the item locations did not cover the same ranges of the logit scale as the person locations. At the positive and negative ends of the trait, no item thresholds were found at the person locations, which indicated that these persons exhibited higher or lower ability levels that could not be measured by the items of the scale. For the total aDSWAL-QoL scale and the aDSWAL-QoL Symptoms subscale, at the negative end of the trait, no persons were located at the item thresholds, which indicated that the average item difficulties of some of the items were too low.

Discussion
The analysis did not support the structural validity or objectivity of either the DSWAL-QoL or the aDSWAL-QoL total scales and subscales or the statistical sufficiency of the total scores and subscale scores. Misfit to the Rasch model, multidimensionality and/or the presence of DIF were found. Comparing the subscales of both versions, the Eating desire subscale of the aDSWAL-QoL exhibited an overall fit to the model in contrast to its corresponding subscale in the DSWAL-QoL, while the Communication and Social functioning subscales in the DSWAL-QoL did fit the model. For all other subscales, the results for the overall fits were similar. A large number of extreme scores were present in both versions of the scale, and this phenomenon was even greater for the DSWAL-QoL subscales. These extreme scores influenced the reliability and the power of fit. The presence of low levels of PSI and the high percentage of extreme scores reflected suboptimal Abbreviations and symbols: aDSWAL-QoL adjusted DSWAL-QoL, IS initial scale, RS rescaled scale based on the suggestions from the Rasch methodology, DIF differential item functioning, FR fit residual, SD standard deviation, PSI Person separation index reported with extremes (+) and without extremes (÷), χ 2 (df) chi-square (degrees of freedom), LCI lower confidence interval, % ext percentage of extreme scores, NA not applicable (T-tests were not performed when there were too few thresholds in each subset or when the items were split for DIF when using the RUMM software). Bold indicates misfit to the Rasch model. Bold italic indicates multidimensionality targeting for the subscales of both versions. The suboptimal targeting for the total scales and subscales resulted in decreased estimation precisions of the item and person parameters [21]. The misfit items were most present when all of the items were treated as one total scale and were quite similar in both versions. Local item dependence was present between items within and across the subscales for both the DSWAL-QoL and aDSWAL-QoL scales. The main sources of misfit for both scale versions were disordered thresholds for the items in the total scale and the individual subscales with the exception of the items of the Communication subscale in the DSWAL-QoL and the General burden and Fatigue subscales in the aDSWAL-QoL. We expected fewer disordered thresholds in the aDSWAL-QoL because the aDSWAL-QoL has been proven to be more feasible for use in groups with additional language and/or cognitive impairments [13]. Note that the 5-point response category in the aDSWAL-QoL contains similar content as the DSWAL-QoL. Some patients might have interpreted the graphic support (i.e., the symbols that were intended to enhance the comprehension of the response categories) in a different manner than what was intended. Nonetheless, it was obvious that the original scoring structures for most of the items of both the total scales and subscales did not work as intended (Additional file 1). The latter may be because the people were not able to discriminate between the response categories. Either the different categories were not well defined or the difference in meaning was too subtle (i.e., what is the difference between 'somewhat' and 'a little'?). Additionally, the incorrect assumption that the Likert scale is an interval scale is common, although the categories of a 5-point Likert format represent a qualitative variable that is actually only sequential and ordinal [41]. To obtain linear, equal-interval level results, testing of the ordering of the response categories against the Rasch measurement model and subsequent rescaling of the items with disordered thresholds is required.
The presence of DIF by group was found in both versions of the scale but was most prominent for the aDSWAL-QoL. Most of the items that displayed DIF were biased toward the cognitive group, which tended to obtain higher scores for these items. This finding indicates that this group overestimated their HRQoL. The latter was expected because this group did not suffer from oropharyngeal dysphagia. The DSWAL-QoL and aDSWAL-QoL are disease-specific scales developed for people with oropharyngeal dysphagia. The LC group did not meet this condition; thus, the appropriateness of including this group in the study could be questioned. The objectivity of the scale can only be established by the Rasch methodology if one important requirement is satisfied, i.e., if the items and the sample are within the specific frame of reference for which the scale was developed [17]. Nonetheless, including this patient group was important because it enabled the evaluation of whether the scale and subscales were influenced by DIF. It was also expected that this LC group would exhibit extreme scores (i.e., all of the maximum scores for HRQoL) because of the absence of oropharyngeal dysphagia. However, there was no pattern in the extreme scores across the three groups. This issue leads to the question of the extent to which the scales are completed in a 'reliable' manner (i.e., whether they truly capture the patient's perspective). Compared to the other two groups, more people in the cognitive group had an underlying etiology of dementia. The language and cognitive impairments were also greater in the LC group; thus, this group likely had more problems understanding the questions of both the DSWAL-QoL and aDSWAL-QoL. Next to impaired language functions, dementia encompasses a large spectrum of behavioral and other cognitive impairments, such as changes in personality and behavior, impaired reasoning and handling of complex tasks, poor decision-making ability and poor judgment [42]. The finding that this LC group did not demonstrate extreme scores as expected indicates that caution should be exercised in the use of these scales with dysphagic people with dementia because the dementia-related factors might influence the responses. The use of the 5-point response category for the Social functioning subscale of the aDSWAL-QoL should be questioned because the middle response category of this subscale included 'I don't know.' The literature indicates that respondents do not interpret this type of middle response category as expected from the integer scoring (i.e., monotonically). Consequently, disordered thresholds can occur because these categories differ from other response options in their probabilities of being selected. With respect to the integer scoring [43], it would be appropriate to reformulate this response category.
After the adjustments, a potential scoring structure was suggested for all items of both the DSWAL-QoL and aDSWAL-QOL total scales and subscales (Additional file 1). The scoring structure was often different for some of the items when they were treated as one scale instead of being part of the subscales. In most of the items, the 5-point response category was rescaled to a 3-or 4-point response format. A disadvantage of using different response formats is that it might lead to confusion and cause erroneous responses [44]. The analysis suggested that, rather than including 44 items, the total DSWAL-QoL scale should be rescaled to a 38-item scale and the total aDSWAL-QoL should be rescaled to a 42-item scale in which five items are group specific. The proposed numbers of items for the subscales of both versions are displayed in Additional file 1 and in Tables 2 and 4. Note that a large-scale empirical study is needed to confirm the scoring structures of both the scales and subscales.
The overall fit improved for the total aDSWAL-QoL scale but not for the total DSWAL-QoL after the adjustments, and the person FR SDs remained high for both total scales. Misfits were also demonstrated for the Eating desire subscale of both scales after adjusting the items. The issue of fit is a relative matter in the Rasch methodology because it depends on the sample size [21]. The reliabilities were high for both total scales but low for most of the subscales of both questionnaires. When comparing studies that performed cross-cultural adaptations of the SWAL-QoL [7,8,13,45], we observed differences in reliability. These studies used Cronbach's α to evaluate the internal consistency. However, relying on Cronbach's α is only justified if the data are normally distributed. Multiple ceiling or floor effects were observed in those studies [7,8,13,45], raising questions about the accuracy of the internal consistencies of these scales. It is not possible to compare our study results (based on the PSI) with studies that have used Cronbach's α because α includes extreme scores, whereas the estimate of the PSI requires extrapolated values for extreme scores [46]. Specifically, the calculation of Cronbach's α assumes equal standard errors (SEs) in all of the scores. This assumption contrasts with the calculation of the PSI in which the SE increases as the scores become more extreme [46]. After adjusting for LID in the Symptom subscale of the aDSWAL-QoL, the reliability increased, which indicated multidimensionality [39]. Whether LID is response or trait dependent may be difficult to distinguish in polytomous analysis [39]. For both the DSWAL-QoL and aDSWAL-QoL total scales, LID was present between and across subscales. Thus, due to an unclear pattern of the LID, it was not possible to create testlets that would resolve LID. Since LID might be caused by multidimensionality, it could be suggested to reconsider the dimensional structure of both scales using factor analytic approaches. Although Vanderwegen et al. [7] performed traditional (linear) PCA on the DSWAL-QoL, PCA only identifies the variables that show the strongest linear relationship with each other and tries to explain for as much of the total variance in the data [20]. Therefore, factor analysis for ordinal data [47] is required as it identifies the (number of) latent constructs and the possible underlying factor structure of a set of variables [20]. Furthermore, multidimensionality could not be resolved by the Rasch methodology for either of the total scales. This finding indicates that each item of both scales should be scored separately and should be considered as a single item [22]. The main strength of this study was that by using a modern test theory approach, both scales could be improved (e.g., ordered thresholds for the items). However, we could not establish the structural validity and objectivity of either the total scales or the subscales and the total score and subscale scores remained statistically insufficient.

Limitations and future research
One major limitation of this study was the relatively limited sample size. A sample size of at least 64 to 144 persons is required to achieve 95% confidence that the item calibration is within ± 0.5 logits [40]. Our study sample of 108 patients was within the recommended sample size (i.e., sufficient for the total scales). Subjects with extreme scores were excluded from the analysis because they did not contain information for the estimation of the item and person threshold parameters [36]. Thus, for the subscales, the effective sample size in this study was smaller than the original sample size. The low PSI and the low power of fit had to be taken into account when interpreting the results for the subscales. To derive definitive conclusions about the items and the scales, welltargeted and sample sizes ≥ 250 people are recommended [36,40]. Therefore, this study must be interpreted as a pilot study. Nonetheless, clinicians and researchers cannot longer rely on both the DSWAL-QoL and aDSWAL-QoL total scores and subscale scores as an indicator of how a patient's HRQoL is affected by oropharyngeal dysphagia. Until the psychometric properties have been established in a larger sample, we suggest to use the proposed scoring structure (Additional file 1) for each individual item, taking into account to derive only qualitative information from that item. Items suggested to be removed from the total scales and subscales should be interpreted with caution. The psychometric weaknesses of these scales indicate to reconsider the cross-cultural validation process [48] and to evaluate if the translations and adaptations meet accepted standards of cross-cultural validation [4,22]. We recommend using IRT for further validation of the original SWAL-QoL [11] and all of its translations. Most of the subscales exhibited a lack of sufficient items to allow for the assessment of the unidimensionality of the scale. After all, multiple items enable the improvement of the reliability because random errors of measurement can be averaged out. Multiple items increase the scope of a scale and are less open to variable interpretation [49]. Scales that are too extensive do not function well in routine clinical practice [11] because the patient's burden in completing multiple items can be onerous and time consuming [50]. With 44 items, both of the SWAL-QoL versions are still long and extensive scales. Therefore, it would be beneficial to create and validate a shorter version. We analyzed the two versions of the SWAL-Qol separately. For future studies, it may be useful to merge the two datasets and perform DIF analysis using the version as a person factor. We used a residual correlation of r > 0.30 above the average of all the correlations for the detection of LID [38], although this criterion might be regarded as arbitrary [51]. If a more strict criterion of r > 0.20 was used, we might have found more LID.

Conclusions
This is the first study to examine the structural validity and objectivity of both the DSWAL-QoL and aDSWAL-QoL total scales and subscales and the statistical sufficiency of the total scores and subscale scores using item analysis with the Rasch model. However, the analysis could not establish these psychometric properties because a misfit to the model, multidimensionality, disordered thresholds, DIF and/or LID were found. This analysis with the Rasch model identified areas that require further investigation. Our study highlighted the fact that relying on the DSWAL-QoL and aDSWAL-QoL subscale scores and total scale scores to make conclusions about a person's dysphagia-related HRQoL should be undertaken with caution before the psychometric requirements have been established. The adjustments suggested by the Rasch model induced scale improvement, as the disordered thresholds were rescaled, the misfit items were removed and the DIF was resolved. Although we were not able to derive definitive conclusions about the items and the scales, this study illustrated the added value of the use of Rasch analysis in the detection of the psychometric strengths and weaknesses of these rating scales. Therefore, this study can be viewed as an essential step forward toward the further improvement of these scales.