Validation of a mobility item bank for older patients in primary care

Background To develop and validate an item bank to measure mobility in older people in primary care and to analyse differential item functioning (DIF) and differential bundle functioning (DBF) by sex. Methods A pool of 48 mobility items was administered by interview to 593 older people attending primary health care practices. The pool contained four domains based on the International Classification of Functioning: changing and maintaining body position, carrying, lifting and pushing, walking and going up and down stairs. Results The Late Life Mobility item bank consisted of 35 items, and measured with a reliability of 0.90 or more across the full spectrum of mobility, except at the higher end of better functioning. No evidence was found of non-uniform DIF but uniform DIF was observed, mainly for items in the changing and maintaining body position and carrying, lifting and pushing domains. The walking domain did not display DBF, but the other three domains did, principally the carrying, lifting and pushing items. Conclusions During the design and validation of an item bank to measure mobility in older people, we found that strength (carrying, lifting and pushing) items formed a secondary dimension that produced DBF. More research is needed to determine how best to include strength items in a mobility measure, or whether it would be more appropriate to design separate measures for each construct.


Background
Physical function is a central component of health status and quality of life [1]. In addition to measuring physical function with fixed length scales such as the Health Assessment Questionnaire [2] or the subscale of physical functioning of the Medical Outcomes Study Short Form-36 (PF-10) [3], it can also be measured using item banks based on item response theory (IRT) models [4,5]. In some of these item banks, physical function is measured as a two-dimensional construct consisting of mobility and upper extremity function [6,7], although in others a unidimensional solution has been considered more appropriate [8,9]. Nevertheless, it is true that the latter is not sufficiently robust for certain health conditions [9]. The majority of these physical function measures are aimed at assessing health outcomes in patients with chronic diseases or in rehabilitation contexts [6][7][8][9][10]. However, there are no specific measures to assess physical function in community dwelling older people, with the exception of the Late-life Function and Disability Instrument [11,12].
Measuring physical function -mainly mobility rather than upper extremity function -in older people is doubly useful as physical function is a strong predictor of disability, institutionalisation and death and is also a primary outcome, more proximal than disability, in longitudinal and clinical trials aimed at explaining or preventing disability [13,14]. Due to the scarcity and importance of late life mobility measures, the first of the two objectives of this paper is to present the development and validation of an item bank to measure mobility in community dwelling older people, using IRT methods. Items in the item bank were based on International Classification of Functioning, Disability and Health (ICF) mobility indicators [15]. Consequently, neither upper extremity function items nor disability (in activities of daily living) items were included.
In addition, significant gender differences in mobility have been observed, in the sense that women present a poorer function [16,17]. These differences are not uniform across the mobility domains, but are greater in the carrying, lifting and pushing domains than in the walking and moving domains [17][18][19][20][21][22]. However, psychometric studies analysing gender differential item functioning or DIF-namely, depending on construct level, whether the probability of responding to an item differs for the compared groups-have not yielded any relevant or systematic findings, except that most DIF effects are cancelled at the level of aggregate score [8,9,12,23,24]. For example, nine items in the physical function computerised adaptive testing version of the European Organisation for Research and Treatment of Cancer Quality of Life Questionnaire-Core 30 showed gender DIF, but DIF cancellation occurred because the DIF observed was in opposite directions: walking and moving items were more demanding for men whereas carrying, lifting and pushing items were more demanding for women [24].
However, although DIF cancellation can be secured in a fixed measure or even in an entire item bank, this is not the case in adaptive measures created from this latter [23,25]. In a standard DIF analysis, an internal criteriontotal score or an estimate based on total score-is used as a conditioning variable and then each item is individually studied for DIF [23]. However, it is also possible to study a bundle of items simultaneously rather than separately, and by analysing item bundles it becomes possible to test the DIF amplification hypotheses, i.e., whether items depending on a common secondary dimension have DIF effects, significant or nonsignificant, which accumulate at the level of item domain or bundle (differential bundle functioning or DBF) [26,27]. Accordingly, the second objective of this study was to examine whether mobility domains form secondary dimensions containing items that present DBF. Therefore, the two objectives of this paper are to present the development and validation of an item pool to measure mobility in older people and to analyse differential item and bundle functioning across gender.

Study population
The data presented in this article have been taken from the baseline of a longitudinal study on mobility measures as predictors of adverse health outcomes. People considered eligible for participation in the study comprised those over 69 years old attending five primary health care centres in the Autonomous Region of Valencia (Spain). Those patients who produced more than three errors (four if they were illiterate) in the Short Portable Mental Status Questionnaire [28], had serious communication problems or were considered too weak to participate in physical performance tests, were excluded. Sampling was consecutive: all eligible patients from one day of each week during the period November 2006 to October 2007 were selected. Of the 700 eligible patients, 593 gave informed consent and comprised the study sample. No statistically significant differences between participants and non-participants were observed for age or sex. The participants gave their informed consent and the study was approved by the corresponding authorities of the health centres involved.

Late life mobility item bank (LLM-IB)
A pool of 104 mobility items was selected from the literature and a panel of experts (two physicians, four nurses and three psychologists) assessed their relevance and suitability for older people, and also classified them into four domains based on three ICF categories of mobility: changing and maintaining body position (BP), carrying, lifting and pushing (CLP), walking (Walking) and going up and down stairs (UDS). Walking and UDS were considered separately and items relating to moving around using transportation were not included. The relevance of the activities included was also evaluated by three focus groups of older people. As a result of the above, 48 items were selected and their ease of understanding was assessed in 17 cognitive interviews. No items were eliminated, but modifications were made to various item statements. The item stem posed the question in terms of ability, in the present tense and made no reference to health, with a rating scale of four response categories: no difficulty, some difficulty, much difficulty and unable to do. Scores were scaled measuring mobility limitation: the higher the score, the worse the function.
Other mobility measures PF-10 and the Short Physical Performance Battery (SPPB) were used as external criteria for the mobility item bank. PF-10 is a 10-item self-report measure based mainly on lower extremity mobility [3,29]. The SPPB battery objectively assesses physical function of the lower extremities. It consists of three tests: balance, gait speed and chair stand. It has demonstrated excellent reliability, predictive validity and sensitivity to clinically important change and has been recommended for objectively measuring mobility limitations [14,30].

Biodemographic, clinical and disability measures
Biodemographic variables included body mass index (kg/ m 2 ), age, sex, education and living arrangements. Cognitive function was evaluated using the Short Portable Mental Status Questionnaire [28]. Symptoms of depression were evaluated with the Geriatric Depression Scale [31]. Morbidity was measured by the presence or absence of the following medical diagnoses: hypertension, rheumatoid arthritis, osteoarthritis, myocardial infarction, angina pectoris, congestive heart failure, diabetes, cancer, chronic pulmonary disease, stroke, hip fracture, Parkinson's disease, and claudication [32,33]. Finally, subjects were asked whether they needed the help of another person to complete any of the following activities: eating, toileting, bathing, dressing and transferring (ADL dependence).

Procedure
Measurements were collected at the primary health care centres, but not during the subject's medical appointment. The SPPB was administered by trained observers, who also recorded height and weight, morbidity was reported by the doctors caring for the patients who participated in the study and the other measures were completed in an interview situation, conducted by the same observers. Reliability of the mobility item pool and the SPPB was assessed in a pilot study. Using an interval of 15 days and a sample size of n = 62, the intra-class correlation coefficient for intra-rater reliability was 0.90 for the entire item pool, with a range of 0.60 -0.90 for each of the items. Intra-class correlation coefficient for SPPB intra-rater reliability was 0.80 (n = 62) and for inter-rater reliability, 0.88 (n = 30).

Data analysis
The main analyses consisted of examining DIF and DBF and calibrating the item pool using the Rasch rating scale model (RSM) [34]. Prior to this however, we performed a descriptive analysis of the items and examined the three assumptions common to IRT models: monotonicity, unidimensionality and local independency. Unidimensionality is also an assumption for standard DIF analysis. Since the unidimensionality of a measure in a population does not ensure its unidimensionality in subpopulations [35], this aspect was also analysed separately in the subsamples of women and men. DIF/DBF analysis was performed before calibrating the item pool to avoid confusing item DIF with item misfit

IRT assumptions
TestGraf [36] was used to analyse whether the items had a monotonic relation with the construct and if each response category had a maximum probability of being selected over a unique interval of the scale. TestGraf estimates and displays the characteristic response curves by means of the nonparametric regression method known as kernel smoothing. To examine the unidimensionality of the item pool, we tested confirmatory, single and bifactor models with factor analysis methods suitable for ordinal data, namely analysis of polychoric correlation matrices using a diagonally weighted least squares estimator [4,37,38]. We specified four group factors in the bifactor model, one for each mobility item pool domain. These analyses were performed for the entire sample and also for the male and female sub-samples. To measure goodness-of-fit of the models, we selected the Comparative Fit index (CFI), the Tucker Lewis Index (TLI), the root-mean-square error of approximation (RMSEA) and the standardised root mean square residual (SRMR) indices [4]. The cut-off values were as follows: 0.95 for TFI and CFI, 0.08 for RMSEA and 0.06 for SRMR [4,39]. For the bifactor models, we also estimated the proportion of variance explained by group and general factors, together with differences between common factor loadings for the single and bifactor models [38]. Moreover, residual correlations were calculated for the single factor models and r > 0.2 was selected as the cut-off for determining the presence of local dependency [4]. LISREL was used for these analyses [37].

Differential item and differential bundle functioning analysis by sex
The simultaneous item bias test (SIBTEST) framework was used to assess DIF. SIBTEST is a nonparametric method which enables DIF to be tested both at item and item bundle levels [40]. An item bundle is a subset of substantively homogeneous or statistically dimensionally homogeneous items which measure a dimension secondary to the dominant dimension measured for the entire pool [40]. In this study, the bundles consisted of the four mobility item pool domains. SIBTEST permits formal statistical testing of item DIF and DBF, and a magnitude measure, β. The β scale is the probability scale for single item analysis and the expected score scale for bundle analysis. Bundle β is simply the sum of item β for each of the bundle items [41].
Standard item DIF analysis uses an internal criterion, total score or a latent ability estimate, as a conditioning variable [35]. Since the conditioning variable should not have any items with significant DIF, a prior purification stage was implemented before the definitive item DIF analysis. The two types of DIF, uniform and nonuniform, were analysed: the Poly-SIBTEST (SIBTEST for ordinal data) was used to assess uniform DIF and the Crossing-SIBTEST for non-uniform DIF [42,43]. As only binary data can be analysed with the Crossing-SIBTEST, categories on the rating scale were combined as follows: no difficulty vs. the rest. Items were flagged for DIF if P < 0.05, using Bonferroni correction for multiple testing. We also conducted a sensitivity analysis of DIF: for uniform DIF we assessed differences between item locations produced in a Rasch RSM analysis, for each group, using t-tests; for non-uniform DIF, we used TestGraf to graphically examine the differences between the item response curves for each group.
To examine DBF, which is the cumulative effect of significant and nonsignificant item DIF across the item domain, we used two external criteria as conditioning variables, PF-10 and SPPB. Since PF-10 is a self-report measure, this criterion is the closest to the mobility item pool. However, SPPB, which is a mobility standard based on objective performance, can be useful for detecting pervasive DIF produced by self-report measures. Analysing DBF entails analysing item DIF, and therefore the results of the latter are also given.

IRT analysis: Rasch RSM
The item pool was calibrated using the Rasch RSM, the simplest Rasch model for polytomous items [44]. RSM allows items to vary in their level of difficulty but assumes that all items are equally discriminant and share the same rating scale structure [44]. Due to its more restrictive nature, it is robust for small or medium sized samples and is likely to provide more generalisable results [45]. In the RSM, response categories (K) are assigned intersection parameters (K -1 intersection parameters or thresholds) which are considered equal across items, and an item location is described by a single parameter that indicates the difficulty or ease of the item relative to category thresholds [34]. The RSM enables estimates of item location, category thresholds and subject score to be placed on the same metric. The fit of data to the RSM was assessed with infit and outfit mean square error statistics, using a cut-off of <0.6 or >1.4 for possible item deletion [9,44]. Item deletion was implemented sequentially and concluded once none of the remaining items showed misfit. To assess the accuracy of the final item bank, the test information function and its reciprocal [46] the standard error function, were calculated. The person reliability index (analogous to Cronbach's alpha, but excluding extreme scores [47]) was also calculated. To examine item bank coverage and suitability for the sample, item difficulties and person scores were plotted together, centering the scale on zero logitsthe average difficulty of items. Finally, the mobility item bank and the PF-10 items were grouped according to their response options and then co-calibrated onto one common construct (mobility). We used the same pivot anchor for both rating scales: the step from "no difficulty" (or "no limitation") to the next [48]. WINSTEPS was used for these analyses [49].

Missing data
All of the analyses except the RSM analysis were performed using imputed data obtained through matching, employing the PRELIS (LISREL) Impute Missing Value dialog box. For the RSM analysis, Joint Maximum Likelihood was implemented as the estimation method. This method does not require missing data to be imputed but considers such data ignorable. Table 1 presents the demographic and clinical characteristics of the subjects.

Descriptive analysis of the item pool
Three items returned percentages for the first response option ("no difficulty") of 90% or more, the item-test correlations ranged between 0.53 and 0.83 and percentages of missing responses per item were less than 5% in all cases with the exception of two which were slightly higher.

IRT asumptions
The item response curves had a monotonic relation with the construct for all the items; however, the slopes of three items were not steep enough (items previously identified with percentages > 90% in the first response option). As regards the characteristic response curves, for the majority of the items the intermediate option curves ("some difficulty", "much difficulty") lacked a maximum over a unique interval of the scale. Therefore, we examined two possibilities: combining both intermediate options or combining the last two options, i.e., no difficulty, some + much difficulty and unable to do, vs. no difficulty, some difficulty, much difficulty + unable to do. The first solution was clearly better since the curves for all the items would then have a maximum over a unique interval of the scale, whilst in the second solution, the curves for the intermediate option would lack a maximum for the majority of the items. Figure 1 shows examples of these curves for four items with each of the three rating scales. Consequently, we eliminated the three items which were flagged and recoded the rating scale for the successive analyses into three categories: no difficulty, some/much difficulty and unable to do. Table 2 gives the confirmatory factor analysis results both for the entire sample and separately for men and women. Item loadings and fit indices of the single factor model supported a unidimensional interpretation of the item pool. Furthermore, the results for the bifactor model indicated that the influence of the domains (group factors) did not distort this interpretation: the differences between common factor loadings in the bifactor model and the single-factor model did not exceed 0.10, with a median of 0.01; the group factors  explained only 9.29% of variance vs. 66.43% for the common factor, and no item had a higher loading for the group factor than for the common factor. This pattern of results was repeated in analyses by sex, although the influence of the CLP group factor was higher in men. All the residual correlations for the single-factor model were lower than 0.2, except one which was 0.21 (the items "Sitting, bend over to pick something up" and "Standing, bend down to pick something up"); consequently, we considered that there were no local dependencies in the item pool.

DBF and DIF analysis
Standard DIF analysis with the purified conditioning variable flagged the same items with significant DIF as the DIF analysis with no purified conditioning variable. Table 3 gives a summary of DIF results. No item was flagged for non-uniform DIF, but there was evidence of uniform DIF: one item from the Walking domain (W11), three from the BP domain (BP10, BP14, BP15) and two from the CLP domain (CLP03, CLP05) were flagged for significant DIF. No item from the UDS domain was flagged for significant DIF. Furthermore, most of the Walking domain items presented negative (nonsignificant) DIF and all the CLP domain items showed positive (significant or nonsignificant) DIF.
DIF analysis with the two external criteria as conditioning variables produced very similar results: most of the CLP domain items showed significant item DIF, and the BP domain items which were flagged for significant item DIF were the same as those which had been flagged by the standard item DIF analysis. The results of DBF analysis also coincided with the two external criteria: three domains presented DBF (the Walking domain was the exception), but the magnitude was only substantial and consistent across the items in the CLP domain (Table 3).
We have decided to delete items that were consistently (by the three criteria) flagged for significant DIF, but we kept one of them (BP14) because it measured in the highest level of the construct.

Rasch RSM analysis
Six items, one from the Walking, one from the UDS and four from the BP domains were iteratively eliminated because of misfit. Table 4 shows the category thresholds, item locations and mean square error statistics for the remaining 35 items (15 Walking items, 10 UDS items, 7 BP items and 3 CLP items). Item pool coverage and accuracy was satisfactory throughout the entire continuum of mobility, with the exception of the upper level of capacity, which corresponds to more demanding activities than running 500 m without difficulty or performing vigorous activities (Figures 2 and 3). 6.7% of people obtained the lowest score (greatest capacity or least mobility limitation) and no person received the maximum score. The person reliability index was 0.95. Figures 2  and 3 also show the results for co-calibration of LLM-IB and PF-10.

Discussion
In this paper, we present the development and validation of a mobility item pool in a sample of 593 older people attending primary health care practices in Spain. Item content was based on ICF mobility indicators, and the item stems and response options concerned difficulty in performing an activity without external help. We examined IRT assumptions, analysed DIF/DBF by sex and calibrated the item pool with the Rasch RSM. No evidence was found of non-uniform DIF but we did observe uniform DIF and DBF. Although the confirmatory factor analysis results satisfied stringent criteria for unidimensionality, the DBF results called this conclusion into question, mainly because with the exception of the Walking domain, all other domains showed DBF, notably the CLP domain. Following the Rasch RSM analysis, 35 items remained in the pool and formed the Late Life Mobility item bank (LLM-IB), which measured with a reliability of 0.90 or higher across the entire spectrum of mobility, except at the extreme end of better function. Lastly, the 35 items were co-calibrated with the PF-10 items.
A noteworthy aspect of this study is that to the best of our knowledge, this is the first time in the literature on patient reported outcomes that DBF has been analysed. To achieve this, in addition to examining DIF according to  standard procedure, we also examined augmented DIF at domain level (DBF) using two external criteria as conditioning variables: the PF-10 scale and SPPB. Results of the DIF/ DBF analysis with the two external criteria were very similar, suggesting no bias in self-report versus performancebased scales as a method to measure late life mobility: most of the CLP domain items and three BP domain items were flagged for significant DIF. Standard DIF results were less similar to those above, since fewer CLP domain items were identified as presenting significant DIF and there were more items with DIF, significant or nonsignificant, with opposite signs: most of the Walking domain items were negative and all of the CLP domain items were positive. This has also been observed recently during the development of the European Organisation for Research and Treatment of Cancer Physical Function item bank, and the most plausible explanation is that both bundles/domains measure different secondary dimensions [24,25,35]. Although conditioning with an internal criterion such as total score produces DIF values with a trade off between positive and negative values as DIF values are statistically dependent [26], it is interesting that the items which systematically presented opposite values were Walking and CLP items. However, when an external criterion is used as a conditioning variable, statistical dependence disappears [26]. Thus, DIF/DBF analysis using SPPB and PF-10 as conditional variables revealed that CLP measured a secondary dimension that produced significant DIF and DBF, but Walking domain items produced neither DBF nor DIF, with the exception of one item according to SPPB but none according to PF-10. Therefore, standard DIF analysis indicated that Walking items and CLP items measured different domains and DIF/DBF analysis revealed that Walking was the core dimension of the mobility construct and CLP was a secondary dimension that produced DBF. This interpretation, that CLP items measure a secondary dimension of the mobility construct, is also consistent with results from non-psychometric studies, which have reported that gender differences are greater in items in this domain than in other mobility domains [16][17][18][19][20][21] and that these differences do not disappear after adjustment for important covariables [18,19,51]. These results are also consistent with those found in the fields of geriatric frailty and sarcopenia, where these items are commonly referred to as indicators of strength: walking and strength constitute two separate sub-dimensions of the frailty construct [52,53], and strength is a predictor of mobility decline and is a more intense predictor in men than women [54]. If a secondary dimension produces DIF, the DIF is benign if the dimension is considered part of the construct, but adverse if the secondary dimension is considered a nuisance [25,40]. Therefore, deciding whether the strength domain produces benign DIF or adverse DIF is a theoretical issue, but the data show that the inclusion of strength items increases gender differences in mobility. When validating the LLM-IB, we decided that the strength domain produces benign DBF and we excluded only those items that were consistently flagged for significant DIF.
We used the Rasch RSM to calibrate the item bank and eliminated six of the 41 items that still remained in the item pool, having previously eliminated three for being too easy and four due to DIF. Thus, 35 items remained and constituted the LLM-IB. Most of the Walking and UDS items were retained since they did not present any of the problems observed in the items in the other two domains. We believe that these results help to explain the predominance of walking and going up & down stairs items in the fixed and adaptive physical function measures. Indeed, in the PF-10 and Health Assessment Questionnaire II [55], most of the items are from the Walking or UDS domains. In the new measures, short forms and computer adaptive test applications developed from item banks such as the Patient Reported Outcomes Measurement Information System Physical Function item bank [56] or the Activity Measure for Post Acute Care mobility item bank [6] also produce a predominance of items from the Walking and UDS domains. This occurs even if a content balancing algorithm is introduced to select the first items from the computer adaptive test applications, since the greater wealth of information contained in the Walking and UDS items, calibrated with IRT models which included a discrimination parameter, means that in the end, these achieve greater representation.
The item pool originally contained four response options, but a graphical, non-parametric IRT analysis showed that the number of response options per item should be reduced. We examined two rating scale alternatives, one combining the two intermediate options ("some difficulty" and "much difficulty") whilst the other combined the two options reflecting greatest difficulty ("much difficulty" and "unable to do"). We chose the first because it was psychometrically better, and because it is common practice to distinguish between difficulty and incapacity in research on the disablement process. Our sample consisted of older people, generally with a poor educational level (reflecting the current cohort of the elderly population in Spain), which alone may explain why a rating scale with three options works better than a rating scale with more [57].
This study has various limitations. Firstly, in the DBF analysis, one of the bundles, CLP domain, contained only five items. Consequently, the idiosyncrasy of these may constitute an alternative explanation to our interpretation based on the validity of five items as a domain measure. However, the items included are among the most common in the literature. In addition, care was taken not to include items that were too demanding and which would thus have favoured men even more. Secondly, although the use of two conditioning variables which are widely accepted as standard physical function and mobility measures is one of the strengths of this analysis, the study lacked a similar standard for the CLP domain: an objective measure of strength would have enhanced the construct validity of the findings. Thirdly, because DIF by age has repeatedly been found for many items in measures of PF, the extrapolation of our results beyond samples of older people is questionable. Finally, our findings are exclusively cross-sectional. We anticipate validating the item bank and several fixed forms with the longitudinal data collected after monitoring the same cohort for 18 months with outcome variables such as mortality, dependency and hospitalization.

Conclusions
We have designed an item bank in Spanish to measure mobility in older primary care patients which is free from item bias across gender and was calibrated using Rasch RSM. Item bank accuracy and coverage was satisfactory throughout the entire continuum of mobility, with the exception of the upper level of capacity, suggesting the desirability of replenishing the item bank with items that measure at high mobility function level. Furthermore, our results indicate that the walking and going up and down stairs items form the core of the mobility construct whilst strength items form a secondary dimension that produces augmented DIF. These results highlight the desirability of stratifying by domain and weighting domain representation when selecting items to create fixed or adaptive forms of mobility for older people, leaving only strength items marginal. Further research is needed to determine how best to include strength items in a mobility measure, or whether it would be more appropriate to design separate measures for each construct.