Psychometric property | Traditional methods - test and criteria | Rasch methods - test and criteria |
---|---|---|
Acceptability and data quality - Completeness of item- and scale-level data. | ● Score distributions (floor/ceiling effects and skew of scale scores) | ● Even distribution of endorsement frequencies across response categories (>80%) |
● % of item-level missing data (<10%) [30] | ● Low number of persons at extreme (i.e. floor/ceiling) ends of the measurement continuum | |
● % of computable scale scores (>50% completed items) [31] | ||
● Items in scales rated ‘not relevant’ <35% | ||
Scaling assumptions - Legitimacy of summing a set of items (items should measure a common underlying construct). | ● Positive residual r between items (<0.30) | |
● Items have adequate corrected ITC (ITC ≥0.3) [34] | ● High negative residual r (>0.60) suggests redundancy | |
● Items have similar ITCs [34] | ● Items sharing common variance suggests uni-dimensionality | |
● Items do not measure at the same point on the scale | ● Evenly spaced items spanning whole measurement range | |
Item response categories - categories in a logical hierarchy. | ● NA | ● Ordered set of response thresholds for each scale item |
Targeting - extent to which the range of the variable measured by the scale matches the range of that variable in the study sample. | ● Scale scores spanning entire scale range | ● Person-item threshold distribution: person locations should be covered by items and item locations covered by persons when both calibrated on the same metric scale [35] |
● Floor and ceiling (proportion sample at minimum and maximum scale score) effects should be low (<15%) [36] | ||
● Skewness statistics should range from −1 to +1 [37] | ● Good targeting demonstrated by the mean location of items and persons around zero | |
● No published criteria for item level targeting | ||
Reliability | ||
Internal consistency - extent to which items comprising a scale measure the same construct (e.g. homogeneity of the scale). | ● Cronbach's alphas for summary scores (adequate scale internal consistency is ≥0.70 [22] | ● High person separation index >0.7 [38]; quantifies how reliably person measurements are separated by items |
● Item-total r between +0.4 and +0.6 indicate items are moderately correlated with scale scores; higher values indicate well correlated items with scale scores [22] | ||
● Power-of-tests indicate the power in detecting the extent to which the data do not fit the model [24] | ||
● Items with ordered thresholds | ||
*Test-retest reliability - stability of a measuring instrument. | ● Intra-class r coefficient >0.70 between test and retest scores [11] | ● Statistical stability across time points (no uniform or non-uniform item DIF (p=>0.05 or Bonferroni adjusted value)) |
● Pearson r: >0.7 indicates reliable scale stability | ||
Validity | ● Involves accumulating evidence from different forms | |
Content validity - extent to which the content (items) of a scale is representative of the conceptual construct it is intended to measure. | ● Consideration of item sufficiency and the target population | ● Clearly defined construct |
● Qualitative evidence from individuals for whom the measure is targeted, expert opinion and literature review (e.g. theoretical and/or conceptual definitions) [9]. | ● Validity comes from careful item construction and consideration of what each item is meant to measure, then testing against model expectations | |
Construct validity | ||
i) Within-scale analyses - extent to which a distinct construct is being measured and that items can be combined to form a scale score. | ● Cronbach alpha for scale scores >0.70 | ● Fit residuals (item-person interaction) within given range +/−2.5 |
● ITC >0.30 | ||
● Homogeneity coefficient (IIC mean and range >0.3) | ● Non-significant chi square (item-trait interaction) values | |
● Scaling success | ● No under- or over-discriminating ICC | |
● Mean fit residual close to 0.0; SD approaching 1.0 [39] | ||
● Person fit residuals within given range +/−2.5 | ||
Measurement continuum - extent to which scale items mark out the construct as a continuum on which people can be measured. | ● NA | ● Individual scale items located across a continuum in the same way locations of people are spread across the continuum [26] |
● Items spread evenly over a reasonable measurement range [40, 41]. Items with similar locations may indicate item redundancy | ||
Response dependency –response to one item determines response to another. | ● NA | ● Response dependency is indicated by residual r >0.3 for pairs of items [40, 41] |
ii) Between scale analysis | ||
Criterion Validity - hypotheses based on criterion or ‘gold standard’ measure. | ● There are no true gold standard HRQL [42], PU-specific or chronic wound-specific measures available [12] | ● NA |
*Convergent validity - scale correlated with other measures of the same/ similar constructs. | ● Moderate to high r predicted for similar scales; criteria used as guides to the magnitude of r, as opposed to pass/fail benchmarks (high r >0.7; moderate r=0.3-0.7; low r <0.3) [43] | ● NA |
*Discriminant validity – scale not correlated with measures of different constructs | ● Low r (<0.3) predicted between scale scores and measures of different constructs (e.g. age, gender) | ● NA |
*Known groups differences - ability of a scale to differentiate known groups | ● ^Generate hypotheses (based on subgroups known to differ on construct measured) and compare mean scores (e.g. predict a stepwise change in PU-QOL scale scores across 3 PU severity groups and that mean scores would be significantly different) | ● Hypothesis testing (e.g. clinical questions are formulated and the empirical testing comes from whether or not data fit the Rasch model) |
● Statistically significant differences in mean scores (ANOVA) | ||
*Differential item functioning (item bias) - The extent of any conditional relationships between item response and group membership. | ● NA | ● Persons with similar ability should respond in similar ways to individual items regardless of group membership (e.g. age) [44] |
● Uniform DIF - uniformity amongst differences between groups | ||
● Non-Uniform DIF - non-uniformity amongst differences between groups; can be considered at 1% (Bonferroni adjusted) and 5% CIs |