Skip to main content
# Table 1
**Psychometric tests and criteria used in the evaluation of the PU-QOL instrument**

Psychometric property | Traditional methods - test and criteria | Rasch methods - test and criteria |
---|---|---|

Acceptability and data quality - Completeness of item- and scale-level data.
| ● Score distributions (floor/ceiling effects and skew of scale scores) | ● Even distribution of endorsement frequencies across response categories (>80%) |

● % of item-level missing data (<10%) [30] | ● Low number of persons at extreme (i.e. floor/ceiling) ends of the measurement continuum | |

● % of computable scale scores (>50% completed items) [31] | ||

● Items in scales rated ‘not relevant’ <35% | ||

Scaling assumptions - Legitimacy of summing a set of items (items should measure a common underlying construct).
| ● Similar item mean scores [32] and SDs [33] | ● Positive residual r between items (<0.30) |

● Items have adequate corrected ITC (ITC ≥0.3) [34] | ● High negative residual r (>0.60) suggests redundancy | |

● Items have similar ITCs [34] | ● Items sharing common variance suggests uni-dimensionality | |

● Items do not measure at the same point on the scale | ● Evenly spaced items spanning whole measurement range | |

Item response categories - categories in a logical hierarchy.
| ● NA | ● Ordered set of response thresholds for each scale item |

Targeting - extent to which the range of the variable measured by the scale matches the range of that variable in the study sample.
| ● Scale scores spanning entire scale range | ● Person-item threshold distribution: person locations should be covered by items and item locations covered by persons when both calibrated on the same metric scale [35] |

● Floor and ceiling (proportion sample at minimum and maximum scale score) effects should be low (<15%) [36] | ||

● Skewness statistics should range from −1 to +1 [37] | ● Good targeting demonstrated by the mean location of items and persons around zero | |

● No published criteria for item level targeting | ||

Reliability
| ||

Internal consistency - extent to which items comprising a scale measure the same construct (e.g. homogeneity of the scale). | ● Cronbach's alphas for summary scores (adequate scale internal consistency is ≥0.70 [22] | ● High person separation index >0.7 [38]; quantifies how reliably person measurements are separated by items |

● Item-total r between +0.4 and +0.6 indicate items are moderately correlated with scale scores; higher values indicate well correlated items with scale scores [22] | ||

● Power-of-tests indicate the power in detecting the extent to which the data do not fit the model [24] | ||

● Items with ordered thresholds | ||

*Test-retest reliability - stability of a measuring instrument. | ● Intra-class r coefficient >0.70 between test and retest scores [11] | ● Statistical stability across time points (no uniform or non-uniform item DIF (p=>0.05 or Bonferroni adjusted value)) |

● Pearson r: >0.7 indicates reliable scale stability | ||

Validity
| ● Involves accumulating evidence from different forms | |

Content validity - extent to which the content (items) of a scale is representative of the conceptual construct it is intended to measure. | ● Consideration of item sufficiency and the target population | ● Clearly defined construct |

● Qualitative evidence from individuals for whom the measure is targeted, expert opinion and literature review (e.g. theoretical and/or conceptual definitions) [9]. | ● Validity comes from careful item construction and consideration of what each item is meant to measure, then testing against model expectations | |

Construct validity | ||

i) Within-scale analyses - extent to which a distinct construct is being measured and that items can be combined to form a scale score. | ● Cronbach alpha for scale scores >0.70 | ● Fit residuals (item-person interaction) within given range +/−2.5 |

● ITC >0.30 | ||

● Homogeneity coefficient (IIC mean and range >0.3) | ● Non-significant chi square (item-trait interaction) values | |

● Scaling success | ● No under- or over-discriminating ICC | |

● Mean fit residual close to 0.0; SD approaching 1.0 [39] | ||

● Person fit residuals within given range +/−2.5 | ||

Measurement continuum - extent to which scale items mark out the construct as a continuum on which people can be measured. | ● NA | ● Individual scale items located across a continuum in the same way locations of people are spread across the continuum [26] |

● Items spread evenly over a reasonable measurement range [40, 41]. Items with similar locations may indicate item redundancy | ||

Response dependency –response to one item determines response to another. | ● NA | ● Response dependency is indicated by residual r >0.3 for pairs of items [40, 41] |

ii) Between scale analysis | ||

Criterion Validity - hypotheses based on criterion or ‘gold standard’ measure. | ● There are no true gold standard HRQL [42], PU-specific or chronic wound-specific measures available [12] | ● NA |

*Convergent validity - scale correlated with other measures of the same/ similar constructs. | ● Moderate to high r predicted for similar scales; criteria used as guides to the magnitude of r, as opposed to pass/fail benchmarks (high r >0.7; moderate r=0.3-0.7; low r <0.3) [43] | ● NA |

*Discriminant validity – scale not correlated with measures of different constructs | ● Low r (<0.3) predicted between scale scores and measures of different constructs (e.g. age, gender) | ● NA |

*Known groups differences - ability of a scale to differentiate known groups | ● ^Generate hypotheses (based on subgroups known to differ on construct measured) and compare mean scores (e.g. predict a stepwise change in PU-QOL scale scores across 3 PU severity groups and that mean scores would be significantly different) | ● Hypothesis testing (e.g. clinical questions are formulated and the empirical testing comes from whether or not data fit the Rasch model) |

● Statistically significant differences in mean scores (ANOVA) | ||

*Differential item functioning (item bias) - The extent of any conditional relationships between item response and group membership.
| ● NA | ● Persons with similar ability should respond in similar ways to individual items regardless of group membership (e.g. age) [44] |

● Uniform DIF - uniformity amongst differences between groups | ||

● Non-Uniform DIF - non-uniformity amongst differences between groups; can be considered at 1% (Bonferroni adjusted) and 5% CIs |