Acceptability and data quality - Completeness of item- and scale-level data.
● Score distributions (floor/ceiling effects and skew of scale scores)
● Even distribution of endorsement frequencies across response categories (>80%)
● % of item-level missing data (<10%) 
● Low number of persons at extreme (i.e. floor/ceiling) ends of the measurement continuum
● % of computable scale scores (>50% completed items) 
● Items in scales rated ‘not relevant’ <35%
Scaling assumptions - Legitimacy of summing a set of items (items should measure a common underlying construct).
● Similar item mean scores  and SDs 
● Positive residual r between items (<0.30)
● Items have adequate corrected ITC (ITC ≥0.3) 
● High negative residual r (>0.60) suggests redundancy
● Items have similar ITCs 
● Items sharing common variance suggests uni-dimensionality
● Items do not measure at the same point on the scale
● Evenly spaced items spanning whole measurement range
Item response categories - categories in a logical hierarchy.
● Ordered set of response thresholds for each scale item
Targeting - extent to which the range of the variable measured by the scale matches the range of that variable in the study sample.
● Scale scores spanning entire scale range
● Person-item threshold distribution: person locations should be covered by items and item locations covered by persons when both calibrated on the same metric scale 
● Floor and ceiling (proportion sample at minimum and maximum scale score) effects should be low (<15%) 
● Skewness statistics should range from −1 to +1 
● Good targeting demonstrated by the mean location of items and persons around zero
● No published criteria for item level targeting
| || |
Internal consistency - extent to which items comprising a scale measure the same construct (e.g. homogeneity of the scale).
● Cronbach's alphas for summary scores (adequate scale internal consistency is ≥0.70 
● High person separation index >0.7 ; quantifies how reliably person measurements are separated by items
● Item-total r between +0.4 and +0.6 indicate items are moderately correlated with scale scores; higher values indicate well correlated items with scale scores 
● Power-of-tests indicate the power in detecting the extent to which the data do not fit the model 
● Items with ordered thresholds
*Test-retest reliability - stability of a measuring instrument.
● Intra-class r coefficient >0.70 between test and retest scores 
● Statistical stability across time points (no uniform or non-uniform item DIF (p=>0.05 or Bonferroni adjusted value))
● Pearson r: >0.7 indicates reliable scale stability
● Involves accumulating evidence from different forms
Content validity - extent to which the content (items) of a scale is representative of the conceptual construct it is intended to measure.
● Consideration of item sufficiency and the target population
● Clearly defined construct
● Qualitative evidence from individuals for whom the measure is targeted, expert opinion and literature review (e.g. theoretical and/or conceptual definitions) .
● Validity comes from careful item construction and consideration of what each item is meant to measure, then testing against model expectations
| || |
i) Within-scale analyses - extent to which a distinct construct is being measured and that items can be combined to form a scale score.
● Cronbach alpha for scale scores >0.70
● Fit residuals (item-person interaction) within given range +/−2.5
● ITC >0.30
● Homogeneity coefficient (IIC mean and range >0.3)
● Non-significant chi square (item-trait interaction) values
● Scaling success
● No under- or over-discriminating ICC
| || |
● Mean fit residual close to 0.0; SD approaching 1.0 
| || |
● Person fit residuals within given range +/−2.5
Measurement continuum - extent to which scale items mark out the construct as a continuum on which people can be measured.
● Individual scale items located across a continuum in the same way locations of people are spread across the continuum 
| || |
● Items spread evenly over a reasonable measurement range [40, 41]. Items with similar locations may indicate item redundancy
Response dependency –response to one item determines response to another.
● Response dependency is indicated by residual r >0.3 for pairs of items [40, 41]
ii) Between scale analysis
| || |
Criterion Validity - hypotheses based on criterion or ‘gold standard’ measure.
● There are no true gold standard HRQL , PU-specific or chronic wound-specific measures available 
*Convergent validity - scale correlated with other measures of the same/ similar constructs.
● Moderate to high r predicted for similar scales; criteria used as guides to the magnitude of r, as opposed to pass/fail benchmarks (high r >0.7; moderate r=0.3-0.7; low r <0.3) 
*Discriminant validity – scale not correlated with measures of different constructs
● Low r (<0.3) predicted between scale scores and measures of different constructs (e.g. age, gender)
*Known groups differences - ability of a scale to differentiate known groups
● ^Generate hypotheses (based on subgroups known to differ on construct measured) and compare mean scores (e.g. predict a stepwise change in PU-QOL scale scores across 3 PU severity groups and that mean scores would be significantly different)
● Hypothesis testing (e.g. clinical questions are formulated and the empirical testing comes from whether or not data fit the Rasch model)
● Statistically significant differences in mean scores (ANOVA)
*Differential item functioning (item bias) - The extent of any conditional relationships between item response and group membership.
● Persons with similar ability should respond in similar ways to individual items regardless of group membership (e.g. age) 
| || |
● Uniform DIF - uniformity amongst differences between groups
| || |
● Non-Uniform DIF - non-uniformity amongst differences between groups; can be considered at 1% (Bonferroni adjusted) and 5% CIs