Skip to main content

Table 1 Psychometric tests and criteria used in the evaluation of the PU-QOL instrument

From: Development and validation of a new patient-reported outcome measure for patients with pressure ulcers: the PU-QOL instrument

Psychometric property

Traditional methods - test and criteria

Rasch methods - test and criteria

Acceptability and data quality - Completeness of item- and scale-level data.

Score distributions (floor/ceiling effects and skew of scale scores)

Even distribution of endorsement frequencies across response categories (>80%)

% of item-level missing data (<10%) [30]

Low number of persons at extreme (i.e. floor/ceiling) ends of the measurement continuum

% of computable scale scores (>50% completed items) [31]

Items in scales rated ‘not relevant’ <35%

Scaling assumptions - Legitimacy of summing a set of items (items should measure a common underlying construct).

Similar item mean scores [32] and SDs [33]

Positive residual r between items (<0.30)

Items have adequate corrected ITC (ITC ≥0.3) [34]

High negative residual r (>0.60) suggests redundancy

Items have similar ITCs [34]

Items sharing common variance suggests uni-dimensionality

 

Items do not measure at the same point on the scale

Evenly spaced items spanning whole measurement range

Item response categories - categories in a logical hierarchy.

NA

Ordered set of response thresholds for each scale item

Targeting - extent to which the range of the variable measured by the scale matches the range of that variable in the study sample.

Scale scores spanning entire scale range

Person-item threshold distribution: person locations should be covered by items and item locations covered by persons when both calibrated on the same metric scale [35]

Floor and ceiling (proportion sample at minimum and maximum scale score) effects should be low (<15%) [36]

 

Skewness statistics should range from −1 to +1 [37]

Good targeting demonstrated by the mean location of items and persons around zero

 

No published criteria for item level targeting

 

Reliability

  

Internal consistency - extent to which items comprising a scale measure the same construct (e.g. homogeneity of the scale).

Cronbach's alphas for summary scores (adequate scale internal consistency is ≥0.70 [22]

High person separation index >0.7 [38]; quantifies how reliably person measurements are separated by items

Item-total r between +0.4 and +0.6 indicate items are moderately correlated with scale scores; higher values indicate well correlated items with scale scores [22]

Power-of-tests indicate the power in detecting the extent to which the data do not fit the model [24]

Items with ordered thresholds

*Test-retest reliability - stability of a measuring instrument.

Intra-class r coefficient >0.70 between test and retest scores [11]

Statistical stability across time points (no uniform or non-uniform item DIF (p=>0.05 or Bonferroni adjusted value))

Pearson r: >0.7 indicates reliable scale stability

Validity

Involves accumulating evidence from different forms

 

Content validity - extent to which the content (items) of a scale is representative of the conceptual construct it is intended to measure.

Consideration of item sufficiency and the target population

Clearly defined construct

Qualitative evidence from individuals for whom the measure is targeted, expert opinion and literature review (e.g. theoretical and/or conceptual definitions) [9].

Validity comes from careful item construction and consideration of what each item is meant to measure, then testing against model expectations

Construct validity

  

 i) Within-scale analyses - extent to which a distinct construct is being measured and that items can be combined to form a scale score.

Cronbach alpha for scale scores >0.70

Fit residuals (item-person interaction) within given range +/−2.5

ITC >0.30

Homogeneity coefficient (IIC mean and range >0.3)

Non-significant chi square (item-trait interaction) values

Scaling success

No under- or over-discriminating ICC

  

Mean fit residual close to 0.0; SD approaching 1.0 [39]

  

Person fit residuals within given range +/−2.5

 Measurement continuum - extent to which scale items mark out the construct as a continuum on which people can be measured.

NA

Individual scale items located across a continuum in the same way locations of people are spread across the continuum [26]

  

Items spread evenly over a reasonable measurement range [40, 41]. Items with similar locations may indicate item redundancy

 Response dependency –response to one item determines response to another.

NA

Response dependency is indicated by residual r >0.3 for pairs of items [40, 41]

 ii) Between scale analysis

  

 Criterion Validity - hypotheses based on criterion or ‘gold standard’ measure.

There are no true gold standard HRQL [42], PU-specific or chronic wound-specific measures available [12]

NA

 *Convergent validity - scale correlated with other measures of the same/ similar constructs.

Moderate to high r predicted for similar scales; criteria used as guides to the magnitude of r, as opposed to pass/fail benchmarks (high r >0.7; moderate r=0.3-0.7; low r <0.3) [43]

NA

 *Discriminant validity – scale not correlated with measures of different constructs

Low r (<0.3) predicted between scale scores and measures of different constructs (e.g. age, gender)

NA

 *Known groups differences - ability of a scale to differentiate known groups

^Generate hypotheses (based on subgroups known to differ on construct measured) and compare mean scores (e.g. predict a stepwise change in PU-QOL scale scores across 3 PU severity groups and that mean scores would be significantly different)

Hypothesis testing (e.g. clinical questions are formulated and the empirical testing comes from whether or not data fit the Rasch model)

 

Statistically significant differences in mean scores (ANOVA)

*Differential item functioning (item bias) - The extent of any conditional relationships between item response and group membership.

NA

Persons with similar ability should respond in similar ways to individual items regardless of group membership (e.g. age) [44]

  

Uniform DIF - uniformity amongst differences between groups

  

Non-Uniform DIF - non-uniformity amongst differences between groups; can be considered at 1% (Bonferroni adjusted) and 5% CIs

  1. Table adapted from [35, 45]; *Additional tests performed for field test two; ^The PU HRQL literature is not well established, therefore was limited for identifying clinical parameters to formulate known groups; NA No test for particular psychometric property; SD Standard deviation; ITC Item total correlation; IIT Inter-item correlation; ICC Item characteristic curve; r correlation; ANOVA, Analysis of variance; DIF Differential item functioning; CI Confidence interval.