Development and validation of a new patient-reported outcome measure for patients with pressure ulcers: the PU-QOL instrument

Gorecki, Claudia; Brown, Julia M; Cano, Stefan; Lamping, Donna L; Briggs, Michelle; Coleman, Susanne; Dealey, Carol; McGinnis, Elizabeth; Nelson, Andrea E; Stubbs, Nikki; Wilson, Lyn; Nixon, Jane

doi:10.1186/1477-7525-11-95

Table 1 Psychometric tests and criteria used in the evaluation of the PU-QOL instrument

From: Development and validation of a new patient-reported outcome measure for patients with pressure ulcers: the PU-QOL instrument

Psychometric property	Traditional methods - test and criteria	Rasch methods - test and criteria
Acceptability and data quality - Completeness of item- and scale-level data.	● Score distributions (floor/ceiling effects and skew of scale scores)	● Even distribution of endorsement frequencies across response categories (>80%)
	● % of item-level missing data (<10%) [30]	● Low number of persons at extreme (i.e. floor/ceiling) ends of the measurement continuum
	● % of computable scale scores (>50% completed items) [31]
	● Items in scales rated ‘not relevant’ <35%
Scaling assumptions - Legitimacy of summing a set of items (items should measure a common underlying construct).	● Similar item mean scores [32] and SDs [33]	● Positive residual r between items (<0.30)
	● Items have adequate corrected ITC (ITC ≥0.3) [34]	● High negative residual r (>0.60) suggests redundancy
	● Items have similar ITCs [34]	● Items sharing common variance suggests uni-dimensionality
	● Items do not measure at the same point on the scale	● Evenly spaced items spanning whole measurement range
Item response categories - categories in a logical hierarchy.	● NA	● Ordered set of response thresholds for each scale item
Targeting - extent to which the range of the variable measured by the scale matches the range of that variable in the study sample.	● Scale scores spanning entire scale range	● Person-item threshold distribution: person locations should be covered by items and item locations covered by persons when both calibrated on the same metric scale [35]
	● Floor and ceiling (proportion sample at minimum and maximum scale score) effects should be low (<15%) [36]
	● Skewness statistics should range from −1 to +1 [37]	● Good targeting demonstrated by the mean location of items and persons around zero
	● No published criteria for item level targeting
Reliability
Internal consistency - extent to which items comprising a scale measure the same construct (e.g. homogeneity of the scale).	● Cronbach's alphas for summary scores (adequate scale internal consistency is ≥0.70 [22]	● High person separation index >0.7 [38]; quantifies how reliably person measurements are separated by items
	● Item-total r between +0.4 and +0.6 indicate items are moderately correlated with scale scores; higher values indicate well correlated items with scale scores [22]
		● Power-of-tests indicate the power in detecting the extent to which the data do not fit the model [24]
		● Items with ordered thresholds
*Test-retest reliability - stability of a measuring instrument.	● Intra-class r coefficient >0.70 between test and retest scores [11]	● Statistical stability across time points (no uniform or non-uniform item DIF (p=>0.05 or Bonferroni adjusted value))
	● Pearson r: >0.7 indicates reliable scale stability
Validity	● Involves accumulating evidence from different forms
Content validity - extent to which the content (items) of a scale is representative of the conceptual construct it is intended to measure.	● Consideration of item sufficiency and the target population	● Clearly defined construct
	● Qualitative evidence from individuals for whom the measure is targeted, expert opinion and literature review (e.g. theoretical and/or conceptual definitions) [9].	● Validity comes from careful item construction and consideration of what each item is meant to measure, then testing against model expectations
Construct validity
i) Within-scale analyses - extent to which a distinct construct is being measured and that items can be combined to form a scale score.	● Cronbach alpha for scale scores >0.70	● Fit residuals (item-person interaction) within given range +/−2.5
	● ITC >0.30
	● Homogeneity coefficient (IIC mean and range >0.3)	● Non-significant chi square (item-trait interaction) values
	● Scaling success	● No under- or over-discriminating ICC
		● Mean fit residual close to 0.0; SD approaching 1.0 [39]
		● Person fit residuals within given range +/−2.5
Measurement continuum - extent to which scale items mark out the construct as a continuum on which people can be measured.	● NA	● Individual scale items located across a continuum in the same way locations of people are spread across the continuum [26]
		● Items spread evenly over a reasonable measurement range [40, 41]. Items with similar locations may indicate item redundancy
Response dependency –response to one item determines response to another.	● NA	● Response dependency is indicated by residual r >0.3 for pairs of items [40, 41]
ii) Between scale analysis
Criterion Validity - hypotheses based on criterion or ‘gold standard’ measure.	● There are no true gold standard HRQL [42], PU-specific or chronic wound-specific measures available [12]	● NA
*Convergent validity - scale correlated with other measures of the same/ similar constructs.	● Moderate to high r predicted for similar scales; criteria used as guides to the magnitude of r, as opposed to pass/fail benchmarks (high r >0.7; moderate r=0.3-0.7; low r <0.3) [43]	● NA
*Discriminant validity – scale not correlated with measures of different constructs	● Low r (<0.3) predicted between scale scores and measures of different constructs (e.g. age, gender)	● NA
*Known groups differences - ability of a scale to differentiate known groups	● ^Generate hypotheses (based on subgroups known to differ on construct measured) and compare mean scores (e.g. predict a stepwise change in PU-QOL scale scores across 3 PU severity groups and that mean scores would be significantly different)	● Hypothesis testing (e.g. clinical questions are formulated and the empirical testing comes from whether or not data fit the Rasch model)
	● Statistically significant differences in mean scores (ANOVA)
*Differential item functioning (item bias) - The extent of any conditional relationships between item response and group membership.	● NA	● Persons with similar ability should respond in similar ways to individual items regardless of group membership (e.g. age) [44]
		● Uniform DIF - uniformity amongst differences between groups
		● Non-Uniform DIF - non-uniformity amongst differences between groups; can be considered at 1% (Bonferroni adjusted) and 5% CIs

Table adapted from [35, 45]; *Additional tests performed for field test two; ^The PU HRQL literature is not well established, therefore was limited for identifying clinical parameters to formulate known groups; NA No test for particular psychometric property; SD Standard deviation; ITC Item total correlation; IIT Inter-item correlation; ICC Item characteristic curve; r correlation; ANOVA, Analysis of variance; DIF Differential item functioning; CI Confidence interval.

Back to article page

ISSN: 1477-7525

Contact us

Submission enquiries: journalsubmissions@springernature.com

Health and Quality of Life Outcomes

Contact us