Skip to main content

Table 2 Appraisal criteria for assessing the psychometric properties of patient reported outcome measures

From: Patient-reported outcome measures in patients with peripheral arterial disease: a systematic review of psychometric properties

Domain

Criteria

Test re-test reliability

Reliability is the ability of a measure to reproduce the same value on two separate administrations when there has been no change in health.

The intra-class correlation/ weighted kappa score should be ≥ 0.70 for group comparisons and ≥ 0.90 if scores are going to be used for decisions about an individual based on their score [2].

The mean difference (paired t test or Wilcoxon signed-rank test) between time point 1 (T1) and time point 2 (T2) and the 95% CI should also be reported.

Internal consistency

Internal consistency is an assessment of whether the items are measuring the same thing.

A Cronbach’s alpha score of ≥ 0.70 is considered good and it should not exceed ≥0.92 for group comparisons as this is taken to indicate that items in the scale could be redundant. Item total correlations should be ≥0.20 [14].

Content validity

Content validity measures the extent to which the items reflect the domains of interest in a way that is clear.

To achieve good content validity, there must be evidence that the instrument has been developed by consulting patients, experts as well as undertaking a literature review.

Patients should be involved in the development stage and item generation. The opinion of patient representatives should be sought on the constructed scale [2, 14, 16].

Construct validity

Construct validity assesses how well an instrument measures what it was intended to measure.

A correlation coefficient of ≥0.60 is considered as strong evidence of construct validity. Authors should make specific directional hypotheses and estimate the strength of correlation before testing [2, 14, 15].

Criterion validity

Criterion validity assesses the degree of empirical association of the PROM with external criteria or other measures.

A good argument should be made as to why an instrument is a gold standard and correlation with the gold standard should be ≥ 0.70 [15].

Responsiveness

Responsiveness assesses the ability of the PROM to detect changes when changes are expected.

Available methods to measure responsiveness include t-tests, effect size, standardised response means or responsiveness statistics, Guyatts’ responsiveness index. Standardised effects sizes and SRMs of less than 0.2 are considered small, 0.5 moderate, and 0.8 [17].

There should be statistically significant changes in score of an expected magnitude [8].

Floor-ceiling effects

A floor or celling effect is considered if 15% of respondents are achieving the lowest or the highest score on the instrument, respectively [15].

Acceptability

Acceptability is reflected by the completeness of the data supplied. 80% or more of the data should be complete [16].