Domain | Criteria |
---|---|
Test re-test reliability | Reliability is the ability of a measure to reproduce the same value on two separate administrations when there has been no change in health. The intra-class correlation/ weighted kappa score should be ≥ 0.70 for group comparisons and ≥ 0.90 if scores are going to be used for decisions about an individual based on their score [2]. The mean difference (paired t test or Wilcoxon signed-rank test) between time point 1 (T1) and time point 2 (T2) and the 95% CI should also be reported. |
Internal consistency | Internal consistency is an assessment of whether the items are measuring the same thing. A Cronbach’s alpha score of ≥ 0.70 is considered good and it should not exceed ≥0.92 for group comparisons as this is taken to indicate that items in the scale could be redundant. Item total correlations should be ≥0.20 [14]. |
Content validity | Content validity measures the extent to which the items reflect the domains of interest in a way that is clear. To achieve good content validity, there must be evidence that the instrument has been developed by consulting patients, experts as well as undertaking a literature review. Patients should be involved in the development stage and item generation. The opinion of patient representatives should be sought on the constructed scale [2, 14, 16]. |
Construct validity | Construct validity assesses how well an instrument measures what it was intended to measure. A correlation coefficient of ≥0.60 is considered as strong evidence of construct validity. Authors should make specific directional hypotheses and estimate the strength of correlation before testing [2, 14, 15]. |
Criterion validity | Criterion validity assesses the degree of empirical association of the PROM with external criteria or other measures. A good argument should be made as to why an instrument is a gold standard and correlation with the gold standard should be ≥ 0.70 [15]. |
Responsiveness | Responsiveness assesses the ability of the PROM to detect changes when changes are expected. Available methods to measure responsiveness include t-tests, effect size, standardised response means or responsiveness statistics, Guyatts’ responsiveness index. Standardised effects sizes and SRMs of less than 0.2 are considered small, 0.5 moderate, and 0.8 [17]. There should be statistically significant changes in score of an expected magnitude [8]. |
Floor-ceiling effects | A floor or celling effect is considered if 15% of respondents are achieving the lowest or the highest score on the instrument, respectively [15]. |
Acceptability | Acceptability is reflected by the completeness of the data supplied. 80% or more of the data should be complete [16]. |