A summary of the basic measure information is in 'Additional file 1' and a summary of the review is in 'Additional file 2'.
i) Theoretical framework
The clinician report measures stated what the measure was about but none defined what it was supposed to be assessing. These measures also lacked an underlying theoretical framework. The American Knee Society Score (derived to measure knee and patient function), Harris Hip Score (pain and functional capacity), Hospital for Special Surgery Knee Score (disability), Lequesne Hip and Knee Indices (an indices of severity of disease), Merle d'Aubigne Hip Rating (function of the hip) are all measures which, while of value clinically, did not have a well defined construct, nor were they derived from a strong theoretical framework.
Some self-report measures were based on conceptual frameworks proposed by the author(s) of the measure. The McGill Pain Questionnaire (MPQ) was based on a Melzack's theory of pain [49]. This review focuses on the Pain Rating Index (PRI) and the present pain intensity (PPI) item of the McGill Pain Questionnaire. The Health Assessment Questionnaire (HAQ) was based on a hierarchical model of death, disability, discomfort, drug toxicity and dollar cost [39]. This most commonly used part of the HAQ, the Disability Index (HAQ-DI) is focussed on in this review. Much consideration was given to the conceptual meaning of handicap in the process of developing the Disease Repercussion Profile. The Disease Repercussion Profile measures individualised patient-perceived handicap in a broader manner than the WHO defined dimensions of handicap [11]. Other measures were based on an existing defined construct. The SF-36 was derived to measure health status based on the identification and definition of five generic health concepts [22] plus two other concepts identified from empirical evidence [23]. The Arthritis Impact Measurement Scale was developed to reflect the WHO definition of health [50], and the WHOQOL from the definition of quality of life devised by the WHOQOL group [30].
Other measures stated the construct measured but without explicit definition. The EuroQol was developed as a standardised non-disease specific measure for describing and valuing health-related quality of life [21]. The dimensions were selected primarily from existing health status measures. The WOMAC was based on the objective of defining the dimensionality of pain and disability, with five dimensions being initially identified [45]. The final version had three subscales of pain, stiffness, and physical function [46]. The underlying aim of the Oxford Hip and Knee Questionnaires was to measure "patients' perception of a single disease entity" [43].
Thus although three measures defined the construct of interest, no measure was based on both a defined construct and a theoretical framework.
ii) Methodological development
Scaling strategy
Six of the fourteen measures appeared to use standard psychometric scaling methods. The stated scaling methodology of the SF-36, WOMAC and WHOQOL was Likert scaling. The WOMAC could, alternatively, be implemented using a 0–100 mm visual analogue scale for each item, with descriptive anchors of none and extreme. A numeric rating scale version of the WOMAC has also been developed, with response categories between 0 (none) and 10 (extreme) [48]. While the authors of the Oxford Hip and Knee Questionnaires did not state that Likert scaling was used, the resultant questionnaire had the appearance of a Likert-type scale. Two scaling methods were used for the Arthritis Impact Measurement Scale: first, items were grouped into subscales and each subscale was examined using Guttman scaling procedures, and then Likert scaling was used to form an additive scale for each subscale. Thurstone's Categorical Judgement model [51] was used to obtain weightings of pain intensity for each descriptor of pain in the McGill Pain Questionnaire-PRI. This procedure results in an interval scale. The McGill Pain Questionnaire-PPI was a single item with five response categories that were considered equally far apart as to represent an interval scale.
An econometric scaling method was used for the development of the EuroQol. This method involved subjects rating health states (from combining different levels from each item) and results in values being attached to each health state. The Disease Repercussion Profile used a combination of open questions and 10-point graphical rating scales to create a graphical profile score. The HAQ-DI did not appear to have been developed using a standard scaling technique.
None of the clinician report measures appeared to have been developed using a standard scaling technique nor did they explain their scaling strategy.
Item generation technique
A range of techniques was used to generate the items within a measure. There was no information on the item selection techniques for the Harris Hip Score, Hospital for Special Surgery Knee Score, Lequesne Hip and Knee Indices and Merle d'Aubigne Hip Rating. The items for the American Knee Society Score were generated by consensus by members of the American Knee Society. Some measures were based on items from existing instruments (Arthritis Impact Measurement Scale, EuroQol, HAQ-DI, SF-36). Some items were selected from literature, e.g. McGill Pain Questionnaire. Others started by gathering items from patients, e.g. Oxford Hip and Knee Questionnaires, WOMAC and Disease Repercussion Profile. Some measures took a comprehensive approach and used all these techniques and additional ones (e.g. extensive focus groups and question writing panels were additionally used for the WHOQOL). In summary, the method of item generation for the patient self-report measures was generally comprehensive, with most measures using appropriate methods to generate a pool of items that cover the domain of interest. In contrast, there was little information about the choice of items in the clinician report measures.
Item reduction
The Arthritis Impact Measurement Scale, McGill Pain Questionnaire, WHOQOL and WOMAC used psychometric methods of item reduction to reduce the number of items. The SF-36 used specific methods to construct short-form measures from the 'parent' longer Medical Outcomes Study measure [23, 52]. The method details were not found; however, if the methods were similar to those for the SF-20 [52] then it would imply comprehensive testing where item-scale correlations, reliability and validity were examined. Subsequently, the Likert scaling assumptions of the SF-36, were explored with all scales passing tests for item-internal consistency, item-discrimination, and internal consistency of each scale score [24]. The main item reduction for the HAQ-DI was carried out by correlational analyses that identified redundant items [40]. The methods of item reduction for the Oxford Hip and Knee Questionnaires and EuroQol were not explained in detail in the published literature. The item reduction procedures were described in detail for the measures where a stated psychometric scaling strategy was followed, illustrating the advantage of using a psychometric scaling method with an explicit predefined methodology.
Response formats
The Disease Repercussion Profile used open questions for each domain, with severity being rated on a ten point graphical rating scale. For the McGill Pain Questionnaire-PRI, the respondents select from each of the 20 categories, the individual descriptive words that best represent their pain. If none of the words in a category apply then the respondent leaves the category out. For the present pain intensity item, the respondent selects one of five response categories.
All the other twelve measures had ordered response categories with the Arthritis Impact Measurement Scale & the EuroQol additionally including a visual analogue scale. Six of these twelve measures had items with different numbers of response categories (American Knee Society Score, Lequesne Hip and Knee Indices, Hospital for Special Surgery Knee Score, Harris Hip Score, SF-36 & the Arthritis Impact Measurement Scale with between 1 and 6 response categories depending on the measure and item). However, the number of response categories was only discussed for the SF-36 and then only for some items [23]. The other six measures had the same number of response categories for all the items throughout the measure (EuroQol, HAQ-DI, Merle D'Aubigne Hip Rating, Oxford Hip and Knee Questionnaires, WHOQOL, WOMAC). Of these, only the WOMAC and HAQ-DI had the same response continuum (i.e. same wording) for all the items. The HAQ-DI response formats were based on the American Rheumatism Association (ARA) functional classes.
Therefore most of the measures used ordinal (ordered) response formats but there was little consistency of the response format and response continuum within measures. There is much discussion on the problems in performing arithmetic operations and statistical analysis on ordinal scales, mainly due to the unknown interval between categories [53, 54]. The PRI index of the McGill Pain Questionnaire was the only measure on an interval scale and therefore was without these problems. Likert scales are ordinal, although there is much debate as to whether they can be assumed to be interval (i.e., with equal intervals between responses [2]). The response format for the Likert-type measures (SF-36, WOMAC, WHOQOL, Arthritis Impact Measurement Scale, Oxford Hip and Knee Questionnaires) were not true Likert scales as the response continuum was not 'agree' to 'disagree'. This may have an impact on the resultant scale as any changes in the response categories, e.g., changing the usual agree-disagree to favourable-unfavourable, may have an impact on the intervals between the categories. In addition, all the items within a true Likert scale usually have either five or seven response categories, but the Arthritis Impact Measurement Scale and the SF-36 did not use a constant number of response categories, which again may impact the scale. However, it is not clear whether these changes from a traditional Likert scale have a significant impact as there was empirical support for the scaling assumptions of traditional Likert scales in the SF-36 subscales [24].
Scoring method
The McGill Pain Questionnaire-PRI used three possible scoring methods for the list of pain descriptors: the number of items chosen (NWC), the mean scale values (PRI(S)), or the summed rank values of items chosen ((PRI(R)). An alternative weighted-rank method of scoring was also developed [28]. The PPI score was simply the value selected from the 1–5 response scale. The Disease Repercussion Profile used profile scores, where the handicap rating for each domain was plotted on a bar chart to obtain a handicap profile for each patient.
Two measures containing items with different numbers response categories addressed this in their scoring. The Arthritis Impact Measurement Scale used a standardised additive scale. The SF-36 recalibrated the additive scores for linearity and transformed the scores. The American Knee Society Score, Harris Hip Score, Hospital for Special Surgery Knee Score, and Lequesne Hip and Knee Indices (all with varying numbers of response categories) used summated scale systems with the Hospital for Special Surgery Knee Score and American Knee Society Score having items that result in deductions from the point score, e.g., Hospital for Special Surgery Knee Score uses a one point deduction for using a cane. It is unclear how this scoring method was derived and why responses to certain items were allocated their particular points with some items having more weighting than others.
The scoring of the measures with constant numbers of response categories varies; an additive score was used for the Likert-type scales of the Oxford Hip and Knee Questionnaires and WHOQOL. An additive scale is also most commonly used for the WOMAC, however other weighting and aggregation methods were proposed (i.e. normalisation, pooled index, weighting by relative importance, response criteria) [48]. In addition, the WOMAC can be scored using a signal method where patients are asked to select the most important item from each subscale. However, there are concerns about the stability of using the signal method and is not currently recommended [47]. The score for the HAQ-DI items was based on the highest score on any item within each of the eight subscales. The subscale scores were adjusted to take account of the use of aids. An overall disability score was calculated as the average of the subscale scores. The EuroQol could be scored as a profile or a weighted health index based on a table of values from general population samples. A table was used for the Merle D'Aubigne Hip Rating to allow classification of the functional grading of the hip, and an algorithm was provided to calculate improvement after surgery on the hip.
Three of the measures (Oxford Hip and Knee Questionnaires, Merle D'Aubigne Hip Rating and Lequesne Hip and Knee Indices) had only an overall score. All the others also had subscale scores. The SF-36 and American Knee Society Score only had subscale scores and not an overall score. All other measures had an overall score.
In sum, the measures use a wide range of scoring procedures, from the complex weightings in the EuroQol to the simple method of the HAQ-DI (using the highest score within each subclass) that does not fully utilise all the information collected. Jenkinson, 1991 [55] demonstrated that complex weighting methods gain little over a simple scoring system, and thus a simple additive method is generally recommended
Comments
View archived comments (1)