In this paper, new measures of I, A and P have been developed that were specifically derived to measure each ICF component without contamination from other constructs in the model. These new measures can be used to improve assessment in both theory testing and the evaluation of interventions. For theory testing, the use of these uncontaminated measures should reduce over-inflation of observed relationships between constructs that may occur if measures are contaminated with other related constructs or the under-inflation that may occur if the measures are contaminated with constructs unrelated constructs. For example, the new measures should allow for more accurate evaluations of the relationships between the ICF components as these measures should not be contaminated with other constructs in the model.
For evaluating an intervention, the new measures allow for the assessment of the three distinct ICF components. Failure to adequately measure each distinguishable outcome might result in failure to detect benefit or harm due to an intervention or to a treatment. For example, in the treatment of patients with severe arthritis, an analgesic might predominantly affect impairment, an exercise programme might influence activity limitations and participation restrictions, but have little influence on impairment, whereas providing additional transport services might only alter participation restriction. If combined or contaminated measures are used then positive or negative effects may be masked.
While the previous work on the selection of items identified some items relevant for any population , this paper develops measures specifically in the context of joint replacement surgery, mainly for osteoarthritis. Thus the measures are particularly relevant for that population, even though some of the items originated from generic measures. Further work would be necessary to confirm the value of the measures for different populations.
Two methods of item analysis were explored, the traditional CTT approach and the more recent IRT method. These methods have their strengths and weaknesses. The use of both methods may yield more information than only using one of the methods. Each method was explored individually and then the results from each method compared and contrasted. CTT and IRT methods identified common items for removal from the item pool. Each method also suggested some items that could be removed that were not indicated by the other method using the criteria outlined. The CTT-MAP analysis indicated that three items were more highly correlated with a total other than the hypothesised construct total. There were feasible explanations for the removal of all three items. IRT additionally highlighted items that had low information and could possibly be removed. This was preferable to the CTT approach of item reduction where factor analysis is used and may result in small areas of a construct being covered. This problem is even more likely if some of the items have similar wordings as these would be the strongest indicator of the factor and be retained ahead of other items. Using IRT can also result in the items representing a small area of the construct. However, this is driven by a different theoretical approach to CTT, based upon items not discriminating well or not having much information.
The decision to use a discriminating parameter of < 1.25 as a criteria for item removal was somewhat arbitrary. As described earlier, the decision was based on published suggestions but as yet there is no consensus on what values for the discrimination parameter or information function are acceptable. Again, there were plausible reasons why items had been identified as having low information and so they were also removed from the item pool.
The IRT analysis indicated that the model fitted using the pool of candidate items for P_irt had poorer fit than the I_irt and A_irt models. However, as there is no consensus about how to assess model fit or how to deal with misfitting data , the effect of this is difficult to quantify and so this may have an effect on the results for Participation Restriction. The P_irt had fewer items than the I_irt or A_irt sets of items. This reflected the observation that commonly used measures in OA tended to focus on I and A. Our analysis of 342 items found only 44 pure P items. Nevertheless, the resultant measure of Participation Restriction appeared to have acceptable properties.
The item analysis resulted in the removal of 4 Impairment items, 9 Activity Limitation items and 11 Participation Restriction items with 14 of these items being identified by both CTT and IRT. The resultant measures consisted of 9 Impairment items (Ab-I), 17 Activity Limitation items (Ab-A) and 9 Participation Restriction items (Ab-P). The correlations of the resultant measures with the criterion variable of the SF-36 appeared to follow the expected pattern. The measures had acceptable Cronbach's alpha (all > 0.8). However, when this was explored in more detail using IRT, Ab-I was not reliable at very high levels of impairment while Ab-P was not reliable at the low end of the construct. This suggests that new items should be written to cover these areas if it is to be used for all ability levels. So, for Ab-I, some 'easy' items should be written to discriminate the high end of the construct e.g. 'my joint is uncomfortable (never to always aches)'. For Ab-P some new 'hard' items should be added to discriminate this area of the construct e.g. 'are you able to participate in sporting activities?' This illustrates an advantage of using IRT, as the lack of reliability at the extremes of the construct was not identified by the CTT analysis. It is possible that the lack of reliable items at the ends of the I and P constructs may be due to the items having been selected from measures that were developed using CTT methods. For example, a high Cronbach's alpha can be achieved by selecting items that are all strongly related to each other but may cluster around a small area on the underlying construct. The total information was greatest for Ab-A with information > 10 across most of the construct.
The Graded Response Model fit was acceptable for the Ab-I, Ab-A and Ab-P models. The model fit was better than it had been for the candidate item models for Impairment (I_irt) and Participation Restriction (P_irt) but a little worse for Activity Limitation (A_irt). The distributions appeared approximately normal when graphically examined, although Ab-P had statistically a slight skew.
A two parameter IRT model was selected in order to be able to estimate both a difficulty and discrimination parameter. There is much debate between using the single parameter Rasch model (where item difficulty is estimated and equal item discrimination is assumed) or a more general 2 parameter IRT model. Some favour the single parameter Rasch model as they believe it adheres to the fundamental measurement principle that all items behave in the same way (i.e. the data must fit the model) . Others favour using an IRT model that best fits the data and suggest the Rasch model may be too restrictive and can lead to discarding useful items (see [48, 49]). In this study, we are interested in developing measures that are tailored to OA. We therefore chose to use an approach that allows us to select items that convey the most information about our chosen population rather than force particular properties on each item in our measure. In addition, with a limited set of items it is unlikely that sufficient items would be found that cover the construct as well as all having the same discrimination. The formation of very large item banks for computer adaptive testing (CAT), may, in the future, allow the use of the Rasch model to develop tailored questionnaires. Until such time, we take the pragmatic approach and select the two parameter IRT model.
The selected items could be explored further. If a shorter measure was required, stricter criteria could be used for selecting items with IRT. Alternatively, a decision could be made on how many items the resultant measure should have. Using IRT methods, items could be identified that have information (precision) across the construct domain .
The response rate of 43% was quite low but reasonable given the long length of the questionnaire (27 pages, 254 items). It appeared that the sample was representative as there were no differences between the responders and non-responders on gender, age and disability. The question remains to whether the 60% who did not participate were significantly different from the sample on other unmeasured variables.
This study was based on a population with severe hip or knee problems as they were assessed prior to surgery. If a measure is required to assess patients post-operatively, or patients in the earlier stages of osteoarthritis, then the same items should be useful as IRT is an invariant method (i.e. item parameters should be similar even with a sample that has different levels of 'ability'). However, the accuracy of the parameter estimates does depend on the limitation levels of the calibration sample. As the sample of patients about to undergo joint replacement has relatively low levels of 'ability' then the parameter estimates would be most accurate for the easier items. Hence, it would be useful to repeat the analysis on patients after surgery as these patients would have more 'ability' and thus should provide more accurate parameter estimates for the harder items. Additionally, this would also allow an empirically evaluation of the invariant property of IRT.
The resultant measures appeared to have acceptable properties to date. However, only a preliminary psychometric evaluation of reliability and validity was carried out. As reliability and validity can never be proved but is based on an accumulation of evidence, much further empirical testing needs to be carried out.
The resultant measures have been constructed to represent the theoretical constructs without contamination from other constructs in the ICF model to allow for the testing of the ICF model. However, this representation was based on the DCV judgements of expert judges and may not represent the discrimination made by respondents to the measures. It will be important to explore if the measures are statistically independent using patients responses to the items.