The clinical application of ICF codes to diverse populations remains an active topic of discussion [21–27], with little consensus as to how each code and qualifier must be utilized for specific populations. There are related previous studies, which deal with the concept of ICF model using different exisiting scales [28–30]. Some studies dealt with the ICF reproducibility to assign ICF categories to extant measures[3]. In geriatric care research, Jette et al. have identified distinct concepts shared by activity and participation[31].
However, still to date, to the best of the authors' knowledge, there is no study that has shown the test-retest reproducibility of the ICF as a scale to evaluate functioning in a specific population.
The ICF is based on a universal model that theoretically can be applied regardless of cultures, age groups or care settings [7, 27, 32]. However, various codes may have different implications for various care settings in practical terms, and individual ICF items requires validity and reliability studies in application to diverse populations. Such efforts are already underway in the form of development of ICF core-sets for specific medical conditions[11]. Conceptual applications of the ICF to National surveys have also been undertaken[4, 33, 34].
This study differs from both these approaches, as it does not rely on the experts' opinions to assure face and content validity, but applies the ICF directly as an instrument of geriatric assessment to select more adequate items, while aiming to develop new scales using ICF taxonomy.
It requires a certain level of test-retest reproducibility and measurability, or discard of items which are not appropriate to create new scales.
The authors are now developing the elderly communication performance scale according to the result of test-retest reliability statistics, because AP items related to communications have acceptable level of test-retest reliability.
Items such as d320 and d340, which are related to communication using formal sign language, showed low measurability. These items are not always applicable in the general geriatric care setting, but are pertinent for individuals with hearing loss. Thus, the scale developer can select ICF items with certain reliability and measurability according to the scope of each scale.
The other rationale of testing such a wide range of the ICF codes is that elderly persons hold problems that cover multiple disciplines. This contrasts with the ICF core-set project which is relatively disease focused.
Reliability of the ICF qualifiers
Our findings raise concerns about the low reliability of the ICF items using qualifiers.
Although overall reliability of the ICF items was low, it had improved considerably, when the weighted kappa statistics were stratified by the work experience of the evaluators. As shown in Table 1, the weighted kappa of the TAI scales did not show marked differences compared to the ICF items. A previous study on the TAI scales also indicated that the reliability was not dependent on the experience of the evaluators [15]. It indicates the ICF items and its qualifiers may be too difficult to quantify in some cases.
By stratifying the results by care-settings, it was possible to get better test-retest reproducibility in the institutional setting. This may be because more information, including medical records, are available in the setting.
The result of reliability differs depending on the chapter. As shown in Figures 2 and 3, the low weighted kappa value of chapters 4, 5 and 8 of the BF domain, and chapters 8 and 9 of the AP domain contribute to the overall low reliability of the ICF.
In BF domain, chapter 4("Functions of the Cardiovascular, Haematological, Immunological and Respiratory Systems"), 5 ("Functions of the Digestive, Metabolic and Endocrine Systems") and 8 (Functions of the Skin and Related Structures") are composed of items that can be described with specific medical examination.
For example, "blood pressure functions" (b420) can be described much more easily with blood pressure level measurable with arm cuff than using qualifier levels from 0 to 4.
Immeasurability of the ICF items
What we call immeasurable in this study include level 8 – not specified (available information does not suffice to quantify the severity of the problem) and level 9 – not applicable (e.g., d760, Family relationships is not applicable to an elderly person without family).
For example, in case of the global psychosocial functions (b122:immeasurability rate 2.6%), 38 evaluators could not quantify it because the sufficient information was not available and one evaluator rated it as not applicable as shown in additional file 1. This indicates that items with low immeasurability rate can be easily evaluated.
In contrast, 96% of the measurement was rated as immeasurable in sexual function (b640), and most of them were rated as not applicable, as expected by the target sample of this study. Overall, most of the rating as immeasurable was by level 8 (not specific), although some items such as chapter 8 ("Major life area") of the AP domain showed more level 9 than level 8.
Chapter 8 ("Major life area") is comprised of the categories "education" (d810-d839), "work and employment" (d840-d859), and "economic life" (d860-d879), while Chapter 9 (Community, social and civic life) includes "community life" (d910), "recreation and leisure" (d920), "religion and spirituality" (d930), "human rights" (d940) and "political life and citizenship" (d950). To accurately assign scores in the sub-domains of education, work and employment, community life, and political life in a population of institutionalized elderly patients may be difficult, or even inappropriate. Thus, the large proportion of institutionalized geriatric patients in our study sample may have affected the high immeasurability scores in these two chapters. The measurement of "religion and spirituality" and "human rights" requires multidimensional and subjective assessment. Thus it is difficult to assign either of them into a single code[35, 36].
The low reliability shown in this study indicates the difficulty of using the ICF as a measurement tool and is also attributable to the ambiguous nature of the qualifiers. For example, when an evaluator judges the performance level of school education, he or she may assess the subject as level 4 ("complete difficulty"), because of the subject's inability to obtain further education or to attend an institution for learning. However, this item may also be regarded as "not applicable" or "not specified," especially in the context of institutionalized geriatric patient for whom school attendance is not an expected component of daily life.
In contrast, frequently assessed items in the LTCI assessment appeared to have high reliability. Presumably because items such as toileting and self-dressing constitute a part of a standard self-care assessment already widely used by healthcare professionals [37]. This similarity may explain the high reproducibility of self-care item assessments between independent evaluators in our study.
Validity of the ICF Checklist
An additional purpose of this study was to evaluate the validity of the ICF Checklist in geriatric assessment. We have also used the checklist as a training tool for evaluators, because it was the sole available material at the commencement of this study for official training of the ICF. We have found that the existing ICF Checklist lacks several items which we found scored high in reliability and low in immeasurability rate. These items include, "global psychosocial functions" (b122); "temperament and personality function" (b126); "calculation functions" (b172); "mental function of sequencing complex movements" (b176); "articulation functions" (b320) and "gait pattern functions" (b770) in the BF domain, and "focusing attention" (d160); "making decisions" (d177); "transferring oneself" (d420); "Moving around in different locations" (d460) in the AP domain.
The ICF checklist includes less reliable and immeasurable items, e.g. "blood pressure functions" (b420); "haematological system functions" (b430); "immunological system functions" (b435); "respiration functions" (b440)"; digestive functions" (b515); "endocrine gland functions" (b555) and "sexual functions" (b640) in the BF domain, and "school education" (d820); "apprenticeship" (d840); "religion and spirituality" (d930) and "human rights" (d940) in the AP domain. Some of the body function related items could be better described with chronic disease, such as high blood pressure, anemia, and diabetes. Items not relevant to the elderly care settings such as school education; apprenticeship might be just omitted when applying the scheme to those settings.
Importance of participation in religions and spirituality might vary depending on cultural settings. Also, human rights (d940) may play a pivotal role on understanding geriatric domestic violence.
This result should help selecting more useful sets of the ICF items that would reflect evaluators' needs and reliability of items. Some modification to the ICF checklist may also facilitate the use of the ICF.
Study Limitations
There are a few limitations in this study. The samples were selected from various service providers based on the stability of the function during the test-retest period. The kappa statistic is dependent on the samples. Therefore these samples might not fully represent the target population, namely the elderly using long-term care services in Japan. However, the use of a large sample obtained from multiple centers is nevertheless indicative of relatively low reliability of the ICF items measured with the qualifiers.
Also, other possible confounders such as the cultural settings and evaluators' professional backgrounds may influence the ICF measurement values. It is possible that some of the ICF items show different item functioning (DIF) depending on these confounders. The Rasch measurement technique is applicable to answer this question, which remains to be studied[38]. The illustrations added by the authors to clarify the definition of each item could have biased the results. However, our intention in incorporating illustrations was to standardize evaluator assessments. Previous studies have shown that illustrations increase the reliability of assessment instruments[39].
Lastly, the authors used the sum of qualifiers 8 and 9 as a simple index of immeasurability. Items with a high prevalence of level 8 suggested that it was difficult for the evaluator to ask the question or obtain the information from the medical chart. In contrast, assignment of a qualifier of 9, which was more prevalent in chapter 8 of AP domain, suggested these items were not applicable. However, these two qualifiers may convey quite different information, and the study design made it difficult to compare the differences between these two qualifiers. In addition, it was difficult to analyze inter-rater reliability of qualifiers 8 and 9 because of the skewed distribution of the result between these qualifier levels. However, the prevalence of these qualifiers, as shown in Additional files, should help in selecting ICF items for future research.