Practical methods for dealing with 'not applicable' item responses in the AMC Linear Disability Score project
© Holman et al; licensee BioMed Central Ltd. 2004
Received: 23 April 2004
Accepted: 16 June 2004
Published: 16 June 2004
Whenever questionnaires are used to collect data on constructs, such as functional status or health related quality of life, it is unlikely that all respondents will respond to all items. This paper examines ways of dealing with responses in a 'not applicable' category to items included in the AMC Linear Disability Score (ALDS) project item bank.
The data examined in this paper come from the responses of 392 respondents to 32 items and form part of the calibration sample for the ALDS item bank. The data are analysed using the one-parameter logistic item response theory model. The four practical strategies for dealing with this type of response are: cold deck imputation; hot deck imputation; treating the missing responses as if these items had never been offered to those individual patients; and using a model which takes account of the 'tendency to respond to items'.
The item and respondent population parameter estimates were very similar for the strategies involving hot deck imputation; treating the missing responses as if these items had never been offered to those individual patients; and using a model which takes account of the 'tendency to respond to items'. The estimates obtained using the cold deck imputation method were substantially different.
The cold deck imputation method was not considered suitable for use in the ALDS item bank. The other three methods described can be usefully implemented in the ALDS item bank, depending on the purpose of the data analysis to be carried out. These three methods may be useful for other data sets examining similar constructs, when item response theory based methods are used.
When questionnaires consisting of a number of related items are used to measure constructs such as health related quality of life [1, 2], cognitive ability  or functional status , it is likely that some patients will omit responses to a subset of items. A variety of ways of dealing with missing item responses in this type of questionnaires have been proposed . These range from imputation methods [6, 7] to algorithms, which permit parameters to be estimated, whilst ignoring missing data points  and frameworks, in which it is possible to construct a joint model for the data and the pattern of missing data points . It is always essential to examine why some responses are missing and whether there is a pattern underlying the missing data for questionnaires [10–12], but particularly when an item bank is being calibrated. A calibrated item bank is a large collection of questions, for which the measurement properties, in the framework of item response theory, of the individual items are known and should form a solid foundation for measuring the construct of interest. This foundation could be weakened if the treatment of missing item responses had not been properly examined.
The AMC Linear Disability Score (ALDS) item bank aims to measure functional status, as defined by the ability to perform activities of daily life [4, 13, 14]. Items for inclusion in the ALDS item bank were obtained from a systematic review of generic and disease specific instruments for measuring the ability to perform activities of daily life  and supplemented by diaries of activities performed by healthy adults. The ALDS items were administered by specially trained nurses. Two response categories were used: 'I could carry out the activity' and 'I could not carry out the activity'. If patients had never had the opportunity to experience an activity a not applicable response was recorded. In the context of the ALDS item bank, it is not immediately clear how responses in the category 'not applicable' should be analysed. Some instruments, such as the CAMCOG neuropsychological test battery [3, 15] and the Sickness Impact Profile , treat such responses as a 'negative' category and others, such as the SF-36 [1, 2], impute a response based on those given to the other items. In this paper, responses to the 'not applicable' category in the ALDS project have been examined in the wider context of missing data .
In this paper, four practical, missing data based strategies for dealing with responses in the category 'not applicable' are examined in the context of item response theory. The four strategies are: cold deck imputation; hot deck imputation; treating the missing responses as if these items had never been offered to those individual patients; and using a model which takes account of the 'tendency to respond to items'. The results will be used to make recommendations about the choice of procedure in the ALDS project and other measures of functional status, which are analysed with item response theory.
Item content and parameters.
Estimates of the item parameters ( )
Hot deck 1st run
Items never offered
Including tendency to respond
Mean 5 runs hot deck
Running for more than 15 minutes (++) (2)
Going for a walk in the woods (2)
Running for less than 5 minutes (3)
Walking up a hill or high bridge (++) (3)
Lifting up a toddler (3)
Moving a bed or table (4)
Playing with a child on the floor (5)
Tightening a screw (+) (5)
Going shopping for clothes (++) (6)
Change a light bulb in a ceiling lamp (7)
Mopping the floor (++) (11)
Putting the rubbish out (12)
Lifting a box weighting 10 kg (13)
Shopping for groceries for a week (13)
Painting a ceiling (14)
Cleaning a bathroom (17)
Carrying a heavy bag upstairs (17)
Painting a wall (18)
Cycling for 15 minutes (24)
Change sheets and duvet cover on bed (25)
Caring for potted plants on a balcony (25)
Vacuuming a flight of stairs (26)
Washing a window from the outside (27)
Cycling with a heavy load of shopping (30)
Pumping up a bicycle tyre (33)
Travelling by plane (38)
Mopping a flight of stairs (39)
Vacuuming the inside of a car (48)
Swimming for an hour (+) (54)
Washing a car (82)
Mowing the lawn (102)
Repairing a puncture in bicycle tyre (133)
Cronbach's alpha coefficient for scale
Dealing with 'not applicable' item responses
This section describes the four strategies for dealing with these responses: cold deck imputation; hot deck imputation; treating the missing responses as if these items had never been offered to those individual patients; and using a model which takes account of the 'tendency to respond to items'. These strategies were chosen because they are implemented in instruments measuring similar constructs and the authors regarded them as representing clinically plausible mechanisms. The strategies will be compared by examining the root mean squared difference, as defined in the Appendix, between estimates of the item parameters and by comparing estimates of the mean functional status in the group.
Cold deck imputation replaces each missing data point with a pre-determined constant. This may be the same for each data point or vary with factors internal or external to the data. For example, it has been recommended that missing item responses in the SF-36 be replaced by the mean of the responses to other items in the same sub-scale [1, 2]. Imputing the same value for all missing data points can be attractive because of its apparent simplicity or because researchers feel that they have a strong justification for the choice of constant in the context of the data. However, this method artificially reduces the amount of variability in the data, possibly leading to substantial bias in parameter estimates. In addition, statistical theory provides little support for this method . The cold deck imputation procedure used in this paper replaces all responses made in the category 'not applicable' with 'cannot'. This is consistent with some other questionnaires for measuring aspects of functional status, such as the Sickness Impact Profile , the Mini-mental state examination and the CAMCOG , in which items, to which patients make no response, are coded in a 'negative' category.
Hot deck imputation replaces each missing value with a value drawn from a plausible distribution  incorporating theoretical or observed aspects of the data . Clinicians may feel that hot deck imputation procedures introduce an unnecessary random element into their data, and hence be wary of these methods. However, if the hot deck procedure is run a number of times and each data set is analysed in the same way, differences in the results can be used to make inferences about the effect of the imputation procedure . In this paper, the hot deck imputation procedure has been run five times, resulting in five complete data sets, and is based on logistic regression and closely mirrors the one-parameter logistic IRT model described above. The procedure is constructed, so that patients with a higher level of functional status have a higher probability of having responses in the category 'can carry out the activity' imputed than patients with a lower level of functional status. Similarly, responses imputed for more difficult items are more likely to be in the category 'cannot carry out the activity' than those for easier items. Technical details of the hot deck imputation procedure are given in the Appendix.
In some circumstances, it may be desirable to act as if the researchers had no intention of collecting the missing data points . This avoids any potential bias or reduction of variability introduced by an imputation procedure. Care should be taken that only the data points that are actually missing are 'ignored', rather than that the whole case, or unit, is removed from the analysis, as occurs in many standard procedures. When using IRT and marginal maximum likelihood estimation procedures [20, 21], it is possible to treat items, to which no response was made, as if they had never been offered to the respondent . This is equivalent to ignoring the missing responses  and is essential in the application of computerised adaptive testing [23, 24]. This procedure is explained in more depth in the Appendix. A number of models have been proposed, which directly incorporate the pattern of 'missing' item responses into the model used to examine the data. These models rest on the assumption that two, perhaps related, processes are at work when an item is presented to a patient. The first process can be described as the tendency to judge items to be applicable to one's own situation or the tendency to respond to items . The second process reflects the patients' functional status. These two processes can be modelled jointly by using the one-parameter logistic IRT model for each process individually and assuming that the health status of a patient and the tendency to judge items to be applicable is correlated . This type of model is described in more depth elsewhere .
In this paper, the one-parameter logistic model , sometimes known as the Rasch model, is used as a tool to analyse the response patterns given by patients to a set of items. This model examines the probability P ik that patient k, with functional status equal to θ k , responds to item i in the category 'can carry out', where
and β i describes the 'difficulty' of item i in relation to the construct functional status. It is unlikely that this model would fit functional status data satisfactorily enough to be used as a final model for an instrument, but since the aim of this study is to compare the performance of a number of methods for dealing with missing data, this simpler model is acceptable. The extent to which all items represented a single construct was examined using Cronbach's alpha coefficient .
In this paper, a two stage procedure was used to estimate the parameters in the one-parameter logistic model. Firstly, the item parameters (β i ) were estimated. In this process it was assumed that the values of the functional status (θ k ) formed a Normal distribution, resulting in marginal maximum likelihood estimates. Secondly, estimates of the patients' functional status (θ k ) were obtained.
The fit of the model to the data was assessed using weighted residual based indices transformed to approximately standard Normal deviates [20, 29]. Values above 2.54 (1% level) were regarded as indicative of item misfit. Estimates of the item difficulty parameters (β i ) obtained using the different procedures for dealing with missing data were compared using the root mean squared difference, as described in the Appendix.
The best estimates of functional status for individual patients are usually obtained using maximum likelihood methods. However, clinical studies are often more concerned with inferences based on groups of patients. It has been shown that using maximum likelihood estimates of the functional status (θ k ) in standard statistical techniques can lead to substantial biases [30, 31]. To avoid this, plausible values for the functional status of each patient have been drawn from their own posterior distribution of θ . The item parameters and patients' functional status have been estimated in ConQuest . Other calculations were carried out in S-PLUS .
The estimates of the item parameters (β i ) and their standard errors are given in Table 1. Standard errors for the parameters in the 'tendency to respond' model are not currently available in the software. This is indicated by the symbol '-' in Table 1. Items denoted by (++) demonstrated item misfit across more than one method and items denoted by (+) demonstrated item misfit for one method. The values of Cronbach's alpha coefficient for each procedure are given in the bottom row of Table 1. All values are greater than 0.8, indicating that the items reflect a single construct.
The root mean squared differences.
1st run hot deck
2nd run hot deck
Mean 5 runs hot deck
Items never offered
1st run hot deck
2nd run hot deck
Mean 5 runs hot deck
Items never offered
Tendency to respond
The root mean squared differences.
Procedure used to deal with NA responses
95% Confidence interval for mean
Cold deck imputation
Hot deck imputation
Treating 'NA' as if the items had never been presented
In the ALDS project, 'not applicable' item responses occur when patients have never had the opportunity to attempt to perform the activity described. This means that it is not possible to assess whether a respondent would be able to perform an activity if they had had an opportunity to do so. Hence, there is no theoretical evidence to support the use of the cold deck imputation procedure described in this article, even though comparable methods are used in some, broadly similar, questionnaires such as the Sickness Impact Profile .
The procedures for dealing with missing item responses, which use hot deck imputation or treat the missing responses as if these items had never been offered to those individual patients and are described in this article, could both be useful in the calibration phase of an item bank based on item response theory. The latter method can be implemented if marginal maximum likelihood or some Bayesian estimation methods are applied to avoid any bias caused by the imputation method. The hot deck imputation procedure may be valuable in situations where a complete data matrix is required. However, it should be noted that there are three reasons that the hot deck imputation procedure performs so well for the data in this paper. Firstly, the hot deck imputation procedure closely resembles the IRT model used. Secondly, the model fits the data fairly well. Finally, 32 items have been used. It is highly likely that a poor outcome for the hot deck imputation procedure would have resulted if these conditions had not pertained. However, it should be noted that it may be impractical to repeat exploratory analyses a number of times, reducing the attractiveness of true multiple hot deck imputation, although results obtained using a single run of a hot deck imputation procedure should be treated with care. Finally, if the aim of a study is to make inferences on the functional status of patients, the procedure, which takes account of the 'tendency to respond to items' may be a valuable tool. However, in a calibration study to estimate item difficulty parameters this model does not provide any more useful information than when hot deck imputation is implemented or the missing responses were treated as if these items had never been offered to those individual patients.
There were almost no true missing item responses in the data described in this paper. The nurse interviewers were instructed to ensure that they had a response on each item and the response forms were machine readable. These procedures illuminated two important causes of missing data. The 'not applicable' option was only selected after the nurse-interviewer had made extensive inquiries into the experiences of the respondent. Hence, it seems reasonable to assume that the 'not applicable' category was used for the reason described. However, qualitative research on the reasons why respondents used this category would be needed to be sure about this. Given the relatively low level of responses in the category 'not applicable', the authors feel unable to make recommendations about the use of these procedures in data sets with much higher proportions of missing data. All four methods are relatively practical and can be implemented fairly easily. However, the hot and cold deck imputation methods are more suitable if analysis using software requiring a complete data matrix is to be carried out.
The ALDS item bank is currently under development. This means that the dimensionality and measurement properties of the item bank are still being investigated, although preliminary results suggest that a selection of items reflect a single latent trait , although there is a large degree of differential item functioning between male and female and between younger and older respondents . It has been decided that items for which more than 10% of responses are in the category 'not applicable' are not suitable for inclusion in the item bank . Hot deck imputation and the procedure treating the items as if they had never been presented to the respondents have been implemented in different types of analysis of the ALDS data.
This article has examined four strategies to deal with responses in a 'not applicable' category in the context of missing data when item response theory is used to analyse the data resulting from multi-item questionnaires. These were cold and hot deck imputation, treating the missing responses as if these items had never been offered to those individual patients and using a model which takes account of the 'tendency to respond to items'. The four procedures were implemented on data from the AMC Linear Disability Score project. This project aims to develop an item bank to measure the functional status of chronically ill patients. In the first part of this study, estimates of the item parameters were obtained and compared using a numerical and a graphical method. The results show that the point and interval estimates obtained are very similar when the procedures based on hot deck imputation, treating the missing responses as if these items had never been offered to those individual patients and using a model which takes account of the 'tendency to respond to items' are used. The estimates obtained following the cold deck imputation procedure were substantially different to the estimates obtained using the other strategies.
In the second part of the study, the effects of the type of procedure on estimates of the functional status of patients was examined. It appears that cold deck imputation leads to significantly different estimates of the mean functional status in a group of patients than either hot deck imputation or treating the missing responses as if these items had never been offered. Differences between estimates obtained using the latter two methods were not significant. These results confirm that, in clinical studies, it is necessary to consider the method for dealing with responses in a 'not applicable' category in the context of the data.
Hot deck imputation
In the hot deck imputation procedure implemented in this paper, the functional status of patient k is estimated by t k ,
where m 1k and m 0k are the number of questions patient k responded to in the categories 'can' and 'cannot', respectively. Using the data from patients that had responded to item i, the probability, r ik that patient k responded in category 'can' was modelled using
where the parameters b 0i and b 1i describe the relationship between the functional status, estimated by t k , and the probability of responding in category 'can' of item i. In turn, if patient l, l ∈ (1, 2,..., K), did not respond to item i, the values of , and t l were used in r ik to obtain an estimate of r il . This probability is used to obtain an observation on a Binomial distribution, B(1, ), which is imputed to replace the missing observation on item i for patient l.
In this paper, this procedure was implemented five times, resulting in five 'complete' data sets. The mean of the five estimates of β i was taken to obtain . The standard error of is defined as
where j denotes the run of the hot deck imputation procedure, β ij the estimate of β obtained for item i in run j of the imputation procedure and s.e.(β ij ) the standard error of β ij obtained directly from the likelihood in the estimation process .
Treating the missing responses as if those items were never offered to the individual patients
In order to examine the effect of treating responses to individual items in the category 'not applicable' as if those items were never offered to the individual patients, the item parameters, β i , will be estimated using a marginal maximum likelihood estimation procedure . The likelihood, L, of a particular response pattern for the one parameter logistic IRT model can be written
where p ik is as defined in the section on statistical analysis. In addition, I ik is an indicator variable taking the value 1 if patient k responds to item i in the category 'can carry out', the value 0 if patient k responds to item i in the category 'cannot carry out' and the value c if if patient k responds to item i in the category 'not applicable'. Furthermore, J ik is an indicator variable taking the value 0 if patient k responds to item i in the category 'not applicable' and the value 1 otherwise. In order to estimate β i and θ k a number of assumptions have to be made. Firstly, the item parameters have to be identified in relation to the latent trait. In this article, the mean of the distribution of θ, μ θ , will be assumed to be 0. An increase in the number of subjects from k to k + 1 results in a corresponding increase in the number of parameters to be estimated, meaning that parameter estimates may not be consistent. It is common to assume that the values θ k are observations on a particular, often Normal, distribution. This results in marginal maximum likelihood estimates of β i .
The root mean squared difference
The root mean squared difference (RMSD) is defined as
RH and RL were supported by a grant from the Anton Meelmeijer fonds, a charity supporting innovative research in the Academic Medical Center, Amsterdam, The Netherlands.
RH conceived the study, prepared the first draft and carried out the analyses. CAWG, RL, AHZ and RJdH critically reviewed the manuscript. RH prepared the final version.
Item response theory
AMC Linear Disability Score
Root mean squared difference
The authors would like to thank Janneke te Marvelde for her help in developing the Figures in this paper.
- McHorney CA, Ware JE, Lu JF, Sherbourne CD: The MOS 36-item short-form health survey (SF-36): III. Tests of data quality, scaling assumptions, and reliability across diverse patient groups. Med Care 1994, 32: 40–66.PubMedView ArticleGoogle Scholar
- Rand Health Sciences Program: Rand 36-item Health Survey 1.0 Santa Monica, California: Rand Corporation 1992.Google Scholar
- Roth M, Tym E, Mountjoy CO, Huppert FA, Hendrie H, Verma S, Goddard R: CAMDEX. A standardised instrument for the diagnosis of mental disorder in the elderly. British Journal of Psychiatry 1986, 49: 698–709.View ArticleGoogle Scholar
- Holman R, Lindeboom R, Vermeulen M, Glas CAW, de Haan RJ: The Amsterdam Linear Disability Score (ALDS) project. The calibration of an item bank to measure functional status using item response theory. Quality of Life Newsletter 2001, 27: 4–5. [http://www.mapi-research-inst.com/allissue.asp]Google Scholar
- Fayers PM, Curran D, Machin D: Incomplete quality of life data in randomized trials: missing items. Stat Med 1998, 15: 679–696. Publisher Full Text 10.1002/(SICI)1097-0258(19980315/15)17:5/7<679::AID-SIM814>3.3.CO;2-OView ArticleGoogle Scholar
- Hunsberger S, Murray D, Davis CE, Fabsitz RR: Imputation strategies for missing data in a school-based multi-centre study: the pathways study. Stat Med 2001, 20: 305–16. 10.1002/1097-0258(20010130)20:2<305::AID-SIM645>3.0.CO;2-MPubMedView ArticleGoogle Scholar
- Faris PD, Ghali WA, Brant R, Norris CM, Galbraith PD, Knudtson ML: Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. J Clin Epidemiol 2002, 55: 184–91. 10.1016/S0895-4356(01)00433-4PubMedView ArticleGoogle Scholar
- Schafer JL: Analysis of incomplete multivariate data New York: Chapman and Hall 1997.View ArticleGoogle Scholar
- Heckman JJ: Sample selection bias as a specification error. Econometrica 1979, 47: 153–161.View ArticleGoogle Scholar
- Rubin DB: Inference and missing data. Biometrika 1976, 63: 581–92.View ArticleGoogle Scholar
- Rubin DB: Multiple Imputation for Nonresponse in Surveys New York: Wiley 1987.View ArticleGoogle Scholar
- Little RJA, Rubin DB: Statistical analysis with missing data New York: Wiley 1987.Google Scholar
- Lindeboom R, Vermeulen M, Holman R, de Haan RJ: Activities of daily living instruments in clinical neurology, optimizing scales for neurologic assessments. Neurology 2003, 60: 738–742.PubMedView ArticleGoogle Scholar
- Holman R, Lindeboom R, de Haan RJ: Gender and age based differential item functioning in the AMC linear disability score project. Quality of Life Newsletter 2004, 32: 1–4. [http://www.mapi-research-inst.com/allissue.asp]Google Scholar
- Fillenbaum GG, George LK, Blazer DG: Scoring nonresponse on the mini-mental state examination. Psychological Medicine 1988, 18: 1021–5.PubMedView ArticleGoogle Scholar
- Bergner M, Bobbitt RA, Carter WB, Gilson BS: The sickness impact profile: development and final revision of a health status measure. Med Care 1981, 19: 787–805.PubMedView ArticleGoogle Scholar
- Holman R, Glas CAW, Zwinderman AH, de Haan RJ: The treatment of not applicable responses in an item bank to measure functional status using item response theory. Poster presented at the 23rd meeting of the International Society for Biostatistics. Held in Dijon, France 11–13 September 2002
- Kolen MJ, Brennan RL: Test Equating New York: Springer 1995.View ArticleGoogle Scholar
- Holman R, Lindeboom R, Glas CAW, Vermeulen M, de Haan RJ: Constructing an item bank using item response theory: the AMC linear disability score project. Health Services and Outcomes Research Methodology 2003, 4: 19–33. 10.1023/A:1025824810390View ArticleGoogle Scholar
- Wu ML, Adams RJ, Wilson MR: ACER ConQuest: Generalised Item Response Modelling Software Melbourne: ACER Press 1998.Google Scholar
- Thissen D: Marginal maximum likelihood estimation for the one parameter logistic model. Psychometrika 1982, 47: 175–186.View ArticleGoogle Scholar
- Lord FM: Maximum likelihood estimation of item response parameters when some responses are omitted. Psychometrika 1983, 48: 477–482.View ArticleGoogle Scholar
- van der Linden WJ, Glas CAW: Computerized Adaptive Testing. Theory and Practice Dordrecht, the Netherlands: Kluwer Academic Publishers 2000.View ArticleGoogle Scholar
- Mislevy RJ, Chang H: Does addaptive testing violate local independence? Psychometrika 2000, 65: 149–156.View ArticleGoogle Scholar
- Andersen EB: Estimating latent correlations between repeated testings. Psychometrika 1985, 50: 3–16.View ArticleGoogle Scholar
- Holman R, Glas CAW: Modelling non-ignorable missing data mechanisms with item response theory models. British Journal of Mathematical and Statistical Psychology, in press.
- Rasch G: On general laws and the meaning of measurement in psychology. In Proceedings of the Fourth Berkely Symposium on Mathematical Statistics and Probability 1961, 4: 321–34.Google Scholar
- Cronbach LJ: Coefficient alpha and the internal structure of tests. Psychometrika 1951, 16: 297–334.View ArticleGoogle Scholar
- Wright BD, Masters GN: Rating scale analysis: Rasch measurement Chicago, IL: MESA Press 1982.Google Scholar
- May K, Nicewander WA: Measuring change conventionally and adaptively. Educational and Psychological Measurement 1998, 58: 882–897.View ArticleGoogle Scholar
- Little RJA, Rubin DB: On jointly estimating parameters and missing data by maximising the complete-data likelihood. American Statistician 1983, 37: 218–220.Google Scholar
- Pinheiro JC, Bates DM: Mixed-Effects Models in S and S-PLUS New York: Springer-Verlag 2000.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.