Skip to main content

The COPD-SIB: a newly developed disease-specific item bank to measure health-related quality of life in patients with chronic obstructive pulmonary disease



Health-related quality of life (HRQoL) is widely used as an outcome measure in the evaluation of treatment interventions in patients with chronic obstructive pulmonary disease (COPD). In order to address challenges associated with existing fixed-length measures (e.g., too long to be used routinely, too short to ensure both content validity and reliability), a COPD-specific item bank (COPD-SIB) was developed.


Items were selected based on literature review and interviews with Dutch COPD patients, with a strong focus on both content validity and item comprehension. The psychometric quality of the item bank was evaluated using Mokken Scale Analysis and parametric Item Response Theory, using data of 666 COPD patients.


The final item bank contains 46 items that form a strong scale, tapping into eight important themes that were identified based on literature review and patient interviews: Coping with disease/symptoms, adaptability; Autonomy; Anxiety about the course/end-state of the disease, hopelessness; Positive psychological functioning; Situations triggering or enhancing breathing problems; Symptoms; Activity; Impact.


The 46-item COPD-SIB has good psychometric properties and content validity. Items are available in Dutch and English. The COPD-SIB can be used as a stand-alone instrument, or to inform computerised adaptive testing.


In the last few decades, it has been recognised that it is imperative to include health-related quality of life (HRQoL) as an outcome measure in the evaluation of treatment interventions in patients with chronic obstructive pulmonary disease (COPD) [1, 2]. COPD is a chronic respiratory condition that cannot be cured; therefore, many COPD treatment programmes focus on the self-management of symptoms and their effect on the patient’s HRQoL [3].

Currently, HRQoL in patients with COPD is typically measured by means of standardised self-report questionnaires that were developed using Classical Test Theory (CTT) [4]. Although most HRQoL questionnaires have been extensively validated, their use is not without limitations; many of these limitations stem directly from the static nature of the current generation of questionnaires [5]. To facilitate the comparison of scores within and among patients, the same questions need to be administered to each patient at each time-point. This means that a single set of questions should be suitable to assess the entire underlying range of HRQoL (from very good to very poor) and should provide sufficient measurement precision at all levels in between. Consequently, a large number of questions are typically required to achieve both sufficient measurement width (content validity) and precision (reliability). This places a considerable burden on patients, who have to complete numerous items, many of which seem irrelevant or redundant to their specific situation. Ideally, each questionnaire should be tailored to the individual patient, resulting in each item (question) soliciting valuable information. However, this should not result in a lack of comparability across patients. This flexibility can be achieved using modern techniques: computerised adaptive testing (CAT). CAT [6] is a specific type of computer based testing that uses an Item Response Theory (IRT) [7] measurement model for item selection during test taking. IRT and CAT were first used in the field of educational measurement. In the last few decades, both techniques have become increasingly popular in health research. Item selection in a CAT is dependent on a patient’s estimated score on one or more latent traits. The estimate of the score on the latent trait (here: HRQoL) of the patient is continuously adjusted (each time an answer to an additional item is given) until a specific pre-defined criterion is reached [8]. This procedure permits a higher degree of precision with fewer items than a procedure using static scales [8]. CAT is scored in real-time; results can be displayed to the physician and/or patient almost instantly in written and graphic reports.

A CAT selects items from a pool of items: an item bank. An item bank ideally consists of a large number of items covering all relevant aspects of the construct under study. An item bank can be developed from scratch, or built on the foundations of previous work (e.g., using items from existing questionnaires as a starting point) [5, 8]. Item bank development usually includes both quantitative and qualitative methods; i.e., respectively, evaluating the item performance using an IRT model, and conducting cognitive interviews or focus groups in order to obtain in-depth understanding of the way the construct is perceived by members of the target population and cognitive interviews to improve item formulation (see e.g., [915]). It is paramount that the items be of good quality, both in terms of content validity and psychometric properties: a CAT can only be as good as the item bank it is based on [8]. After the key concepts to be included in the bank have been identified, the formulation and presentation of the items has been found adequate, and the psychometric properties of the items favourable (acceptable coverage of latent trait values, adequate measurement precision where it is needed) a final calibration of the item bank is performed. From this point onward the item parameters are considered “known” and can be used for item selection in CAT.

There is a need for flexible, accurate, and efficient assessment of quality of life in COPD. Currently, there is no gold standard. The SGRQ and SGRQ-C are two of the best-known legacy measures and have been shown to be of high quality; however, they might be viewed as problematic or unsuitable for use in (routine) practice, due to their length. The purpose of this paper is to describe the development of the COPD-SIB: a COPD-specific HRQoL item bank that can be used to inform CAT, covering topics that are relevant to COPD patients. We report on both qualitative (item selection and generation) and quantitative (psychometric analysis using IRT) aspects of this process.


Item selection and development

A predefined structured item generation methodology was used to select and design items for the COPD-SIB. This procedure consisted of three steps (which are illustrated in Fig. 1). First, it was determined which topics should be covered. Topics were identified by conducting a literature review and by re-analysing interviews with patients conducted previously [9]. This task was performed by LL under the supervision of MP. Second, relevant items were selected from existing instruments based on the findings of step 1, and new items were written to fill gaps (defined as topics that were not sufficiently covered). This task was jointly performed by LL and MP, and reviewed by JP. Third, the items selected and developed in step two were evaluated for relevance and clarity in several sets of cognitive interviews (see Additional files 1 and 2); the results from these interviews were used to further improve the items and fill newly identified gaps (defined as topics that had not been identified in a previous step but emerged as highly relevant based on the interviews conducted in step 3). This task was primarily performed by MP, with contributions from LL and under the supervision of JP.

Fig. 1
figure 1

Flowchart of the development process of the COPD-specific item bank (COPD-SIB)

The St. George Respiratory Questionnaire for COPD patients (SGRQ-C) was taken as a starting point, since it is widely used and contains many items of high quality [16, 17]. Items from other instruments were considered for inclusion if a) they pertained to themes considered important by COPD patients (importance was deduced from interviews and literature review), and b) they did not show too much overlap with SGRQ-C items. Permission from the developers of the questionnaire for use of these items was a requirement. We included items from five existing questionnaires in our initial item pool: the SGRQ-C, the Quality of Life for Respiratory Illness Questionnaire (QoL-RIQ), the COPD Assessment Test, the Maugeri Respiratory Failure Questionnaire Reduced Form (MRF26), and the VQ11 [1822]. After items had been selected from existing instruments, the topics covered by these items were compared to the ones most frequently mentioned in the patient interviews. Gaps were identified, and new items were written using statements made by patients as a starting point.

For the SGRQ-C and the COPD Assessment Test, official Dutch translations were available. The items selected from the QoL-RIQ, MRF26, and VQ11 were translated into Dutch by an expert; a native Dutch speaker who holds a university degree in English Language and Culture and has ample experience in English-Dutch and Dutch-English translation. She also translated all newly developed Dutch items into English.

All items in the initial item pool were subjected to cognitive debriefing, using the Three Step Test Interview (TSTI) [23]. In this study, only the Dutch items underwent the process of cognitive debriefing and validation. We plan to repeat this process for the English items in a future study, in collaboration with colleagues from Canada [24]. See Additional file 1 for a detailed explanation of this procedure along with example probes.


Data from three Dutch COPD patient samples were used for the analyses (see Fig. 1). Purposive sampling was used for samples 1 and 2 (interview data); inclusion stopped when saturation was reached. The inclusion criteria were: a medical diagnosis of COPD; sufficient mastery of the Dutch language; being able to answer questions in a face-to-face interview (samples 1 and 2); being able to complete a questionnaire (samples 1-3). All patients in samples 1 and 2 were recruited through pulmonary clinics in the Netherlands. The patients in sample 3 (questionnaire data) were recruited through healthcare professionals in JP’s professional network. See Additional file 2 for detailed information about the samples.

Psychometric evaluation of the item bank

Test design

In addition to evaluating the psychometric properties of the COPD-SIB items, we wanted to establish the measurement properties of three generic HRQoL domains in a Dutch COPD sample. The results for these three domains will be presented in a separate paper. We did not want to create one long questionnaire including all four domains, since this would be very burdensome for patients; therefore we decided to divide the total number of itemsFootnote 1 over three so-called booklets (questionnaire versions), each containing around 100 items.Footnote 2 The booklets contained between 23 and 32 COPD-SIB items each, of which 10 were anchor items. Anchor items are items that are present in every booklet and which are thought to have stable measurement properties. They can be used to link the items in the different booklets to form a common scale, when using parametric IRT (this procedure is also known as equating) [25]. A widely used guideline to selecting anchor items is that this item set should be a mini-version of the whole item bank, implying that the anchor set should cover the same content (but with fewer items) as the total item bank [25]. The anchor item set used in this study was selected by a content expert (JP) to ensure it adequately reflected the original spread in topics. The other COPD-SIB items were divided randomly over the three booklets. See Fig. 2 for a visual impression of the booklet design, and Table 2 for more information regarding which item was included in which booklet.

Fig. 2
figure 2

Visual representation of booklet design with number of items on the y-axis and booklet number on the x-axis. Note that the items are ordered according to their booklet assignment to illustrate the design

Assessing item quality and calibrating the item bank

The main purpose of the current study was to develop a unidimensional disease-specific item bank: the COPD-SIB. We wanted to retain only items of sufficient psychometric quality. The Graded Response Model (GRM; an IRT model suitable for Likert scale data) [26, 27] was estimated to obtain item parameters needed for the CAT. Several item fit statistics are currently available for the GRM, such as the S-X 2; however, these only have adequate power in very large samples [28]. Unsurprisingly, this statistic did not flag any item for misfit in our analysis. Rather than relying on these outcomes, we used two complementary procedures providing outcomes that are not dependent on the IRT model under evaluation: Mokken Scale Analysis (MSA) [29, 30] and parametric smoothed regression lines based on a generalised additive model (GAM) [31]. MSA was used to identify items that formed a strong unidimensional scale. Items that were flagged for removal by the MSA were further evaluated by visually inspecting the response curves estimated using GAM plots to determine the nature of the misfit. A GAM model is a generalised linear model based on a set of smooth functions; the model does not require a detailed specification of parametric relationships, thus allowing for relatively flexible modelling of statistical relationships (typically involving regression splines) [32].

MSA was performed using the R [33] package Mokken [34]. The model used was the monotone homogeneity model (MHM), which is a nonparametric IRT model. In recent years, MSA has been increasing in popularity in health research (e.g., [16, 3542]). MSA is a scaling method that identifies scales that allow an ordering of individuals on an underlying one-dimensional scale using the unweighted sum of item scores. In order to establish which items co-vary and form a scale, scalability coefficients are calculated on three levels: item-pairs (H ij ), items (H i ), and scale (H). H is based on H i and reflects the degree to which the scale can be used to reliably order persons on the latent trait using their sum score. Similar to the item-rest correlation, H also expresses the degree to which an item is related to other items in the scale. A scale is considered acceptable if 0.3 ≤ H < 0.4, good if 0.4 ≤ H < 0.5, and strong if H ≥ 0.5 [12; 13].

The MSA analyses were performed for each booklet separately (since MSA cannot account for the type of test design we used). We first performed confirmatory analyses, using H ≥ 0.3 as a cut-point for an acceptable scale. Since the H-value for one of the booklets fell below the cut-point, the confirmatory analyses were followed by exploratory analyses, again using H ≥ 0.3 as a cut-off. In an exploratory MSA scales are formed in an iterative manner; the selection algorithm starts with two good items, adding one item at a time using certain criteria (H i  ≥ user-specified cut-off; the item under consideration does shows a positive relationship in terms of H ij with other items in the scale). Two selection algorithms are currently available; we chose to use the newer one, the genetic algorithm [43].

The GRM was fitted and parameters were estimated using the R package mirt [44]. Metropolis-Hastings Robbins-Monro (MH-RM) estimation was used with a tolerance threshold of 0.001. The algorithm converged after 602 iterations. The GAM plots were also produced using the mirt package (function itemGAM).


Item selection and development

Domain definition

Eight important themes not covered by PROMIS domains were identified based on literature review and patient interviews:

  1. 1.

    Coping with disease/symptoms, adaptability

  2. 2.


  3. 3.

    Anxiety about the course/end-state of the disease, hopelessness

  4. 4.

    Positive psychological functioning

  5. 5.

    Situations triggering or enhancing breathing problems

  6. 6.


  7. 7.


  8. 8.


Items that pertained to these eight themes were selected/written to be included in the COPD-SIB item bank.

Item generation and revision

The items that were selected for psychometric evaluation are listed in Table 1 (English version). Note that the items were coded in such a way that a higher score on the latent trait is indicative of better quality of life. We decided not to include the COPD Assessment Test items, since patients were confused by the format (most patients only read/paid attention to the left half of the items). The SGRQ-C items, on the other hand, were generally very well-received by patients. We used the findings reported by Paap et al. [17] to inform item revision for the SGRQ-C items that were included in the initial item pool.

Table 1 Overview over items selected for psychometric evaluation

We followed an iterative procedure (three revision rounds) for the remaining items, since this subset of the item pool included newly written items. Patients clearly had trouble switching back and forth between different response formats, and strongly objected to dichotomous response options. Therefore, we decided to standardise the response format for all items in the item bank to a 5-point Likert-scale reflecting magnitude (“not at all” to “very much”), frequency (“never” to “always”), and agreement (“strongly disagree” to “strongly agree”). Composite items were split into separate ones, double negations were rephrased, and the expression “lung disease” was changed to “COPD”. See Table 1 for the original and revised item texts.

Preparing the data for psychometric analysis

A large number of items had low endorsement (n < 10) for at least one response option/category. This can cause problems in psychometric analyses; hence, the problematic categories were merged with adjacent categories for these items. Note that items having different numbers of response categories due to merging does not constitute a problem for the GRM, nor does it hamper the comparison of item discrimination parameters (estimated with the GRM) among items. See Additional file 3 for the R code used to merge item categories. Three items were removed at this stage, due to a large number of missing values (>20 %) per booklet: items 6, 7 and 8.

Psychometric evaluation of the item bank

Assessing item quality: results of the MSA and visual inspection of GAM plots

MSA requires a complete data-set. Therefore the MSA analysis was repeated for each booklet separately and two-way imputation was used to create a complete data-set for each booklet (2-4 % missing values per booklet) [45, 46].

The confirmatory analyses resulted in acceptable H-values for booklets 1 (.30) and 3 (.31), but a low H-value for booklet 2 (.26). Taking the results of the three exploratory MSA’s together, 19 items (see Table 2) were flagged as problematic (most of them had very low or even negative H ij values and were not assigned to any scale). If these items would have been excluded from the analyses, the H-values would have equalled .43, .40, and .43 for booklets 1, 2, and 3, respectively.

Table 2 Item properties

Visual inspection of the GAM plots (smoothed regression lines) for the items flagged for removal in the MSA revealed substantial differences between one or more response curves as estimated under the GAM as compared to their counterparts estimated under the GRM, for most items. In some cases, one or more of the response curves was hard to estimate (very erratic, with multiple peaks). For five items (10, 19, 24, 53, 54), a very striking type of misfit was identified: the GAM plots showed that one or more response curves were U-shaped, indicating that both patients with very high and very low θ-scores scores were likely to endorse these response categories (see Fig. 3 for example plots).

Fig. 3
figure 3

Option response curves as estimated using the GRM (on the right), and parametric smoothed regression lines based on a GAM (on the left) for an item with good fit to the GRM (item 27), and one with bad fit (item 10)

Calibrating the item bank: results of the parametric IRT analysis

Table 2 shows the estimated parameters based on the GRM for 63 out of 66 items.Footnote 3 Up to five parameters are calculated in this model: the slope (denoted α) and the thresholds (denoted β j ). The slope of an item expresses its ability to discriminate among persons with low and high HRQoL; it is also indicative of how strongly this item is related to the latent trait (denoted θ). The threshold parameters indicate the point on the latent trait scale at which 50 % of the patients would choose the response category in question or higher. Since the probability is always 100 % for choosing the lowest category or higher, there is no threshold for the lowest category. Originally, all items were scored on a 5-point Likert scale ranging 0-4; however, since we had to collapse some response categories due to data sparseness, not all items in Table 2 have four thresholds. For example, for item 21 (“Because of my COPD I’m afraid of being alone.”), the categories 0 (strongly agree) and 1 (agree) were merged. Thus, the probability of choosing at most neither agree nor disagree is 50 % for patients with a θ-score of -2.79; the probability of choosing at most disagree is 50 % for patients with a θ-score of -0.736; and the probability of choosing strongly disagree is 50 % for patients with a θ-score of 1.267.

The metric of the threshold values is determined by the distribution of θ. A standard normal distribution (mean = 0, SD = 1) was assumed when estimating the model (this is done to identify the model, similar to confirmatory factor analysis; in Bayesian terms this can be considered as a prior distribution). The threshold values as well as θ-scores may be interpreted relative to this distribution. Bayesian expected a-posteriori (EAP) scoring was used to estimate the θ-scores. The EAP estimator uses prior information (in this case the estimated population distribution in the fitted model) in calculating θ-scores. When this method is used, extreme scores are pulled in toward more realistic values. This is especially useful in cases where patients endorse either the lowest or highest response category on all items, in which case the maximum likelihood estimate is undefined. Figure 4 depicts the distribution of estimated θ-scores as well as the estimated threshold parameters. Both distributions look reasonably normal, and the threshold parameters cover the entire range of relevant θ-values (see Fig. 4).

Fig. 4
figure 4

Distribution of estimated theta-values (solid line) and of estimated beta parameters (dashed line); both estimated using the Graded Response Model

The information function (Fig. 5) shows that the item bank covers all relevant θ-values (>99 % of θ-values fall in the range of -3 and +3). This figure depicts the measurement precision as a function of θ. An information value of 5 corresponds with a reliability of 0.8. The information function is the sum of the item information functions; each item gives most information close to its thresholds, and items with higher slopes give more information.

Fig. 5
figure 5

Information Function for the full item bank (solid line) and the shortened item bank (dashed; problematic items removed)

Selecting items for the final item bank

As can be seen from Table 2, 17 out of 20 items flagged by the MSA had low (<1) or even negative α values. For three flagged items (39, 41, 48), no clear reason for misfit could be identified (acceptable item parameters, no obvious difference between GAM and GRM plots). These three items were therefore retained in the item bank. The GRM was estimated again after removal of the 17 problematic items. The resulting item parameters can be found in Additional file 4. This set of 46 items can be considered as the final item bank. Removing problematic items did not have a substantial effect on the information function (Fig. 5).


This paper describes the development of an item bank that measures disease-specific quality of life in patients with COPD: the COPD-SIB. We started out with 66 items (including SGRQ-C items) covering content described as highly relevant by patients, healthcare professionals, and the literature. These items were assessed using complementary psychometric techniques and the data of 666 Dutch COPD patients. The final item bank contains 46 items that form a strong scale. This item bank could be used as a stand-alone instrument, either in full-bank form; better yet, it could be used as the basis for CAT.

Seven items stood out among misfitting items: they had negative slope parameters and/or one or more response curves were U-shaped. Negative slope parameters were found for four items (item 13: “Because of my COPD, I appreciate my social contacts (e.g., friends, partner, relatives) more”; item 16: “Since being diagnosed with COPD, I have lived more consciously”; item 53: “I could accept it, when I was not able to do something anymore, due to my COPD”; item 54: “I persevered until I had finished an activity, despite the fact that I couldn’t perform that activity well, due to my COPD”), while U-shaped response curves for one or more categories were found for four items (item 10: “I am confident I will be able to cope with my COPD, even if the complaints get worse”; item 19: “I am content with the things I can still do”; item 24: “I value my life just as much as I did before I was diagnosed with COPD”; and item 53). When comparing the content of these items to other items in the bank, it is apparent that these items are all worded in a positive way whereas most items in the bank are not. Only one positively worded item showed good fit (item 57: “I got my breathing problems under control.”). The reason we included items with a more positive item formulation, was that several patients indicated that they felt it did not do their situation justice if the item bank would only consist of negative items. Patient quotes were used to inform the formulation of these items. Our results illustrate that it can be difficult to optimise content validity while simultaneously maintaining the same level of construct validity (under a given model); in this case, adding items to improve content validity resulted in multidimensionality. It has been previously suggested that including reversed worded items in a questionnaire might affect reliability and aspects of validity [47, 48]. Patients may not notice that some items are formulated in a reversed way, or they might be confused by this reversal in meaning. As an effect, there may be an increase in measurement error and/or a method/artifical second factor may be found in dimensionality analyses caused by response bias [49]. To prevent response bias caused by inattention or confusion, it may be advisable to present positively and negatively worded items separately in a future study, as suggested by Roszkowski and Soven [50]. Another possibility would be to create separte item banks for positively and negatively worded items; PROMIS follows this strategy for a number of domains (e.g., [51, 52]). If these strategies do not solve the issue of U-shaped response curves, it may be worth while to re-analyse the data with a different IRT model, which allows for peaked/dipped response curves (a so-called “unfolding model”) [53].

We developed 29 new items that were subjected to cognitive debriefing along with a selection of items from existing questionnaires. Initially, the answering categories provided for the newly developed items were dichotomous: agree/do not agree. A substantial number of patients indicated that they were unhappy with only having two options, and asked for Likert scales. We made adjustments accordingly, and decided to harmonise the answering categories of all items following PROMIS guidelines. Patients were happy with the 5-point Likert scales. Our findings illustrate, however, that this not necessarily means that patients will use the entire scale. The resulting data sparseness poses challenges when modelling the data. A widely used solution is to merge adjacent categories, which is also what we did for a number of items. This solution is not popular with everyone; but since having a very low cell count for certain item-category combinations leads to problematic parameter estimates (very high or low, large standard error) it is unavoidable in practice. In such cases, it may be advisable to use a model that is unsensitive to differences in the number of categories per item after merging, such as the GRM we used in this study. We suggest that this approach (providing the patient with the response scale of their preference, subsequently merging categories prior to analysis, and finally using an appropriate model) is to be preferred to avoiding dealing with the field of tension between patient perspective and psychometric considerations.


In the development of the COPD-SIB, the patient perspective has taken a central role. The item bank contains items tapping into several topics described as highly relevant by patients and the literature. We used complementary psychometric techniques to evaluate the candidate items, and the final selection forms a strong unidimensional scale. The COPD-SIB is a promising candidate to measure COPD-specific HRQoL in routine practice; especially when used to build a CAT (time efficient, while not compromising measurement precision). The COPD-SIB was developed using a large Dutch sample of COPD patients. The Dutch version of the item bank is ready for use, and available upon request (contact MP or JP). First steps toward cross-cultural validation are currently underway [9, 24].


  1. The total number of items equaled 211: 148 for the 3 generic domains and 63 for the COPD-SIB.

  2. An informal feasibility study indicated that patients felt that completing up to 100 items was acceptable (n = 4; data not shown).

  3. As mentioned previously, three items (items 6, 7 and 8) were excluded prior to analysis since they had >20 % missing values.


  1. Curtis JR, Deyo RA, Hudson LD. Health-related quality of life among patients with chronic obstructive pulmonary disease. Thorax. 1994;49:162–70.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  2. Global Initiative for Chronic Obstructive Lung Disease (GOLD). Global strategy for the diagnosis, management, and prevention of COPD. Retrieved from Accessed 24 May 2016.

  3. Zwerink M, Brusse-Keizer M, van der Valk PD, Zielhuis GA, Monninkhof EM, van der Palen J, Frith PA, Effing T. Self management for patients with chronic obstructive pulmonary disease. Cochrane Database Syst Rev. 2014;3:Cd002990.

    PubMed  Google Scholar 

  4. Weldam SWM, Schuurmans MJ, Liu R, Lammers J-WJ. Evaluation of Quality of Life instruments for use in COPD care and research: A systematic review. Int J Nurs Stud. 2013;50:688–707.

    Article  PubMed  Google Scholar 

  5. McHorney CA. Generic health measurement: past accomplishments and a measurement paradigm for the 21st century. Ann Intern Med. 1997;127:743–50.

    CAS  Article  PubMed  Google Scholar 

  6. Van der Linden WJ, Glas CAW. Computerized Adaptive Testing: Theory and Practice. Dordrecht: Kluwer Academic Publishers; 2000.

    Book  Google Scholar 

  7. Embretson SE, Reise S. Item response theory for psychologists. Mahwah, NJ: Erlbaum; 2000.

    Google Scholar 

  8. Wainer H, Dorans NJ, Eignor D, Flaugher R, Green BF, Mislevy RJ, Steinberg L, Thissen D. Computerized adaptive testing: a primer. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates; 2000.

    Google Scholar 

  9. Paap MCS, Bode C, Lenferink LIM, Groen LC, Terwee CB, Ahmed S, Eilayyan O, van der Palen J. Identifying key domains of health-related quality of life for patients with chronic obstructive pulmonary disease: the patient perspective. Health Qual Life Outcomes. 2014;12:106.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Paap MCS, Bode C, Lange L, van der Palen J. Using the Three Step Test Interview to understand How the SGRQ-C Is Interpreted by COPD Patients. Qual Life Res. 2014;23:170–170.

    Google Scholar 

  11. Paap MCS, Bode C, Lenferink LIM, Terwee CB, van der Palen J. Identifying key domains of health-related quality of life for patients with chronic obstructive pulmonary disease: interviews with healthcare professionals. Qual Life Res. 2015;24:1351–67.

    Article  PubMed  Google Scholar 

  12. Nikolaus S, Bode C, Taal E, van de Laar MA. Which dimensions of fatigue should be measured in patients with rheumatoid arthritis? A Delphi study. Musculoskeletal Care. 2012;10:13–7.

    Article  PubMed  Google Scholar 

  13. Nikolaus S, Bode C, Taal E, Oostveen JC, Glas CA, van de Laar MA. Items and dimensions for the construction of a multidimensional computerized adaptive test to measure fatigue in patients with rheumatoid arthritis. J Clin Epidemiol. 2013;66:1175–83.

    Article  PubMed  Google Scholar 

  14. Cella DF, Riley W, Stone A, Rothrock N, Reeve B, Yount S, Amtmann D, Bode R, Buysse D, Choi S, et al. The Patient-Reported Outcomes Measurement Information System (PROMIS) developed and tested its first wave of adult self-reported health outcome item banks: 2005-2008. J Clin Epidemiol. 2010;63:1179–94.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Terwee CB, Roorda LD, de Vet HC, Dekker J, Westhovens R, van Leeuwen J, Cella D, Correia H, Arnold B, Perez B, Boers M. Dutch-Flemish translation of 17 item banks from the patient-reported outcomes measurement information system (PROMIS). Qual Life Res. 2014;23:1733–41.

    CAS  PubMed  Google Scholar 

  16. Paap MCS, Brouwer D, Glas CAW, Monninkhof EM, Forstreuter B, Pieterse ME, van der Palen J. The St George’s Respiratory Questionnaire revisited: a psychometric evaluation. Qual Life Res. 2015;24:67–79.

    Article  PubMed  Google Scholar 

  17. Paap MCS, Lange L, van der Palen J, Bode C. Using the Three-Step Test Interview to understand how patients perceive the St. George’s Respiratory Questionnaire for COPD patients (SGRQ-C). Quality of Life Research 2015: Advance online publication.

  18. Ninot G, Soyez F, Prefaut C. A short questionnaire for the assessment of quality of life in patients with chronic obstructive pulmonary disease: psychometric properties of VQ11. Health Qual Life Outcomes. 2013;11:179.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Meguro M, Barley EA, Spencer S, Jones PW. Development and Validation of an Improved, COPD-Specific Version of the St. George Respiratory Questionnaire Chest. 2007;132:456–63.

    PubMed  Google Scholar 

  20. Vidotto G, Carone M, Jones PW, Salini S, Bertolotti G. Maugeri Respiratory Failure questionnaire reduced form: a method for improving the questionnaire using the Rasch model. Disabil Rehabil. 2007;29:991–8.

    CAS  Article  PubMed  Google Scholar 

  21. Maille AR, Koning CJ, Zwinderman AH, Willems LN, Dijkman JH, Kaptein AA. The development of the 'Quality-of-life for Respiratory Illness Questionnaire (QOL-RIQ)': a disease-specific quality-of-life questionnaire for patients with mild to moderate chronic non-specific lung disease. Respir Med. 1997;91:297–309.

    CAS  Article  PubMed  Google Scholar 

  22. Jones PW, Harding G, Berry P, Wiklund I, Chen W-H, Kline Leidy N. Development and first validation of the COPD Assessment Test. Eur Respir J. 2009;34:648–54.

    CAS  Article  PubMed  Google Scholar 

  23. Hak T, van der Veer K, Jansen H. The Three-Step Test-Interview (TSTI): An observational instrument for pretesting self-completion questionnaires. 2004.

    Google Scholar 

  24. Paap MCS, Ahmed S, Eilayyan OJ, Terwee CB, Van der Palen J. Combining Disease-Relevant and Disease-Attributed measures to assess HRQOL in patients with COPD in the Netherlands and Canada. Qual Life Res. 2014;23:151–151.

    Google Scholar 

  25. Kolen MJ, Brennan RL. Test equating, scaling, and linking: Methods and practices. 3rd ed. New York, NY: Springer; 2014.

    Book  Google Scholar 

  26. Samejima, F. Estimation of Latent Ability Using a Response Pattern of Graded Scores (Psychometric Monograph No. 17). Richmond, VA: Psychometric Society; 1969. Retrieved from

  27. Samejima F. The graded response model. In: van der Linden WJ, Hambleton RK, editors. Handbook of modern item response theory. New York: Springer; 1996. p. 85–100.

    Google Scholar 

  28. Kang T, Chen T. Performance of the generalized S-X2 item fit index for the graded response model. Asia Pacific Education Review. 2011;12:89–96.

    Article  Google Scholar 

  29. Mokken RJ. A theory and procedure of scale analysis. The Hague: Mouton; 1971.

    Book  Google Scholar 

  30. Sijtsma K, Molenaar IW. Introduction to Nonparametric Item Response Theory. Thousand Oaks: Sage Publications; 2002.

    Book  Google Scholar 

  31. Hastie T, Tibshirani R. Generalized Additive Models. Stat Sci. 1986;1:297–318.

    Article  Google Scholar 

  32. Wood SN. Generalized Additive Models: An introduction with R. Boca Raton, Florida: Chapman and Hall/CRC; 2006.

    Google Scholar 

  33. R Development Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2012.

    Google Scholar 

  34. van der Ark LA. Mokken scale analysis in R. J Stat Softw. 2007;20:1–19.

    Article  Google Scholar 

  35. Sijtsma K, Emons WH, Bouwmeester S, Nyklicek I, Roorda LD. Nonparametric IRT analysis of Quality-of-Life Scales and its application to the World Health Organization Quality-of-Life Scale (WHOQOL-Bref). Qual Life Res. 2008;17:275–90.

    Article  PubMed  PubMed Central  Google Scholar 

  36. Stochl J, Boomsma A, van Duijn M, Brozova H, Ruzicka E. Mokken scale analysis of the UPDRS: dimensionality of the Motor Section revisited. Neuro Endocrinol Lett. 2008;29:151–8.

    PubMed  Google Scholar 

  37. Emons WHM, Sijtsma K, Pedersen SS. Dimensionality of the Hospital Anxiety and Depression Scale (HADS) in Cardiac Patients: Comparison of Mokken Scale Analysis and Factor Analysis. Assessment. 2010.

  38. Sousa RM, Dewey ME, Acosta D, Jotheeswaran AT, Castro-Costa E, Ferri CP, Guerra M, Huang Y, Jacob KS, Rodriguez Pichardo JG, et al. Measuring disability across cultures--the psychometric properties of the WHODAS II in older people from seven low- and middle-income countries. The 10/66 Dementia Research Group population-based survey. Int J Methods Psychiatr Res. 2010;19:1–17.

    Article  PubMed  PubMed Central  Google Scholar 

  39. Roorda LD, Green JR, Houwink A, Bagley PJ, Smith J, Molenaar IW, Geurts AC. Item hierarchy-based analysis of the Rivermead Mobility Index resulted in improved interpretation and enabled faster scoring in patients undergoing rehabilitation after stroke. Arch Phys Med Rehabil. 2012;93:1091–6.

    Article  PubMed  Google Scholar 

  40. Watson R, van der Ark LA, Lin LC, Fieo R, Deary IJ, Meijer RR. Item response theory: how Mokken scaling can be used in clinical practice. J Clin Nurs. 2012;21:2736–46.

    Article  PubMed  Google Scholar 

  41. Sütterlin S, Paap MC, Babic S, Kubler A, Vogele C. Rumination and age: some things get better. J Aging Res. 2012;2012:267327.

    Article  PubMed  PubMed Central  Google Scholar 

  42. van den Berg SM, Paap MCS, Derks EM. Using multidimensional modeling to combine self-report symptoms with clinical judgment of schizotypy. Psychiatry Res. 2013;206:75–80.

    Article  PubMed  Google Scholar 

  43. Straat JH, van der Ark LA, Sijtsma K. Comparing Optimization Algorithms for Item Selection in Mokken Scale Analysis. J Classif. 2013;30(1):75–99. doi:10.1007/s00357-013-9122-y.

  44. Chalmers RP. mirt: A Multidimensional Item Response Theory Package for the R Environment. J Stat Softw. 2012;48:1–29.

    Article  Google Scholar 

  45. Bernaards CA, Sijtsma K. Influence of simple imputation and EM methods on factor analysis when item nonresponse in questionnaire data is nonignorable. Multivar Behav Res. 2000;35:321–64.

    CAS  Article  Google Scholar 

  46. van Ginkel JR, van der Ark LA. SPSS syntax for missing value imputation in test and questionnaire data. Appl Psychol Meas. 2005;29:152–3.

    Article  Google Scholar 

  47. Schriesheim CA, Hill KD. Controlling acquiescence response bias by item reversals: The effect on questionnaire validity. Educ Psychol Meas. 1981;41:1101–14.

    Article  Google Scholar 

  48. Schriesheim CA, Eisenbach RJ, Hill KD. The Effect of Negation and Polar Opposite Item Reversals on Questionnaire Reliability and Validity: An Experimental Investigation. Educ Psychol Meas. 1991;51:67–78.

    Article  Google Scholar 

  49. van Sonderen E, Sanderman R, Coyne JC. Ineffectiveness of reverse wording of questionnaire items: let's learn from cows in the rain. PLoS One. 2013;8:e68967.

    Article  PubMed  PubMed Central  Google Scholar 

  50. Roszkowski MJ, Soven M. Shifting gears: consequences of including two negatively worded items in the middle of a positively worded questionnaire. Assessment Eval High Educ. 2009;35:113–30.

    Article  Google Scholar 

  51. Lai J-S, Garcia S, Salsman J, Rosenbloom S, Cella D. The psychosocial impact of cancer: evidence in support of independent general positive and negative components. Qual Life Res. 2012;21:195–207.

    Article  PubMed  Google Scholar 

  52. Salsman JM, Garcia SF, Lai JS, Cella D. Have a little faith: measuring the impact of illness on positive and negative aspects of faith. Psychooncology. 2012;21:1357–61.

    Article  PubMed  Google Scholar 

  53. Post WJ, van Duijn MAJ, van Baarsen B. Single-Peaked or Monotone Tracelines? On the Choice of an IRT Model for Scaling Data. In: Boomsma A, van Duijn MJ, Snijders TB, editors. Essays on Item Response Theory, vol. 157. New York: Springer; 2001. p. 391–414. Lecture Notes in Statistics.

    Chapter  Google Scholar 

Download references


We thank Professor Paul Jones, Professor Giulio Vidotto, Professor Grégory Ninot, and Rianne Maillé for granting us permission to include modified versions of items from the SGRQ-C, MRF-26, VQ11, and QoL-RIQ, respectively. We thank all patients that participated in this study for their valuable input. Finally, we thank Mitzi Paap, Bachelor of Arts in English Language and Culture, for translating the items from Dutch into English.


This study was supported by grant # from Lung Foundation Netherlands.

Authors’ contributions

MP and JP designed the study. LL and NH collected the data with the help of JP; MP, KK, NH and LL analysed and interpreted the data. JP critically reviewed all aspects of data collection, analysis and interpretation. MP wrote the manuscript with contributions of LL, NH, KK and JP. All authors approved the final version of the manuscript.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

All patients gave informed consent.

Ethics approval and consent to participate

The ethical review board of the University of Twente approved the study. This study did not need approval of the Medical Ethical Review Board, according to European regulations. All procedures performed in studies involving human participants were in accordance with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Muirne C. S. Paap.

Additional files

Additional file 1:

Detailed explanation of Three Step Test Interview (TSTI) along with example probes. (PDF 92 kb)

Additional file 2:

Detailed description of the samples used in this study. (PDF 121 kb)

Additional file 3:

R code used to collapse response categories. (PDF 54 kb)

Additional file 4:

Calibration results of the final version of the COPD-SIB item bank. (PDF 294 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Paap, M.C.S., Lenferink, L.I.M., Herzog, N. et al. The COPD-SIB: a newly developed disease-specific item bank to measure health-related quality of life in patients with chronic obstructive pulmonary disease. Health Qual Life Outcomes 14, 97 (2016).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Item response theory
  • IRT
  • Patient perspective
  • Item bank
  • COPD
  • SGRQ-C
  • MRF-26
  • VQ11
  • QoL-RIQ