A scoping review to create a framework for the steps in developing condition-specific preference-based instruments de novo or from an existing non-preference-based instrument: use of item response theory or Rasch analysis

Background There is no widely accepted framework to guide the development of condition-specific preference-based instruments (CSPBIs) that includes both de novo and from existing non-preference-based instruments. The purpose of this study was to address this gap by reviewing the published literature on CSPBIs, with particular attention to the application of item response theory (IRT) and Rasch analysis in their development. Methods A scoping review of the literature covering the concepts of all phases of CSPBI development and evaluation was performed from MEDLINE, Embase, PsychInfo, CINAHL, and the Cochrane Library, from inception to December 30, 2022. Results The titles and abstracts of 1,967 unique references were reviewed. After retrieving and reviewing 154 full-text articles, data were extracted from 109 articles, representing 41 CSPBIs covering 21 diseases or conditions. The development of CSPBIs was conceptualized as a 15-step framework, covering four phases: 1) develop initial questionnaire items (when no suitable non-preference-based instrument exists), 2) establish the dimensional structure, 3) reduce items per dimension, 4) value and model health state utilities. Thirty-nine instruments used a type of Rasch model and two instruments used IRT models in phase 3. Conclusion We present an expanded framework that outlines the development of CSPBIs, both from existing non-preference-based instruments and de novo when no suitable non-preference-based instrument exists, using IRT and Rasch analysis. For items that fit the Rasch model, developers selected one item per dimension and explored item response level reduction. This framework will guide researchers who are developing or assessing CSPBIs. Supplementary Information The online version contains supplementary material available at 10.1186/s12955-024-02253-y.


Introduction
Condition-specific preference-based instruments (CSPBI) measure health-related quality of life (HRQoL) relevant to patients with a specific condition or disease.In contrast, generic preference-based instruments such as the EQ-5D family of questionnaires [1] are suitable for general use [1][2][3].Preference-based instruments contain a classification system with items representing attributes and levels within items which, with a value set, produce a utility score anchored at zero (dead) and one (perfect health).Values are derived from patients or members of the general public who provided utilities for health states using direct methods, including time trade off (TTO) [3,4] or discrete choice experiments (DCE) [3,4].Utility is used to calculate the quality-adjusted life year (QALY), a key outcome in economic evaluations and clinical decision-making.
Several systematic reviews included aspects of CSPBI development, including one that found 51 different CSP-BIs [5,[10][11][12][13].Brazier et al. [5] described six stages of preference-based instrument development starting with an existing condition-specific non-preference-based instrument, such as the Functional Assessment of Cancer Therapy -General measure (FACT-G) scale [14], or European Organization for Research and Treatment of Cancer (QLQ-C30) [15] in oncology.The stages are: I) establish dimensionality, II) select items for each dimension, III) test the number of levels, IV) validate the health state classification system, V) valuation survey, and VI) model the valuation data.When there is no established condition-specific non-preference-based instrument, the steps in the development of a CSPBI begin with creating a classification system of domains de novo [13,16].
Factor analysis (confirmatory or exploratory) is used to establish dimensions.Item response theory (IRT) or Rasch analysis can be used to eliminate items and select one or two items to represent each dimension [5].Item response theory (IRT) is a measurement approach that explains the probabilistic relationship between items and a latent construct (e.g., HRQoL) [17].The Rasch model is the simplest IRT model [18].When items fit the Rasch model, the instrument has favourable properties: unidimensionality, interval-level scoring, additivity, and sample-free measurement [19].Instruments developed with Rasch or IRT methods have high precision and efficiency by selecting the fewest items to cover the latent construct [19,20].Health states are then sampled and modelled using a decomposed or composite approach [5].While these stages provide a starting point for the development of novel CSPBI, the methods described by Brazier et al. begin with an existing condition-specific non-preference-based instrument.Within these stages, there are insufficient details for novice CSPBI developers to follow.Additionally, when there is no suitable condition-specific non-preference based instrument, developing a novel CSPBI de novo is the best option.These initial steps of creating a non-preference-based instrument de novo have been described by Guyatt et al. [16], yet these steps were absent from the Brazier et al. stages [5].
The aim for this scoping review is to address these gaps by operationalizing Brazier et al. 's stages based on available literature, and adding the initial steps to develop a preference-based instrument de novo when there is no existing HRQoL instrument.Our focus was the use of Rasch and IRT methods to establish dimensions, reduce items per dimension, and reduce item levels because resulting instruments have favourable properties.These steps underpin the creation of a multi-attribute health state classification system to develop a novel preferencebased instrument.Our objectives were to: 1. Identify the steps in constructing CSPBIs, both de novo and from an existing non-preference-based instrument.2. Describe the application of Rasch or IRT methods within these steps.3. Develop an expanded framework to guide future development of CSPBIs.

Information sources
We followed the Joanna Briggs Institute (JBI) published guidance document [21,22], and the Preferred Reporting Items for Systematic Review and Meta-Analysis Scoping Review (PRISMA-ScR) reporting guidelines (Supplementary Information, S1) [23].Our scoping review protocol was not published.Searches were performed in Ovid MEDLINE, Ovid Embase, Ovid PsychInfo, EBSCO CINAHL, and the Cochrane Library from inception to December 2022 (Supplementary Information, S2).An experienced health sciences librarian (JB) and TT developed a search strategy (Supplementary Information, S2) using Medical Subject Headings (MeSH) and keywords about: 1. Measurement of condition-specific HRQoL 2. Eliciting health state utility values to develop a preference-based instrument 3. Methods to develop instruments measuring HRQoL

IRT including Rasch analysis
The search strategy was reviewed by a second librarian, following the Peer Review of Electronic Search Strategies (PRESS) standard [24].

Selection of articles
Search results were imported into Thomson Reuters End-Note X9.3.3 to remove duplicates.
A primary (TT) and secondary (ST) reviewer independently screened titles and abstracts, followed by full text articles using Covidence [25].We excluded abstracts, commentaries, editorials, letters, and non-English articles.Articles were excluded if they predicted utilities from only demographics or other non-disease factors, or validated non-English instruments, since these do not describe the development of the instruments.
We included articles that described either the development of a CSPBI using IRT or Rasch analysis, or the elicitation of utility weights for the instrument.Articles about instruments had the following measurement purposes: 1) to discriminate between known disease states, or 2) to measure responsiveness after treatment and over time.
We also hand-searched the reference list of Goodwin's systematic review [13] for the names of instruments.Additional searches were performed using individual instrument names on Pubmed from inception to February 2024 (Supplementary Information, S3).We chose the review by Goodwin and Green because it included all steps of CSPBI development, and was the most recent and most comprehensive of the review papers that we found.
Inter-reviewer reliability was assessed using a kappa statistic, with cut-off scores: 0.40-0.59for fair agreement, 0.60-0.74for good agreement, and 0.75 and higher for excellent agreement [26].Discrepancies in interpreting eligibility criteria were discussed, and the criteria were revised for clarity if inter-reviewer reliability was below good [26].

Data extraction
The steps of instrument development were extracted from full text articles.The data extraction form (Supplementary Information S4) was pilot tested on 10 articles that covered all instrument development phases and was iteratively revised until it captured all essential steps.One reviewer (TT) extracted the data from all articles and a second reviewer (ST) reviewed the data against all articles.Discrepancies were resolved by discussion.

Constructing the framework
We started with Brazier et al. 's six stages outlining how to derive CSPBIs from existing psychometric instruments [5].Next, we reviewed existing frameworks for the development of classification systems of domains for non-preference-based instruments [16,27], and for use of factor analysis [28,29] and Rasch analysis [19].Finally, we reviewed articles describing the development of CSP-BIs to identify the key steps.

Study selection
Figure 1 shows the PRISMA diagram.After removal of duplicates, the titles and abstracts of 1,967 references were reviewed, and 71 additional references were identified from hand-searching Goodwin's systematic review [13].One hundred and fifty-four full-text articles were retrieved and reviewed.Data were extracted from 109 articles representing 41 unique instruments, and 21 diseases/conditions.Inter-rater agreement was fair (kappa = 0.57) during the title and abstract screening, and good (kappa = 0.71) during the full text selection.

Phases and steps to developing CSPBIs
Figure 2 shows the framework of the four phases and 15 steps of CSPBI development.

Phase I (Steps 1-3): Conceptualize measurement construct and develop initial items
These three initial steps were conducted for the 7 instruments developed de novo (Table 2).These steps are only required when developing a CSPBI de novo and therefore are absent from Brazier's stages, which start with an existing non-preference-based instrument.The data to gather for phase I are the relevant literature of frameworks and existing items, and results from patient interviews or focus groups.
Step 1. Determine a priori conceptual framework A conceptual framework defines the construct to be measured.The purpose of starting with a conceptual framework with defined core dimensions is to ensure that measurement of the construct is comprehensive and has established boundaries [132,133].Three instruments were developed with condition-specific conceptual frameworks (DUI, WAITe, Vis-QoL) [78,108,118].
Step 2. Generate initial items The purpose of generating an initial comprehensive pool of items is to cover the entire construct to be measured [20].Items that represented the conceptual framework of the descriptive system were generated using literature reviews (WAITe) [108], chart reviews, or other existing HRQoL instruments [31,78,103,117,125,134,135].Patient and clinician experts were consulted in interviews (WAITe) [108], and focus groups (MHOM RA, VisQoL) [42,117,118] (Table 2), which consider patient perspectives [136].
Step 3. Initial item reduction The purpose of initial item reduction is to ensure alignment of the items with an a priori framework of HRQoL [103] (Table 2), and to remove redundant items [20].Developers field-tested the VisQoL in people with and without vision impairment [118,126].Developers reduced items after consultation with patients, carers, and/or clinicians (MHOM RA, WAITe, Vis-QoL) [108,117,118], performing framework analysis (WAITe) [108].Development of the PBI-WRQL [111] and PB-HIV [131] removed correlated dimensions (r > / = 0.3) and mapped initial items to an established framework to establish the instrument dimensions (Table 2).

Phase II (Steps 4-8): Establishing dimension structure
Factor and principal component analyses (PCA) are data aggregation techniques that explain the pattern of correlations between items and latent constructs, such as HRQoL dimensions [28] (Table 3).Phase II overlaps with Brazier's stage I (Fig. 2).The intent of establishing the dimensional structure is to assess structural independence, which means there is a low correlation between dimensions [137].The data to collect for phase II are responses to the questionnaire.
Step 4. Assess factorability of items The factorability of items indicates whether it is feasible to proceed with factor analysis [28].Coefficients of 0.3 to 0.8 in a correlation matrix [31,108,112] or > 0.70 in Cronbach's alpha [81,118] are criteria for factorability.If performing PCA or exploratory factor analysis (EFA), developers also assessed the Bartlett test of sphericity, and Kaiser-Meyer Olkin measure of sampling adequacy [31,66,87,108,116,141,142] (Table 3).
Step 6. Select the number of factors to retain If there was no hypothesized dimensional structure, developers had to decide on the number of factors to retain to best represent the underlying structure of the dataset [28].In PCA and EFA, developers considered the amount of variance that was explained by the eigen values [33,45,80,87] or visualized in a scree plot [69,87,88,107,141], or they performed parallel analysis to interpret the scree plots more objectively [87,108,116,141] (Table 3).
Step 8. Evaluation of model fit The purpose of evaluating model fit is to assess whether the model needs revision to fit the data.Developers evaluated global model fit using root mean square error of approximation (RMSEA) and standardized root mean square residual (SRMR) (< 0.08 acceptable, < 0.05 good), and comparative fit index (CFI) and Tucker-Lewis index (TLI) (> 0.9 acceptable, > 0.95 good) [49,64,75,94,118,134].Developers evaluated factor loadings (> 0.3 or > 0.4) to ensure the item loaded sufficiently to the factor.In PCA and EFA, developers considered cross-loading differences (< 0.15, or < 0.2) to assign the item to the dimension with the higher loading (ABC-UI, AQL-5D, EORTC-8D, QLU-C10D, DUI) [31,33,45,78,141].If model fit was inadequate using any data aggregation approach, developers re-inspected factor loadings and applied residual correlations to improve overall global fit (e.g.,QLU-C10D, BUI) [49,64].Developers of the DMD-QoL found poor initial fit using CFA, but fit was improved in a 3-dimensional hierarchical model using EFA [134] (Table 3).

Phase III (Steps 9-13): Reducing items per dimension
Together with Phase II, the purpose of reducing items per dimension in Phase III is to create a preferencebased instrument that is amenable to valuation [3].The data required to perform Phase III are responses to the questionnaire, which can be the same set of data used in Phase II.
Step 9. Fit Rasch or IRT model Rasch and IRT models have different purposes, originating from two diverging traditions.Rasch models belong to a model-based tradition since the model is selected first, and the tests are designed to determine if the data fit the model.Proponents of the Rasch model posit that the Rasch model represents the structure of item responses before they can be used for measurement [143].In the alternative databased traditions, different models within the IRT family are explored to find the best fitting model for the available data [144].Thirty-nine of 41 instruments fit the data to a Rasch model.Six instruments used a Rasch rating scale model, nine used the Rasch partial credit model, and 24 used an unspecified polytymous model.Two instruments fit an IRT graded response model (GRM) (ReQoL-UI, NQU) [94,101] (Table 4).
Aligned with Brazier's stage III (explore item level reduction) [5] (Fig. 2), CSPBI developers who conducted Rasch analysis first evaluated item response ordering to collapse disordered categories, or removed items with disordered response options, and re-ran the model.Sometimes developers asked experts to review the language of merged categories for clarity and comprehensiveness (face validity) [84,88,116] (Table 4).
Developers who used Rasch analysis then evaluated model fit, item fit, and person fit.Global model fit was assessed with an item-trait interaction χ 2 (non-signifi- cant, with Bonferroni correction) and/or person separation index, similar to Cronbach's alpha or reliability (> 0.7, or > 0.8) [27,33,61,70,84,88,103,108,116,117,125,145].Many developers reported item and then person fit statistics [31,40,78].Mean item fit residuals and mean person fit residuals, measures of divergence between expected and observed responses for item or person responses, respectively, were evaluated.Residuals > 2.5 or < -2.5 represent poor fit [27,33,61,70,84,88,103,108,116,125].Additional chi-square statistics were used to investigate observed vs expected responses for items with a severity level near the person's HRQoL level (infit) or for all items (outfit) [66,78,83,95,142], where a significant chi-square means an item misfits the model [19] (Table 4).
Lastly, some developers tested unidimensionality of the instrument by performing PCA of the item residuals after fitting the Rasch model.The associations between item residuals should be random.The developers of the DUI assessed the percentage of variance attributable to the Rasch factor, and the first residual factor to assess unidimensionality [65,78].Next, independent t-tests of person score residuals of items that loaded positively (> 0.30) or negatively (< -0.30) were sometimes performed.If the items in the instrument are strictly unidimensional, the percentage of significant tests should be < 5% (POS-E, P-PBMSI, DEMQOL-U, BUI) [65,70,103,116].This also can be expressed as a confidence interval for a binomial test of proportions for the significant tests (CORE-6D) [88] (Table 4).
Developers of the ReQoL and the NQU scoring system fit the GRM, an IRT model.The model fit of the ReQoL was evaluated with the sum-score-based item fit statistic (S-χ 2 ) [145].The item information function was calcu- lated to identify the score range where each item covered the most information, and the higher the discrimination parameter, the more information it provides.Test information of the total item pool was calculated, and the range where measurement precision > 0.9 was calculated [101,145] (Table 4).
Step 10.Select items per dimension The purpose of selecting a small number of items per dimension is so that the health states from the eventual preferencebased instrument are amenable to valuation [136].This step overlaps with Brazier's stage II [5].Developers used clinimetric and psychometric criteria to select items whether fitting a Rasch or IRT model.If items fit the Rasch model, most developers selected one item per dimension based on Rasch analysis criteria, conventional psychometrics, and item importance.Developers of the DMD-QoL-8D selected two items for each underlying factor [83]. Representative items for the dimension spanned a range of condition severity (AQL-5D, MSIS-8D, DMD-QoL-8D) [27,33,83] (Table 4).Developers retained items with a high correlation between the item and its dimension score (AQL-5D, DMD-QoL-8D) [33,83], that could adequately discriminate (e.g., QLU-C10D and FACT-8D: early vs late stage cancer) [49,62], or had high responsiveness (e.g., OAB-5D and FACT-8D: standardized response mean between baseline to specific time on treatment) [42,62].Conventional psychometric criteria were applied to exclude items with a high proportion of missing data (VFQ-UI, ABC-UI, DMD-QoL-8D) [31,83,126], or high floor and ceiling effects (VFQ-UI, CARIES-QC-U, DMD-QoL-8D) [83,112,126].Some developers included item importance and impact ratings from experts to guide item selection (ABC-UI, QLU-C10D) [31,49], or combinations of patient and health care provider perspectives (Table 4).
For the two instruments that used a graded response IRT model for item selection, developers chose items maximizing coverage of the construct, or selected two items per dimension for their item bank (Neuro-QoL) [101] (Table 4).Items with high Fisher information contribute to higher measurement precision (ReQoL) [94].
Step 11.Model validation The purpose of model validation is to evaluate whether the fitted model measures what it intended to measure [20].Aligned with Brazier's stage IV [5] (Fig. 2), some developers validated the factor analysis or Rasch analysis using another dataset or a split half of the original dataset [27,33,43,78,84,88,116,118].Developers incorporated the perspectives of patients, clinicians, or researchers (e.g.importance ratings, interviews) to validate the meaningfulness of the resulting factors [45,49,62,75,78,80,84,112].Other developers checked that the resulting classification system had a dimensional structure aligned with the parent psychometric instrument [45,49,97] (Table 4).
Step 12. Evaluate measurement properties and interpretability The purpose of assessing measurement properties (reliability, validity, and responsiveness) of a novel instrument before it is used is so that we can be sure that it consistently measures what it is intended to measure, including changes in health [146].Interpretability is being able to assign qualitative meaning to quantitative scores [146].
The minimal important difference (MID) was assessed for the OAB-5D and compared with the EQ-5D-5L [9].Both anchor and distribution-based methods were used to determine the MID of the DEMQOL-U [73] (Table 5).

Phase IV (Steps 13-15): Valuation and modeling of health state utilities
Table 6 outlines these steps, which are aligned with Brazier's stages V and VI [5].The data required for phase IV are utility weights.
Step 13.Elicit heath state utility values The purpose of eliciting utility values is to develop a set of utility weights to assign to the health states derived from the instrument [149,150].Individuals eliciting utility weights were either patients, members of the general public, or carers.Twentyfive CSPBIs elicited utilities from the general public, the most common group, whereas 13 CSPBIs elicited utilities from patients.Patients produced significantly higher utility values than the general public when assessed for the same instrument (e.g., cognition in MS) (MSIS-8D) [98].Health states must be selected for valuation, and the most common method was an orthogonal design in which each dimension level had an equal chance of combination with all other dimension levels in the instrument (15 instruments) (Table 6).Direct utilities were elicited using cardinal (e.g., TTO, SG), or ordinal (e.g., DCE) methods, most frequently using TTO (21 instruments) (Table 6).
Step 15.Evaluate utility function Developers used various criteria to evaluate the utility function used to score the CSPBI.In our scoping review, the utility function was evaluated based on regression model coefficients for statistical significance [31,43,45,80,83,89], and for consistency with the descriptive systems [31,45,80,83].For example, individuals with poor health were expected to have lower utility values than people with good health.
Figure 2 shows our 15-step framework with Brazier's corresponding stages.

Discussion
This scoping review produced a framework with 15 key steps that outline the phases of developing CSPBIs from the development of a conceptual framework to evaluating the utility function.This framework overlaps with the steps or stages from existing frameworks from psychometrics [16], and factor analysis [29,151], and augments Brazier's six stages of CSPBI development [5].Brazier's stages begin at our step 4 with establishing dimensionality of a pre-existing non-preference-based instrument.We added steps 1-3, required when developing any instrument de novo, coinciding with psychometric item development.
Our framework is novel by connecting the steps of initial stages of psychometric item development (phase I) established by Guyatt et al. (1986)'s seven stages of questionnaire design [16], with the steps of preferenceinstrument development.Our framework steps, excluding step 1 (a priori framework), are found within Guyatt's stages [16], but in a different order due to their emphasis on judgemental approaches in creating a psychometric questionnaire vs our focus on quantitative approaches to developing a preference-based questionnaire.We have noted that some steps in Phase II are desirable when performing EFA or PCA, but they are not required.In circumstances where data availability is limited, model validation using a novel dataset may not be possible.
Through comparing our approach with O'Brien [28], and Norman and Streiner [29], our framework generalizes those authors' approaches that are common amongst factor analysis with and without an a priori factor structure.
Surprisingly, few developers used CFA, even though most instruments were developed from existing psychometric instruments, when an a priori dimensional structure could be tested.When evaluating factor loadings, developers did not explicitly state the need to have 2 or 3 items per factor, or that a key objective in EFA is to fit the most parsimonious factor structure [29].
While 39 of 41 instruments used Rasch analysis, fewer than half of CSPBI developers explicitly described using psychometric and Rasch criteria in item selection (step 10), a critical step in this framework.
Peasgood et al. [136] described additional item selection criteria which are being applied in developing the novel generic preference-based instrument, the EQ-HWB (health and well-being) [156].Some of these criteria overlap with the concept of sensibility [157] and coverage of the full range of the domain in our item selection step.Peasgood et al. also highlight ensuring measurement of current HRQoL so that items can be used in comparisons between and within people, and ensuring that the items are suitable for valuation [136].
The utility elicitation method, respondent type (general public vs patients), and the functional form likely affected the derived utility values but these were frequently not acknowledged and could be further studied [101,158].
Limitations of this scoping review were not including a critical appraisal of included articles and only included CSPBIs in which Rasch analysis or IRT analysis were used in the steps of their development.

Conclusions
This study fills a gap in the methodological literature by providing a comprehensive framework to guide the development of preference-based instruments de novo, adding to quality assessment criteria of patient-reported outcomes such as the Consensus-based Standards for the selection of health Measurement Instruments (COSMIN) [146,159].Rasch and IRT methods improved item selection and the overall robustness of the resulting instruments with potential for item banking and computerized adaptive testing [101,158].This study will help guide the rigorous development of CSP-BIs, to better measure patient preferences for clinical decision-making and cost-effectiveness analyses.

Step 6 .
If there is no hypothesized dimensional structure, select the factors to retain Cattell scree test E, P

4 .Step 6 .
Assess factorability of itemsItem-total correlations E, If there is no hypothesized dimensional structure, select the factors to retain Amount of variance explained E, P

a
Adjacent response categories were merged if disordered b Item-trait interaction fit statistics, n.s.chi-square, p-value > 0.01 after Bonferroni adjustment c Person separation reliability > 0.7 for group use, or > 0.85 for individual use d Exclude items where fit residuals > 2.5 or < -2.5 -and refit Rasch model e Exclude persons where person fit residuals > 2.5 or < -2.5 and refit Rasch model f P: patients, HCP: health care providers, R: researchers.#Fisher information for graded response model f P: patients, HCP: health care providers, R: researchers, G: general public

Table 1
Published condition-specific preference-

Dimensions Number of health states Method of eliciting utilities Groups for utility elicitation Scaling anchors
Abbreviations: TTO Time trade-off, DCE-TTO Discrete-choice experiment, time-trade-off, LT-TTO Lead time-time trade-off, BWS Best worst scaling, RS Rating scale, VAS Visual analogue scale, UK United Kingdom, USA United States of America

Table 2
Phase I (Steps 1-3) Conceptualize measurement construct and develop initial items P Patients, G General public, C Carers, HCP Health care providers, R Researchers

Table 3
Phase II (Steps 4 to 8) Establish the dimension structure

Table 4
Phase III (Steps 9 to 11) Reducing items per dimension

Table 5
Phase III (Step 12) Evaluate measurement properties and interpretability

Table 6
Phase IV (Step 13-15) Value, model, and evaluate health state utilities