In order to ensure that the new RF-item bank is comparable with data collected with the existing (static) version of the QLQ-C30, it needs to cover the same aspects as the QLQ-C30 RF Scale. The item bank should extend the measurement continuum, i.e. allow for the assessment of a broader range of severity of impairment, and increase measurement precision. In addition, the items should fit a unidimensional model in order to be included in the final item bank. The WHO ICF differentiates between limitations in activity and restrictions in participation. To reflect the RF construct as defined within the QLQ-C30 we decided to focus on limitations in activity and considered aspects of participation to be assessed by the social functioning domain of the QLQ-C30.
Phase I - Literature search
Phase I involved a literature search to collate existing items measuring RF. Searches were applied to the following databases: PubMed, EORTC Item Bank (http://groups.eortc.be/qol/item-bank), ProQolid (https://eprovide.mapi-trust.org/), Psyndex and PsyndexPlus. The search was conducted in September, 2008 applying combinations of the following free text and MeSH-terms: neoplasm*, cancer, role, social, daily, function*, well-being, limitation.
Phase II - Operationalization
The item list compiled in phase I was refined according to pre-defined selection steps. In each selection step two independent reviewers performed the ratings, which were then compared and discussed in case of disagreement. A third reviewer was involved in case of disagreement, ratings were discussed and then a majority decision was made. Reviewers had expertise in HRQOL, CAT and/or clinical oncology. First, items that were redundant, not compatible with the QLQ-C30 item style, or that assessed issues outside of the scope of conceptual framework were eliminated (step 1). Based on the remaining items, new items in the style of the QLQ-C30 (i.e. a question with a one-week recall period, assessing severity of impairment on a 4-point Likert scale from 1-not at all to 4-very much), were developed (step 2). Step 3 comprised another redundancy rating and a rating of item relevance to the RF construct. In step 4, the remaining items were rated for difficulty (i.e., the level of RF being assessed). Subsequently, they were subjected to QLG internal expert reviews (step 5) before being sent out for international expert reviews (step 6) on the items’ relevance for the assessment of RF, redundancy, clarity, and appropriateness.
Phase III - Pre-testing
To ensure content validity and the appropriateness of the items for the target population the preliminary item list was pre-tested in an international sample of cancer patients. Inclusion criteria were a cancer diagnosis, age ≥18 years, sufficient command of respective national language, no overt cognitive impairments, and informed consent. Translations were done according to published guidelines by the Translation Office of the EORTC Quality of Life Department [17]. Based on patient feedback, the content and wording of the item list was refined and a preliminary item list to be used in field testing was created.
Phase IV - Field testing and calibration of the item bank
Sample and procedure
The preliminary item list was field-tested in an international sample of cancer patients. Inclusion criteria were the same as in phase III. We aimed at a heterogeneous sample of at least 1,000 patients, which is sufficiently large for the purposes of item calibration [18, 19]. Patients were approached in different oncology treatment settings (e.g., in-patient and outpatient; curative and palliative treatment) in order to cover a broad range of socio-demographic and clinical characteristics as well as different levels of RF impairment. In addition to the preliminary item list, patients completed the QLQ-C30 and answered questions on item relevance, clarity, and appropriateness, which were also presented paper-pencil based.
Evaluation of dimensionality and local dependence
The items were evaluated to determine if they met the requirements of unidimensionality and local independence using exploratory and confirmatory factor analyses. We were also interested in the potential overlap between the constructs RF and physical functioning (PF). As all patients had completed the QLQ-C30 in phase IV data collection we were able to investigate the factor structure of the new RF items and the physical functioning (PF) items of the QLQ-C30.
Eigenvalues, root mean square error of approximation (RMSEA) <0.10, the Tucker-Lewis Index (TLI) >0.90 and the Comparative Fit Index (CFI) >0.90 were used as criteria in the evaluation of factor structure and model fit [20, 21]. Residual correlations >0.20 served as indicators of local dependence (LD) [22].
Item bank calibration and evaluation of item fit
Items were checked for monotonicity, i.e. whether the cumulative probability of choosing a given response category or a higher category is non-decreasing with increasing IRT scores, i.e. the better RF, the more likely a response reflecting higher RF should be given. This was done by comparing the average item scores with the sum of the rest scores. Then items were calibrated to a generalized partial credit model (GPCM) [23], a model which allows estimating a discrimination (slope) parameter for each item (i.e. the item’s ability to discriminate between people) and a set of threshold parameters (i.e. the locations on the continuum where the item’s response options are most likely to be endorsed). To assess item fit, S-Χ2 fit statistics [24, 25], the difference between expected and observed responses (bias) and infit and outfit mean-squares (MnSq) were used [26]. Bias is indicated by a root mean square error (RMSE) of ≥1, which would correspond to a difference of one response category. Concerning MnSq-values, primarily large infit and outfit, i.e. >1.3, were regarded as problematic as they indicate poor agreement between observed and expected responses [27]. In addition, to make infit and outfit values less dependent on sample size and variation of responses they can be t-transformed to approximately standard normal distribution. Values outside ±2 (1.96) may be regarded as possibly problematic (95 % CI), and e.g. outside ±2.6 (99 %) as problematic, and outside 3.3 (99.9 % CI) as clearly problematic.
Differential item functioning
The items then were tested for differential item functioning (DIF), i.e. if items perform differently in certain sociodemographic and clinical subgroups. This was done using ordinal logistic regression [28–30]. Group variables were age, gender, country, cancer site, cancer stage, current treatment, living with a partner/alone, level of education, working/retired/other. Subsequently, for items with DIF it was tested if it affects parameter estimates. The method compares the RF scores obtained with the model which does not account for DIF with a model which does. If the RF scores differ substantially, defined as a difference larger than the median standard error for the RF estimates, this would indicate practically problematic DIF, also termed “salient scale-level differential functioning” [15, 28, 31].
Evaluation of measurement properties
Finally, the item bank’s performance for CAT measurement was assessed using real and simulated data. CAT simulations to evaluate measurement precision were done using Firestar and were based on the collected data (N = 1023). We simulated CATs asking an increasing number of items starting with one and ending with 9. We estimated the RF score based on these CATs, and compared these scores with the score based on all 10 items. As starting item we used the QLQ-C30 RF item with the highest average information. The Expected A Priori (EAP) method was applied for latent trait (theta/θ) estimation.
To evaluate possible savings in sample size, relative validity (RV) of the CATS compared to the QLQ-C30 RF scale in detecting expected group differences was calculated [32]. The RV is the ratio of two test statistics for comparing two (known) groups. We used the t-test statistic for each of the CATs as the numerator and the t-test for the QLQ-C30 RF scale as the denominator – hence an RV >1 indicates that the CAT has greater discriminating power than the QLQ-C30 scale. Known group variables (age, sex, stage, work, therapy, education) were tested if significant for either the CAT or the QLQ-C30 measures. If significant they were used for calculating RVs. This was done based on the collected data.
RV was also assessed on the basis of simulated data. We simulated responses to the items on the basis of RF scores sampled from normal distributions with different means. We compared groups of different sizes and different true effect sizes. For each of the possible settings we ran 2000 simulations. For more details on methods please refer to Petersen et al. 2011 [15] and Petersen et al. 2012 [16]. Statistical packages used were SAS, Parscale [33] and Mplus [34].