Validation and reliability of the Dutch version of the EORTC QLQ-BLM30 module for assessing the health-related quality of life of patients with muscle invasive bladder cancer

Background Quality of Life (QoL) of bladder cancer patients has been largely neglected. This is partly due to the lack of well-validated QoL questionnaires. The aim of this study is to examine the structural validity, reliability (i.e., internal consistency and test-retest reliability), construct validity (i.e., divergent validity and known group validity) and responsiveness of the Dutch version of the European Organisation for Research and Treatment of Cancer QoL questionnaire for muscle invasive bladder cancer (EORTC-QLQ-BLM30). Methods Patients with newly diagnosed muscle invasive bladder cancer (MIBC) participating in the population-based ‘Blaaskankerzorg In Beeld’ (BlaZIB) study who completed the EORTC-QLQ-BLM30 at baseline were included. BlaZIB is a Dutch nationwide population-based prospective cohort study collecting clinical data and QoL data of bladder cancer patients. QoL is assessed with a self-administered questionnaire at four points in time: 6 weeks (baseline), 6 months, 12 months and 24 months after diagnosis. Confirmatory factor analysis and multitrait scaling analysis were used to investigate and adapt the scale structure. Reliability, construct validity and responsiveness of the revised scales were evaluated. Results Of the 1542 patients invited to participate, 650 patients (42.2%) completed the QLQ-BLM30 at baseline. The questionnaire’s scale structure was revised into seven scales and eight single items. Internal consistency and test-reliability were adequate for most scales (Cronbach’s α ≥0.70 and intraclass correlation coefficient ≥ 0.70, respectively), with the exception of the revised urostomy problem scale and abdominal bloating and flatulence scale. The questionnaire exhibited little overlap with the EORTC-QLQ-C30: all correlations were < 0.40, except for the correlation between emotional function (QLQ-C30) and future worries (QLQ-BLM30). The questionnaire was able to distinguish between patient subgroups formed on the basis of physical function, but not – as hypothesized– based on stage. Changes in health due to treatment were captured by the questionnaire, indicating that the questionnaire is responsive to change. Conclusions This study shows that the adapted scale structure of the EORTC-QLQ-BLM30 generally exhibits good measurement properties in Dutch patients, but needs to be validated in other languages and settings. Trial registration BlaZIB, NL8106, www.trialregister.nl Supplementary Information The online version contains supplementary material available at 10.1186/s12955-022-02064-z.


Introduction
Bladder cancer (BC) is one of the ten most common types of cancer worldwide [1]. About a quarter of the patients presents with Muscle Invasive Bladder Cancer (MIBC) and of three quarter of patients diagnosed with Non-Muscle Invasive Bladder Cancer (NMIBC) 10-15% will subsequently progress to MIBC.
In the past decades the main focus of BC research has been on optimizing oncological outcomes, with relatively little attention being paid to the impact of the disease and its treatment on the functional health, symptom burden and health-related quality of life (HRQoL) of patients [2]. Several reports have indicated that the HRQoL outcomes of patients with BC appear to be worse than those of patients with other cancer types [3]. HRQoL outcomes are typically assessed with patient-reported outcome measures (PROMS). There are currently a number of PROMs designed to assess the HRQoL of patients with BC, including: the Bladder Cancer Index (BCI) [4]; the Functional Assessment of Cancer Therapy questionnaire for bladder cancer patients in general (FACT-Bl) and for those who undergo a cystectomy (FACT-VCI) [5,6]; and the European Organisation for Research and Treatment of Cancer (EORTC) questionnaires for NMIBC (EORTC-QLQ-NMIBC24) [7,8] and MIBC (EORTC-QLQ-BLM30) [9]. A major limitation of many of these PROMs is the lack of validation studies demonstrating that these PROMs can accurately measure what they intend to measure [10]. This is especially true for the EORTC-QLQ-BLM30. To date, only one study has investigated the internal consistency of four of seven scales of the QLQ-BLM30 [11] and all other psychometric properties, with the exception of known-group validity [10], have not yet been assessed.
The aim of the current study was to investigate the structural validity, reliability (i.e., internal consistency and test-retest reliability), construct validity (i.e., divergent validity and known group validity), and responsiveness of the Dutch-language version of the EORTC-QLQ-BLM30 in patients with MIBC.

Study design
The study included patients diagnosed with non-metastatic MIBC (≥cT2,cN0-2,cM0) between November 1st 2017 and November 1st 2019, who participated in the HRQoL component of the BlaZIB study (Blaaskankerzorg In Beeld, EN: Insight into bladder cancer care). BlaZIB is a Dutch population-based prospective cohort study, embedded in the Netherlands Cancer Registry (NCR), evaluating the quality of bladder cancer care in the Netherlands. BlaZIB collects comprehensive clinical data and HRQoL data of patients. More information about the BlaZIB study can be found elsewhere [12]. The Committee on Research involving Human Subjects (CMO) of Arnhem-Nijmegen deemed the BlaZIB study exempt from ethical review under the Medical Research Involving Human Subjects Act (WMO). The BlaZIB study was approved by the privacy review board of the NCR. Informed consent, either written or digital, was obtained from all patients participating in the HRQoL component of the BlaZIB study.

Questionnaires
All patients diagnosed in a hospital participating in the HRQoL component of the BlaZIB study received an invitation to complete the baseline questionnaire shortly, i.e. about 6 weeks, after histological confirmation of the bladder tumour (T6wk). Patients who completed in the baseline questionnaire and were still alive at follow-up received a follow-up questionnaire at 6 months (T6mo), 12 months (T12mo) and 24 months (T24mo) after diagnosis. The questionnaires were provided digitally and in paper-and-pencil format and included questions on demographics, work, lifestyle and HRQoL. HRQoL was assessed with the Dutch version of the EQ-5D-5L, the EORTC-QLQ-C30, the EORTC-QLQ-NMIBC24 and the EORTC-QLQ-BLM30. The Bladder Cancer Index (BCI) was included as an additional non-obligatory questionnaire.

Questionnaire scoring
The EORTC-QLQ-C30 consists of 30 items assessing global health status, five functional health domains (physical, role, emotional, cognitive and social functioning) and nine symptoms (fatigue, nausea and vomiting, pain, dyspnoea, insomnia, appetite loss, constipation, diarrhoea and financial difficulties) [13]. The EORTC-QLQ-BLM30 consists of 30 items and originally hypothesized scale to form seven scales (urinary symptom (US), urostomy problem, single catheter use problem (CAT), future worries (FW), abdominal bloating and flatulence (BAF), body image (BI) and sexual functioning) [14]. All items, except for global health status (seven-point scale), are scored on a four-point Likert scale ranging from 1 (not at all) to 4 (very much). Because patients who completed the online questionnaire were required to answer all questions, the response category 'not applicable / not willing to share' was added to the items of the sexual functioning scale (items 53 to 60) in the online and paper-and-pencil questionnaire. This extra response category was handled as missing in the calculation of the scores. In accordance with the EORTC guidelines, all responses were linearly transformed to a 0 to 100 scale. and missing data were imputed by averaging the scores of the scale if more than 50% of the items of the scale were completed [15].

Additional measures
In order to assess the test-retest reliability of the EORTC-QLQ-BLM30, 81 patients diagnosed with MIBC were asked to complete the EORTC-QLQ-BLM30 2 weeks after the T12mo assessment (T12mo + 2wk; response rate: 84.4%). For practical reasons, this latter questionnaire was only administered in a paper-and-pencil format, but included the same instructions given at T12mo. The T12mo + 2wk questionnaire contained four questions to assess whether patients had less, equal or more complaints in general and on three subscales (urinary, bowel and sexual) compared to the previous questionnaire (T12mo). Only those patients who indicated that they were stable over time on the relevant subscales (i.e. equal complaints) were included in the test-retest analysis.
To assess the divergent validity, the content of the EORTC-QLQ-BLM30 was compared with that of the core questionnaire, the EORTC-QLQ-C30.

Structural validity
Confirmatory factor analysis (CFA) was performed to evaluate the hypothesized scale structure of the QLQ-BLM30. Because the US and urostomy problem scales are mutually exclusive, the CFA was run twice, i.e., without US and with urostomy problem and vice versa. Maximum Likelihood (ML) was used as estimator in the CFA and missing items were imputed using Full Info Max Likelihood (fiml). Model-data-fit of the CFA was assessed with model chi-square, the Comparative Fit Index (CFI), Root Mean Square Error of Approximation (RMSEA) and Standardized Root Mean Square Residual (SRMR). Model chi-square > 0.05, CFI ≥0.95, RMSEA < 0.05 and SRMR< 0.05 indicate a good fit, and CFI > 0.90, 0.05 < RMSEA < 0.08 and 0.05 < SRMR < 0.08 indicate an acceptable fit [16].
Multitrait scaling analysis was performed for each assessment point to evaluate the unidimensionality of the scales (i.e., an assumption in classical test theory) and to examine whether the individual items could be grouped in the hypothesized scales. A correlation of ≥0.40 between an item and its own scale was regarded as adequate statistical evidence for convergent validity. Statistical evidence of discriminant validity was defined as a correlation of < 0.40 between an item and other scales in the questionnaire [17]. Items that had poor convergent and/or discriminant validity were discussed and reassigned to another or new scale if necessary. Further psychometric analyses were performed after finalizing the scale structure.
Floor and ceiling effects were examined for each scale. A scale was considered to have a floor or ceiling effect if more than 15% of the patients achieved the lowest or highest possible score, respectively.

Measurement error and reliability
Cronbach's coefficient α was calculated for each scale to assess internal consistency. A Cronbach's α of 0.70 or higher was considered acceptable for group comparisons. Test-retest reliability was assessed based on the questionnaires administered at T12mo and T12mo + 2wk using the intraclass correlation coefficient (ICC) for absolute agreement (two-way mixed model, single measure) [18]. An ICC value of 0.70 or higher was considered acceptable.

Hypothesis testing for construct validity
Divergent validity of the QLQ-BLM30 was assessed by calculating the Spearman correlation coefficients between the scales of the QLQ-C30 and QLQ-BLM30. It was hypothesized that symptoms scales of the QLQ-BLM30 would have low to moderately negative correlations with the functioning scales of the QLQ-C30. A strong correlation was expected between scales that were conceptually related, i.e. constipation and diarrhoea (QLQ-C30) vs abdominal bloating and flatulence (QLQ-BLM30).
Known group validity was assessed by comparing patients with stage II (T2,N0,M0) and stage III (T3-4a,N0,M0 or T1-4A, N1-2, M0) (UICC TNM 2018 [19]) and physical function (PF) < 90 and ≥ 90. It was expected that the HRQoL of patients with stage II and III disease would be comparable as these patients are treated similarly [20]. We hypothesized that patients with high PF (≥90) would report better functioning and less symptoms on all scales than patients with low PF (< 90). Effect sizes (ESs) were calculated using Cohen's d statistic (mean difference divided by pooled standard deviation). These provide a distribution-based estimate of the magnitude of mean differences/changes, where an ES of 0.2 is considered small, 0.5 moderate, and 0.8 large [21].

Responsiveness
Responsiveness to change was assessed in patients who underwent a treatment with curative intent (i.e., radical cystectomy (RC), (chemo) radiotherapy ((C)RT) [20]) after completion of the baseline questionnaire, showed no disease recurrence or progression and completed the EORTC-QLQ-BLM30 questionnaire at all time points. It was hypothesized that patients would report increased urinary, bowel and sexual problems after removal of the bladder compared to baseline [22,23]. Additionally, we hypothesized that patients who were treated with (C) RT would report better sexual function and body image than patients treated with RC [11].
The CFA was conducted with the software package R using the "lavaan" package [24]. ICCs were calculated in STATA version 16.0 (StataCorp LLC, College Station, Texas, USA). All other statistical analyses were performed using SAS version 9.4 (SAS Institute, Cary, North Carolina, USA).

Patient characteristics, completion rates and missing data
Of the 1542 patients invited to participate in the HRQoL measures, 650 patients (42.2%) completed the baseline questionnaire (T6wk). Respondents were more often male, had a slightly better comorbidity profile, had a higher SES, had a more favourable stage distribution and were more likely to undergo a RC (see Additional file 2 for a comparison of the patient and tumour characteristics of the respondents and non-respondents). The follow-up questionnaires had higher completion rates; 396 (62.7% of the invited patients) at T6mo, 357 (70.3%) at T12mo and 277 (76.5%) at T24mo (Fig. 1). The majority  Table 1). The percentage of missing responses, including not applicable, on the items single catheter use (item 44) and female sexual function (item 60) were high (> 85%) (see Additional file 2). The percentage of missing responses was low (< 3%) for the items 45 to 52 (future worries, bloating and flatulence and body image scales) and varied per assessment point for the items belonging to the original urinary symptom and urostomy problems scale (items 30 to 43), as these scales are mutually exclusive. The percentage of missing responses for the items belonging to the original sexual function scale (items 53 to 60) was < 48%, except for female sexual function, if limited to the patients reporting at least some sexual activity (item 48).

Structural validity
Items 44 and 60 were excluded from the CFA because of the high number of missing responses (> 85%). The hypothesized scale structure of the QLQ-BLM30 did not fit the data well, with: a CFI of 0.80-0.86, RMSEA of 0.07-0.08 and SRMR of 0.11-0.13 (see Additional file 2). Multitrait scaling analysis showed that the sexual functioning and urostomy problems scales were particularly   problematic (see Additional file 2). For this reason, we decided to revise the sexual functioning scale in the same way as was done for the EORTC-QLQ-NMIBC24 [7] (see Table 3), even though items 57 and 58 showed a moderate correlation (0.53-0.59) and the model fit remained largely the same (±0.002 change) after combining these items into one scale (Additional file 2). Based on the data and the item content, the urostomy problems scale was reduced to a three-item scale (items 38, 39 and 43). The remaining items in the originally  hypothesized scale about urostomy irritation (item 40), urostomy embarrassment (item 41) and urostomy support (item 42) were kept as single items. Although the bloating and flatulence scale showed low convergent validity (< 0.40) at all time points (see Table 2), it was considered to be an unidimensional scale in patients who underwent a RC (con: 0.43; dis: − 0.07 to 0.31). This led to the decision to keep the bloating and flatulence scale intact. The revised scale structure exhibited adequate to good fit at all time points (see Additional file 2). The hypothesized and revised scale structures of the EORTC-QLQ-BLM30 are shown in Table 3.

Measurement error and reliability
The internal consistency of the revised scales at all time points was good (α > 0.70), with the exception of urostomy problems (0.55-0.65) and bloating and flatulence (0.48-0.54; Table 2). Test-retest reliability was acceptable for three scales (ICC > 0.70); nearly acceptable for two scales (ICC 0.68-0.69) and fair to moderate for the urostomy problems scale (0.61) and bloating and flatulence (0.47; Table 4).

Hypothesis testing for construct validity
The correlations between the scales of the core questionnaire (QLQ-C30) and QLQ-BLM30 questionnaire were low (< 0.40; Table 5), with the exception of the correlation between emotional function (QLQ-C30) and future worries (QLQ-BLM30). This indicates that the module's content does not, for the most part, overlap with the content of the core questionnaire.
The scores of patients with stage II and stage III bladder cancer were, as expected, quite similar (ES < 0.30). Patients with a score of 0.90 or higher on the physical function scale of the QLQ-C30 had higher scores on the functional scales and lower scores on the symptom scales of the QLQ-C30 and QLQ-BLM30 compared to patients with physical function scores < 0.90 (Table 6).

Responsiveness
Future worries decreased after baseline in patients who underwent treatment with curative intent (EF = 0.67 to 1.39; Table 7). Body image (EF= -0.77 to -0.62) and male sexual problems (EF= -0.78 to -0.67) deteriorated in patients who underwent a RC, RC, while body image (EF = 0.23 to 0.33) and urinary function (EF = 0.16 to 0.59) improved in patients undergoing (C)RT.

Discussion
The aim of this study was to examine the structural validity, reliability, construct validity and responsiveness of the Dutch version of the EORTC-QLQ-BLM30. The original hypothesized scale structure of the QLQ-BLM30 could not be substantiated using data of 650 Dutch patients with MIBC, and therefore, the scale structure was revised into seven scales (urinary symptoms, urostomy problems, future worries, abdominal bloating and flatulence, body image, sexual functioning and male sexual problems) and eight single items. The revised scale structure, in general, exhibited good reliability, construct validity and responsiveness. Only reliability (i.e. internal consistency and test-retest reliability) of the new urostomy problems scale and abdominal bloating and flatulence scale was below the acceptable cut-off point. Based on the data and the items' content, we revised the six-item urostomy problems scale into a three-item scale and three single items. The new urostomy problem scale performs better (i.e., has higher internal consistency and better CFA results) than the originally hypothesized scale, but still performs below the acceptable cut-off point for both internal consistency and test-retest reliability. We would note that the original hypothesized scale showed better internal consistency (α = 0.71) in the study of Mak et al. [11]. This indicates that more research will be needed to examine the coherence of the items 38 to 43, as the possibility exists that the originally hypothesized urostomy problem scale may be sufficient in other populations than the current study population.
The two items of the abdominal bloating and flatulence scale (i.e., items 48 and 49) were only moderately correlated and thus appeared to measure one unidimensional construct in the general MIBC population. Furthermore, the test-retest analysis indicated that the scores on this scale varied more than would be desirable in generally stable patients, and no large differences were observed for this scale in patients who experienced changes in health over time. Similar results have been observed for this scale in patients diagnosed with NMIBC [7,8]. However, the correlation between the items of the bloating and flatulence scale was higher for patients treated with RC compared to the total MIBC population. It may be that the bloating and flatulence scale is not relevant for the entire MIBC population, but only for certain subgroups such as patients treated with RC or (C)RT. More research is needed to explore this scale further in other populations and patient subgroups. The response rates for the items addressing sexual function, especially female sexual function (item 60), were generally lower than for the items of other scales if corrected for the missing responses related to having and not having had a urostomy (i.e., urinary symptom and urostomy problem scale). This finding is in line with other studies [7,22,25], but nevertheless resulted in the exclusion of female sexual function and the item single catheter use from the CFA. The new grouping of the items addressing sexual function is the same as proposed and confirmed for the QLQ-NMIBC24 [7,8,25], and therefore, we expect that this new grouping of the sexual items will be sustained in future studies investigating the measurement properties of the QLQ-BLM30.
Although this study evaluated the measurement properties of the QLQ-BLM30 in a large populationbased group of patients with MIBC, it has some limitations. The primary limitation is its setting; this study was performed in a single study and country. Further research will be needed to examine the measurement properties of the QLQ-BLM30 in other countries and settings. Furthermore, the completion rate of the baseline questionnaire (T6wk) was rather low, which may affect the generalizability of the scores to the entire Dutch population with MIBC. We do, however, believe that this would have a negligible effect on the observed measurement properties of the QLQ-BLM30 as it is unlikely that potential selection bias significantly affect the measurement properties of the questionnaire. Finally, we would note that we relied on classical psychometrics to evaluate the properties of the QLQ-BLM30. Clinimetric evaluation of the questionnaire might result in a somewhat different scale structure than reported here, because that approach does not Table 7 Responsiveness to change over time in patients who underwent a potential curative therapy, completed the EORTC-QLQ-BLM30 at all time points and had no disease recurrence or progression BAF Bloating and flatulence, BI Body image, (C) RT (Chemo) radiotherapy, FW Future worries, mo Month, RC Radical Cystectomy, SD Standard deviation, SX Sexual functioning, SXmen Sexual problems in men, UP Urostomy problems, US Urinary symptoms; wk. = week a A higher score on this scale indicates more symptoms, problems or worries b A higher score on this scale indicates better functioning c Effect size is mean difference (i.e. between T6wk and time point unless otherwise indicated) divided by the pooled standard deviation d Effect size is mean difference (i.e. between T6mo and time point) divided by the pooled standard deviation