Interpreting small treatment differences from quality of life data in cancer trials: an alternative measure of treatment benefit and effect size for the EORTC-QLQ-C30

Background The EORTC-QLQ-C30 is a widely used health related quality of life (HRQoL) questionnaire in lung cancer patients. Small HRQoL treatment effects are often reported as mean differences (MDs) between treatments, which are rarely justified or understood by patients and clinicians. An alternative approach using odds ratios (OR) for reporting effects is proposed. This may offer advantages including facilitating alignment between patient and clinician understanding of HRQoL effects. Methods Data from six CRUK sponsored randomized controlled lung cancer trials (2 small cell and 4 in non-small cell, in 2909 patients) were used to HRQoL effects. Results from Beta-Binomial (BB) standard mixed effects were compared. Preferences for ORs vs MDs were determined and Time to Deterioration (TD) was also compared. Results HRQoL effects using ORs offered coherent interpretations: MDs >0 resulted in ORs >1 and vice versa; effect sizes were classified as ‘Trivial’ if the OR was between 1 ± 0.05 (i.e. 0.95 to 1.05); ‘Small’: for 1 ± 0.1; ‘Medium’: 1 ± 0.2 and ‘Large’: OR <0.8 or >1.20. Small HRQoL effects on the MD scale may translate to important treatment differences on the OR scale: for example, a worsening in symptoms (MD) by 2.6 points (p = 0.1314) would be a 17 % deterioration (p < 0.0001) with an OR. Hence important differences may be missed with MD; conversely, small ORs are unlikely to yield large MDs because methods based on OR model skewed data well. Initial evidence also suggests oncologists prefer ORs over MDs since interpretation is similar to hazard ratios. Conclusion Reporting HRQoL benefits as MDs can be misleading. Estimates of HRQoL treatment effects in terms of ORs are preferred over MDs. Future analysis of QLQ-C30 and other HRQoL measures should consider reporting HRQoL treatment effects as ORs. Electronic supplementary material The online version of this article (doi:10.1186/s12955-015-0374-6) contains supplementary material, which is available to authorized users.


Background
Health related quality of life (HRQoL) is an important endpoint in cancer trials for several reasons. First, where effect sizes are small, HRQoL can 'add value' to expensive cancer treatments. Secondly, considerable time is spent completing instruments for the purpose of estimating the impact of treatments on HRQoL. Therefore, such efforts should result in HRQoL effects that are meaningful and interpretable, especially where HRQoL is a primary or co-primary endpoint [1]. Thirdly, some anti-cancer treatments exhibit serious side-effects, despite improvements in overall survival (OS); HRQoL is also reported to be a predictor of survival in lung cancer patients [2], the leading cause of death among cancers [3]. It would be important to understand for example, how survival differs between patients with 'poor' baseline HRQoL, compared to those with 'Good' HRQoL. Finally, HRQoL outcomes are often required for cost-effectiveness analyses and drug reimbursement [4,5]. Therefore, understanding and interpreting HRQoL data is crucial in evaluating cancer treatments.
The EORTC-QLQ-C30 (QLQ-C30) is a widely used cancer specific instrument [6]. The instrument has 30 questions from which 15 domains (sub scales) are determined, consisting of 5 'function' scales, 8 'symptom' scales, a global quality of life (QL) scale and a finance scale (FI). For QL and function domains, high scores indicate better HRQoL. For symptom domains (and FI), low scores indicate better HRQoL.
Treatment effects from the QLQ-C30 are often reported as mean differences (MDs) [7], despite scores having heavily skewed distributions with ceiling effects (many patients with scores of 0 or 100) and censored data due to progressive disease, death or failure to complete questionnaires. The interpretation of HRQoL MDs can be more complicated than survival endpoints. Consequently, alternative measures of treatment effect have been proposed.
Maringwa suggests a minimally important 'difference over time' as a measure of effect [8]. The area under the curve (AUC) can be difficult to interpret, although useful for reducing multiple observations to a single value [9]. However, if HRQoL is measured at a few time points (e.g. baseline and month 12), the AUC will have limited value. Moreover, the interpretation of the effect can become tricky (e.g. for HRQoL scores of 100 at each of 0, 1 and 2 months, the AUC score is but the original HRQoL scale is 0 to 100).
Categorizing scores: e.g. improvements in symptoms from 'moderate' or 'severe' (67-100 points at baseline) to 'non' or 'little' (0 to 33 points) was proposed by Langendjik [10]. Reck and Norman [11,12] suggested 'noted' changes in HRQoL occur when a 'shift' of greater than half of the baseline standard deviation is observed). Time to HRQoL deterioration (TD) has been suggested (Anota) [13]. However different definitions of 'deterioration' lead to different conclusions and median TD may not be estimable (e.g. few events) and further complicated by non-proportional hazards (PH). Interpretation of effects with TD using HRs is however similar to ORs. Reporting a 'Trend' is also a way of describing HRQoL over time (Schaake) [14], although difficult to interpret (e.g. how much 'more trend' is there for experimental vs. control?).
The above measures of HRQoL effects can be difficult to interpret for patients and clinicians. The mean is often the statistic of choice to define treatment effect sizes for HRQoL endpoints in most of these measures.
One commonly reported clinically relevant effect size proposed by Osoba and King [6,15,16] is ≥10 points MD (on any domain), a value used as a benchmark by researchers to determine whether HRQoL benefits exist [7]. Some researchers interpret a 10 point improvement as a difference between treatments, while others as a 10 point change (improvement) from baseline (Hirsh) [6,17], which is not always possible. For example, if a patient scores 8 points (or 92 points) at baseline, a reduction (or increase) of 10 points is not possible. Moreover, 'important' treatment differences need not be the same for symptom as functional scales. A worsening of 5 points in a symptom scale may be more important than a 10 point improvement in a functional scale.
For HRQoL endpoints, the magnitude of effect sizes are often considered to be clinically relevant if a difference of 10 points is observed, regardless of whether HRQoL is a primary or secondary outcome. Such requirements are not expected of other secondary clinical endpoints in cancer trials (e.g. time to progression (TTP)). One reason may be that secondary endpoints are not powered or there is a clinical rationale that the secondary outcome cannot be expected to yield effects similar to primary endpoints. In a similar vein, effect sizes should not be expected to be uniform across HRQoL domains for demonstrating treatment benefit because some smaller effect sizes (e.g. < 10 points) may be important. In this research we attempt to show that some small effect sizes on a MD scale might be dismissed as clinically irrelevant but remain important on a relative scale.
Little attention has been given to smaller HRQoL effects (MDs) which are often glossed over unless a 'statistically significant' p-value is reported alongside. Small MDs tend to be perceived as offering limited HRQoL benefit but can mask important improvements, particularly when data are analysed using an alternative scale (e.g. OR scale). This presents a challenge for setting thresholds for defining clinically relevant HRQoL effect sizes. Moreover, ORs can facilitate an interpretation of effects similar to hazard ratios (HR), familiar to many oncologists (OR are interpreted in a similar way to HRs). Therefore, in this article after presenting baseline characteristics, we offer effect size categories based on the OR and describe example situations of the relationship between ORs and MDs. We discuss aspects of statistical significance of small effects in the context of ORs and MDs and compare preferences between ORs vs MDs from several clinicians; Finally, we compare ORs and MDs with time a to deterioration (TD) approach (TD ≥5 points) following Anota [13].

Data
HRQoL data from six randomized controlled trials (RCT) conducted by the CRUK & UCL CTC were analayzed [9,[18][19][20][21][22]. These were selected because they comprised of all patient level QLQ-C30 data available in the CTC database from RCTs in lung cancer which had been published.

Assessments
Data were collected during clinic visits and questionnaires returned by patients during follow up; QLQ-C30 was assessed at several time points including baseline, pre and post chemotherapy and at monthly intervals for at least 24 months or until disease progression.

Statistical analysis
Patient level HRQoL scores for each of the 15 domain scores were analysed using a a repeated measures [21,22] analysis for reporting MDs and a more novel Beta Binomial (BB) model in a mixed model framework [23] for reporting ORs. For the BB model, responses were transformed to a (0,1) scale using the transformation [23] Y-a/b-a, where a and b are the minimum and maximum possible scores and Y the observed response. For example, a score of 80 is transformed as 80-0/(100-a) = 80/100 = 0.8. Dichotomization is not required for a BB model to generate ORs. The BB model has been used in a variety of applications [23][24][25]. Its advantages over standard (linear) models in terms of statistical properties are widely reported [25,26]. The BB is also flexible because it models scores at the extreme ends of the scale (e.g. many patients scoring 0 or 100), a common feature of QLQ-C30 scores, using zero-one inflated model [25,26]. MDs were classified similar to those described by Cocks [7]; 'Trivial' (0-3 points), 'Small' (3-10 points), 'Modest'/ 'Medium' (10-15 points) and 'Large' (>15 points). Similarly, ORs were classified as 1 ± 0.05 ('Trivial'), 1 ± 0.1 ('Small'), 1 ± 0.2 ('Medium') and <0.8 or >1.2 ('Large').
Time to Deterioration (TD) was determined using the first time where scores reduced/increased by ≥ 5 points. Patients without deterioration were censored. A Kaplan-Meier and Cox proportional hazards (PH) analysis was carried out.
A pilot survey was carried out to determine preliminary evidence of whether clinicians and/or patients preferred ORs or MDs for expressing treatment effects. Three items, physical function (PF), Pain (PA) and cognitive function (CF) from the 15 domains were randomly selected and presented to each of five clinicians and their patients (where possible). Patients/clinicians were asked to state preferences for ORs or MDs (Additional file 1). Lower/ High scores express preferences for ORs; scores close to 5 express indifference.

Distribution of QLQ-C30
Most (>85 %) QLQ-C30 responses were very skewed ( Fig. 1 & Additional file 2: Figure S1). For TOPICAL, 14/15 (93 %) of scores had alpha or beta values (special values associated with a BB distribution relating to the mean and variance) <1; Kolmogorov-Smirnov tests rejected normality (p-value <0.001). Therefore, using the mean as a measure of HRQoL benefit and consequently MDs is not considered a suitable reporting metric for HRQoL scores. Statistical analysis should be conducted according to the underlying (true) distribution of the data. The distribution of QLQ-C30 scores from the six trials were not normally distributed in most (≥85 %) of cases. Four examples are provided to understand the relationship between ORs and MDs.

Example 1: when MDs are small but ORs are large
In the TOPICAL Trial the MD for constipation (CO) symptoms were 2.6 points (p = 0.1314) while this was an OR of 1.17 (p < 0.0001)the choice of interpretation is 'a worsening in CO by a mean difference of 2.6 points with erlotinib compared to placebo' vs 'patients are 17 % more likely of having worsening CO symptoms with erlotinib compared to placebo'. The MD scale gives the impression that CO symptoms worsens by a 'Trivial' amount of 2.6 points ( Table 2). This tends to occur when responses are skewed ( Fig. 1 and Additional file 2: Figures S1, S2 and S3). In the presence of heavily skewed data, the OR is a suitable choice for presenting HRQoL effects from the QLQ-C30. In the TOPICAL trial, patients had worse diarrhoea (DI) with erlotinib: MD of 15.1 ('Large' effect) points (p <0.001) with a corresponding OR of 1.12 (p = 0.0505). The DI scores were considerably skewed ( Fig. 1) which might explain why the larger MD corresponded with only 12 % ('Medium' effect) higher odds of diarrhoea with erlotinib compared to placebo (OR = 1.12). The OR appears to have modified the 'Large' effect size (borderline significance) to a smaller (non-significant) effect size. In study 10, RF improved by a MD of about 13 points (Table 2) with the experimental treatmenta 'Medium' effect. Using an OR, this was an improvement in role function by almost 30 % (OR =1.29 '). On examination of Additional file 2: Figure S1, responses fell into only three distinct categories at 0, 50 and 100 and scores were not Normally distributed making use of the MD questionable. The OR approach has relegated a 'Medium' effect to a 'Large' effect.

Example 4: when MDs and 'ORs agree on the direction of effects
In the TOPICAL trial, two of the MDs (MD of 3.2 and 3.6 in TOPICAL; p-values of 0.0017 and 0.0007 for PF and CF respectively) had corresponding ORs of 1.10 and 1.14 (p-value = 0.0168 and 0.0107). Both MDs and ORs are in agreement that PF and CF are improving with the experimental treatment. Hence, on average, patients had 10 % and 14 % higher odds of improved PF and CF on erlotinib compared with placebo respectively ( Table 2).
The above are a limited number of examples reflecting the challenges associated with defining thresholds of HRQoL differences with the MD. Another issue that can complicate interpretation is when small effects become difficult to interpret and justification is made through statistical significance. Statistical significance of small HRQoL effects are often reported, but the clinical relevance not always discussed. Table 3 shows that 28/90 (31 %) of 'small' or 'Trivial' effects based on MD were statistically significant compared with 7/90 (8 %) for ORs.     Tables S2, S3.

Effect size classification for ORs and MDs
Estimates for OR effect size categories similar to those described earlier [7] were determined using a cumulative frequency plots from MDs and ORs ( Fig. 2 and Additional file 2: Tables S2, S3, Table S4 shows that 12/59 (20 %) of 'Trivial' effects based on MDs might be clinically important because on an OR scale these were 'Medium' or 'Large'. Consequently some clinically important effects may be missed using MDs. Figure 2 shows median HRQoL effect sizes are 2.5 points (half of effect sizes are ≤2.5), roughly equivalent to 7 % changes in HRQoL on the OR scale; similarly for the lower and upper quartiles, 25 % of effect sizes ≤1 point or 4 % changes on the OR scale; and 75 % of effect sizes are ≤3.6 points (ORs of about 1.10).
Secondly, for effect sizes of 1, 3, 5 10 and >15 points, the equivalent ORs are about 1.02, 1.07, 1.13, 1.25 and 1.37 respectively. The threshold for a large effect size of >15 points is challenging: patients expected to improve/worsen by almost 40 %. This may be a difficult target for some cancer drugs to achieve when compared with each other.

Summary of preference scores from survey
Five lung cancer clinicians completed a pilot (Additional file 1) survey (London UCH, Liverpool, Leeds, Chester and Imperial College London). At this time no patient responses were available. Hence a total of 15 scores from 5 clinicians who expressed preferences for either ORs or MDs for each of PF, Pain and CF were analysed. Stronger preferences were expressed for ORs over MDs: mean scores of 2.4, 3.1 and 2.8 for PF, Pain and CF respectively. Hence, initial evidence suggests clinician preference was greater for ORs than MDs. The results would need to be confirmed in a larger sample.

Comparison with time to deterioration
The time it takes for a patient to deteriorate from baseline by ≥5 was not possible for about 13 % HRQoL domain scores due to too few events (i.e. patients did not show of ≥5 points). Moreover, a TD of ≥5 points was not always possible because scores were clustered in values such as 16.7, 33.3 and 66.6 (e.g. as in CF scores for TOP-ICAL - Fig. 1). No patient experienced (or could experience) a TD of exactly 5, 10 or 15 points (the possible values of the QLQ-C30 for CF were only 0, 16.7, 33.3, 50.0, 66.7, 83.3 and 100). The median TD (Additional file 2: Table S5) was not calculable for some symptom   [28] uses a propensity score (logistic regression) approach to report odds of HRQoL deterioration; Kurita et al. (2015) [29] use ORs with the QLQ-C30 in renally impaired patients. In these analyses scores were dichotomized in order to generate the ORs. In our analysis, no such dichotomization (and consequent loss of information) was required due to flexibility of the Beta-Binomial regression approach.
Patient and clinician understanding of MDs have not been previously shown to be concordant [7] and this may in part be due to how HRQoL benefits are expressed to patients. Clinicians and patients may find it easier to agree on relative quantities than absolute differences. The pilot survey results may support relative quantities. The choice between interpretations such as: "your diarrhoea will be worse with the new treatment by 15 points, on average" instead of: "the likelihood of diarrhoea with the new treatment is significantly higher by about 11 % compared to placebo", is a matter of preference, but the latter may be appealing for some. Aligning understanding of smaller effect sizes is increasingly important with the emergence of novel treatments for lung cancer being compared with each other (and not just placebo).
There are several advantages and disadvantages of both MDs and ORs. First, ORs evaluate relative (instead of absolute) treatment effects. For objective endpoints, absolute differences (e.g. 4 vs 3 months survival) may provide easier interpretations of treatment benefits (although the effects are median and not mean differences in cancer trials). However, HRQoL are self-reported endpoints for which even the most experienced clinician has difficulty interpreting. For such endpoints, a relative scale may be more useful. If treatment effects from primary endpoints are judged by relative quantities (e.g. hazard ratios), there are no reasons why treatment effects from HRQoL endpoints should not also be assessed this way. Both survival time and HRQoL share some similar distributional properties (e.g. skewed or censored). There is some concern that effects near the boundaries (floor/ceiling) will be overvalued with ORs compared to effects around the middle. However, such concerns can be addressed through the use of zero-one inflated models   [25] which model the over/under dispersion.
Secondly, the OR model assumes a fixed odds ratio over time (i.e. the effect is constant over time), which may not hold in a longitudinal QoL setting. Reliable interpretation of MDs also depends on an absence of treatment by time interactions (i.e. ORs and MDs are not dependent on specific time points). Thirdly, statistical models for MDs will provide predicted patient level HRQoL responses. For example, a patient taking experimental treatment with a certain demographic profile might yield a predicted PF score (e.g. 5 points). Similarly, a model for estimating ORs can be used to predicted a probability of a achieving a specific PF score for a given patient (group of patients) on the experimental treatment (response curves are advocated by the FDA for patient reported outcomes) [30].
The suggested effect size of >10 units on the QLQ-C30 was proposed almost two decades ago when fewer treatment comparators were available [15]. Few (about 2 %) MDs were >10 points and this research confirms earlier conclusions that small changes in HRQoL can be important (Cella, 2002) [7,31]. Importantly, the implications of skewed distributions were not factored in when the magnitude of effect sizes were defined in earlier research.
There are several strengths and limitations of this analysis. First, a large sample size is used from clinical trials in similar groups of patients. Secondly, established criteria for classifying effect sizes were used for MDs [7]. Third, the BB model is a robust approach to analysing skewed data with ceiling effects, without arbitrary dichotomisation of responses. Finally, interpreting ORs is similar to that of HRs which many oncologists are familiar with.
Although the BB approach offers an alternative approach to analyse and interpret HRQoL effects, it is more complex. The complexity is outweighed by the benefits of reliable and potentially easier to interpret estimates of effect. A further limitation is that analysis has been restricted to lung cancer patients, but can be applied to other tumour types and disease areas. The classifications suggested for ORs in this analysis are arbitrary (even if based on the observed data) and different results can occur with alternative categories. Definition of effect sizes may require some threshold to be set which may necessarily be subjective. However, a starting point in our view is that the most appropriate metric is used to present HRQoL effects in cancer patients, an area for further research. The initial survey results too should also be confirmed in a larger sample size.
Treatment effects for HRQoL from the QLQ-C30 should be reported using relative quantities such as ORs which appear to be clinically intuitive, easier to interpret and where analysis involves modelling the skewed distribution of responses.