Patients
The patient data used for the Factor and Rasch analysis of the FACT-G were collated from two studies, one published [3] and another unpublished, which have been carried out by the Cancer Research UK, Psychosocial and Clinical Practice Research Group (St. James's University Hospital, Leeds). In the first study, 265 patients completed the paper version of FACT-G as an outcome measure in a randomised trial investigating the effects of using regular QOL measurement in oncology practice. Patients completed the FACT-G on 4 occasions: baseline, after 3 outpatient consultations, at 4 and 6 months.
In the second study, one group of patients completed an electronic version of the FACT-G on a standalone computer with a touchscreen monitor (n = 200). The aim of the study was a comparison of a number of quality of life instruments.
The studies received ethical approval from the local ethics committee of the Leeds Teaching Hospitals NHS Trust (UK).
Instrument
The FACT-G version 4 consists of four subscales, Physical Well-being (PWB), Social & Family Well-being (SFWB), Emotional Well-being (EWB), and Functional Well-being (FWB). These are rated on a five-point Likert scale (i.e. "Not at all", "A little bit", "Somewhat", "Quite a bit", "Very much"). The scale scores are derived by summing the raw scores, which range from 0 to 28 (or 0 to 24 for Emotional Well-Being). Scores from the Physical and Emotional Well-being scales (with the exception of one item) are reversed. A total score is derived by summing the scale scores from all four subscales (range 0 – 108). Higher subscale scores indicate better health, functioning, or well-being. The timescale for the FACT-G is the past 7 days. Missing items were treated according to the guidelines of the questionnaire developers, which involves prorating scores, i.e. calculating the mean for completed items for each subscale containing missing data (where there is a 50% or greater response) and substituting this for the missing data.
Statistical methodology
Traditional psychometrics
Reliability and factor analysis
In addition to means and standard deviations of the scale scores, the internal consistency of each domain was assessed using Cronbach's alpha. A principal components analysis was performed on the raw scores, and the factor structure rotated using orthogonal rotations (varimax). Only factor loadings above 0.50 were considered as indicative of item loading.
Rasch analysis
Rasch models [9] are latent trait models which model a probabilistic relationship between the level of latent trait (commonly referred to as person "ability" or "measure") and the items used for measurement (item "difficulty" or "location"). Both person ability and item location (estimated in terms of log-odds or "logits") are located along the same continuum. The estimation procedure provides person ability estimates which are independent of the items employed in the assessment, and conversely estimates item locations independently from the sample of test users (or patients) employed.
The data were analysed with Winsteps software [17] using the Partial Credit Model for polytomous data [18]. Item locations and person measures were derived for the each of the four FACT-G scales.
Three important criteria for Rasch models were investigated, namely, unidimensionality, item fit and item invariance.
Dimensionality
Unidimensionality concerns whether the data form a single factor [19] and can be used to assess whether the single latent trait explains all the variance in the data. Unidimensionality of each scale was evaluated with principal components analyses (PCA) of the residuals once the initial latent trait (i.e. the "Rasch" factor) has been extracted [17]. The following criteria were used to determine whether additional factors were present in the residuals: 1). a cutoff of 60% of the variance explained by the Rasch factor; and 2) eigenvalues smaller than 3 and the percentage variance explained by the first contrast of less than 5% [17]. However, recent studies have demonstrated that these measures might not be sufficient to determine multidimensionality [20].
Therefore, in addition to these criteria a method recommended by Smith [21] was employed to identify any potential multidimensionality: Item parameters for misfitting items were estimated with the entire scale, as well as independently for just those misfitting items. These two estimates for each misfitting item were then subtracted from each other and an average, or shift constant [21] calculated. Person measures were calculated for the entire scale (including misfitting items), as well as by using the misfitting items alone. The latter are then weighted using the shift constant (added to the person measures estimated by the misfit items alone) and independent t-tests performed for each pair of person measures. The percentage of tests falling outside the 95% confidence interval, ± 1.96, may then be evaluated. Since within the Rasch model person measures should agree within a certain degree of error irrespective of the subset of items used in the estimation process, any significant number of tests outside this interval will indicate the presence of multidimensionality.
Item fit and location
The item fit to the Rasch model is commonly measured by the mean-square residual fit statistic [19]. Two commonly employed fit statistics to assess item fit are the weighted mean square or infit statistic, and the unweighted mean square or outfit statistics. The outfit statistic is sensitive to anomalous outliers for either person or item parameters, whereas the infit statistic is sensitive to residuals close to the estimated person abilities. Fit statistics for items have an expected value of 1.0, and can range from 0 to infinity. Deviations in excess of the expected value can be interpreted as 'noise' or lack of fit between the items and the model, whereas values significantly lower than the expected value can be interpreted as item redundancy or overlap.
Item fit was assessed for the four subscales (Physical, Social & Family, Emotional, and Functional Well-being), as well as the FACT-G total. Fit was evaluated against a range of 0.70 – 1.30 for infit (weighted) mean squares [22], as well as outfit (unweighted) mean squares greater than 1.4. Any misfitting items (fit > 1.30) were removed from the individual scales and the Rasch analysis re-run. This iterative process was continued until no further misfit was observed. The item location was determined for the final iterative process for those items falling within the fit range (< 1.30) once misfitting items had been removed. The fit and item locations were also recorded for misfitting items.
Item invariance (Differential Item Functioning)
Item invariance refers to the fact that the estimated item location parameters should not be dependent on the sample used to derive the estimates. Rasch models require the item estimation to be independent of the subgroups of individuals completing the questionnaires. In other words, item parameters should be invariant across populations [23]. Items not demonstrating invariance are commonly referred as exhibiting differential item functioning (DIF) or item bias. Identification of differential item functioning (DIF) allows comparisons and evaluations to be made of whether items are functioning equivalently across important categories, such as diagnosis, extent of disease. Item invariance can be assessed by producing independent estimates of item location using subgroups of individuals (e.g. groups defined by gender, age group, diagnosis etc.).
As two different samples were used for the Rasch and Factor Analysis and as the data were derived through different modes of administration, differential item functioning analysis was used to determine whether item invariance held between the item parameters estimated from the two samples. Item invariance was derived by holding item location parameters constant while person measures were estimated separately for each age group [17]. This was then evaluated using a paired t-test. Item invariance was evaluated using a contrast between item difficulties of 0.5 logits or greater and a Bonferroni adjustment was applied to control for any effects due to multiple testing [22]. Therefore contrasts between parameters were evaluated at a level of significance (α) of 0.01 (t > 2.56).
As the outcome of this analysis would determine whether the data from the two samples could be pooled for the Factor and Rasch analyses, the differential item functioning analysis was carried out first.
Analysis of clinical significance of removal of misfitting items
The data used for this analysis was derived from a randomised control trial exploring the impact of measuring and using health-related quality of life (HRQoL) on doctor-patient communication and patient well-being [3]. Patients were randomly allocated to one of three arms depending on whether they regularly completed the intervention HRQoL questionnaire on a touchscreen computer prior to each clinic visit, whether these results were fed back to their physicians or receiving standard care [3]. Patients completed the FACT-G as an outcome measure at home at four time points.
The data used in this analysis are derived from the first two time points: baseline and after 3 outpatient consultations (approximately 2–3 months after baseline completion). The 3 study arms were compared in terms of changes in FACT-G over time (scores at time 1 minus baseline scores) using univariate analyses of variance and regression analysis. The dependent variable was the change in FACT-G domains and total scores. Study arm, performance status, gender and diagnosis were entered as fixed factors, and baseline FACT-G score per domain, age and time on study as covariates.
New scale scores were derived for each FACT-G subscale (PWB, SFWB, EWB & FWB), as well as the total FACT-G score by removing any misfitting items identified in the Rasch analysis and each subscale was rescored. The above analyses were carried out for the original and rescored FACT-G subscale scores and total score.
In addition to this analysis, the impact of removing misfitting items was also assessed through its influence on effect size [24]. Effect sizes were calculated by subtracting scores at time 1 from the baseline score and dividing this by the standard deviation of the baseline score for each subscale and rescored subscale, as well as the FACT-G total.