 Research
 Open Access
 Published:
Reliability analysis of the Chinese version of the Functional Assessment of Cancer Therapy – Leukemia (FACTLeu) scale based on multivariate generalizability theory
Health and Quality of Life Outcomes volume 15, Article number: 93 (2017)
Abstract
Background
The Functional Assessment of Cancer Therapy–Leukemia (FACTLeu) scale, a leukemiaspecific instrument for determining the healthrelated quality of life (HRQOL) in patients with leukemia, had been developed and validated, but there have been no reports on the development of a simplified Chinese version of this scale. This is a new exploration to analyze the reliability of the HRQOL measurement using multivariate generalizability theory (MGT). This study aimed to develop a Chinese version of the FACTLeu scale and evaluate its reliability using MGT to provide evidence to support the revision and improvement of this scale.
Methods
The Chinese version of the FACTLeu scale was developed by four steps: forward translation, backward translation, cultural adaptation and pilottesting. The HRQOL was measured for eligible inpatients with leukemia using this scale to provide data. A singlefacet multivariate Generalizability Study (Gstudy) design was demonstrated to estimate the variance–covariance components and then several Decision Studies (Dstudies) with varying numbers of items were analyzed to obtain reliability coefficients and to understand how much the measurement reliability could be vary as the number of items in MGT changes.
Results
Onehundred and one eligible inpatients diagnosed with leukemia were recruited and completed the HRQOL measurement at the time of admission to the hospital. In the Gstudy, the variation component of the patientitem interaction was largest while the variation component of the item was the smallest for the four of five domains, except for the leukemiaspecific (LEUS) domain. In the Dstudy, at the level of domain, the generalizability coefficients (G) and the indexes of dependability (Ф) for four of the five domains were approximately equal to or greater than 0.80 except for the Emotional Wellbeing (EWB) domain (>0.70 but <0.80). For the overall scale, the composite G and composite Ф coefficients were greater than 0.90. Based on the G coefficient and Ф coefficient, two decision options for revising this scale considering the number of items were obtained: one is a 37item version while the other is a 45item version.
Conclusion
The Chinese version of the FACTLeu scale has good reliability as a whole based on the results of MGT and the implementation of MGT could lead to more informed decisions in complex questionnaire design and improvement.
Background
Leukemia, the first major cause of cancerrelated deaths for children and adults under 35 years old, is an increasingly common health care problem in the world. It can be divided into two categories, i.e., acute leukemia and chronic leukemia, based on the natural course and the cellular differentiation degree. In China, it has been estimated that there were 75,300 new cases of leukemia and 53,400 deaths in 2015 [1]. The treatment can mitigate the clinical symptoms and improve the 1 and 5year overall survival rates but also introduces toxicity that can offset the clinical benefit. For patients with cancer, the primary concerns include the disease symptoms, treatment toxicity, increased risk of second malignancy, longterm and late effects of treatment (e.g., fatigue), depressive mood and anxiety, reduced work efficiency and family dysfunction. These concerns can be reflected by a comprehensive and multidimensional index: the healthrelated quality of life (HRQOL). There is an increasing recognition of the need to assess the HRQOL when doctors administer certain types of clinical treatment [2]. However, the measurement of the HRQOL for patients with leukemia needs specific, reliable and effective instruments.
Consequently, several leukemiaspecific instruments have been developed and used, which include the Life Ingredient Profile for hematologic malignancies [3], the Medical Research Council/EORTC Quality of Life QuestionnaireLeukaemia Module (MRC/EORTC QLQLEU [4]), an HRQOL scale for individuals with leukemia in longterm remission, the EORTC for Chronic Myeloid Leukaemia (EORTC QLQCML24), which contains an additional EORTC leukemiaspecific module [5], and the Functional Assessment of Cancer TherapyLeukemia (FACTLeu) scale [6]. Among these instruments, the FACTLeu scale was created by combining the Functional Assessment of Cancer TherapyGeneral module (FACTG) [7] and a subscale made up of 17 leukemiaspecific items. The original English version was developed by Northwestern University in the USA which has been working with us. The use of the FACT scale system has been officially sanctioned. To the best of our understanding, there have been no reports of the development of a simplified Chinese version of the FACTLeu scale. Therefore, the major objective of this study was to develop a sound simplified Chinese version of the FACTLeu scale.
Although the original English FACTLeu scale has been tested and determined to be a valid, reliable, and efficient instrument for evaluating the leukemiaspecific HRQOL in both acute and chronic disease [6], it is necessary to test the reliability and validity of the simplified Chinese version before its expanded use in the clinic, in consideration of the possible culturesensitive characteristics of the HRQOL. Usually, researchers evaluate the reliability and validity of HRQOL scales using Classical Test Theory (CTT).A reliability and validity analysis of the simplified Chinese version of the FACTLeu scale was performed using CTT. The Cronbach’s α coefficients of the overall scale were greater than 0.9, and the Cronbach’s α coefficients were greater than 0.80 at the domain levels. The intraclass correlation coefficients (ICC) of all domains and the overall scale were greater than 0.90, which indicates good to excellent reliability. The good convergent and discriminant validity and good construct validity were confirmed by a correlation analysis and factor analysis. The criterionrelated validity was determined to be good when using the Quality of Life Instrument for Cancer PatientsLeukemia (QLICPLE), a scale developed entirely by us, as a criterion; more detailed results will be reported elsewhere. CTT posits that an observed score is the combination of the true score plus error, and the reliability then is the ratio of the true score variance to the observed score variance. In CTT, the reliability is based on a single source of error, ignoring all other potential sources of error variance. When measurement error stems from multiple sources, CTT may be inadequate. Generalizability theory (GT) provides a comprehensive and unifying framework that goes beyond the CTT model of a single error term by allowing for the simultaneous analysis of main and interaction effect source of error variance [8]. The conventional reliability approaches in CTT are typically post hoc, that is, the measurement reliability is computed after the fact. GT, however, can be used proactively in planning better measurement protocols. This flexibility and forecasting capability is not generally provided by conventional reliability approaches such as CTT. GT subsumes other forms of reliability approaches (e.g., internal consistency reliability, interrater reliability, and intraclass correlation) and provides a comprehensive and unifying framework for assessing the measurement reliability, especially for complex measurement situations. This theory was pioneered in the educational field and has been used in medicine [9,10,11,12,13,14,15,16,17,18,19,20,21]. The application of GT includes the univariate generalizability theory (UGT) method and the multivariate generalizability theory (MGT) method. The MGT was initially proposed by Cronbach based on the multiple analysis of variance (MANOVA), and it is appropriate for multidimensional and complicated measurement situations [22]. The analysis and estimation process of MGT considers not only the variances (variance components), but also the covariance structure of domains. In MGT, the reliabilities of all domains are estimated simultaneously, rather than each domain in isolation [23]. The HRQOL assessment contains different domains, usually with a different set of items within each domain. Considering these multidimensional characteristics of HRQOL, the application of MGT becomes natural. Although some studies [24,25,26,27,28,29,30] used GT in assessing the reliability of HRQOL, none considered MGT. Therefore, the secondary objective of the current study was to evaluate the reliability of the Chinese version of the FACTLeu scale by using MGT. The current study is expected to answer the following questions:

(1)
What is the reliability of the Chinese version of the FACTLeu scale?

(2)
What impact does the test length (number of items) have on the reliability of the Chinese version of the FACTLeu scale?

(3)
How should the number of items for every domain in the Chinese version of the FACTLeu scale be changed to obtain better reliability in the future?
Methods
Translation of the simplified Chinese version of the FACTLeu scale
The FACTLeu scale was translated from the original English version into simplified Chinese. In the translation procedures four steps were carried out: forward translation, backward translation, cultural adaptation, and pilottesting. In the first step, the original English version was independently translated into simplified Chinese by two translators (one an epidemiologist and the other an oncologist who specializes in leukemia) whose native language is Chinese but who are proficient in English. Then the two forward translated versions were compared by a research coordinator to find any differences. When the differences were identified, the research coordinator discussed them with the two translators until they all agreed on a reconciled Chinese version. In the second step, the reconciled Chinese version was translated into English by two backtranslators who had never read the original English version. The two backtranslators are both oncologist who are proficient in English. Then, the backward translation and the original English version were compared, and the reconciled Chinese version was further modified accordingly by the same coordinator. The process was repeated until the backward translation was identical or nearly identical to the original English version. After that, the items that were likely to lead to confusion or misunderstanding due to cultural differences were modified based on the results from an indepth interview of oncologists and a focus group discussion. In the final step, a pilot test was carried out among 15 patients diagnosed with leukemia and who satisfied the eligibility criteria. They were asked to independently complete the scale acquired in the third step and then were interviewed using a questionnaire. Any item that was upsetting or difficult to understand should be marked and given appropriate suggestions during the interview. Based on the suggestions from the patients, necessary modifications were made and a final Chinese version for use in the formal measurement of the HRQOL was obtained.
Structure of the Chinese versions of the FACTLeu scale
The structure of the Chinese version of the FACTLeu scale is the same as that of the original English version; it is composed of two subscales (the 27item FACTG and the 17item leukemiaspecific subscale). The FACTG consists of four domains: the 7item Physical Wellbeing (PWB) domain, the 7item Social/Family Wellbeing (SWB) domain, the 6item Emotional Wellbeing (EWB) domain and the 6item Functional Wellbeing (FWB) domain. For ease of description, the leukemiaspecific subscale is called the LEUS domain in this article. Each item of the FACTLeu scale was rated on a fivelevel scoring system, namely, not at all, a little bit, somewhat, quite a bit, and very much. The positive items received scores from 0 to 4 points, while the negative items were designed with scoring in the reverse order.
Study subjects
This study recruited inpatients diagnosed with leukemia at the First Affiliated Hospital of Kunming Medical University and the First People’s Hospital of Yunnan Province during October 2013 to March 2014. Eligible participants were aged 18 years or older, at least 2 months post diagnosis of any type or stage of leukemia, had a life expectancy of more than 3 months, and were able to read and understand the questionnaires. Exclusion criteria included a diagnosis of psychosis or dementia and illiteracy.
Measurement process
The investigators (doctors, nurses and medical postgraduate students) obtained informed consents from the patients who agreed to participate in the study and met the inclusion criteria. The Chinese version of the FACTLeu scale was used to measure the HRQOL of the included leukemia patients after the investigators explained the study and the scale. The HRQOL is a selfreport measurement, so each patient was asked to answer the scale by himself or herself. The answers were immediately checked by the investigators to verify completeness. If missing items were found, the scale was returned to the patients immediately to fill in the missing items. The patients completed the 44item FACTLeu scale at the time of admission to the hospital.
A multivariate generalizability theory approach
GT contains two stages: Generalizability Study (Gstudy) and Decision Study (Dstudy). The Gstudy serves as a “pilot” study that decomposes the variance and covariance components related to various error sources to help confirm the relationship between the measurement goal and measurement facets based on the data collected and using analysis of variance (ANOVA) or multiple analysis of variance (MANOVA). In the Dstudy, the information from the Gstudy is used for the planning of an “optimal” measurement protocol so that the best possible reliability can be achieved while balancing other factors.
In generalizability theory, a latent trait is defined by sets of conditions under which it may be observed; each set of conditions is called a facet. Usually, the latent trait is the object of measurement [31]. In this study, the HRQOL of patients is defined as the measurement goal and may be measured with different items and at different times; the items and times may be conceived as two different facets. Two or more facets can be combined in various ways, along with the definition of the objects of measurement (the persons), to define a set of observations, called the universe of assessable observations (UAO) for a given latent trait [31]. In a D study, a universe of generalization (UG), which contains those facets and conditions that a study was willing to generalize with a particular measurement procedure, is specified. Along with the specification of the UG, universe scores for the objects of measurement are defined. A universe score is defined as the expected value of the observed scores for a measurement goal over all conditions in the UG [31]. Usually, it is defined that the UG and the UAO have the same structure.
Multivariate Gstudy design
As the first step, the Gstudy includes design, data collection, and estimation of the relevant variancecovariance components under the design conditions. In the current study, the measurement goal is the HRQOL of patients with leukemia, which is abbreviated as “p” in the design and the items facet (abbreviated as “i”) is the only facet of measurement because the HRQOL is a selfreport measurement and there is no rater. The items facet is a random facet because a different set of items can be involved in each replication. The HRQOL data are unbalanced because there are unequal numbers of items within each domain. The domains facet (abbreviated as “h”) is treated as the fixed facet because every replication of the measurement of the HRQOL involved the same domains. It is defined that the facet i was nested within the facet h. In addition, facet p is completely crossed with facet i because every patient answers every item in the FACTLeu scale. In the univariate Gstudy sense, this design is expressed as p × (i : h), where p, i, and h represent patients, items, and domains facets, respectively. The fixed domains facet and the random effect variance component design associated with each fixed domain level yield a multivariate G study design, p ^{•} × i ^{∘}, with the number of levels for the fixed domains facet being n _{ h }. The solid circle, ·, indicates that the patients facet is crossed with the fixed multivariate variable (i.e., domain) whereas the empty circle, ∘, indicates that the items facet is nested within the fixed multivariate variable. In other words, there is a random effect p ^{•} × i ^{∘} design within each of the five fixed domains and any single item is only associated with a single domain [32]. The mathematical models of the p ^{•} × i ^{∘} design can be defined as
or
In Equation 1, X _{ pih } is the observed score that results from a single observation in the h ^{th} domain and this score may be decomposed into an overall mean and several effects according to the analysis of variance model. In the righthand side of Equation 1, μ _{ h } means the overall mean in the h ^{th} domain; μ _{ ph } ∼ (μ _{ ph } ∼ = μ _{ ph } − μ _{ h }) means the patient effect in the h ^{th} domain because μ _{ ph } is the average score of all items for patient “p” in the h ^{th} domain; μ _{ i : h } ∼ (μ _{ i : h } ∼ = μ _{ i : h } − μ _{ h }) means the item effect on the observed score in the h ^{th} domain because μ _{ i : h } is the average score per person for item“i” in the h ^{th} domain; and μ _{ pi : h,e } ~ means the residual effects in the h ^{th} domain, including the patientitem interaction effect and other effects. The FACTLeu scale has five domains (n _{ h } = 5), and the model equations for the different domains can be presented as follows:
It follows that the variance and covariance components for the population and the UAO can be grouped into three symmetric matrices: ∑_{ p }, ∑_{ i } and ∑_{ pi }. Note that ∑_{ p } is the variance–covariance component matrix among patients, ∑_{ i } is the variance–covariance component matrix for items within a domain, and ∑_{ pi } is the patientitem interaction within a domain. As all patients contribute data to all levels of domains facet but the items are nested in different levels of domains facet, ∑_{ p } is a full matrix while ∑_{ i } and ∑_{ pi } are diagonal [32].
Multivariate Dstudy designs
Once the variance and covariance matrix from the Gstudy results are available, they could be used in the Dstudy to estimate the variance components of the universe score and the variance components of the corresponding error to calculate the two reliability coefficients: the generalizability coefficients (G) and the index of dependability (Ф) for every domain and the overall scale. Several Dstudies under the original measurement protocol and the new measurement protocols modified by changing the number of items were analyzed to understand how the measurement reliability could vary with a changing number of items.
Original measurement protocol
Following the method in a univariate P × I design (letters should be capitalized in the Dstudy) and the definition of the variance for the composite, we could define ∑_{ P } = ∑_{ p }, ∑_{ I } = ∑_{ i }/n _{ ih } ^{′} and ∑_{ PI } = ∑_{ pi }/∑_{ nih } in the multivariate Dstudy. Note that ∑_{ i }/n _{ ih } ^{′} means the h ^{th} diagonal element in ∑_{ i } divided by the corresponding number of items (n _{ ih } ^{′} ), and ∑_{ pi }/∑_{ nih } means the h ^{th} diagonal element in ∑_{ pi } divided by n _{ ih } ^{′} , because ∑_{ nih } was designed to be a diagonal matrix containing the numbers of items within the levels of h (e.g., ∑_{ nih } = diag(7, 7, 6, 7, 17) in this study).
In the p ^{•} × i ^{∘} design, the patientitem interaction within domain (pi:h) component alone constitutes the variation of the relative error, while the variation of the absolute error contains both the patientitem interaction within domain (pi:h) component and the item within domain (i:h) component. Therefore, the variance component of the relative error for every domain (σ _{ δh } ^{2} ) and the variance component of the absolute error for every domain (σ _{ Δh } ^{2} ) can be calculated using Equation 4 and Equation 5.
Actually, in the p ^{•} × i ^{∘} design, σ _{ δh } ^{2} equals to ∑_{ PI } and σ _{ Δh } ^{2} equals to the direct sum of diagonal matrices ∑_{ PI } and ∑_{ I }. The G coefficient and the Ф coefficient are calculated using Equation 6 and Equation 7, respectively. In Equation 6 and 7, σ _{ h } ^{2} (P) is the variance component of the patients in the Dstudy, which is the diagonal elements in matrix ∑_{ P }(∑_{ P } = ∑_{ p }).
In the MGT framework, not only these indexes for every domain of scale but also the corresponding composite values of these indexes for the overall scale obtained by defining a weight coefficient are important. The weight coefficient (w _{ h }) reflects the proportion of the total number of items in the measurement procedure that were associated with each h in the overall scale. It is commonly defined as w _{ h } = n _{ ih }/n _{ i ·}, where n _{ ih } is the number of items in the h ^{th} domain and n _{ i ·} designates the total number of items in all domains of the overall scale, that is, n _{ i ·} = n _{ ih(1)} + n _{ ih(2)} + n _{ ih(3)} + n _{ ih(4)} + n _{ ih(5)}. The weight coefficient is applied to the corresponding variance (σ _{ h } ^{2} ) or covariance (\( {\sigma}_{h{ h}^{\prime }} \)) to calculate the composite values of those indexes. For example, the composite universe score variance for overall scale (σ _{ c } ^{2} (P)) can be calculated by Equation 8, while the composite variance of the relative error (σ _{ δc } ^{2} ) and the absolute error (σ _{ Δc } ^{2} ) for the overall scale can be calculated by Equations 9 and 10, respectively.
After σ _{ c } ^{2} (P), σ _{ δc } ^{2} and σ _{ Δc } ^{2} are obtained by Equations 8 ~ 10, the composite G coefficient and the composite Ф coefficient for the overall scale can be calculated by making use of the same principles as used in Equation 6 and 7. The estimated variancecovariance component matrices and all indexes used to reflect the measurement error and reliability in MGT can be calculated using a specialized software, mGENOVA. The software and the manual for mGENOVA [33, 34] can be downloaded from the website http://www.uiowa.edu/~itp or http://www.uiowa.edu/~casma. mGENOVA can process an almost unlimited number of observations very rapidly.
In this research, the execution process of mGENOVA included three steps. Step1: A database was established and saved or transformed into a text file in the same folder as the mGENOVA application on the basis of its data structure (see Additional file 1: Data Structure and a Screenshot). Step2: A code file that contains a set of control cards and those necessary parameters (such as the filename of the database, the filename of the analysis results, the name of each domain, the number of domains, the number of subjects, and the number of items in every domain) was created and saved as a text file in the same folder as the mGENOVA application. In this code file, all parameters were separated from each other by any number of spaces and the order in which the parameters were provided was fixed; they must occur in column 10 or beyond. An example of a code is shown in Additional file 2: Example Code of mGENOVA. Step3: mGENOVA.exe was doubleclicked and then mGENOVA prompted us for the filename of the code file. After the filename of the code file (for example, FACTLEU.txt) and a return were typed, mGENOVA completed its execution.
New measurement protocols
To understand how the numbers of items would affect the measurement reliability of the Chinese version of the FACTLeu scale and provide some suggestion for the modification of the scale, several multivariate D studies were conducted using a threestep process.
In the first step, the test length (the number of items) was varied proportionally to three test lengths: the original test length and half and double test lengths. For example, the “half” test length had a total of 24 items, where n _{ ih(PWB)} ^{′} = n _{ ih(SWB)} ^{′} = n _{ ih(FWB)} ^{′} = 4, n _{ ih(EWB)} ^{′} = 3 and n _{ ih(LEUS)} ^{′} = 9 because the number of items was rounded off. It was assumed that there were 4 random parallel items each of the PWB domain, SWB domain, and FWB domain, 3 items in the EWB domain and 9 items in the LEUS domain. Based on the results of previous D studies and the criteria that an optimal reliability coefficient is usually defined as greater than 0.80, we could determine which domain should be increased and which domain should be decreased based on the number of items. Then, the range in which the number of items of every domain should be changed in the next Dstudy analysis was defined.
In the second step, five scenarios were designed. In each scenario, only the number of items in one domain was changed, while the other four domains kept their original test lengths. In the first scenario, scenario A, the number of items in the PWB domain was changed in the range defined in the first step, while the other four domains kept their original test lengths; the same steps were followed for scenario B, C, D and E by changing the number of items in domains SWB, EWB, FWB and LEUS, respectively. Several multivariate Dstudies were conducted for each scenario to find the appropriate number of items for each domain, deemed to be the number that made either the G coefficient or Ф coefficient just above 0.8 with simultaneously observed large composite G and composite Ф coefficients (>0.90). To describe the results of every scenario more concisely, the “appropriate number of items” that made the G coefficient just larger than 0.8 was defined as n _{ i } ^{G} , and the “appropriate number of items” that made the Ф coefficient just larger than 0.8 was defined as n _{ i } ^{ϕ} .
In the third step, the numbers of items for the five domains were reallocated based on the n _{ i } ^{G} for each domain and the n _{ i } ^{ϕ} for each domain found in the second step to provide two decision options for the further modification of the scale, and two multivariate Dstudies were then conducted for the two decision options.
Results
General characteristics of the patients
After excluding the patients who were not willing or unable to participate, 101 eligible inpatients diagnosed with leukemia were included. The patients were aged 18 ~ 80 years old (the mean age was 40.46 ± 15.12 years old); 55.4% were males and most (80.2%) were of the Han ethnicity. For their education level, 60.4% completed high school. 76.2% were married and most (92.1%) had public insurance. Only 10.9% thought their economic status was welloff. 69 patients (68.3%) were diagnosed with acute leukemia, and 32 (31.7%) were diagnosed with chronic leukemia. The characteristics of the patients are provided in Additional file 3: Table S1.
Reliability based on MGT
G study results
As discussed previously, the MGT applications include the Gstudy and Dstudy. Table 1 presents the Gstudy results for all variance components (the diagonal elements) and covariance components (covariation among the domains) of the five domains of the Chinese version of the FACTLeu scale based on the current design (7 items in PWB domain, 7 items in SWB domain, 6 items in EWB domain, 7 items in FWB domain, and 17 items in LEUS domain; 44 items in total).
Each of the variance components of the patients (σ _{ p } ^{2} ) represented the estimated “true score” variance across the patients on the specific domain of the scale. Based on the results, this variance component was the largest (0.679) for the FWB domain, followed by the PWB domain (0.642) and the EWB domain (0.491). The variance component of the LEUS domain was the lowest (0.327). Such information suggested that, relatively speaking, the HRQOL of the patients with leukemia differed most on the FWB domain and least on the LEUS domain. The correlation coefficients between the five domains were relatively large, those between the PWB domain and EWB domain, PWB domain and LEUS domain, and EWB domain and LEUS domain, with values of 0.912, 0.949 and 0.976, respectively. These results further corroborated that it was very suitable to use MGT methods to evaluate the reliability of the FACTLeu scale.
The sources of variation for every domain were grouped into three parts: from patient, from item and from patientitem interaction. For the PWB domain, among the three sources, the variation component of the patientitem interaction (σ _{ pi } ^{2} ) was the largest, the variation component of the patient was in the second place and only a small amount of variation was due to the item. Similar results were observed for three of the other four domains (SWB, EWB, and FWB), except for the LEUS domain, in which the variation component of the item ranked second. Given that the largest source of variation in a domain score was from the personitem interaction, we can assume that different subjects might react to the same item in different ways, despite having the same total score on the scale.
D study results
(a)Dstudy results under the original measurement protocol
The Dstudy results under the original measurement protocol are presented in Table 2. It shows that the G and Ф coefficients for four of the five domains were approximately equal to or greater than 0.80, except for the EWB domain based on original test length. These two reliability coefficients for the EWB domain were greater than 0.70 but smaller than 0.80. The variance components of error when estimating the universe score by using the sample mean for all domains were smaller than 0.05. Additionally, it is clear that the G coefficient was larger than Ф for every domain.
Based on the weight coefficients (listed in Table 3), the variance components of the universe score and of the corresponding errors for the five domains were integrated, and then the composite G and the composite Ф were computed for the overall scale, with the results presented in Table 2. The last column in Table 2 shows that the composite G and composite Ф coefficients were both greater than 0.90 and the composite variance component of error when estimating the universe score using the sample mean was 0.009 based on the original test length.
Further analyses (in Table 3) indicate that the contribution rate for composite universe score (CRCUS) approached the proportion of the domain score (PDS) of the original design for most domains. The greatest difference between the contributioin rate for the composite universe score and the proportion of the domain score was seen in the SWB domain (the absolute and relative differences were −7.64% and −48.02%, respectively), while the difference between these two indexes for the LEUS domain was the smallest (absolute and relative differences of 0.75% and 1.94%, respectively).
(b)Dstudy results under New measurement protocols
As mentioned in the methodology section, to provide guidance in the modification of the scale, several multivariate Dstudies were conducted using a threestep process.
Firstly, Fig. 1 provides a graphical representation of the effect on the two reliability coefficients (G and Ф) when the test length was changed. Figure 1 implies that a longer test length led to a larger reliability coefficient in every domain and in the overall scale. The increment speed of the two reliability coefficients from “half” to “original” was faster than that from “original” to “double”. Even though the number of items was reduced by half, the composite G coefficient was still greater than 0.90 and the composite Ф coefficient was still greater than 0.85.
As described in Table 4, among the five domains, under the original test length, both the G and Ф coefficients were the largest for the PWB domain whereas they were the smallest for the EWB domain. After the test length was reduced to a half, the G and Ф coefficients for the EWB domain were both smaller than 0.70. Given that an optimal reliability coefficient is usually defined as one that is greater than 0.80, the number of items of the EWB domain needs to be increased from the original test length, and the changing range of the number of items in the next Dstudy analysis was defined as from 6 to 12. In the same way, the numbers of items of the PWB domain, FWB domain, FWB domain and LEUS domain could be slightly decreased from the original test length, and the changing ranges of the numbers of items for three of the domains (PWB, SWB and FWB) were defined as from 4 to 7, while the changing ranges of the number of items for the LEUS domain was defined as from 9 to 17. Information on the reallocation of the numbers of items in the five scenarios is provided in Additional file 3: Table S2.
Several multivariate Dstudies were conducted for each scenario designed in the second step, and the results are presented in Fig. 2. It is clear from the five panels of Fig. 2 that both the G and Ф coefficients increased with the number of items. As shown in the first panel (Scenario A) of Fig. 2, n _{ i } ^{G} and n _{ i } ^{ϕ} for the PWB domain were both 5, and the composite G and Ф coefficients were both greater than 0.90. Similar results were presented in the second panel (Scenario B) except that n _{ i } ^{G} and n _{ i } ^{ϕ} for the SWB domain were both 7. In turn, it could be seen that n _{ i } ^{G} for the EWB domain was 8 and n _{ i } ^{ϕ} was 10 in Scenario C of Fig. 2, n _{ i } ^{G} for the FWB domain was 5 and n _{ i } ^{ϕ} was 6 in Scenario D, and n _{ i } ^{G} for LEUS domain was 12 and n _{ i } ^{ϕ} was 17 in Scenario E.
The results of the two multivariate Dstudies for the two decision options designed based on these abovementioned n _{ i } ^{G} and n _{ i } ^{ϕ} for each domain are shown in Table 5. The total number of items for the overall scale in decision option A, where the numbers of items for the five domains were reallocated based on n _{ i } ^{G} , could be decreased from 44 to 37, while the total number of items for the overall scale in decision option B, where the numbers of items for five domains were reallocated based on n _{ i } ^{ϕ} , would be slightly increased from 44 to 45. It is clear that the composite G and Ф coefficients were both greater than 0.90 for both decision option A and decision option B. In decision option A,the G coefficients for all domains were greater than or approximately equal to 0.80, but only Ф coefficients for the PWB and SWB domains were greater than or equal to 0.80. In decision option B, the G and Ф coefficients for all domains were greater than or approximately equal to 0.80.
Discussion
Reliability of the Chinese version of the FACTLeu
The reliability is the extent of variation reflected by the measured results that is the result of accidental errors in the system. In other words, the reliability refers to the dependability, reproducibility, stability, and consistency of the measurement. In GT, the G and Ф coefficients are two important reliability coefficients that are used to depict the reliability for the “relative decision” and “absolute decision”, respectively. Within the GT framework, the “relative decision” depends on the normreferenced score interpretation, which considers the consistency of the relative standings of the individuals rather than the consistency of the actual scores, while the “absolute decision” depends on criterionreferenced score interpretation which considers both the consistency of the relative standings of the individuals and the consistency of the actual scores [16]. The coefficients are selected depending on the researchers’ interests. In HRQOL research, if one’s interest lies in conducting a normreferenced measurement to compare the HRQOL of different patients (relative decision), the G coefficient should be selected to specifically determine the HRQOL and quantify the dependability of the score. If one’s goal is to perform a criterionreferenced test to investigate the HRQOL of patients (absolute decision), the Ф coefficient should be used to inform about how dependable a score is. It is clear that the Ф coefficient is typically lower than the G coefficient for every domain because the variance component of item within domain (i:h) is factored into the absolute error variance and Ф coefficient but not the relative error variance or G coefficient.
Some researchers [31, 35, 36] suggested that the reliability of an instrument is generally good when the reliability coefficients (G coefficient or Ф coefficient) are greater 0.8 in GT. The composite G and Ф coefficients were greater than 0.90 for the Chinese version of the FACTLeu scale with the two indexes for four of the five domains all approximately equal to or greater than 0.80 except for the EWB domain (greater than 0.70 but smaller than 0.80) based on the original test length. Even though the number of items was reduced to a half, the composite G coefficient was still greater than 0.90 and the composite Ф coefficient was still greater than 0.85. These results indicated a very high level of measurement reliability for the Chinese version of the FACTLeu scale as a whole, and the measurement reliability was also good at the domain levels. Additionally, the results that the contributioin rate for the composite universe score approached the proportion of the domain score in the original design for most domains (except for the SWB domain) exemplified that the allocation of the scale items was reasonable in general. However, it is worth paying attention to the quality of the items in the SWB domain in the future because the relative difference between these two indexes in the SWB domain was the greatest and the difference value was negative.
Influence of the number of items and suggestions for revision
The G and Ф coefficients increased with the number of items and were both greater than 0.8 in the double test length design. However, doubling the test length might not be realistic in practice, because it is possible that the reliability would conversely decrease with the inclusion of too many items due to an excessive consumption of time. An interesting finding in Fig. 1 was that the increment speed of the two reliability coefficients from “original” to “double” gradually slowed down. It was expected that the reliability can be increased without increasing testing time by reallocating the items to different domains. Therefore, we provided two decision options in which the best allocation of the number of items for the overall scale could be designed. Similar to the discussion above, the two decision options for further modifying the Chinese version of the FACTLeu would be selected from depending on the researcher’s interest. If one’s interest lies in ranking (relative decision), a good consideration would be decision option A, in which the number of items in the EWB domain would increase from 6 to 8, the number of items of the SWB domain would remain the same, the numbers of items of the other three domains would be slightly reduced, and the total number of items for the scale would be 37. If one’s interest lies in the absolute standings (absolute decision), a better selection would be decision option B, in which the number of items of the EWB domain would increase from 6 to 10, the numbers of items of the SWB domain and LEUS domain would remain unchanged, the numbers of items of the other two domains would be slightly reduced, and the total number of items for the scale would be 45.
Limitations
It is worth noting that the sample size of the study is not very large, which may affect the findings to some extent. In addition, the study subjects were from the inpatient population at hospitals, which may affect the generalizability of the scale. Therefore, larger and additional studies in which the study subjects are expanded to other populations such as outpatients are needed to validate the scale. Using MGT analysis, we put forward the suggestions about varying the number of items for every domain from the macroscopic point of view. However, we cannot answer which items would be removed by using MGT analysis. In the followup study, we should screen and select the items through the item response theory (IRT) methods.
Conclusion
To sum up, the Chinese version of the FACTLeu scale has good reliability as a whole based on the results of MGT, and the implementation of MGT could lead to more informed decisions in complex questionnaire design and improvement. If the Chinese version of the FACTLeu scale will be modified in the future, there will be two available choices depending on the decision needed (relative decision or absolute decision) based on our analytical results: a 37item version and a 45item version.
Abbreviations
 ALL:

Acute lymphoblastic leukemia
 AML:

Acute myeloid leukemia
 ANOVA:

Analysis of variance
 CLL:

Chronic lymphocytic leukemia
 CML:

Chronic myelogenous leukemia
 CRCUS:

Contribution rate for composite Universe score
 CTT:

Classical test theory
 Dstudy:

Decision study
 EORTC QLQCML24:

EORTC for Chronic Myeloid Leukaemia
 EWB:

Emotional Wellbeing (FACTLeu)
 FACTG:

Functional Assessment of Cancer TherapyGeneral module
 FACTLeu:

Functional Assessment of Cancer Therapy–Leukemia
 FWB:

Functional Wellbeing (FACTLeu)
 G:

Generalizability coefficients
 Gstudy:

Generalizability study
 GT:

Generalizability theory
 HRQOL:

Healthrelated quality of life
 ICC:

Intraclass correlation coefficients
 IRT:

Item response theory
 LEUS:

Leukemiaspecific subscale (FACTLeu)
 MANOVA:

Multiple analysis of variance
 MGT:

Multivariate generalizability theory
 MRC/EORTC QLQLEU:

Medical Research Council/EORTC Quality of Life QuestionnaireLeukaemia Module
 PDS:

Proportion of domain score
 PWB:

Physical Wellbeing (FACTLeu)
 QLICPLE:

Quality of Life Instrument for Cancer PatientsLeukemia
 SWB:

Social/Family Wellbeing (FACTLeu)
 UAO:

Universe of assessable observations
 UG:

Universe of generalization
 UGT:

Univariate generalizability theory
 Ф :

Indexes of dependability
References
 1.
Chen W, Zheng R, Baade P, Zhang S, Zeng H, Bray F, Jemal A, Yu X, He J. Cancer Statistics in China, 2015. CA Cancer J Clin. 2016;66:115–32.
 2.
Redaelli A, Stephens JM, Laskin BL, Pashos CL, Botteman MF. The burden and outcomes associated with four leukemias: AML, ALL, CLL and CML. Expert Rev Anticancer Ther. 2003;3:311–29.
 3.
Stalfelt AM, Wadman B. Assessing quality of life in leukemia: presentation of an instrument for assessing quality of life in patients with blood malignancies. Qual Assur Health Care. 1993;5:201–11.
 4.
Watson M, Zittoun R, Hall E, Solbu G, Wheatley K. A modular questionnaire for the assessment of longterm quality of life in leukaemia patients: the MRC/EORTC QLQLEU. Qual Life Res. 1996;5:15–9.
 5.
Efficace F, Baccarani M, Breccia M, Saussele S, Abel G, Caocci G, Guilhot F, Cocks K, Naeem A, Sprangers M, et al. International development of an EORTC questionnaire for assessing healthrelated quality of life in chronic myeloid leukemia patients: the EORTC QLQCML24. Qual Life Res. 2014;23:825–36.
 6.
Cella D, Jensen SE, Webster K, Hongyan D, Lai JS, Rosen S, Tallman MS, Yount S. Measuring healthrelated quality of life in leukemia: the Functional Assessment of Cancer TherapyLeukemia (FACTLeu) questionnaire. Value Health. 2012;15:1051–8.
 7.
Cella DF, Tulsky DS, Gray G, Sarafian B, Linn E, Bonomi A, Silberman M, Yellen SB, Winicour P, Brannon J. The functional assessment of cancer therapy scale: development and validation of the general measure. J Clin Oncol. 1993;11:570–9.
 8.
Gage NA, Debra P, Hirn R. Increasing reliability of direct observation measurement approaches in emotional and/or behavioral disorders research using generalizability theory. Behav Disord. 2014;39:228–44.
 9.
Berggraf L, Ulvenes PG, Wampold BE, Hoffart A, McCullough L. Properties of the Achievement of Therapeutic Objectives Scale (ATOS): a generalizability theory study. Psychother Res. 2012;22:327–47.
 10.
Hall C. Comment: Generalizability theory and assessment in medical training. Neurology. 2015;85:1628.
 11.
Iramaneerat C, Yudkowsky R, Myford CM, Downing SM. Quality control of an OSCE using generalizability theory and manyfaceted Rasch measurement. Adv Health Sci Educ Theory Pract. 2008;13:479–93.
 12.
Kang M, Kim Y, Rowe D. Reliability of peak stepping cadences using generalizability theory. Med Sci Sports Exerc. 2013;45:327.
 13.
Preuss R. Using generalizability theory to develop clinical assessment protocols. Phys Ther. 2013;93:562–9.
 14.
Prion S, Gilbert G, Haerling K. Generalizability theory: an introduction with application to simulation evaluation. Clin Simul Nurs. 2016;12:546–54.
 15.
Ragan BG, Kang M, Marquez T, Bell GW, Zhu W. Graphic pain rating scale reliability using generalizability theory. Med Sci Sports Exerc. 2004;36:S295.
 16.
Schünemann HJ, Norman G, Puhan MA, Ståhl E, Griffith L, HeelsAnsdell D, Montori VM, Wiklund I, Goldstein R, Mador MJ, Guyatt GH. Application of generalizability theory confirmed lower reliability of the standard gamble than the feeling thermometer. J Clin Epidemiol. 2007;60:1256–62.
 17.
Stegner A, Cook D. Evaluating reliability and response bias in computeradministered pain rating scales: a generalizability theory analysis. J Pain. 2013;14:S22.
 18.
Vangeneugden T, Laenen A, Geys H, Renard D, Molenberghs G. Applying concepts of generalizability theory on clinical trial data to investigate sources of variation and their impact on reliability. Biometrics. 2005;61:295–304.
 19.
Volpe RJ, Briesch AM, Gadow KD. The efficiency of behavior rating scales to assess inattentiveoveractive and oppositional defiant behaviors: Applying generalizability theory to streamline assessment. J Sch Psychol. 2011;49:131–55.
 20.
Wasserman RH, Levy KN, Loken E. Generalizability theory in psychotherapy research: the impact of multiple sources of variance on the dependability of psychotherapy process ratings. Psychother Res. 2009;19:397–408.
 21.
Paul CC, Jennifer J, Robert G, Mart Beth CG, Connolly G, Sarah RK, Jessica LH, Xin T. A generalizability theory analysis of group process ratings in the treatment of cocaine dependence. Psychother Res. 2011;21:252–66.
 22.
Yang ZM, Zhang L. Generalizability theory and its applications. Beijing: Educational science Publishing House; 2003. p. 2–53.
 23.
Chen D, Hu BY, Fan X, Li K. Measurement quality of the chinese early childhood program rating scale: an investigation using multivariate generalizability theory. J Psychoeduc Assess. 2014;32:236–48.
 24.
Chavez LM, Garcia P, Ortiz N, Shrout PE. Applying generalizability theory methods to assess continuity and change on the Adolescent Quality of LifeMental Health Scale (AQOLMHS). Qual Life Res. 2016;25:3191–6.
 25.
Wan C, Li H, Fan X, Yang R, Pan J, Chen W, Zhao R. Development and validation of the coronary heart disease scale under the system of quality of life instruments for chronic diseases QLICDCHD: combinations of classical test theory and Generalizability theory. Health Qual Life Outcomes. 2014;12:82.
 26.
Lei P, Lei G, Tian J, Zhou Z, Zhao M, Wan C. Development and validation of the irritable bowel syndrome scale under the system of quality of life instruments for chronic diseases QLICDIBS: combinations of classical test theory and generalizability theory. Int J Color Dis. 2014;29:1245–55.
 27.
Wu J, Hu L, Zhang G, Liang Q, Meng Q, Wan C. Development and validation of the nasopharyngeal cancer scale among the system of quality of life instruments for cancer patients (QLICPNA V2.0): combined classical test theory and generalizability theory. Qual Life Res. 2016;25:2087–100.
 28.
Meng Q, Yang Z, Wan C, Luo J, Dai Y, Cun Y. Reliability analysis of quality of life instruments for cancer patientsgastric cancer (QLICPGA) based on generalizability theory. Tumor. 2013;33:428–33.
 29.
Li W, Luo J, Wan C, Li G, Lu Y, Yang H, Meng Q. Evaluation on reliability of the chinese version of functional assessment of cancer therapy  ovary cancer by classical test theory and generalizability theory. Chin Gen Pract. 2013;16:749–54.
 30.
Bravo G, Sene M, Arcand M. Reliability of healthrelated qualityoflife assessments made by older adults and significant others for health states of increasing cognitive impairment. Health Qual Life Outcomes. 2017;15:4.
 31.
Nussbaum A. Multivariate generalizability theory in educationalmeasurement  an empiricalstudy. Appl Psychol Meas. 1984;8:219–30.
 32.
Wu YF, Tzou H. A multivariate generalizability theory approach to standard setting. Appl Psychol Meas. 2015;39:507–24.
 33.
Brennan RL. Generalizability theory. New York: SpringerVerlag New York Inc; 2001.
 34.
Brennan RL. Manual for mGENOVA (Version 2.1). Iowa City: Iowa Testing Programs; 2001. p. 50.
 35.
Winterstein BP, Willse JT, Kwapil TR, Silvia PJ. Assessment of Score Dependability of the Wisconsin Schizotypy Scales Using Generalizability Analysis. J Psychopathol Behav Assess. 2010;32:575–85.
 36.
Keller LA, Clauser BE, Swanson DB. Using multivariate generalizability theory to assess the effect of content stratification on the reliability of a performance assessment. Adv Health Sci Educ. 2010;15:717–33.
Acknowledgements
While carrying out this research project, we received substantial assistance from the staffs of the First People’s Hospital of Yunnan Province and the First Affiliated Hospital of Kunming Medical University. We sincerely acknowledge this support.
Funding
Supported by the National Natural Science Foundation of China (81273185, 81302510).
Availability of data and materials
Please contact the authors for data requests.
Authors’ contributions
MQ and YZ analyzed and interpreted the data and drafted the manuscript, which was critically revised by all others. WC and LX contributed to the design of the study and supervised statistical analysis. WY and XG were involved in the data collection. XG, as a major clinical expert, validated the clinical aspects of the article. YX was a major contributor in revising the manuscript. ZM carried out the data analysis and literature review. All authors have read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
No individual’s personal data has been included.
Ethics approval and consent to participate
The study protocol was approved by the Institutional Review Board (IRB) of the investigators’ institutions and the hospital. The respondents were anonymous, voluntary and provided consent for participation.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Additional files
Additional file 1:
Data Structure and a Screenshot. (DOC 365 kb)
Additional file 2:
Example Code of mGENOVA. (DOC 23 kb)
Additional file 3:
Supplementary Results. Table S1. General characteristics of the patients included. Table S2. Allocation of the numbers of items per domain for the five scenarios. (DOC 53 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Acute leukemia
 Chronic leukemia
 Quality of life
 Evaluation studies
 Multivariate generalizability theory
 Reliability