Development and validation of the coronary heart disease scale under the system of quality of life instruments for chronic diseases QLICD-CHD: combinations of classical test theory and Generalizability theory

Background Quality of life (QOL) for patients with coronary heart disease (CHD) is now concerned worldwide with the specific instruments being seldom and no one developed by the modular approach. Objectives This paper is aimed to develop the CHD scale of the system of Quality of Life Instruments for Chronic Diseases (QLICD-CHD) by the modular approach and validate it by both classical test theory and Generalizability Theory. Methods The QLICD-CHD was developed based on programmed decision procedures with multiple nominal and focus group discussions, in-depth interview, pre-testing and quantitative statistical procedures. 146 inpatients with CHD were used to provide the data measuring QOL three times before and after treatments. The psychometric properties of the scale were evaluated with respect to validity, reliability and responsiveness employing correlation analysis, factor analyses, multi-trait scaling analysis, t-tests and also G studies and D studies of Genralizability Theory analysis. Results Multi-trait scaling analysis, correlation and factor analyses confirmed good construct validity and criterion-related validity when using SF-36 as a criterion. The internal consistency α and test-retest reliability coefficients (Pearson r and Intra-class correlations ICC) for the overall instrument and all domains were higher than 0.70 and 0.80 respectively; The overall and all domains except for social domain had statistically significant changes after treatments with moderate effect size SRM (standardized response mea) ranging from 0.32 to 0.67. G-coefficients and index of dependability (Ф coefficients) confirmed the reliability of the scale further with more exact variance components. Conclusions The QLICD-CHD has good validity, reliability, and moderate responsiveness and some highlights, and can be used as the quality of life instrument for patients with CHD. However, in order to obtain better reliability, the numbers of items for social domain should be increased or the items’ quality, not quantity, should be improved.


Background
Coronary heart disease (CHD) is worldwide the leading cause for morbidity and mortality in adults [1,2]. In Germany, prevalence rates of CHD in the general population are 6.5% (women) to 9.1% (men) [3]. In the United States, CHD is the number 1 cause of death among American men and women, causes 1 of every 5 deaths, and accounted for an estimated $177 billion in direct and indirect costs in 2010 [4]. On the data from National Health and Nutrition Examination Survey 2005 to 2008, an estimated 16300 000 American adults have CHD, with the CHD prevalence for the total, men and women which larger than 20 years old being 7.0%, 8.3% and 6.1%, respectively in the United States [5]. In China, CHD is the second leading cause of cardiovascular death, accounted for 22% of cardiovascular deaths in urban areas and 13% in rural areas [6]. The age-adjusted CHD mortality among the population aged >35 years in 2004 is 128.0 per 100 000 per year for urban men, 97.8 for urban women and 79.7 for rural men, 57.3 for rural women, using the new world standard population [6]. An epidemiological study showed that there were about 1,300,000 new cases of CHD diagnosed in China each year [7], and the incidence of CHD is steadily increasing in China [8]. It was estimated that three-fourths of global deaths and 82% of the total disability adjusted life years lost due to CHD occurred in middle-income countries [9].
There has been a rapid and significant growth in the measurement of quality of life as an indicator of health outcome in patients with CHD, considering that it has long disease duration and much symptoms and therapy side effects. According to WHO (World Health Organization), Quality of Life (QOL) is defined as individual's perceptions of their position in life in the context of the culture and value systems in which they live and in relation to their goals, expectations, standards and concerns. It includes aspects of health such as physical functioning, social and role functioning, mental health, and general health perceptions that people experience directly. Therefore, QOL is an increasingly important outcome in the study of diseases, and a suitable endpoint in cardiac populations, also in terms of long-term prognosis. In the clinical course of CHD, there are many aspects where patients' quality of life may be affected which include symptoms of angina and heart failure, limited exercise capacity of the aforementioned symptoms, the physical debility caused, and psychological stress associated with the chronic stress. Many studies have demonstrated that assessing changes in QOL could be a useful complement to clinical management of CHD by assisting in monitoring disease severity and progression [10][11][12]. Although generic instruments such as the SF-36 and Euroqol EQ-5D were widely used for evaluating QOL of CHD, they do not capture symptoms and side effects specific to CHD [13]. Thus, some disease-specific QOL instruments [13][14][15][16][17][18][19][20] for CHD have been developed including Seattle Angina Questionnaire (SAQ) [14], Quality of Life after Myocardial Infarction (QLMI) [15], the MacNew Heart Disease Quality of Life instrument [16], Minnesota Living with Heart Failure questionnaire (MLHF) [17], Angina Pectoris Quality of Life Questionnaire(APQLQ) [18], the Myocardial Infarction Dimensional Assessment Scale(MIDAS) [19], the Cardiovascular Limitations and Symptoms Profile (CLASP) [20], etc. However, these instruments are appropriate for either angina pectoris or myocardial infarction, and were not developed by the popular modular approach-a general/core module plus specific modules. The modular approach has the advantages of being developed fast and easily, and the resulting scale has well-characterized structure, in which the general module is used to capture the psychometric properties shared by a group of relevant diseases, while the disease-specific module is used to characterize the distinctive disease features [21][22][23]. Moreover, they are lacking Chinese cultural backgrounds to some extent considering their original use in Englishspoken patients. For example, Taoism and traditional medicine focus on good temper and high spirit. Good appetite and sleep are highly regarded in daily life with food culture being very important. This kind of culture dependence does not reflect in most QOL instruments in other languages.
In respond to this need, we have developed a system of Quality of Life Instruments for Chronic Diseases (QLICD, V1.0) by combining a general QOL module and disease-specific modules under the guides of classical test theory (CTT) and Generalizability Theory (GT), Item Response Theory (IRT) [23,24]. The general module, called QLICD-GM, can be used with all types of chronic disease patients, while the specific module addresses the lack of specificity in the general module by capturing the unique aspects of QOL pertaining to the specific disease [23,24]. For example, the Hypertension instrument QLICD-HY is constructed by combining QLICD-GM with the specific module for Hypertension [24]. Similarly, the coronary heart disease instrument (QLICD-CHD) is constructed by combining QLICD-GM with the specific module for this disease. In this paper, we describe the developmental process and study the validation of this QLICD-CHD.

Establishment of the general module (QLICD-GM)
By following WHO's definition of QOL [25], a nominal group consisting of 16 individuals and a focus group with 10 experts including physicians/nurses and medical researchers were formed to use the programmed decision method to present the conceptual framework and select items [23,24]. The item selection was based not only on qualitative analysis such as nominal group, focus group discussions and in-depth interview, but also on four quantitative statistical procedures-variation procedure, correlation procedure, factor analysis procedure and cluster analysis procedure. The entire process of developing the QLICD-GM has been described in detail elsewhere [23], but the main steps were summarized as a schematic diagram below: The final QLICD-GM included 30 items (selected from a 73-item pool) which be classified into 3 domains and 10 facets with physical domain having 8 items (coded PH1-PH8), psychological domain 11 items (coded PS1-PS11) and social domain 8 items (coded SO1-SO8) (see Table 1 for details). This scale was shown to have good validity, reliability, and better responsiveness compared with the SF-36 based on the data from 620 inpatients of seven chronic diseases: hypertension, coronary heart disease, chronic gastritis, peptic ulcer, COPD, chronic obstructive lung disease, and chronic pulmonary heart disease [23].

Establishment of the specific module
After development of the QLICD-GM, twenty-five items that reflect symptoms, side effects and special mental health of CHD were selected to form the item pool of the specific module. A developmental process similar to the one described above for the general module was used to obtain the final module, which consists of 16 items, coded CHD1-CHD16, classified into 3 facets (see Table 1

Validation of the QLICD-CHD Data collection and scoring
The formal QLICD-CHD (the general module QLICD-GM plus the specific module) described above was used for patients with CHD in a field survey in order to study its psychometric properties (validity, reliability and responsiveness). The study population was limited to CHD inpatients who were able to read and understand the questionnaires at any stages and treatments. The participating investigators (doctors and medical post-graduate students) explained the trial and the scale to the patients and obtained informed consent from those who agreed to participate in the study. Each patient (n = 146) was asked to answer the questionnaires at the time of admission to the hospital by themselves. A random subsample consisting of 50 patients also participated in a second assessment the following day after hospitalization so that the test-retest reliability can be calculated. All patients available at the third scheduled assessment time-point (111 cases) completed the measures at discharge (after approximately 1 week of treatment) to evaluate responsiveness of the questionnaire. Answers were checked immediately each time by the investigators in order to ensure its integrality. If missing values were found, the questionnaire would be returned to the patients to fill in the missing item.
The Chinese version of SF-36 [26], which have eight domains: Physical Function (PF), Role-Physical (RP), Bodily Pain (BP), General Health (GH),Vitality (VT), Social Function (SF), Role-Emotional (RE) and Mental-Health (MH), was also used to provide data for assessing the criterion-related validity of the QLICD-CHD, and also convergent and discriminant validity.
Based on the data collected, the raw scores of items, domains and overall scale were calculated. Each item of QLICD-CHD is rated in a five-level Likert scoring system, namely, not at all, a little bit, somewhat, quite a bit, and very much. The positively stated items are directly scored from 1 to 5, while the negatively stated items are For comparison purposes, all domain scores were linearly converted to a 0-100 scale using the formula: SS = (RS-Min) × 100/R, where SS, RS, Min and R represent the standardized score, raw score, minimum score, and range of scores, respectively.

Statistical analysis for psychometrics
The validity, reliability, and responsiveness of the QLICD-CHD were analyzed. Validity is the degree to which the instrument measures what it is supposed to measure, with several types of validity being distinguished [27,28]. Construct validity was evaluated by Pearson's correlation coefficient r (item-domains/facets correlations) as well as by factor analysis with Varimax Rotation. Multi-trait scaling analysis [29] was employed to test item convergent and disciminant validity, with the two criteria: (1) convergent validity is supported when an item-domain correlation is 0.40 or greater; (2) disciminant validity is revealed when item-domain correlation is higher than that with other domains. Criterion-related validity was evaluated by correlating corresponding domains of the QLICD-CHD and SF-36 because of the lack of an agreed-upon gold standard. Relatively high correlations among conceptually related domains and relatively low correlation among conceptually distinct domains would suggest high criterion-related validity. And this can also demonstrate convergent and discriminant validity because they involve comparing logically related measures to see if they are correlated more strongly (convergent) or more weakly (discriminant).
Reliability is the degree to which an instrument is free from random error, with being evaluated by measuring internal consistency reliability and reproducibility frequently. The internal consistency, which refers to the homogeneity of the items of the scale, was assessed by Cronbach's alpha coefficient for each domain/facet. A high internal consistency suggests that the scale is measuring a single construct. Reproducibility (the test-retest reliability) establishes the stability of an instrument over time in a stable population [28]. It was evaluated by the Pearson's correlation coefficients between the first and second assessments, and intra-class correlation (ICC) with definition of absolute for single measure under the two-way mixed model [30,31]. Patients were considered stable if they did not experience treatments the following day to hospital.
Responsiveness is the instrument's ability to detect clinically important change over time. It was measured by comparing the mean score change between the two assessments before and after treatments using paired ttests as well as the effect size, SRM (standardized response mean) [32,33].

Generalizability theory analysis
Besides classical test theory analysis above, we also applied Generalizability Theory (G theory) to investigate the score dependability of the QLICD-CHD. G theory has been presented as a way to refine the designs of measurement procedures in an attempt to yield reliable data [34][35][36][37]. Serving as an alternative to the more familiar classical measurement theory, which yields the less useful intra-class correlation coefficients, G theory addresses the dependability of measurements and allows for the simultaneous estimation of multiple sources of variance, including interactions. Thus, a distinction is made between 2 types of studies: G studies and D studies. A G study quantifies the amount of variance associated with the different facets (factors) that are being examined. A D study provides information about which protocols are optimal for a particular measurement situation by generating Generalizability (G) coefficients that can be interpreted as reliability coefficients across various facets of the study.
In our research, G-Studies and D-Studies were performed to estimate the variance components and dependability coefficients in one facet person-by-item design (p × i design). We defined the quality of life of patients as the target of measurement and items as one facet of measurement error. Given every person is asked to reply to all items, the design is One-facet Crossed Design [34][35][36][37]. For the G-Study, a universe of admissible observations, which consists of the object of measurement and the measurement error facets, is defined and the variance components are estimated. For the D-Study, a universe of admissible generalizations, which represents the measurement conditions based on the object of measurement and the measurement facets a  researcher is willing to generalize over, is defined and the variance components associated with the universe of admissible generalizations are estimated.

Construct validity
Correlation analyses showed that there were strong associations between items and their own domains/facets (most correlation coefficients are higher than 0.5), but weak relationship between items and other domains/ facets (see Table 1). For example, correlation coefficients between PHD and items of PH1-PH8 (in bold) are higher than those between PHD and other items. Especially, the correlation coefficients between items and their own facets were much larger than that between items and other facets. There were 8 principal components (initial eigenvalues >1) abstracted from 30 items of the general module (QLICD-GM) by factor analysis, accounting for 68.1% of the cumulative variance. By using the Varimax rotation method, it can be seen that the 8 principal components From results above, theoretical construct was confirmed generally by data analysis, showing good construct validity.

Criterion-related validity
Correlation coefficients among the domain scores of the QLICD-CHD and SF-36 were presented in Table 2, showing that the correlations between the same and similar domains are generally higher than those between different and non-similar domains. For example, the coefficient between the physical domain of QLICD-CHD and physical function of SF-36 was 0.61, higher than any other coefficients in this row. Similarly, the coefficient between the psychological domain of QLICD-CHD and mental health of SF-36 was 0.49, higher than any other coefficients in this row (PCS and MCS are not mutually exclusive domain of SF-36).
These confirmed the criterion-related validity to a reasonable degree and also demonstrated the convergent and divergent validity to some extent.

Reliability
The reliability of the scale was evaluated by three procedures: internal consistency, test-retest and ICC (see Table 3 for details). The Cronbach's α for all domains and facets were computed using the measurements data at admission because of larger sample size. As can be seen in Table 3, the Cronbach's α for these four domains were higher than 0.70, and most of them were higher than 0.70 at the facet levels. The test-retest correlation coefficients (r) for the 4 domains and 13 facets of QLICD-CHD ranged between 0.61-0.98, with r = 0.90 for the overall scale and the minimum r = 0.80 for SPD among the four exclusive domains. The differences in domain and facet scores between the first and the second assessments were not statistically significant for all domains and most facets except for Independence and symptom by paired t tests (P > 0.05). The results from ICC were very similar to Pearson's correlation coefficients (r).

Reliability from generalizability theory
The estimated G-study results were provided in Table 4 based on the current design, in which 146 patients filled out the quality of life instrument QLICD-CHD with 46 items. For physical domain, the variances accounted for 67.22% by person-by-item interactions and 28.30% by person, only a small source of variation (4.48%) was due to item. Given the largest source of variation in this domain score is by the person-by-item interaction, it means that different people might understand and react to the same item in different ways despite having the same total score on the scale. Similarly, the largest source of variation was due to person-by-item interactions in other domains, while the variances by person were in the second place (except for social domain by item).
The D-Studies were performed to estimate G-coefficients and Ф coefficients for the current design and alternative designs with varied numbers of items for four domains of QLICP-CHD, with results presenting in Table 5. It showed acceptable reliability coefficients (G and Ф coefficients >0.70) for three of four domains except for social domain for the current design. In addition, Table 5 showed the effects of the various levels of items (from 6 to 22) on reliability with G ranging from 0.59 to 0.92, and Ф ranging from 0.55 to 0.91.

Responsiveness
It can be seen in Table 6 that significant changes occurred for domains of physical, psychological and the specific, and also the sub-total (QLICD-GM) and overall scale (P < 0.01) with effect size SRM ranging from 0.32 to 0.67. At the facets level, five of thirteen facets were of statistical significance with effect size SRM ranging from 0.20 to 0.88.

On development approach and advantages
Since same-class diseases such as cancers share many things in common, an approach widely adopted in recent years to develop QOL instruments for diseases within a Table 3 Reliability of the quality of life instrument QLICD-CHD (n = 146 for α, n = 50 for r and ICC)   common class is to combine a general module for the entire class of disease with specific modules for individual diseases to capture both common features within the disease class and disparities among different disease members. This approach can substantially reduce the amount of time and effort in developing new instruments. Both the QLQs from EORTC and the FACTs from CORE for QOL assessments of cancer patients have been developed based on this modular principle [21,38]. Unlike these two QOL instruments systems, we employed this modular approach to systematically and more efficiently develop a system of new instruments for chronic diseases directly, with QLICD-GM forming the general module and QLCID-CHD being a specific scale for coronary heart disease. This modular approach unifies all disease-specific instruments of QLICDs using the same general module with similar constructs. To our knowledge, although a number of instruments have been widely used for studying CHD impacts on patients' QOL, no one was developed directly by the modular approach. Therefore, the QLICD-CHD has several advantages over existing instruments [23,24]. First, it can compare HRQOL across diseases by the general module and also capture the symptoms and side effects by the specific module, demonstrating both generic and specific properties. Second, it consists of a moderate number of items with a clear hierarchical structure (items → facets → domains → overall) so that mean scores can be computed not only at the domain (four domains) and the overall levels but also at the different facet levels (13 facets) to detect changes in greater detail. Users can select either one or both levels for a study at hand. Third and perhaps more important is the strong Chinese cultural background underlying the QLICD-CHD. For example, the Chinese culture pays more attention to family relationship and kinship, dietary, temperament and high spirit, which are probed by the items of QLICD-CHD focusing on this type of cultural heritage such as appetite, sleep, energy and family support. Specifically, items of PH6 'Have you had a good appetite?' , PH7 'Were you satisfied with your sleep?' , PH2 'Have you felt fatigue easily?' , SO4 'Have you had good relations with your families?' , and SO5 'Could you acquire material and emotional help and support from your family when you need?' etc. reflected these aspects in details.
is the variance components of relative error. σ 2 (Δ) is the variance components of absolute error. σ 2 (× PI ) is the variance components of error when estimating the universe score by using sample mean. σ 2 (Eρ 2 ) is the Generalizability coefficient. Φ is the index of dependability.

On psychometrics
Generally, a practical QOL instrument must be validated with respect to at least three aspects: validity, reliability and responsiveness. Instrument validity is the extent to which an instrument can capture what it purports to measure. By following WHO's definition of QOL and the programmed decision procedures, we developed the QLICD-CHD by using focus group discussion, in-depth interview and pre-testing to effectively reduce the number of items in the final version to 30 from an initial 73 item pool for the general module, and to 16 from an initial pool of 25 items for the specific module, ensuring good content validity and sound conceptual structure. Correlation analyses showed strong association between items and their own domains/facets but weak correlations between items and other domains/facets. Factor analysis revealed that the components extracted from the data basically coincide with the theoretical construct of the instrument. These results confirmed the good construct validity. Correlation coefficients between domain scores of QLICD-CHD and SF-36 showed the criterion-related validity to a reasonable degree and the convergent and divergent validity to some extent. Reliability refers to the reproducibility or consistency of item scores from one assessment to another. Test-retest reliability (r), ICC and internal consistency (Cronbach's α) are the most frequently used indicators and were tested in the current study. It is well recognized that internal consistency (α) should be at least 0.70 and reliability (r) should be above 0.80 in a test-retest situation [32]. Thus, our results in Table 3 showed that this instrument has good reliability for all α were higher than 0.70 and r (ICC) greater than 0.80 at domain levels.
The assessment methods on responsiveness can be generally divided into two categories: internal and external [32,33]. In this paper we focused on internal responsiveness with the hypothesis that the sensitive instrument should detect changes when they occur after treatment. SRM is a good indicator of effect size, with values of 0.20, 0.50 and 0.80 representing small, moderate and large responsiveness [32,33]. As seen from Table 6, QOL scores had significant changes after treatment for three of the four domains as well as the overall score (P < 0.05), with SRM equal to 0.52, 0.28, 0.62 and 0.56. Given that it reasonable to expect no statistically significant change for the social domain and some facets pertaining to stable traits post-treatment, QLICD-CHD seems to have good responsiveness.

On analysis of generalizability theory
Traditionally, the scale is assessed by classical test theory analysis, in this research Generalizability Theory was also applied both in G-study and D-study. Which coefficients will be selected depending on the researchers' interests? If one's interest lies in ranking people (relative decision), then the G-coefficient informs about how dependable a score is. If one's interest lies in the absolute standings to a criterion (absolute decision), the index of dependability Ф reflects the score dependability. The index of dependability is typically lower than Gcoefficients because they consider the main error effects in addition to the interaction effects that are used for Gcoefficients. This research presented both G-coefficients and Ф, and also their changes when items assumed to be changed. For social domain, we estimated a Gcoefficient of 0.64 and an index of dependability of 0.60 for the current design, which was a little below the acceptable level of 0.70. Hence, the domain's items need improvement. For an alternative design with 17 items, the G-coefficient estimated to be 0.74 and the index of dependability 0.70. Therefore, it will be better to increase the numbers of items of social domain from 11 to 17 in order to reach an acceptable dependability. For other domains, G-coefficients and index of dependability were all greater than 0.70 for the current design, and changed a little as items changing. It can be considered that current items are reasonable and acceptable for these domains.
To sum up, the analysis from Generalizability Theory confirmed the reliability of the scale further. However, the numbers of items for social domain should be increased in order to obtain better reliability.

Study limitations
It is worthy to note that the sample size of the study is not very large, which may also affect the findings, especially those with respect to factor analysis (146 cases vs 30 variables for the general module). Although correlational analysis was conducted simultaneously to display the construct, which overcome it to some extent, additional larger studies are needed to validate it further. Moreover, the subjects in this study were selected from the inpatient population at hospitals. Additional studies are needed to assess the generalizability of the instrument to other settings and populations such as outpatients at a local clinic.
In summary, the QLICD-CHD can be used as a useful instrument in measuring and assessing quality of life for patients with coronary heart disease who speak Chinese (the largest population in the world), with good psychological properties and some highlights.