CHF-PROM: validation of a patient-reported outcome measure for patients with chronic heart failure

Background Due to a lack of an appropriate disease-specific patient-reported outcome (PRO) instrument for chronic heart failure including its social support and treatment aspects in China, this study was performed to develop a patient-reported outcome measure (PROM) for patients with chronic heart failure and evaluate its reliability, validity, and feasibility. Methods According to the standard PROM guidelines established by the Food and Drug Administration, an item pool was formed by reviewing a large amount of relevant literature and interviewing patients with chronic heart failure about their main symptoms. Thus, the primary scale was created after adjusting the items and language with the help of patients and experts in the field. Next, 155 patients from 8 hospitals in different districts were recruited for a pilot survey using questionnaires containing these items. The patients’ responses were analyzed using the classical test theory and item response theory to select high-quality items and determine the subdomains of the scale. This was followed by a formal investigation in the same eight hospitals. In total, 360 patients and 100 healthy subjects were included to evaluate the reliability, validity, and feasibility of the items. Through this process, the final scale was established. Results The final scale comprised 12 subdomains with 57 items related to physical, psychological, social, and therapeutic areas. The data analysis results of the formal investigation showed that the PROM for chronic heart failure had good reliability, validity, and feasibility. Reliability was verified by Cronbach’s alpha coefficient, which was 0.913 for the total scale, 0.903 for the physical domain, 0.941 for the psychological domain, 0.827 for the social domain, and 0.839 for the therapeutic domain. The construct validity results met the relative criteria of confirmatory factor analysis. Discriminant validity was represented by score comparisons of nine subdomains. The response rate and the effective rate of return of the CHF-PROM were 98.94% and 98.92%, respectively. Conclusions The final scale coincides with the theoretical framework and better reflects the overall quality of life of patients with chronic heart failure. This scale can be used as a valid instrument to evaluate clinical treatment and clinical trials of chronic heart failure. Electronic supplementary material The online version of this article (10.1186/s12955-018-0874-2) contains supplementary material, which is available to authorized users.


Background
Heart failure (HF) is a syndrome caused by a functional heart disorder. The heart is unable to meet the needs of the body at the normal pressure [1]. As a complex clinical syndrome, heart failure (HF) is the terminal phase of all systemic heart diseases by various causes. More than 26 million individuals have HF, and this number is increasing. By 2050, an estimated 20% people among those aged > 65 years will have developed HF [2]. HF has become an overwhelming threat to human health and social development. Based on the severity of disease, HF can be divided into acute HF (AHF) and chronic HF (CHF) [3].
CHF is the final stage of heart disease. It is a complex clinical syndrome characterized by dyspnea, edema, and fatigue [4]. Its treatment includes medical therapy, mechanical circulatory assistance, and cardiac transplantation [5]. Individual therapeutic strategies based on patients' reported outcomes, which can reflect patients' individual situations, has been proven effective for relieving the symptoms of CHF and improving patients' quality of life (QoL). Compared with many other chronic diseases, CHF affects QoL more profoundly. QoL has become a major concern in modern medicine in recent years. However, clinical management and research have not taken CHF into consideration to a satisfactory degree [6]. Depression and social function disability have been shown to have a significant impact on QoL in patients with CHF [7]. Other factors affecting QoL include treatment compliance, satisfaction with treatment, and adverse effects of related treatments [8]. Additionally, decisions regarding therapy can change over time depending on the feelings of the patients and their families.
Patient-reported outcomes (PROs) are based on health-related quality of life (HRQoL). HRQoL reflects patients' overall feelings regarding their disease and correspondent therapy. As a central part of PROs, HRQoL is essential and indispensable for evaluating patients' health status [9]. PROs are not summaries provided by medical professionals but are instead patientcentered self-reports of patients' feelings regarding their health state, functional status, and therapeutics. Thus, PROs are helpful in diagnosis and therapy and are of significant importance in clinical practice [10][11][12][13][14]. Widely accepted by medical professionals, PROs make use of patients' feedback and view patient self-evaluation as an important aspect of the end-point in clinical trials. In 2006, the United States Food and Drug Administration circulated a publication entitled "Guidance for Industry: Patient-Reported Outcome Measures: Use in Medical Product Development to Support Labeling" [9], which further standardized the development and validation of PROs both clinically and academically [15][16][17].
Health-related quality of life instruments includes generic measures and disease-specific measures. All of these can reflect the quality of life of patients. General measurements for patients with chronic HF include the Nottingham Health Profile, Simple SF-36 Health Survey Questionnaire, and World Health Organization Quality of Life Scale-Brief Version [18]. These general measurements are not specific for CHF; therefore, they cannot specifically and completely represent the situation of patients with CHF. However, disease-specific measures quantify more clinically relevant domains than generic health status measures and are often more sensitive to clinical change. As the terminal phase of all organic heart diseases, CHF has specific clinical features and treatments; therefore, development of disease-specific measures for HF is necessary. Meanwhile, specific measurements used in the clinical setting include the Minnesota Living with Heart Failure Questionnaire (MLHFQ), Chronic Heart Failure Questionnaire, Kansas City Cardiomyopathy Questionnaire (KCCQ), and Quality of Life Index-Cardiac Version [18][19][20][21]. Among these, the MLHFQ and KCCQ are more popular than the others. The MLHFQ was the first questionnaire used in HF and has been translated and culturally adapted into at least 34 languages. It contains 21 items, most of which focus on physical and emotional domains; only one focuses on therapy [19,20]. The Chronic Heart Failure Questionnaire evaluates fatigue, dyspnea, and emotion [20]. The KCCQ reports an overall summary score and five subdomain scores: physical limitations, symptoms, self-efficacy, social interference, and HRQoL. It focuses more on physical limitations, symptoms, and HRQoL and gives little attention to self-efficacy and social interference [18]. The Quality of Life Index-Cardiac Version was established in Europe and can be used for all types of heart disease [20].
Notably, doctors change treatment plans based on their patients' social support and therapy status. For example, if the patient's compliance decreases during the treatment period, the doctor can identify the specific cause by calculating the score of the related items in the scale. This may provide doctors with a relatively objective solution to improve patients' dependence. Additionally, the score for the social support dimension of the scale can reflect the patient's family situation and social environment. This could guide community doctors to help patients or their family members to solve corresponding problems and provide better community medical services. However, existing questionnaires rarely assess such factors [18,20,21]. Therefore, developing a Chinese questionnaire, specifically one that is culturally relevant to mainland China, is necessary because the management of CHF strongly depends on the different societal value systems, medical provision priorities, and economic environments in this country. We herein propose a measure based on PROs for patients with chronic HF to improve the current questionnaire for cardiovascular disease and guide clinical treatment.

Conceptual framework construction
A conceptual framework for the CHF-PROM was constructed by considering the principles for developing PRO scales established by the Food and Drug Administration [22], previous life-quality questionnaires for patients with HF, and the relevant theories of CHF. The CHF-PROM should include four domains: the physical domain (PHD), psychological domain (PSD), social domain (SOD), and therapeutic domain (TRD).

Item generation
We consulted a large number of relevant studies and related questionnaires [9,[18][19][20][21][22]. The patients' major disease symptoms, psychological and social conditions, and satisfaction towards medical services or side effects of treatment were also collected. The item pool was generated according to all of this information.

Formation of preliminary scale
Face-to face interviews regarding the above-mentioned items were required. Patients' subjective opinions were taken into consideration. The item pool was applied to 10 patients with CHF in hospitals or communities (5 males, 5 females; average age, 65 years). During this process, the patients were asked to point out words they could not understand, and items were added or deleted as necessary. The items were revised by three cardiovascular disease experts, a psychologist, and a sociologist, who were invited to make suggestions regarding all four domains. Based on the patients' and experts' opinions, the CHF-PROM was further modified to form a preliminary scale. The scores of the items were calculated using a 5-point Likert scale. Patients were enrolled from eight different hospitals in Shanxi Province, China. The inclusion criteria for this study were an age of > 18 years, with the principal diagnosis of Chronic Heart Failure according to the 2013 ACC/AHA guideline on HF [2], and consent to fill out the questionnaire. We excluded patients with combined psychiatric disorders and those who were incapable of understanding or completing the questionnaire because of language barriers or intellectual disabilities. Healthy subjects were defined as people who had not been diagnosed with any diseases by physicians. Healthy subjects who matched the basic characteristics of patients with CHF were recruited from communities of Shanxi Province. Before collecting healthy subjects, the investigators contacted related departments of target communities to obtain support from community workers. At the same time, full preparations for publicity were made by creating posters to display in the communities. Documents that introduced the survey were also distributed. Healthy subjects who were willing to participate in the questionnaire survey provided written informed consent. The participants filled out the questionnaire by following the same survey process followed by patients with CHF. In cases of missing, we corrected and supplemented the data in a timely manner. In factor analysis, Nunnally [23] suggested that the number of subjects should be at least 10 times the number of study variables. Some scholars have suggested that the actual sample size should be 5 to 10 times greater than the number of observed variables to obtain accurate parameter estimates and reliable results [24].
The purpose of our study was thoroughly explained to all participants. Written informed consent was obtained from all participants. These questionnaires were made available on the first day of hospitalization. During hospitalization, the patients independently completed the questionnaires according to their own physical conditions by following the instructions provided by the investigators. For the elderly patients who were unable to complete the questionnaires, the investigators read the content of the questionnaires and/or filled in the answers according to the patients' selections without any suggestions. Data entry and its verification are important in the process of data management in clinical studies [25]. Double data entry was adopted to control data quality using EpiData3.1 software. In total, 105 patients and 50 healthy subjects were enrolled in the pilot study. Various statistical analyses were conducted to select high-quality items and develop the preliminary scale, such as the classical test theory [e.g., discrete trend, factor analysis, correlation coefficient, Cronbach's α if item deleted (CAID) and corrected item-total correlation (CITC)] and item response theory. A further larger-scale survey involving 365 patients with CHF and 100 healthy subjects was conducted by using the preliminary scale.

Scale scoring
Patients responded to each item on a 5-point Likert scale to reflect how often they had experienced each issue during the past 2 weeks. An initial value ranging from 0 to 4 was assigned for each category (0 = never, 1 = occasionally, 2 = about half of the time, 3 = often, and 4 = almost every day). To ensure a consistent relationship between the responses to all items and the PROM, all responses were transformed in the following way: positively scored items were recorded as the original score plus 1, while negatively scored items were recorded as 5 minus the original score. This resulted in a score ranging from 1 to 5 for each item, with a higher score associated with a more positive PROM.

Discrete trend
A low discrete degree indicated that the subjects were inclined to select the same answer. In other words, the items were not useful for indicating differences. The scores generally exhibited a normal distribution; thus, the standard deviation was calculated for every item. Items with a low standard deviation (< 1.0) were deleted. Generally, a value of > 1.0 indicates that the participants may select different answers for an item [26].

Exploratory factor analysis
Considering the small sample size, an exploratory factor analysis was performed and the solution was rotated separately in each field (physical, psychological, social, and therapeutic). We determined the number of factors according to the eigenvalue and variance contribution ratio. The eigenvalue should be > 1.0, and the maximum cumulative variance contribution rate was 70%. Items with low factor loading (< 0.4) were removed. Generally, it was considered that the measurable variable (e.g., item) was mainly affected by this potential factor (e.g., subdomain) if factor loading was ≥0.4 [27].

CAID
The CAID and CITC were used to evaluate the internal consistency among the items. If an item had a negative effect on the internal consistency of its own dimension, Cronbach's α coefficient increased greatly when the item was deleted. A CITC of < 0.4 indicated that an item was poorly correlated to the scale. In this circumstance, the item should be deleted [28].

Correlation coefficient
The representativeness of an item was measured by the correlation coefficient with its own subdomain. An item with a correlative value of < 0.6 was generally considered to be poorly correlated to the corresponding subdomain [29]. Such an item was removed.
IRT IRT is part of modern measurement theory and was proposed to overcome the defects of CTT [30]. It is also called latent trait theory and has advantages in terms of item selection and test construction. It claims that the relationship between subjects' abilities and their responses to an item can be described as a function. The basic task is to define this relationship. In brief, IRT can be viewed as a probabilistic method for discussing the relationship between subjects' potential traits and their responses to items.
If we set θ as a subject's ability, then p(θ) is the probability that the subject will respond to an item correctly. The functional relationship can be reflected by a curve called the item characteristic curve. We selected two important parameters on the curve: α reflects the discriminant degree, and b indicates the item difficulty. A graded response model appropriate for hierarchical and continuous data was constructed considering the 5-point Likert scale used in this study, extending a unidimensional model to a multidimensional one [31]. Five parameters were estimated in our study, namely a, b 1 , b 2 , b 3 , and b 4 , where b 1 is the difficulty level parameter between Answers 1 and 2, and so on, Here, a must have a value of > 0.60, and b ranges from − 3 to 3. Items supported by at least three methods were retained in the final CHF-PROM.

Validation of the final scale Reliability
We calculated Cronbach's alpha coefficients for four fields and the total scale to measure the internal consistency of the CHF-PROM. Generally, a value of > 0.70 indicates that individual items provide an adequate contribution to the overall scale [32].

Content validity
The patients' opinions were typically consulted to validate the content with respect to how well the items met the empirical indexes of interest [33].

Construct validity
We subjected the factor structure of the scale to confirmatory factor analysis (CFA). The model was assessed with respect to the following relative goodness-of-fit statistics: root mean square error approximation (values of < 0.08 indicated adequate fit and values of < 0.05 indicated close fit of the data to the model) [34], normed fit index (values of ≥0.90), non-normed fit index (values of ≥0.90), incremental fit index (values of ≥0.90), comparative fit index (values of ≥0.90), and root mean square residual (values of < 0.09) [33]. We used LISREL 8.70 to assess the construct validity with CFA.

Discriminant validity
We determined the discriminant validity by comparing the mean scores for every subdomain of the CHF-PROM among the healthy people and patients with CHF. We compared the differences using a t-test, with the significance level set at P < 0.05 [35].

Feasibility
We evaluated the feasibility of the CHF-PROM by examining the response rate, completion rate, response time to completion, percentage of missing data, and score distribution. We considered response and return rates of < 85% to be inadequate and a completion time of 30 min to be acceptable. SPSS 16.0, Multilog 7.03, EpiData3.1, and LISREL 8.70 were used to conduct the data analysis. The entire study flow diagram is present in Fig. 1.

Generation of item pool
After consulting relevant literature and interviewing patients with CHF, we established four domains as described in the Methods section: physical domain, psychological domain, social domain, and therapeutic domain. These 4 domains were then divided into 12 subdomains and a pool of 67 items (see Additional file 1). The conceptual framework of the instrument is shown in Fig. 2.

Formation of preliminary scale
Establishment of the CHF-PROM was based on published literature and related questionnaires. Consultants were also needed to improve the validity of the questionnaire [3,7,8,[12][13][14][15]. According to the advice provided by patients and experts in this field, six items were removed ("PHD1. Do you feel that your limb is weak?", "PHD15. Do you have constipation?," "PSD13. Do you often check things over and over again?," "PSD14. Do you often wash your hands or count over and over again?," "PSD22. Do you feel that people do not judge your achievements properly?," and "TRD6. Did you think the examinations are necessary?"), three items were added ("Do you feel that your illness is a burden to your family?," "Do you know the side effects of the drugs?," and "Are you worried about the side effects of the drugs?"), and one item was divided into two items ("PSD4. Do you feel less concentrated and forget things easily?"). As a result, we generated 65 items for the CHF-PROM.

Item selection Participant characteristics
The screening phase involved 105 patients and 50 healthy subjects. The patients with CHF had an average age of 69.16 ± 11.24 years. The normal subjects had an average age of 56.96 ± 14.96 years. The basic characteristics of the patients with CHF and healthy subjects are shown in Table 1. The demographic data were compared using the chi-square test for categorical variables.

First item-selection phase
Five statistical methods within the CTT and IRT were used to select the items. Items PHD3, PHD7, PSD12, and SOD9 were deleted according to the above-

Validation of the scale
The scale was validated in large-scale sample. The sample size was determined based on Nunnally's rule. The sample size was only slightly below the target sample size. Patients were enrolled from different departments of eight different hospitals in Shanxi Province, China. Some patients were not willing to participate in the questionnaire because of their physical condition at that time, fear of disclosing their privacy, and other factors. In these target hospitals, several departments of cardiology were participating in investigations using other psychological questionnaires and were therefore unwilling to take part in the survey. Bias many be introduced into the study results if inpatients with CHF participate in two questionnaires simultaneously. So, 470 questionnaires were sent out and 467 were collected (98.50%) totally. There were 460 valid questionnaires (patients with CHF, 360; healthy people, 100). The patients with CHF had an average age of 69.87 ± 10.60 years, and the healthy subjects had an average age of 57.06 ± 14.67 years. The participants' baseline data are    shown in Table 4. The demographic data were compared using the chi-square test for categorical variables.

Reliability
Cronbach's alpha coefficients for the four domains and overall scale are shown in Table 5. In general, this questionnaire showed great reliability.

Construct validity
The results of the CFA were as follows: physical domain measurement model: 16 items corresponding to 3 latent variables; PSD measurement model: 21 items corresponding to 4 latent variables; SOD measurement model: 8 items corresponding to 2 latent variables; and TRD measurement model: 12 items corresponding 3    latent variables. Table 6 shows the goodness-of-fit for the CFA. The results showed that the model correlated well with the reference standard. The parameter estimation results of the CFA are presented in Table 7.

Discriminant validity
In this survey, the scores of each subdomain in addition to the therapeutic domain and total score of the scale between the patients and healthy subjects showed significant differences (see Table 8). These differences indicated that the scale was able to distinguish people in different groups.

Feasibility
In the large-scale clinical investigation, the response rate of the CHF-PROM was 98.94%, and the effective rate of return was 98.92%. The average completion time of CHF-PROM was 15 min. The score distribution of each item was analyzed. No major floor or ceiling effects were found. Only 0.06% of the responses to the psychological domain were missing. These findings suggest that the CHF-PROM is feasible for use in clinical practice.

Discussion
As a chronic disorder, CHF requires special management from patients and their families, including adjustment of daily habits, liquid management, and heart rate management. Based on detailed PROs, medical professionals can provide individual instructions to patients to improve their quality of life and reduce re-hospitalization and mortality rates [36]. We established the present CHF-PROM because of the brevity of previous HF questionnaires, which were translated directly from aboard and focused little on social support and therapy status. We applied four domains (physical domain, psychological domain, social domain, and therapeutic domain) and performed large-scale survey for the healthy subjects and patients with CHF in 8 hospitals to generate this CHF-PROM, which can more fully reflect the health status of patients with CHF.
We developed the present CHF-PROM in compliance with the development principles and processes of international scales. The CHF-PROM was developed in three stages: generation of the item pool, a pilot survey to form the preliminary scale, and use of large-scale clinical trials to form the final scale. To ensure that each selected item was sensitive, representative, and independent, we adopted different statistical methods in the process of generating the scale. The average time spent performing PRO data collection was about 15 min. This is thought to have been an acceptable time for the inpatients. During this time period, the inpatients could complete the questionnaires and provide accurate responses. The time of data collection should be controlled when performing a questionnaire-based study. The timing of data collection might have influenced the responses.
The methods employed to develop related scales are still limited to CTT [37]. Our study is innovative in that IRT was applied in addition to CTT. IRT has some advantages over CTT. Using IRT, estimation of parameters is independent the number of measured subjects. It is also possible to indicate the accuracy of the test capability [38][39][40]. Besides statistical methods, clinical professional knowledge was also required during the process of item selection. The item "PHD18: Do you often feel nauseous?" met the requirements of statistical methods for item selection, but it did not describe a typical symptom of CHF; therefore, we deleted this item. Ten items were removed based on joint consideration of the CTT results, IRT results, and clinical knowledge. The final scale contained 57 items, 12 subdomains, and 4 domains.
We also evaluated the reliability, validity, and feasibility of the scale for 360 patients and 100 healthy subjects. The results showed that this novel scale is a reliable instrument. The CHF-PROM was generated to overcome the deficiencies in the existing HF scales. However, this study had some limitations. First, some problems exist in the personal basic information section of the scale. Economic income and consumption levels vary among different provinces and cities, making adoption of a single evaluating system of QoL inappropriate for patients with CHF. Previous studies have reported that patients' incomes, living conditions, life events, and education levels are the main   factors influencing mental health, and among them, income most strongly affects living conditions and life events. Thus, income and educational level are included in the basic information of the scale [41]. Second, some problems exist in selection of the items. We removed four items with poor sensitivity, independence, representativeness, and discrimination in the preliminary experiment. Our results suggest that every aspect should be considered in the future design of relevant scales. Finally, although our largescale survey has indicated that CHF-PROM is a valid instrument, our samples were collected in only a limited area and are not completely representative of all patients with CHF. To further revise and improve the scale, more efforts are needed to extract larger numbers of patients from different provinces and regions and even different countries for cross-language scale adjustment to develop a CHF-PROM with wider applicability [42]. And these adjusted versions also need a validation that must be done separately from this Chinese version.

Conclusion
In this study, we developed a CHF-PROM that showed better reliability, validity, and feasibility than previously established scales. The CHF-PROM provided the patients a greater chance to participate in treatment decisions, suggesting that PROs can be used in more clinical trials and diagnostic settings in the future. This will allow doctors to obtain more comprehensive medical information, and PROs will become an important indicator of the end-point in curative effects.