LC-PROM: Validation of a patient reported outcomes measure for liver cirrhosis patients

Background The aim of the study is to develop a specific patient-reported scale of liver cirrhosis according to the Patient Reported Outcome guidelines of the Food and Drug Administration (FDA), and to examine its capacity to fill gaps in this field. Methods A conceptual framework was developed and a preliminary item pool developed through literature review and interviews of 10 patients with liver cirrhosis. With the preliminary items, we performed a pilot survey that included a cognitive test with patients and interviews with experts; the focus was on content and language of the scale. In the item selection stage, seven statistical methods including discrete trends method, discrimination analysis, exploratory factor analysis, Cronbach’s α coefficient, correlation coefficient, test-retest reliability, Item-Response Theory were applied to survey data from 200 subjects (150 liver cirrhosis patients and 50 controls). This produced the preliminary Liver Cirrhosis Patient-reported Outcome Measure (LC-PROM). In the next stage, we conducted the survey with 620 subjects (500 patients and 120 controls) to validate reliability, validity and acceptability of this scale. Results The 55 items and 13 dimensions addressed four domains: physical, psychological, social, and therapeutic. Cronbach’s α coefficients were 0.921 for the total scale; the confirmatory factor analysis, t-tests and ANOVA supported scale validity; the model fit index as Root Mean Square Error of Approximation (RMSEA), Root Mean Square Residual (RMR), Normed Fit Index (NFI), Non-Normed Fit Index (NNFI), Comparative Fit Index (CFI) and Incremental Fit Index (IFI) met the criterion generally. The acceptance ratio and response rate indicated good feasibility. Conclusions This study developed an accurate and stable patient-reported outcome scale of liver cirrhosis, which is able to evaluate clinical effects effectively, is helpful to patients in recognizing their health condition, and contributes to clinical decision making both for patients and physicians. Additionally, the LC-PROM can perform as an ultimate assessment of medical and health care effects and can inform clinical trials of new drugs for liver cirrhosis.


Background
Liver cirrhosis (LC) is a potential consequence of the progression of any of various kinds of liver disease, and the high incidence of hepatitis will lead to a large number of patients suffering from liver cirrhosis. LC is characterized by fatigue, digestive disorders, bleeding and anemia, endocrine disorder, hypoproteinemia, portal hypertension and other serious symptoms that cause great pain to patients physically, impacting their daily social life. As an irreversible, chronic, progressive disease. LC can not be cured completely at the present stage. Particularly for weak patients, the common treatments used in the clinical can cause secondary damage in addition to harm caused by the disease itself.
At present, patients' health status and treatment effects are evaluated by hepatic function test and serological markers, or reflected by hospital stays and symptom improvement over time. However, with the continued development of a biopsychosocial medical model the use of scales to assess patients' fitness has been widely accepted and applied internationally; that is, patients' personally reported data, dubbed patient-reported outcome (PRO), are used to measure clinical results. One of the arguments for using questionnaires to ask patients to judge their own health-related quality of life (HRQoL) is that it has been shown that physicians are generally unable to make accurate judgments of patients' HRQoL. Physicians' judgments not only deviate from those of patients, they also differ among one another. This latter variability makes it particularly difficult to obtain 'objective' judgments of HRQoL [1].
The PRO Harmonization Group, which consists of the Food and Drug Administration (FDA), International Society For Pharmacoeconomics and Outcomes Research (ISPOR), the European Regulatory Issues on Quality of Life Assessment Group (ERIQA), and the International Society for Quality of Life Studies (ISQOL), proposes that evaluation of clinical curative effects should contain data from physicians' reports, physiological measures, caregivers' reports, and PROs, which come solely from the patient. In the course of a disease, there are some symptoms that can only be experienced by patients; i.e., these symptoms cannot be reflected by physical measures. In this case, the normal reference values of medicine do not equal true health; additionally, physician report data are always processed through the subjective consciousness and may only include contents related to the physician' s concerns. What' s more, this report is limited by physicians' knowledge and experience. Therefore, PROs play an important role in clinical practice, and this method is now generally accepted by experts and patients alike. Since the publication of the draft guide for new drug development and curative effect evaluation in February 2006 [2], PROs are becoming more important in assessment of treatment outcome and in new drug registration.
A PRO instrument specific to LC could provide several benefits: it could help improve the evidence base through research assessing effectiveness of LC therapies; facilitate clinician-patient communication and shared decision making; help prioritize patient problems and preferences; monitor changes or outcomes of treatment; measure the performance of healthcare providers and services; and be incorporated in clinical audits [3][4][5].
In short, the aim of this study is to develop such a PRO scale that meets the following criteria: (I) specific to liver cirrhosis; (II) addresses all physical symptoms, psychological feelings, daily activities, and therapeutic status related to LC; (III) comprises items that are founded on the patients' own perspective; (IV) has good internal consistency, a reasonable theoretical framework and can distinguish different severities of the disease; and (V) is of appropriate length and has strong feasibility.

Methods
The Medical Ethics Committee of Shanxi Medical University provided ethics approval, and all participants signed informed consent to participate.
Step 1 item generation Literature review We conducted literature searches on databases and network resources for PRO instruments. From the searches, we formed the conceptual framework of the new instrument, called the Liver Cirrhosis Patient Reported Outcome Measure (LC-PROM).

Patient interviews
We conducted semi-structured interviews with ten liver cirrhosis patients (five males and five females; average age 53 years). In the interview, patients were encouraged to talk about their main disease symptoms, physical feelings and symptoms that they most desired to improve, psychological conditions after diagnosis and participation in social activities since diagnosis, adherence to therapy and satisfaction with their status. In addition, patients could speak freely on other relevant topics. Throughout the process, researchers wrote down the interviewees' original words as far as possible, and audio recordings were made. After the interview, all information was sorted and then an initial list of items was developed.

Cognitive debriefing and discussion with experts
Another ten patients (five males and five females, average age 52 years) were selected to undertake cognitive debriefing. These patients were asked to flag items that were ambiguously worded or difficult to understand, and to suggest items that needed to be added or deleted.
Seven experienced experts including three chief physicians of gastroenterology, one infectious diseases physician, one psychologist, one sociologist, and one ethics expert were invited to discuss whether the initial structural framework was reasonable and whether the items covered all areas of disease evaluation. The correlation of items with their respective dimensions and linguistic issues were considered. We modified the item pool according to the experts' advice, and the preliminary scales were formed.
Step 2 item selection Sampling survey Two hundred subjects were sampled from inpatients of eight different hospitals and communities in Shanxi Province. There were 150 LC patients and 50 health controls.
Patients who were diagnosed with definite LC, who were between 18 and 72 years old, and who were fully able and willing to participate in this study as volunteers were included.
Patients were excluded if they had an uncertain diagnosis, suffered mental illness or disorders of consciousness, were unable to understand questions because of dysgnosia, or were unable to complete the test.
Health controls were healthy volunteers from communities who were not diagnosed with any diseases by physicians and had an age distribution similar to that of LC patients. Health controls also provided informed consent and got some rewards.
The survey was administered by trained investigators. Before beginning, subjects were informed of the survey objective and signed the informed consent form. Next, the participants independently completed the preliminary scale. During the survey, investigators were present to respond to questions. If participants were elderly or had a low education level, investigators read the items to them and wrote down their answers. After the survey, any incomplete scales were filled in by the subjects under the guidance of the investigators.

Scale scoring
Scores were calculated using a five-point Likert scale to reflect frequency of occurrence over the past 2 weeks of the issue presented in each item. The responses were 0 = never, 1 = occasionally, 2 = about half of the time, 3 = often, and 4 = almost every day. The positively-toned items were scored as the original score plus one, and the negatively-toned items were scored as 5 minus the original score. Thus every item score ranged from 1 to 5, with higher scores denoting more positive outcomes.

Statistical methods for item selection
Item reduction was based on both Classical Test Theory (CTT) and Item Response Theory (IRT). This study employed six methods of CTT followed by IRT.

Discrete trend
A low discrete degree means subjects were inclined to select the same answer; that is, the items had a low capacity to test for differences. In general, scores obey a normal distribution, so the standard deviation for every item was calculated. The items with a low standard deviation (<1.0) were deleted.

Discrimination analysis
Items that do not reflect different characteristics of subjects should not remain in the scale. We compared every item score with two independent-sample t-tests (α = 0.05), and the items that were not statistically different were deleted.

Exploratory factor analysis (EFA)
Taking the small sample size into consideration, we did EFA in each domain (physical, psychological, social, and therapeutic) separately, then rotated the solution. According to the eigenvalue and the variance contribution ratio, the number of factors was determined. Items with low factor loading (<0.4) and cross-loading on two or more dimensions were removed.

Cronbach's α if item deleted (CAID)
Internal consistency was evaluated with CAID and the Corrected Item Total Correlation (CITC). If the α coefficient increased greatly when an item was deleted, the item was reducing the internal consistency of its own dimension. CITC < 0.4 indicates an item poorly contributing to the construct of the scale; therefore such items were deleted.

Correlation coefficient
The representativeness of an item was measured by its correlation coefficient with the dimension to which the item belonged. When the value was less than 0.6, the item was not retained.

Retest reliability
This method considered item stability. Thirty subjects were selected from the sample to take a retest 2 weeks after the first test. Among these, 20 cases whose data were error-free in both tests were used to calculate retest correlation coefficient. The criterion for reliability was 0.7.

Item response theory (IRT)
IRT is part of modern measurement theory and was put forward to overcome defects of CTT [6]. It is also called latent trait theory, and has advantages for item selection and test construction. It claims that there is a functional relationship between subjects' abilities and their responses to an item. How to define this relationship is the basic idea and the starting point. In brief, IRT can be viewed as a probabilistic method for discussing the relationship between subjects' potential traits and their responses to items.
If θ represents a subject's ability, P(θ) is the probability of the subject's responding to an item correctly; their functional relationship can be reflected by a curve called the item characteristic curve (ICC). Two important parameters on the curve are used in this study: a reflects discriminant degree and b shows item difficulty. On the ICC whose X,Y axes are θ and P(θ), b is the value of θ corresponding to P(θ) = 0.50; this value ranges from −3 to 3. a is the function of the tangent line's slope at point b; its value ranges between 0.3 and 2, with larger values representing higher degrees of discrimination.
Because the five-point Likert scale was being used, a Graded Response model was constructed, which is appropriate for hierarchical and continuous data, extending a unidimensional model to a multidimensional one [7]. The basic idea of the model [8] is that: assuming the full score of an item is f j , then the number of scores for item j is f j + 1, that is 0,1,2…,f j . If P ajt * is the probability that the score of item j is greater than t when the ability value is θ a , then P aj0 * = 1, P aj, fj+1 * = 0. If P ajt is also the probability that the score of item j is t [9], then P ajt = P ajt *-P aj, t+1 * (t = 0,1,2, …, f j ), where P ajt * = 1/{1 + exp[−Da j (θ a -b jt )]}, in which D = 1.7, a j is the discriminant degree of item j, b jt is the difficulty when the score of item j is t, and the difficulty level of item j is monotonically increasing; that is, b j1 < b j2 < … < b j , fj . P ajt * corresponding to an ICC is called the Project type characteristic function in the Graded Response model.
Five parameters can be estimated in our study, namely a,b 1 ,b 2 ,b 3 ,b 4 , where b 1 is the parameter of difficulty level between answer 1 and answer 2, and so on, and Here a must be > 0.60, and b ranges from −3 to 3.
Items supported by at least five methods were retained in the final LC-PROM.
Step 3 validation of the scale Second Sampling Survey Six hundred twenty subjects were selected in the second survey, of which 120 were controls. Inclusion and exclusion criteria did not change, nor did the survey process.

Reliability analysis
Reliability reflects the stability and consistency of a scale. In our study, Cronbach's α coefficients for the total scale and for each domain were calculated, to evaluate the average consistency of the items. The higher the value is, the better the reliability, but if α is too high, it suggests that the items are not simply related but overlap considerably. In the extreme case where α = 1,we should consider whether some items are redundant and could be eliminated. Here we chose 0.80 as the critical value; i.t., the measured results can be considered stable when α exceeds 0.80.

Validity analysis
Validity, also called accuracy, is the other arm of validation of a scale, and reflects the extent to which a scale measures what it sets out to measure. Validity includes subtypes of content validity, criterion validity, construct validity, and discriminant validity. In this article, we chose to measure the latter two.

Construct validity
This index shows whether the scale constructs match those in the initial framework. A scale with good construct validity is able to target true potential traits for measurement. Factor analysis is a major method for construct validity analysis and includes Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA). When an item collection is not based on theoretical guidance, EFA has the ability to explore the fields and dimensions belonging to a scale. However, before this study, we had reviewed the literature to formulate a scale framework, and EFA had been applied during the process of item selection, so at this stage CFA was suitable. Factor loading for every item and fit index for every domain were calculated.

Discriminant validity
This is an index of a scale's ability to discriminate populations with different traits through comparing test results of selected subjects. The statistical method was a simple two-independent samples t-test. The total scores on the LC-PROM and on each domain were compared between cases and controls to judge whether the LC-PROM could distinguish these two groups. In addition, we stratified the time that patients had been sick as less than 1 year, 1 to 3 years, 3 to 5 years, and more than 5 years. ANOVA was then applied to infer the relationship between disease course and scale score. The scale we developed had a good discriminant validity when p ≤ 0.05.

Feasibility analysis
When a scale can be understood and completed by subjects easily, the scale is said to have strong feasibility. This property is assessed with reference to acceptance ratio, response rate, and completion time.
The entire study flow diagram is presented in Fig. 1.
The LC-PROM focused on 4 domains: Physical (PHD), Psychological (PSD), Social (SOD), and Therapeutic (TRD). This idea is based on the definition of PRO and all the specific scales for liver disease. Meanwhile, taking the Social Avoidance and Distress Scale (SAD) and the Beck Hopelessness Scale (BHS) into consideration, the LC-PROM was divided into a further 13 dimensions, and the initial item pool included 72 items (see Appendix 1). The instrument's conceptual framework is shown in Table 1.

Cognitive debriefing and expert discussion
The LC-PROM was regarded as clear and concise, easy to understand and easy for the patients in the cognitive debriefing to complete. Completion time was 10 min on average. Considering patients' suggestions, we made some modifications to the instrument. Six items in PHD that described atypical symptoms and overlapped with each other were deleted. Symptoms in deleted items included, for example, oliguria, dry eyes, pale skin and mucosa, among others. We also replaced the words "hepatic region" with "right upper abdomen," to make this text easier to understand. Similarly, two items were reduced in PSD, one item was reduced in TRD, and one item was added in SOD.
Experts agreed that the LC-PROM was reasonable in its construction framework and item attributions, and that it was comprehensive in its content. However, because this was a self-rating scale, it was determined that the items should be expressed in the first person, so a full revision was made by research group accordingly. This second draft of the preliminary LC-PROM included 64 items, 13 dimensions and four fields (see Appendix 2).

Item reduction Participant characteristics
We sampled 200 participants in this survey; 189 responded, for an acceptance rate of 94.50 %. There were 179 subjects, including 132 patients and 47 controls, whose data were available, for a final response rate of 94.71 %. Baseline data of participants are shown in Table 2. The average length of time since liver cirrhosis diagnosis was approximately 3.02 years.

Item selection based on CTT and IRT
When CAID was used, we calculated the initial Cronbach's α coefficient when all 64 items were retained; this did not result in deletion of any items, the detailed result was not shown here.
In IRT a number of items were suggested for deletion: fourteen in PHD, four in PSD, and seven in TRD; and only one item was retained in SOD according to parameters a and b. Fig. 2 shows the ICC matrix.  Fifteen items were to be deleted based on statistical results, but considering the value of diseasespecific symptom information and the contributions of certain items to each dimension, six items were maintained in the final version of the LC-PROM.
The final version comprised 55 items within 13 dimensions belonging to 4 domains (see Appendix 2). The detailed screening process is presented in Table 3, and the final construction frame can be seen in Table 4.

Validation of LC-PROM Demographic characteristics
Another 620 subjects (500 cases and 120 controls) were sampled for the validation. Of the 598 who responded, 576 produced valid data for analysis (464 cases and 112 controls). Participant characteristics are presented in Table 5. As Table 5 shows, males were more numerous than females; subjects' average age was 50-55 years. There were no statistically significant differences in the distributions of gender, age, or height between the two groups. LC patients had a higher proportion of smoking and drinking, and lower weight. These characteristics are consistent with risk factors for LC. Among the subjects, 269 patients had been sick for 1 to 5 years, the number of patients who suffered from LC less than 1 year and more than 5 years were 97 and 98 respectively, the average length of time was 3.70 years.

Reliability analysis
Cronbach's α coefficient is one of the indicators for evaluating reliability, with a generally acceptable value of greater than 0.70. Our LC-PROM met this standard, except in the TRD domain (see Table 6). Note: "CC" is short of correlation coefficient, boldface means items which suggested to delete by certain method, " a " means items that measure cross dimensions, "√"means maintain, "×"means delete  As the tables show, standard factor loadings of each item were above 0.50, except for SOD3; therefore, the goodness of fit for LC-PROM is satisfactory. b. Discriminant validity: Discriminant validity analysis was conducted by comparing average scores across different domains as well as total scale scores between patients with various disease courses and the health controls.
In Table 9, the scores of patients are lower than those of controls, suggesting that LC severely affected patients' quality of life. With SOD as the exception, scores were significantly different, as seen in Table 10, and longer clinical courses were associated with lower scores. Perhaps because LC is the final stage of liver disease progression, by the time patients have received a definite diagnosis, they may already have lost the ability to engage in social activity; therefore scores in this domain did not differ. Of course, measurement error cannot be excluded as an explanation, but it had little effect on discriminant validity.
In summary, the LC-PROM was well able to differentiate health and LC patients in varying clinical courses.

Feasibility analysis
The acceptance rate and response rate for the LC-PROM tool were 96.45 % and 92.32 %, respectively. Its average completion time was 10 min.

Discussion
LC is a chronic disease characterized by progressive liver injury which imposes a heavy burden on medical and health services. Bajaj J. S. etal revealed that patients had significant impairment on all domains apart from anger and anxiety compared with caregivers and US norms. Decompensated patients had significantly worse sleep, pain, social and physical function scores compared with compensated ones [14]. Therefore, objective evaluation of clinical effects and patients health conditions is critically important.
We performed reviews of the literature, then collected symptoms of greatest concern and with greatest likelihood of improvement, along with psychological conditions and life states from the patients' perspective. From these, we formed the preliminary item pool for the LC-PROM instrument. Cognitive debriefing and discussions with experts were employed to ensure reasonableness of the conceptual and the structural framework. Next we applied this scale to two samples (n 1 = 120, n 2 = 620) that represented different populations. We considered seven statistical methods and clinical relevance when selecting final items for this tool. In current study, the final version of the LC-PROM comprised 55 items in 4 domains (18 items in PHD, 16 items in PSD, 12 items in SOD, 9 items in TRD) that represent 13 dimensions. Validation   of reliability, validity, and feasibility indicated that the LC-PROM was accurate, reliable and easy to use, showing great potential for clinical application. Through our literature search, we confirmed that the LC-PROM instrument is the first specific scale for LC. The existing PROs for liver diseases are adapted from quality of life measurement scales that are classified as a universal QOL scale and a specific HRQOL scale. For example, WHO Quality of Life-BREF(WHOQOL-BREF), Short Form 36 (SF-36), Nottingham Health Profile (NHP),and the sickness impact profile(SIP) are universal scales, and the Chronic Liver Disease Questionnaire (CLDQ), Hepatitis Quality Of Life Questionnaire (HQLQ), and Liver Disease Quality Of Life (LDQOL) are specific HRQOL scales. All the scales mentioned above have different degrees of defects and in any case do not apply to LC patients. Some studies have indicated that the WHO-QOL is widely used by researchers to study QOL of liver transplant recipients, while the NHP focuses on more severe levels of disability and has thus has been known to be less sensitive to changes in conditions where effects are relatively mild [15,16]. The SIP, in contrast, has a broad coverage of topics, but is therefore very long [17]. The SF-36 is applicable to a broader range of conditions, but has the common disadvantage of generic instruments; namely, they are not designed to identify disease-specific domains that may be important to establish clinical changes [18]. The HQLQ consists of the widely validated generic SF-36 with five added disease-specific subscales, but it excludes patients with a chronic liver disease other than HCV. The CLDQ is a short and therefore feasible questionnaire, but is unable to discriminate between more advanced stages of liver disease. The LDQOL addresses a variety of domains, but is therefore very long (101 items) [10]. The LDSI 2.0 developed by Van der Plas etal. is short, straightforward(only 18 items) and focuses on symptom severity and symptom hindrance, evaluating how patients experience these specific symptoms during daily activities [19]. But in this study, we intend to measure other aspects in addition to symptoms. The translated CLDQ is also used to measure quality of life of Hepatitis B patients [20], and although its reliability and validity have been evaluated, the cultural gap is difficult to bridge. In addition, the instrument has some inherent defects that make it inapplicable to LC patients.
The above-mentioned instruments are designed for chronic liver disease, but not for LC specifically. There is difference between these two disease types. Another point worth noting is that Japanese-related research has found no statistically significant differences among different severity levels of liver disease [13]. However, the LC-PROM tool differs from the scale these researchers used, which was translated directly from English. The LC-PROM is designed specifically for LC, and its item pool took shape through deep interviewing and cognitive testing of patients. Therefore, our instrument may be accepted by respondents more easily, and it performs better for measuring patients' health status.
At present, liver disease questionnaires mainly focus on "physical", "psychological" and "limitation" dimensions. The CLDQ also includes just six subdomains: abdominal symptoms, fatigue, systemic symptoms, activity, emotional function, and worry [21]. The LC-PROM contains a vital addition-a therapeutic domain to obtain information about treatment satisfaction, compliance and drug side effects. The satisfaction with treatment is the major outcome index in new drug clinical trials; this additional field provides information about effects that the trial drug has on targeted patients' health (such as appetite symptoms,  cognitive ability, independence, anxiety and depression, and confidence) and points out the compliance characteristics of the new drug among patients. These are valuable data for clinical therapeutic drug development.
Additionally, optimal therapy can be selected according to these measurement data. In the social domain, the family relationship was emphasized reminding readers of the important role of family support during patient recovery.
During the item selection process, in addition to using subjective methods like cognitive tests and expert discussions, we combined seven kinds of statistical methods to refine the item pool to ensure that items retained were maximally accurate, objective and reliable. Methods employed to develop related scales are still limited to CTT. The innovation of our study is to put IRT into use in addition to CTT. IRT is able to make up for some disadvantages of CTT, allowing acquisition of items that reflect potential traits of the population more accurately.
The instrument demonstrated excellent discriminant ability among LC patients with varying courses of disease. At a basic level, physicians can judge different stages of disease according to the results of the LC-PROM. This will save time relative to the method of full reliance on laboratory indicators.
In a word, the LC-PROM instrument we developed fills a gap in patient-reported clinical outcomes of LC, and lacks the deficiencies seen with existing liver disease PRO tools. It also has the capacity to discriminate disease course, and to evaluate clinical effects and HRQOL accurately; therefore, it will provide valuable data to new drug development for LC.
However, this study still has quite a few limitations that will be addressed and improved in further research To begin, Cronbach's α coefficient for the therapeutic domain in the LC-PROM was less than 0.70, which suggests that the internal coherence of this domain needs to be improved further. As seen in the CFA results, the factor loading for item SOD3 ("I have told my worries to my family") is only 0.35, but in consideration of its special meaning-support from family during illness-we kept it in the final scale. In fact, in the item selection phase, SOD3 was already suggested for deletion with SOD1 ("Friends and relatives take care of my disease"), but we maintained this item for the same reason. Besides, there is no items about sexual function in the scale. The participants expressed that these types of questions were a little sensitive and that it was difficult to respond. We worried about the low response rate and bad overall reliability and validity; therefore we did not include these information in the scale. In order to expand the scope of use, a scale containing this item will be generated in a revised version.
A second limitation relates to criterion validity. The LC-PROM instrument was designed for LC patients, and although participants at different stages of the clinical course were sampled, LC is the final stage of liver disease progression, and patients are often too weak to complete a lengthy scale. Introducing too many tests leads to test fatigue and noncompliance, which increases both survey cost and patients' exhaustion levels; both influence survey results negatively. Therefore, we did not conduct criterion validity analysis in this study; Last but not the least, because of limited resources, our samples were recruited from restricted regions and therefore may not be representative of all patients with LC.

Conclusions
Our study provides strong evidence for excellent reliability and validity of a PRO instrument for LC. We do not suggest that the LC-PROM can replace other related questionnaires on liver disease, but it can obtain valuable information on patients' health conditions, evaluate clinical effects, inform therapeutic method selection and new drug development, as well as health service deployment and clinical research.