Sample and design of source trials
Data of Japanese patients with chronic HF were drawn from three phase II trials: the SOCRATES-REDUCED , SOCRATES-PRESERVED , and ARTS-HF Japan .
The SOCRATES-REDUCED and SOCRATES-PRESERVED studies were both multicenter, international, randomized, double-blind, placebo-controlled, dose-finding, phase II trials of vericiguat in patients with chronic HF. Details of the study methods have been previously described [23, 24]. In brief, patients with worsening chronic HF who had either reduced ejection fraction (EF) (EF < 45%, HFrEF) for the SOCRATES-REDUCED or preserved EF (EF ≥45%, HFpEF) for the SOCRATES-PRESERVED were randomized to one of five treatment arms (4 vericiguat and 1 placebo) and received the treatment for 12 weeks.
In the present study, data from the following assessments of patients’ symptoms, functional status, or health state were analyzed: New York Heart Association (NYHA) class  recorded at baseline, and the KCCQ and the EuroQol five-dimension, three-level questionnaire (EQ-5D-3L)  scores assessed at baseline and at weeks 4, 8, and 12. The present study did not use other clinical data such as biomarkers (e.g., B-type natriuretic peptide [BNP], NT-proBNP), which have low correlation with the patients’ perception of their own health status [28, 29].
ARTS-HF Japan was a randomized, double-blind, active-comparator-controlled, dose-finding phase IIb trial of finerenone in Japanese patients with worsening chronic HF with reduced EF (< 40%) and type 2 diabetes mellitus and/or chronic kidney disease. Patients were randomized to one of six treatment arms (5 finerenone and 1 eplerenone) and received the treatment for 90 days. More detailed study methods including inclusion/exclusion criteria have been described previously . Data from the following assessments were used in the present analysis: NYHA class at baseline and the KCCQ and EQ-5D-3L scores at baseline, days 30 and 90, and 30 days after the last day of treatment (follow-up visit).
Clinical and health state measures
NYHA classification is a system to categorize the extent of physical limitations in patients with HF . Physicians classify patients into one of four classes based on their functional limitations and symptom severity: I (no limitations of physical activity); II (slight limitation); III (marked limitation); and IV (unable to carry on any physical activity without discomfort).
The KCCQ is a 23-item (15 questions), self-administered questionnaire quantifying the following clinically relevant domains: physical limitations, symptom frequency, symptom severity, symptom stability, self-efficacy, social limitation, and QoL . The questions refer to the patient’s heart failure symptoms over the past 2 weeks, and each item is scored on a 5- to 7-point Likert scale. A missing value is assigned the average score of the scored items within the domain, and all item scores are summed within each domain. A domain score is transformed to a 0 to 100 scale, with a higher score indicating a better state. Three summary scores are calculated as follows: 1) the total symptom score (TSS)—the average of the symptom frequency and symptom severity domain scores; 2) the clinical summary score (CSS)—the average of the physical limitations domain score and the TSS; and 3) the overall summary score (OSS)—the average of the CSS and the QoL and social limitations domain scores. The symptom stability and self-efficacy domains are not incorporated into any of the KCCQ summary scores . The Japanese version of the KCCQ was translated and linguistically validated by the Mapi Research Institute (Lyon, France).
The EQ-5D-3L is a generic HRQoL measure, consisting of a five-dimension descriptive system and visual analogue scale (VAS) . In the descriptive system, mobility, self-care, usual activities, pain/discomfort, and anxiety/depression are each rated on a 3-point scale (1 = no problems, 2 = some problems, 3 = extreme problems). A patient’s responses to these five dimensions are then converted into a Japanese value set describing the patient’s overall health state, which ranges from − 0.111 to 1.000 (a higher value indicates a better health state) . The EQ-5D VAS records the patient’s health state on a scale of 0 (worst imaginable) to 100 (best imaginable).
Pooled data of Japanese patients with chronic HF from the above-described three trials were analyzed to evaluate the validity and reliability of the Japanese version of the KCCQ. Since symptoms and physical function are more proximal to the patient experience of the disease, our particular focus was on the CSS, a summary scale of symptoms and physical function, and its component domains (i.e., physical limitations, symptom frequency, symptom severity, and TSS). However, every domain and the OSS were also evaluated in this study. Analyses were performed using SAS Release 9.4 (SAS Institute Inc., Cary, NC, USA).
Construct validity was assessed by the known-group analysis, in which we assessed whether the KCCQ scores could differentiate different groups of patients using the NYHA classes to represent groups of patients with different levels of disease severity. The baseline KCCQ scores were summarized for each NYHA class at baseline. To test an increasing or decreasing trend in scores across NYHA classes, the Jonckheere-Terpstra test  was performed.
To further evaluate whether the KCCQ scores measured the constructs of interest, correlations between the baseline scores of the KCCQ and a related but different measure, the EQ-5D-3L, were analyzed using the Pearson’s correlation for the EQ-5D VAS and the Spearman rank correlation for the five EQ-5D dimensions. The physical limitations domain score and CSS were both expected to have a moderate correlation with the three EQ-5D dimensions (i.e., mobility, self-care, and usual activities), which are considered to be related to functional domains. The symptom stability domain assesses the change in symptoms over the past 2 weeks, and the self-efficacy domain assesses knowledge or understanding of how to manage their symptoms. As these two domains assess distinctively different concepts from those evaluated by the EQ-5D dimensions, no meaningful correlation was expected between these domains and the EQ-5D-3L.
To assess whether items designed to measure the same construct actually do so, the internal consistency of each KCCQ domain/summary score, except for the symptom stability domain, which is a single-item domain, was assessed using Cronbach’s standardized α. An α of ≥0.7 is considered to indicate good interrelatedness among the items within the domain or summary score .
Test-retest reliability, or reproducibility, was assessed by analyzing whether the scores were stable when the patients’ conditions did not change. The test-retest analysis included patients in a stable condition, which was defined as no change in EQ-5D-3L scores between two timepoints : between week 8 and week 12 for the SOCRATES studies and between the last day of treatment and 30 days after the last treatment for the ARTS-HF Japan study. The concordance of the scores at these two timepoints was evaluated using the intraclass correlation coefficient (ICC) . An ICC of ≥0.7 is considered to indicate good agreement , i.e., good reproducibility of the scale.
Responsiveness to patients’ clinical change was evaluated by analyzing whether the KCCQ scores improved when the patients’ health states improved. Patients with improved health states were defined as those with improvement in at least one EQ-5D dimension by ≥1 point without worsening in any EQ-5D dimension . We used the EQ-5D to define those who improved because it was shown to be responsive to clinical changes in patients with HF . Among the patients whose health states were expected to show improvement, changes in the KCCQ scores from baseline to 1 month (more precisely, at week 4 for the SOCRATES studies and at day 30 for the ARTS-HF Japan study) was analyzed by calculating the mean change in scores between the two timepoints and the effect size (mean change in score divided by standard deviation [SD] at baseline). An effect size of 0.2 is interpreted as small, 0.5 as medium, and 0.8 as large . Changes in scores between the two timepoints were also tested using a paired t-test with equal variances assumed.