Psychometric properties of the itch numeric rating scale, skin pain numeric rating scale, and atopic dermatitis sleep scale in adult patients with moderate-to-severe atopic dermatitis

Background The Itch Numeric Rating Scale (NRS), Skin Pain NRS, and Atopic Dermatitis Sleep Scale (ADSS) are self-administered patient-reported outcome (PRO) instruments developed to assess symptoms in patients with atopic dermatitis (AD). The objective of this study was to evaluate the psychometric properties (reliability, validity, and responsiveness) and interpretability thresholds of these PROs using data from three pivotal Phase 3 studies in adults. Methods BREEZE-AD1, BREEZE-AD2, and BREEZE-AD5 evaluated the safety and efficacy of baricitinib in adults with moderate-to-severe AD. Clinician-reported outcomes and other PROs commonly assessed in patients with AD were used to estimate meaningful changes and evaluate test–retest reliability, convergent and divergent validity, known-groups validity, responsiveness, and meaningful change thresholds (MCTs) of the Itch NRS, Skin Pain NRS, and ADSS. Results The test–retest reliability of the Itch NRS, Skin Pain NRS, and ADSS was evidenced by generally large intraclass correlation coefficients (> 0.7) in stable groups of patients between baseline and Week 1 and Weeks 4 and 8. Moderate-to-large correlations (r > 0.4) at baseline and Week 16 were generally observed between each measure and other PROs measuring the same concept, supporting convergent validity. Small-to-moderate correlations with clinician-reported outcomes demonstrated divergent validity. Each instrument was able to distinguish between known groups of disease severity as assessed using other indicators of AD severity. The responsiveness of the Itch NRS, Skin Pain NRS, and ADSS scales was demonstrated through significant differences in their change scores from baseline to Week 16 between categories of change in another PRO also from baseline to Week 16. Thresholds for interpreting meaningful change were estimated as − 4.0 for the 0–10 Itch and Skin Pain NRS items; − 1.25 for the 0–4 ADSS Items 1 and 3 and; − 1.50 for the 0–29 ADSS Item 2, these equivalent to moderate degrees of change. Conclusions Results of this study demonstrate that the psychometric properties of the Itch NRS, Skin Pain NRS, and ADSS are good to excellent. These findings support the use of these instruments in daily assessment of AD symptoms in adults with moderate-to-severe AD. Trial registration ClinicalTrials.gov numbers: NCT03334396, NCT03334422, and NCT03435081.

Background Patients with moderate-to-severe atopic dermatitis (AD) experience a heavy disease burden that substantially impacts both physical and mental functioning. Intense itch, skin pain, and related sleep disturbance are highly prevalent symptoms that patients with AD report as significantly affecting their quality of life (QoL) [1,2]. The most commonly used instruments to assess the severity of AD include the Investigator Global Assessment (IGA) and the Eczema Area and Severity Index (EASI) [3][4][5]. These instruments are based on a physician's visual assessment of clinical signs, and thus fail to capture the patient-experienced symptoms of itch, skin pain, and their impact on sleep. Though itch, skin pain, and sleep disturbance are important to patients with AD, measurement of these burdensome symptoms in clinical trials has so far been limited. Specific patient-reported outcome (PRO) measures may be useful to understand the burden from these symptoms better.
The Itch Numeric Rating Scale (NRS), Skin Pain NRS, and Atopic Dermatitis Sleep Scale (ADSS) are PROs designed to specifically measure the severity of a patient's itch and skin pain, and assess impact of itch on sleep, respectively. These tools were developed according to the Food and Drug Administration (FDA) PRO guidelines [6], as simple, self-administered assessments in daily electronic diaries used in AD clinical trials. Previous studies found that the Itch NRS, Skin Pain NRS [7], and ADSS had good content validity, i.e. represent aspects of disease that are meaningful to patients. However, the psychometric properties of each measure were not assessed. Instruments can assess clinically relevant information, but not have sufficient validity, reliability, or interpretability to be used in clinical trials or practice. These psychometric properties are needed to support the use of these measures in clinical trials. The objective of this study was to determine the reliability, validity, responsiveness, and meaningful change of the Itch NRS, Skin Pain NRS, and ADSS in patients with moderate-to-severe AD using data from three Phase 3 clinical trials.

Study population
BREEZE-AD1 (AD1), BREEZE-AD2 (AD2), and BREEZE-AD5 (AD5) were three multicenter, randomized, double-blind, placebo-controlled, parallel-group Phase 3 clinical trials that evaluated the safety and efficacy of once daily, oral baricitinib 1 mg, and 2 mg, and 4 mg (in AD1 and AD2 only) versus placebo in adult patients with moderate-to-severe AD. In each trial, patients were ≥ 18 years old and intolerant or inadequate responders to topical therapy. At screening and baseline, patients were required to have an EASI score ≥ 16, a validated Investigator Global Assessment for Atopic Dermatitis (vIGA-AD ™ ) score ≥ 3, and a body surface area (BSA) involvement ≥ 10%. Full details of each study, including the primary efficacy and safety outcomes, have been reported previously [8,9]. Each study was conducted with informed consent, under institutional review board approval, and in accordance with the Declaration of Helsinki (ClinicalTrials.gov numbers: NCT03334396 (AD1), NCT03334422 (AD2), and NCT03435081 (AD5)).

Instruments used in the psychometric analyses Itch NRS, Skin Pain NRS, ADSS
The Itch NRS is a single item designed to capture information on self-reported severity of worst itching each day. Patients were asked to rate itching severity based on the worst level of itching in the past 24 h using an 11-point scale from 0 ("no itch") to 10 ("worst itch imaginable"). The single-item Skin Pain NRS assesses selfreported severity of worst skin pain each day. For this, patients were asked to select a number from 0 ("no pain") to 10 ("worst pain imaginable") that best described the worst level of skin pain in the past 24 h. The three-item ADSS captures self-reported impact of itch on sleep disturbance each day, including: difficulty falling asleep (Item 1); number of night-time awakenings (Item 2) and; difficulty falling back asleep after waking (Item 3) during the previous night. Each ADSS item was scored individually. For Items 1 and 3, patients were asked to select a score ranging from 0 ("not at all") to 4 ("very difficult"). For Item 2, patients selected the number of times they woke up each night, ranging from 0 to 29 times. Patients only answered Item 3 if their answer to Item 2 was greater than 0. These three PROs were self-assessed using a daily electronic diary, starting at screening through Week 16. Information was entered into the electronic diary at the end of each patient's day. For each measure, weekly mean scores using the previous 7 days were calculated if at least 4 of the 7 diary values were non-missing. Weekly averages were calculated at baseline (Week 0) and Weeks 1, 2, 4, 8, 12, and 16.

Other scales
The PROs used to evaluate the psychometric properties of the Itch NRS, Skin Pain NRS, and ADSS included: (1) the Dermatology Life Quality Index (DLQI) [10], a selfreported measure of the impact of AD on QoL; (2) the Patient Oriented Eczema Measure (POEM) (11), a selfassessed disease severity score; and (3) the Patient Global Impression of Severity-Atopic Dermatitis (PGI-S-AD). More specifically, the PGI-S-AD is a single item asking patients to rate their overall AD symptoms over the last 24 h, ranging from "no symptoms" to "severe. " The PGI-S-AD measure was collected in the daily diary along with the Itch NRS, Skin Pain NRS, and ADSS items; the other PROs (DLQI and POEM) were assessed during clinic visits. In addition, the clinician-completed EASI, an evaluation of disease extent and clinical signs, was used in the psychometric validation.

Statistical analyses
The following psychometric evaluation methods used in this study are in accordance with the published FDA guidance for assessing the measurement properties of PROs [6] and recent psychometric consensus discussions and presentations [12]. Unless otherwise stated, all analyses were conducted on eligible patients from the intent-to-treat (ITT) population who had weekly mean scores for the Itch NRS, Skin Pain NRS, or ADSS items at baseline. Analysis at visits following baseline includes all patients who had data at baseline and at the respective follow-up days or visits. All analyses were conducted using SAS Version 9.3 or higher (SAS Version 9. 2013. Cary, NC, SAS Institute Inc.).

Test-retest reliability
Test-retest reliability, which measures if instrument scores are reproducible across time, was assessed in a stable patient population during the interval between Week 0 and Week 1 as well as between Weeks 4 and 8. Stable patients were defined as those in the ITT population with weekly mean PGI-S-AD scores between − 0.50 and + 0.50 during each time interval. Intra-class correlation coefficients (ICCs) were calculated between the initial and retest periods. An ICC of ≥ 0.70 was considered acceptable agreement [13][14][15].

Construct validity (convergent and divergent validity)
Construct validity refers to the degree to which scores from one measure are theoretically consistent with those of another measure. Convergent and divergent validity were assessed using Spearman's correlations between each of the Itch NRS, Skin Pain NRS, and ADSS items, and the scores of the PGI-S-AD, DLQI, POEM, and EASI.
It was hypothesized that convergent validity, evidenced by moderate or large correlations, would be demonstrated at Weeks 0 and 16 between each of the Itch NRS, Skin Pain NRS, and ADSS items with the other PROs related to AD symptoms (POEM, DLQI, and PGI-S-AD), and that divergent validity, evidenced by small-to-moderate correlations, would be demonstrated between each of the instruments of interest with the more distally related clinician-completed assessment (EASI).

Known-groups validity (discriminant validity)
Known-groups validity was assessed by exploring the ability of each instrument to discriminate between subgroups of patients with different underlying disease severity. Based on the evaluation of construct validity, measures correlating with the Itch NRS, Skin Pain NRS, or ADSS above the 0.35 criterion for acceptable correlations [18,19] were considered in the analyses of knowngroups validity.
Patients were stratified into severity groups based on baseline scores of PGI-S-AD (weekly mean score of < 3 "no symptoms to mild symptoms" and ≥ 3 "moderateto-severe symptoms") and POEM (scores 0-7 "clear to mild, " scores 8-16 "moderate-to-severe, " and scores 17-28 "severe to very severe" [11]. The weekly average scores on the Itch NRS, Skin Pain NRS, and ADSS items were assessed between these groups using independent samples t-tests (2 groups) and analysis of covariance (ANCOVA) controlling for the effects of age, race, and gender (> 2 groups). When ANCOVA was used, post hoc t-tests assessed the mean weekly score between consecutive severity groups. Any severity group with < 20 patients were omitted from the analysis to ensure sufficient data for interpretation.

Responsiveness
Responsiveness, the ability of the measure to detect change when change in the construct of relevance has occurred, was evaluated using ANCOVAs and post-hoc paired t-tests to assess significant differences in mean changes in the Itch NRS, Skin Pain NRS, and ADSS items from Week 0 to Week 4 and Week 0 to Week 16 between groups of patients with different degrees of change in the construct of relevance. The standardized response mean (SRM) [19] was used to interpret the magnitude of responsiveness of each measure; based on Cohen's recommendations [19], SRMs of 0.20, 0.50, and 0.80 represent small, moderate, and large changes, respectively [20].
Mean changes were assessed within 4 change categories of the POEM: (1) "much improved" patients who moved more than one health category to a better health category (> 1 category improvement); (2) "improved" patients who moved by one health category to a better health category (1 category improvement); (3) "stable" patients who remained in the same health category (no category change); and (4) "declined" patients who moved to a worse health category (≥ 1 category worsening). These categories were based on changes from baseline to the respective time point in the POEM severity category (scores 0-7 "clear to mild, " scores 8-16 "moderate, " and scores 17-28 "severe to very severe" [11]. It was hypothesized that statistically significant differences in the Itch NRS, Skin Pain NRS, and ADSS items would be observed between POEM change categories [11]. Differences in change scores between groups were tested using ANCOVA, controlling for age, gender, and race [21]. Post hoc t-tests and SRMs between consecutive change groups were also conducted.

Meaningful change estimation
Meaningful change refers to the individual-patient level of differences in scores in the domain of relevance which patients perceive as meaningful [6].
Anchor-based assessment An anchor-based analysis, with weekly mean PGI-S-AD serving as the anchor variable, was the primary method used to derive clinical interpretations of the Itch NRS, Skin Pain NRS, and ADSS items. Spearman's correlations were evaluated between the PGI-S-AD weekly average score and each measure at baseline, Week 4, and Week 16. Spearman's correlations were also used to compare the change in the PGI-S-AD weekly average with each measure's weekly average from baseline to Week 4 and Week 16.
To determine within patient meaningful change thresholds (MCTs), patients were classified into response groups based on their level of change in the PGI-S-AD between baseline and Weeks 4 and 16. These groups included "very marked improvement" (≤ −2.5 weekly average score change), "marked improvement" (> −2.5 and ≤ −1.5), "minimal improvement" (> −1.5 and ≤ −0.5), "no change" (> −0.5 and < 0.5), "minimal worsening" (≥ 0.5 and < 1.5), and "marked worsening" (≥ 1.5). MCTs on the Itch NRS, Skin Pain NRS, and ADSS items were based on change from baseline to Week 16 (primary analysis) and baseline to Week 4 (sensitivity analysis) within PGI-S-AD severity groups. A range of MCT estimates (minimal, moderate, and large) were computed for changes in each measure based on observed changes in the minimal, marked, and very marked PGI-S-AD improvement groups. A final MCT estimate for each measure was taken as the MCT equivalent to a moderate degree of change.
Distribution-based methods Meaningful change analyses were also supported by distribution-based methods, which identify the raw score change on a measure that will produce a prespecified effect size and which identify a change which is beyond measurement error [22]. Distribution-based estimates were derived using weekly averages of the Itch NRS, Skin Pain NRS, and ADSS items at baseline. MCT estimates equivalent to 0.2, 0.5, and 0.8 pooled SDs were calculated. The Standard Error of Measurement (SEM) was calculated using the ICC from the test-retest analysis.

Handling of missing data
For Weeks 1, 2, 4, 8, and 12, weekly mean scores for Itch NRS, Skin Pain NRS, and ADSS items were set to missing if there were fewer than 4 non-missing values in the 7-day period before the respective clinic visit. For Week 0 and Week 16 analyses, if there were fewer than 4 nonmissing assessments during the week prior to the visit, the 7-day window was extended by 1 day at a time (up to a maximum of 7 additional days) until there were at least 4 non-missing values.

Results
A total of 624 patients in AD1, 615 patients in AD2, and 440 patients in AD5 were included. Patients' baseline demographics and scores for the instruments of interest and other assessments are listed in Table 1.

Test-retest reliability
The results of the test-retest analysis for each instrument in each study are provided in Table 2. Across all studies, the ICCs ranged from 0.770 to 0.875 for the weekly average Itch NRS and from 0.753 to 0.845 for the weekly average Skin Pain NRS; this indicated acceptable agreement among stable patients using both 1-week and 4-week intervals. For ADSS Items 1, 2 and 3, the ICCs for the weekly average score ranged from 0.754 to 0.843, 0.585 to 0.921, and 0.671 to 0.784, respectively, indicating generally acceptable agreement using both 1-and 4-week assessment intervals. These high levels of agreement indicated that all measures had good test-retest validity.

Construct validity (convergent and divergent validity)
Results supporting convergent and divergent validity of the Itch NRS, Skin Pain NRS, and ADSS items are shown in Table 3. Moderate-to-large correlations between the reference PRO assessments of AD symptoms and the Itch NRS  Table 4 reports the findings of known-groups validity analysis of each instrument using PGI-S-AD and POEM subgroups to define AD severity. At baseline, in all 3 studies, compared with patients in the moderate categories, patients in the severe categories of the PGI-S-AD and POEM had significantly more itching (p < 0.0001), skin pain (p < 0.0001), sleep disturbance (p < 0.0001), night-time awakenings (p < 0.01), and difficulty falling back asleep after waking (p < 0.0001) as demonstrated by higher mean scores on Itch NRS, Skin Pain NRS, ADSS Items 1, 2, and 3, respectively. These findings suggest that the Itch NRS, Skin Pain NRS, and ADSS items are able to distinguish between known groups based on disease severity.

Responsiveness
The responsiveness of the Itch NRS, Skin Pain NRS, and ADSS items between Weeks 0 and 16 and between Weeks 0 and 4 are shown in Tables 5 and 6, respectively. In all three studies, the magnitude of improvement in each instrument increased with greater improvement in the POEM, supporting the ability of each measure to detect change in the construct of relevance where change has occurred. For the Itch NRS and Skin Pain NRS, in each study at Weeks 4 and 16, the "much improved" group statistically significantly differed from the "improved" group (p < 0.001 for Itch NRS, p < 0.05 for Skin Pain NRS), and the "improved" category statistically significantly differed from the "stable" group (p < 0.0001 for both). In each study, at Week 16, the scores of each ADSS item increased with each improvement category; however, not all comparisons between consecutive improvement categories were statistically significant (Table 5).

Anchor-based
Anchor-based estimates of the MCTs (minimal, moderate, and large) for each measure are listed in Table 7.

Distribution-based
Distribution-based MCTs are listed in Table 8. Compared with anchor-based thresholds, SD and SEM estimates were smaller for all measures but the ADSS Item 2; this indicated that the anchor-based estimates are generally above measurement error and thus that improvements in these measures reflect a true improvement in condition severity. The larger distribution-based estimates for ADSS Item 2 reflected the large variability and skewness of this measure at baseline.

Discussion
This study evaluated the psychometric properties of the Itch NRS, Skin Pain NRS, and ADSS using data from three clinical trials of patients with moderate-to-severe AD. For each measure, assessment of test-retest reliability found high levels of agreement in stable groups of patients across all three studies for both 1-week and 4-week comparisons, indicating reliability of each instrument when no change would be expected. As hypothesized, the construct validity of each measure was also demonstrated, with moderate-to-large correlations with other PROs (POEM, DLQI, and PGI-S-AD) supporting convergent validity and smaller correlations with the more distally-related provider assessment (EASI) supporting divergent validity. These findings suggest that the Itch NRS, Skin Pain NRS and ADSS measure the underlying concept of AD symptomatology and, moreover, encapsulate unique information regarding disease symptoms, which can complement clinician-reported assessments in clinical trials. In addition, comparisons of the Itch NRS, Skin Pain NRS, and each ADSS item between PGI-S-AD and POEM severity categories demonstrated each measure's ability to distinguish between known groups based on disease severity. Responsiveness was established through the ability of each instrument to discriminate significantly between subgroups of patients based on four change categories of the POEM ("much improved, " "improved, " "stable" and "declined"). Overall, the Itch NRS, Skin Pain NRS, and ADSS were determined to be highly reliable, valid, and responsive, supporting the use of these PRO instruments in daily assessment of AD symptoms in adults with moderate-to-severe AD.
Using anchor-and distribution-based analyses, thresholds for interpreting change of each measure were derived as criteria to assess treatment benefits in patients   [23,24]. Changes of 1.25 points in ADSS Items 1 and 3 and 1.5 points in ADSS Item 2 were found to optimally demonstrate clinically meaningful improvements in sleep disturbance. These findings further confirm previous psychometric validation data of itch NRS in AD and psoriasis [23,24]. The potential importance of these measures in clinical practice is indicated by the fact that patients with AD have identified itch, skin pain, and sleep disturbance as bothersome and distressing symptoms of their disease [25], but these are difficult or impossible for clinicians to assess using conventional tools. There is thus an unmet need for measures which can assess these patient-perceived symptoms. For example, EASI or BSA instruments assess important signs of disease, but these do not capture the impacts of itch, skin pain and sleep disturbance from AD as perceived by patients. Existing PROs of AD, such as the POEM, and Scoring Atopic Dermatitis or SCORAD include sleep items, but these items are included as part of a total score and do not assess the full impact of itch on sleep disturbance [11,  Between-group comparisons a -< 0.0001 -N/A b < 0.0001 ADSS, Atopic Dermatitis Sleep Scale, NRS, Numeric Rating Scale, PGI-S-AD, Patient Global Impression of Severity-Atopic Dermatitis, POEM0, Patient Oriented Eczema Measure; SD, standard deviation a Between-group comparisons. The LS mean and SE are derived from an ANCOVA adjusting for age, sex, and race. The p value for the pairwise comparisons between consecutive severity groups is assessing differences in scores between groups b Where numbers were < 20 in any severity group, this severity group was omitted from the analysis and the analysis was conducted on the remaining severity groups Silverberg et al.  Though this study demonstrated strong evidence for the reliability, validity, and responsiveness of the Itch NRS, Skin Pain NRS, and ADSS, the data used in this psychometric validation are from a clinical trial and hence may not be generalizable to clinical practice. In addition, the inclusion and exclusion criteria of the three  underlying studies limit this validation to adult patients with moderate-to-severe AD. Only a few patients were available in the mild group for assessing known-groups validity of each instrument using PGI-S-AD and POEM subgroups to define AD severity. The results of this study are also limited to a subset of patients who fluently spoke a language into which the assessment tool had been translated. The FDA recommends daily assessment of symptoms by patients as a shorter recall period allows for more reliable interpretation of symptom data [6]. However, while averaging scores over a 7-day period accounts for day-to-day variation in this analysis, this reduced variability may artificially increase the correlations with other measures [24]. Additionally, a similar study of itch severity measurement suggested a 7-day recall may be more clinically relevant [27]. Nevertheless, future studies are warranted to assess correlations between the Itch NRS, Skin Pain NRS and ADSS, which may further support the use of the three separate instruments in clinical practice.

Conclusions
The results of this study demonstrate that the Itch NRS, Skin Pain NRS, and ADSS are highly reliable, valid, and responsive measures of symptoms that are important to patients with AD. In addition, each PRO is able to measure clinically important symptom changes in these patients. These findings support the use of these PRO instruments in clinical trials of patients with moderateto-severe AD.