### Main data collection

The psychometric and scaling properties of the HADS were assessed among 298 patients recruited from five regional MND care centres in the United Kingdom: The Walton Centre for Neurology and Neurosurgery in Liverpool, Preston Royal Hospital, Oxford John Radcliffe Hospital, Salford Hope Hospital, and Sheffield Royal Hallamshire Hospital. Participants all had a diagnosis of MND from a neurologist with expertise in MND. Patients were unselected for age, sex, and disease presentation or disability status. Questionnaires were either handed out during a routine clinic appointment or sent to the patients' home over a period of twelve months, along with a newsletter describing the research activities of their local care centre. Where patients were unable to complete the questionnaires by themselves a nurse or caregiver was allowed to act as a scribe. Informed consent was given by each participant.

Ethical permission was granted for this study from relevant hospital committees in the U.K. (Hammersmith 05/Q0401/7 and Tayside 07/S1402/64), and local research governance committees at all participating sites.

### Rasch Analysis

To evaluate the scaling properties and construct validity of the HADS, the Rasch measurement model was used [8]. Rasch analysis is a probabilistic mathematic modelling technique used to assess properties of outcome measures. Where data are shown to accord with model expectations, the internal construct validity of the scale is supported, and a transformation of ordinal data to interval scaling is possible [12].

For Rasch analysis, sample sizes requirements are influenced by scale targeting. For a scale that is well targeted (*i.e*. 40-60% endorsement rates for dichotomous items), a sample size of 108 will give accurate estimates of person and item locations (99% confidence of locations being within 0.5 logits). A sample size of 243 will provide accurate estimations of items and person locations irrespective of scale targeting [13].

Analyses used to assess whether the scale conformed to Rasch model expectations are briefly explained below. A comprehensive review with a more detailed explanation of the Rasch analytical process may be found elsewhere [10].

Rasch Unidimensional Measurement Model 2020 (RUMM2020) software (Version 4.1, Build 194) was used for the Rasch analyses presented in this study [14].

#### 1) Fit to the Rasch model

Rasch model fit is primarily indicated by a non-significant fit statistics, indicating that the scale does not deviate from model expectations. For example, both summary and individual item chi-square statistics should be non-significant, after adjusting for multiple testing. In addition, both person and item fit are assessed by their residual mean values. This examines the differences between the observed data and what is expected by the model for each person and each item estimate. At the summary level perfect fit is represented by a mean of zero and a SD of ± 1, while at the individual level for persons and items, a residual value between ± 2.5 is appropriate.

#### 2) Item difficulty and person ability

Estimates of a location on a common metric are provided for both persons (ability) and items (difficulty). In the context of the health sciences, 'ability' may be understood to represent the amount the person has of a given symptom, trait or feeling and difficulty may be understood to represent the magnitude of the symptom, trait or feeling represented by the item. For example, an item that reflected the sentiment that life was no longer worth living would be expected to represent a high level of depression when affirmed.

When data from a patient reported outcome scale is analysed through the Rasch model, both the items and persons are calibrated on the same metric that is measured in logits, or log-odds units. This allows for a comparison of the match between patients and items, showing whether or not the scale is well targeted. In the case of dichotomous items measuring depression, a patient with a logit value of zero on the depression scale would have a 50% chance of affirming an item whose level of depression (difficulty) was also at zero logits. A person with a level of depression at +2 logits (high depression) would have an 88% chance of affirming the item located at zero logits, whereas a person at -2 logits (low depression) would only have a 12% chance of affirming that item.

#### 3) Item category thresholds

The Rasch model allows for the analysis of the way in which response categories are understood by respondents. For example, in the case of a Likert style response as used in the HADS, some respondents may have difficulty differentiating between "Never" or "Very Rarely". In instances where there is too little discrimination between two response categories on an item, collapsing the categories into one response option can improve scale fit to the Rasch model.

Furthermore, where the same rating scale structure across items in not supported (*i.e*. where the distances between category thresholds vary across items) the unrestricted 'partial credit' Rasch polytomous model is used with conditional pair-wise parameter estimation [15].

#### 4) Local dependency

An assumption of the Rasch model is the local independence of items. A good example of this is where two stair climbing items are included in the same scale. If you can climb several flights of stairs unaided, you must be able to climb one flight of stairs. Such items are said to be locally dependent, and are not providing the same information as two independent items. This has the effect of spuriously inflating reliability, as well as affecting the parameter estimates of the Rasch model. This can be identified through the magnitude of residual item correlations, where items with residual correlations above 0.3 are considered to be locally dependent. The problem can be accommodated through the use of testlets, where the locally dependent items are simply added together into one 'super' item [16].

#### 5) Differential item functioning (DIF) [17]

Differential item functioning (DIF) occurs when different demographic or other contextual groups within the sample (*e.g*. males and females) respond in a different way to a certain question *when they have the same level of the underlying attribute*. Two types of DIF can be identified; uniform and non-uniform. Uniform DIF would occur, for example, when males respond consistently higher than females on an item, given the same level of depression. Non-uniform DIF would occur, for example, if females selected a higher response option to an item at lower levels of depression, compared to males, but a lower option at higher levels of depression. Differential item functioning is detected using analysis of variance (ANOVA, 5% alpha).

DIF was assessed for 3 contextual factors (called person factors within the Rasch analysis) including Location (Liverpool/Salford/Oxford/Sheffield/Preston), Age (Quartile split between participants, grouped < 55, 55-62,63-70, > 71) and Gender.

#### 6) Person separation index

The Person separation index (PSI) reflects the extent to which items can distinguish between distinct levels of functioning (where 0.7 is considered a minimal value for research use; 0.85 for clinical use) [18]. Where the distribution is normal, the PSI is equivalent to Cronbach's alpha.

#### 7) Unidimensionality

Finally, independent t-tests are employed to assess the final scale for unidimensionality. Two estimates are derived from subsets of items identified by a principal component analysis of the residuals, and the latent estimate of each person (and its standard error) calculated independently for each test. These estimates are then compared and the number of significant t-tests outside the ± 1.96 range indicates whether the scale is unidimensional or not. Generally, where less than 5% of the t-tests are significant this is indicative of a unidimensional scale (or the lower bound of the binomial confidence interval overlaps 5%) [19].