This appears to be the first controlled study on the interpretation of patient-reported rating scale response categories in the clinical neurosciences. As such, it provides a first evidence base and initial guidance for selection of rating scale response categories when developing new or modifying available patient-reported rating scales for PD. This is highly relevant as clarity, distinctiveness and equality of response category intervals represent fundamental assumptions underpinning traditional rating scale construction [1, 32] that are recognized by, e.g. the FDA when judging the appropriateness of rating scales as clinical trial endpoints . Although focusing on PD, the lack of systematic differences between people with PD and age-matched controls, as well as between other health-related respondent characteristics, suggests that our findings are relevant beyond this context.
The identified best categories for three-, four-, five and six-category response scales were not optimal, as they failed to fulfill the assumption of equal inter-category distances also when considering their 95% CIs. For example, the distances between Some of the time and A good bit of the time are clearly different from those between A good bit of the time and Most of the time. Extrapolating data from this study to response categories in commonly used scales reveals similar problems. For example, the three non-extreme response options in the original PDQ-39 (Occasionally - Sometimes - Often)  correspond to mean VAS locations of 30.8, 45.9 and 74.7, respectively. That is, the estimated distance between the latter two categories is about twice as large as that between the former two. Similar or more extreme situations are evident with scales such as the PFS-16 , FACIT-F , SF-36 , PDQL , and PDQUALIF .
Conceivably, this has at least two consequences. First, it may contribute to respondent difficulties in using the response options. Second, it is unknown what a certain difference in raw rating scale scores represents and by how much more someone has changed compared to people with smaller change scores. This illustrates the ordinal nature of raw rating scale data and argues against the legitimacy of analyzing and interpreting summed integral numerals from item responses as linear measures [3, 33, 34]. This latter aspect represents a fact perhaps partly overlooked when developing rating scales; that is, the profound step that is taken when transforming words (qualitative descriptors) into numbers (quantities) that typically are treated as linear measures.
There are a number of aspects that need to be taken into consideration when interpreting the results presented here. First, the appropriateness of using VAS to evaluate participants' interpretation of response categories may be questioned since evidence speaks against the linearity of VAS data . However, there is also evidence supporting the linearity of VAS ratings [36, 37], and the approach has been found useful in previous studies of rating scale category interpretations [9–11]. Second, our observations refer to the Swedish versions of the studied response categories, and the equivalence between various language versions is dependent on cultural and semantic aspects, as well as the quality of the translation. It has for example been shown that interpretations of the same response category can differ between languages as well as between cultures within the same language . However, the VAS values found here are in general agreement with those reported in previous studies using the same methodology and response categories [9, 10]. This suggests that our observations are not necessarily limited to a Swedish context. Third, we limited the types of response categories to frequency, intensity and agreement, and there are also response categories of these types that were not covered here. Furthermore, the anchor categories were assumed to have fixed values at 0 and 100 mm, whereas their interpretations actually may differ between people. For example, studies investigating the perceived absolute frequency or probability of occurrence associated with frequency descriptors have found variations in the interpretation of Always as well as Never [38, 39].
The samples studied here were not randomly selected, which may limit the generalizability of results. Furthermore, the sample sizes were somewhat limited, which influences the precision of observations and, therefore, renders the reported 95% CIs wider than otherwise would have been the case. However, given that data failed to support the assumption of equal inter-category distances even with consideration of the observed CIs, increasing the number of observations would presumably have yielded even stronger evidence against legitimate raw score summation of the response categories studied here. Similarly, the lack of differences between people with PD and control subjects, as well as between other subgroups also needs to be interpreted in view of the sample size. That is, with increasing numbers of observations, statistically significant differences are increasingly likely to be detected. However, statistical significance says nothing about the practical significance of differences, which is not known for the current type of data.
The variability in interpretations of response categories was wide between individuals (as illustrated by the ranges of VAS values). This does not appear to be limited to patient-reported data, as studies regarding physicians' interpretation of various probability related expressions (including some of the response categories studied here) have shown similar variability . This variability further complicates score interpretation at the individual patient level. An important aspect in this respect is the extent to which interpretations are stable within individuals over time. This needs to be assessed in further studies designed for this purpose. Such studies would also allow for direct evaluation of the error variation in VAS ratings, which is an important aspect for the interpretability of data that was not considered in this study.
Our observations concern the interpretation of response categories without reference to a particular context. This is different from the use of response categories in rating scales where items articulate the context within which responses are requested. Studies have shown that the meaning of descriptors of, e.g. frequency differ according to context as well as respondents' experiences within the context [32, 40]. While this hampers the possibilities to make valid comparisons of raw rating scale data between people and between scales tapping different variables, the magnitude of these effects for various health outcome variables is uncertain and will need to be addressed in future studies.
A large proportion of respondents expressed difficulties with the response category Don't know. This observation is in accordance with previous studies of neutral middle categories (e.g., Undecided, ?, and Not sure) in Likert type response scales [19, 41, 42]. These studies have shown that there may be a variety of reasons why respondents select this type of response category and that in practice, it does not operate as a middle category. It has therefore been recommended that it should not be presented as an integral part of a continuum of levels of agreement but, if used at all, be presented separately from categories expressing agreement levels . The observations reported here provide further qualitative evidence in support for this notion.
The ordinal nature of rating scale response categories challenges the legitimacy of summing individual item scores into total scores, as well as their interpretability [3, 4, 34]. However, there are means to empirically determine how the response categories used with a particular set of items function when administered to a particular group of people, and to overcome the assumption of equal intervals in the construction of total scores. Specifically, the polytomous Rasch measurement model for ordered response categories does not assume equal intervals between response categories, tests whether thresholds between adjacent categories are ordered in the expected manner, and provides a means of exploring the effect of collapsing adjacent categories [19, 41, 43, 44]. Additionally, the Rasch model defines, mathematically, the requirements that data need to meet in order to produce measurements, and when these requirements are met scores can be expressed as invariant measures instead of ordinal numbers [33, 45–47]. This study argues for a wider application of this methodology, including appropriate appreciation of response category functioning, whenever rating scale data are used for measurement. For purposes of assessment (in contrast to measurement [33, 46, 48]) an alternative to summed total scores that takes the ordinal nature of rating scale response categories into consideration would, e.g., be the approaches proposed by Svensson .