This is the first study of which we are aware to evaluate the responsiveness of a mental well-being measure at both group and individual level. The study was possible because of the popularity and uptake of WEMWBS in evaluating interventions designed to improve mental well-being. Although WEMWBS was developed to measure mental well-being at the group or population level there has also been demand from investigators to use the scale at the individual level. It is important therefore that we found WEMWBS to be responsive at both levels. Responsiveness was independent of the type of intervention and sample size, and whilst we cannot know with certainty whether the interventions delivered in the different studies were effective, our results suggest that WEMWBS is responsive in relatively small samples. WEMWBS is likely to be responsive because it evaluates individual mental well-being across both the eudaimonic and hedonic dimensions, and is therefore more able to detect changes in their mental well-being.
When evaluating responsiveness at the group level through distribution-based approaches, the SRM is considered the most appropriate statistic . However, in the literature a variety of statistics have been used to evaluate responsiveness at the group-level, including the paired t-test  and Cohen’s effect size (mean change in score by the standard deviation of the baseline score) . In contrast to the paired t-test, the SRM is a sample size free statistic and therefore allowed us to compare responsiveness in studies of different sample sizes. Cohen’s effect size is dependent on between-subject variability, whilst the SRM is dependent on within-subject variability. As our objective was to evaluate responsiveness of WEMWBS in detecting within-subject change we choose the SRM. Interestingly, the similarity in the standard deviation of the baseline and change WEMWBS scores in the studies evaluated means that Cohen’s effect size will be comparable to the SRM for each study.
In the majority of studies the SRM was greater than 0.5. This compares favourably to other mental illness and life satisfaction scales [38–40], generic health-related quality of life scales , and disease specific scales . We found WEMWBS demonstrated minor floor or ceiling effects (<5%), considerably less than the 15% threshold which has been proposed , and to be responsive in studies undertaken in those with and without underlying mental health problems. This contrasts with mental illness scales that tend to be more responsive in populations with mental health problems . Although in the majority of studies the mean baseline score was below previously reported population norms, these findings suggest that WEMWBS has the capacity to detect change in populations with both good and poor mental health, and to detect subtle improvements.
In evaluating the significance of the SRM we determined the probability of change statistic P^ as it provides a very intuitive interpretation of responsiveness at the group-level. A value of 0.5 suggests that if a change has occurred the instrument is as equally likely to have detected the change, as it is to have not detected the change, and therefore is not responsive. For the majority of the studies, we found the probability of change statistic P^ to be above 0.7, suggesting WEMWBS is responsive at the group level . We assumed that in all the studies the interventions were effective at improving mental well-being. The fact that WEMWBS was not responsive at the group level in all studies could be because the interventions were not effective or because WEMWBS is not responsive to change in those populations. Only one study was undertaken in an adolescent population (Mindfulness in Schools ). In this study, WEMWBS was not found to be responsive. The test-retest reliability coefficient for WEMWBS in the validation study undertaken in an adolescent population (0.66)  was lower than the corresponding coefficient in the validation study undertaken in an adult population (0.83) . It is possible that WEMWBS may not be as responsive to change in adolescents, however, further research is needed to investigate this and whether participant characteristics impact on the responsiveness of WEMWBS.
In the five studies where item level data was available, we found WEMWBS performed reasonably well in detecting change at the individual level. There is as yet no agreed consensus on what constitutes an important change at the individual level, with some suggesting that a change score greater than 1 SEM is important , whilst others suggest 2.77 SEM . In all five studies, we found at the higher threshold of 2.77 SEM, the lower limit of the 95% confidence interval for the proportion of individuals classified as improved was greater than the 2.5% expected if WEMWBS was not responsive at the individual level. An important finding from the individual level analysis was the relatively stable Cronbach’s alpha score in adult populations. In the four studies undertaken in an adult population, the Cronbach’s alpha score was consistently high and comparable to the validation study undertaken in the adult population . In the one study undertaken in the adolescent population, WEMWBS demonstrated satisfactory internal consistency , however, the Cronbach’s alpha score was lower than that found in the validation study undertaken in an adolescent population .
It has been suggested that the SEM of a measure is independent of the sample . It was note-worthy to find that across the five studies for which item level data was available, the SEM was relatively comparable, suggesting that a single change in WEMWBS score could be applied to classify individuals as improved. Previous literature suggests that an improvement of 0.5 units on each item on a Likert scale would equate to an improvement deemed important by individuals . This makes intuitive sense and equates to an overall change score of 7. In the studies evaluated we found a change score of 8 or more equated to statistical importance at the higher threshold of 2.77 SEM. However, a change of 3 or more units (1 SEM) in an individual’s WEMWBS score was greater than the measurement error in the majority of the studies, and thus could be interpreted as important. Further research with comparison to self-reported global ratings of change (GRC) is warranted.
Our conclusions are potentially limited, mainly as a consequence of the data used. We used data from registered users who had replied to our request to use their data, assumed that data was missing at random and only looked at change in those who had undergone an intervention. It is possible, in contravention of copyright, that there are users of WEMWBS who had not registered their use on our database. It is also possible that the reason registered users had not replied to our request was because they had not found a positive finding in their study. It is possible that missing post-intervention WEMWBS scores were not missing at random. This may potentially lead to biased estimates of treatment effect, however, our objective was to determine whether WEMWBS could detect changes in individuals’ mental well-being had they occurred. In evaluating responsiveness, it is this within-person change that is considered relevant . We use distribution-based approaches to evaluate responsiveness to the exclusion of the anchor-based approaches that are increasingly favoured. Anchor-based evaluation requires a GRC. These have been associated with limitations including recall-bias, lack of validity, and are possibly insensitive to prospectively evaluated change [27, 29]. Importantly, GRCs have been used in evaluating instruments measuring physical health. Whether they have construct validity in denoting improvement in mental well-being is not yet known, and therefore anchor-based approaches may not be appropriate. It is also widely acknowledged that where GRCs are not available, statistical approaches to evaluating responsiveness are valid [21, 22, 27, 28, 37]. As with any population measure, some of the changes observed may represent regression to the mean. The fact that change was observed in population groups with average as well as those with low baseline scores suggests that not all change can be attributed to this phenomenon.