Test-retest reliability and measurement error of the Danish WHO-5 Well-being Index in outpatients with epilepsy
Health and Quality of Life Outcomes volume 16, Article number: 175 (2018)
The generic questionnaire WHO-5 Well-being Index (WHO-5), which measures the construct of mental well-being has been widely used in several populations across countries. The questionnaire has demonstrated sufficient psychometric properties; however, the test- retest reliability of the WHO-5 scale has yet to be determined. The aim of this study was to evaluate the test-retest reliability and measurement error of the Danish WHO-5 Well-being Index for outpatients with epilepsy. A further aim was to evaluate whether the method of administration (web, paper, or a mixture of the two modalities) influenced the results.
Epilepsy outpatients aged ≥15 years from three outpatient clinics in Central Denmark Region were included from August 2016 to April 2017. The participants were randomly divided into four test-retest groups: web-web, paper-paper, web-paper, and paper-web. Test-retest reliability was assessed by intraclass correlation coefficients (ICC) and measurement error by calculating minimal detectable change (MDC) on the basis of the standard error of the measurement.
A total of 554 patients completed the questionnaire at two time points. The median duration between test-retest was 22 days. The pooled test-retest reliability estimate was ICC 0.81 (95% CI 0.78; 0.84). The estimated MDC was 23.60 points (95% CI 22.27; 25.10). These estimates showed little variation across administration methods.
WHO-5 showed acceptable test-retest reliability in a Danish epilepsy outpatient population across different method of administration; however, the relatively large measurement error should be taken into account when evaluating changes in WHO-5 scores over time. Further research should be done to explore these findings.
Several considerations are important when selecting patient-reported outcome (PRO) measures for use in clinical practice. A PRO measure should be relevant to patients and clinicians and possess an adequate level of psychometric evidence for the instrument in the target population . In Central Denmark Region, PRO measures have been used as the basis for follow-up in three epilepsy outpatient clinics since 2012 [2, 3]. Patients complete a web or paper-based questionnaire at home instead of having pre-scheduled appointments. Clinical resources could then be directed towards patients with actual need, and clinicians could use patients’ self-reported information to identify otherwise undetected problems. As depression is common in patients with epilepsy , valid and reliable measurement tools are necessary to identify relevant symptoms. For this purpose, the WHO-5 Well-being Index (WHO-5) was selected and has been used since 2012 for outpatients with epilepsy in Central Denmark Region.
WHO-5 is a generic unidimensional questionnaire reflecting the construct mental well-being during the last 2 weeks . The scale was developed in 1998 and has been widely used . WHO-5 includes five positive wording statements rated on a 6-point ordinal scale ranging from 5 “all of the time” to 0 “at no time”. Raw scores, which range from 0 to 25, are multiplied by 4 to obtain a percentage score ranging from 0 (worst) to 100 (best). A percentage score below 50 indicates poor mental well-being and a risk of depression. The WHO-5 has demonstrated sufficient psychometric properties in terms of construct validity, predictive validity, and internal consistency reliability in several patient populations including epilepsy [6,7,8,9,10,11,12,13,14]; however, the test- retest reliability of the WHO-5 scale has yet to be determined. Furthermore, few studies have explored the impact on consistency of using different methods of administration [15, 16].
The study aim was to evaluate the test-retest reliability and measurement error of the Danish WHO-5 Well-being Index for outpatients with epilepsy. A further aim was to evaluate whether the method of administration (web, paper, or a mixture of the two modalities) influenced the results.
Study population and setting
Patients with epilepsy aged ≥15 years from three outpatient clinics in Central Denmark Region were included from August 2016 to April 2017. The patients completed the questionnaire at two time points. First, they completed a questionnaire from the outpatient clinic based on their preferred web or paper administration method (test 1). Subsequently, approximately 2 weeks later, a letter was sent to the patients asking them to complete the same questionnaire again (test 2). The patients were randomly divided into four test-retest groups based on the method of administration at test 1 and test 2: web-web, paper-paper, web-paper, and paper-web. Three reminders were sent in test 1, but no reminders were sent to non-responders in test 2. The WHO-5 Well-being Index was included in the questionnaire in test 1. In addition, the questionnaire included other items, regarding, for example, seizures, symptoms, and general health. The general health construct was measured by using two items from the Danish version of the Short Form 36 Health Survey [17, 18]. A long interval between test administrations increases the risk of change in patients’ health status in a test-retest study, whereas a short interval increases the risk of recall bias . The questionnaire in test 1 was sent to the patients as part of routine outpatient follow-up. Patients’ mental health was assumed to be stable during the time period from test 1 to test 2, since the health status of epilepsy patients is not likely to change over a period of 2 weeks. The patients were not asked in test 2 whether their mental health had changed within the time period.
Descriptive statistics were generated for patient characteristics and for each item to determine the extent of missing values and floor- or ceiling effects, which were considered present if more than 15% had a score at the lower or upper end of the scale . Cronbach’s alpha was used to assess internal consistency. The 95% confidence interval (CI) of the Cronbach’s alpha values was estimated by using the bootstrap method (1000 replications). The time interval between test 1 and 2 was calculated as the difference in number of days from the dates of responses. Test-retest reliability of the scale was assessed by intraclass correlation coefficients (ICC) agreement model 2.1 , with 95% CI, and for single items, kappa with squared weights and 95% CI was used. An ICC value of 0.70 is considered acceptable for group level analysis, but when evaluating individual patients, an ICC of 0.90 is recommended . The kappa values were interpreted as following: < 0.2 (slight), 0.21–0.4 (fair), 0.41–0.6 (moderate), 0.61–0.8 (substantial), and 0.81–1.0 (almost perfect) . Measurement error was assessed with differences between test 1 and 2 plotted against the means of the two measurements by Bland–Altman plots with 95% CI and 95% limits of agreement (LOA). LOA equals the mean change in scores between test 1 and 2 (mean change ±1.96 x standard deviation of the changes) and gives an indication of how much two scores can vary in stable patients. LOA are expressed in the units of measurement instrument and give a direct indication of the size of the measurement error . The measurement errors reflect the within intraindividual variation and were estimated as the standard error of the measurement (SEM) . SEM equals the square root of the error variance. The interpretation of a SEM estimate is not straight forward; therefore the SEM was converted into the minimally detectable change (MDC). MDC95 equals 1.96 ± √2 x SEM and indicates the smallest within-person change that can be interpreted as a “real” individual change above the measurement error . Thus, a change in scores within the LOA or smaller than MDC95can be attributed to measurement error . Patients with missing item values were excluded from the analyses. Two sensitivity analyses were performed to investigate whether the length of the time interval between test 1 and test 2 affected our results. In the first analysis, patients were excluded if the time period between test 1 and test 2 was above 30 days, and in the second analysis all patients with a time interval above 14 days were excluded. STATA 15 software (Stata Corp, College Station) were used for all statistical analyses.
Patient and item characteristics
A total of 554/1640 (34%) patients responded to the questionnaire twice. The median age was 57.3 years (Table 1). The response-rates in the four test-retest groups ranged from 48% (web-paper and paper-paper) to 34% (web-web) to 9% (paper-web). Non-responders were more likely younger, paper-responders, and had lower self-reported general health in test 1 (data not shown). The median response time between test-retest was 22 days (inter quartile range 10 days). A total of 14 patients had missing values for WHO-5 in test 1 or 2 and were excluded from the analyses. Percentages of missing values ranged from 0.2 to 1.1%, and there was a tendency towards ceiling effects in all items (Table 2). Cronbach’s alpha was 0.89 (95% CI 0.87; 0.90) in test 1 and 0.89 (95% CI 0.87; 0.91) in test 2.
Test-retest reliability and measurement error of WHO-5
Kappa values for the five single items were substantial (Table 2) . The ICC of the pooled WHO-5 score was 0.81 (95% CI 0.78; 0.84) (Table 3). Differences between test 1 and test 2 plotted against the mean of the two tests with upper and lower LOAs are shown in Fig. 1. The estimated SEM was 8.51 points (95% CI 8.03; 9.05), which resulted in a MDC95 of 23.60 points (95% CI 22.27; 25.10). The analysis was repeated in the four test-retest groups (Table 3 and Fig. 2). Administration methods did not noticeably alter the estimates. The overall results did not change, when the analyses were repeated with restricted intervals between test 1 and 2.
Test-retest reliability of the Danish WHO-5 Well-being Index was found to be acceptable in an epilepsy outpatient population, but a relatively large measurement error was observed. The estimated MDC95 was 23.60 points, indicating that changes in the WHO-5 instrument must be substantial to ensure that a ‘real’ change is not due to measurement error. Methods of administration did not markedly influence the results.
This study follows the COSMIN framework [23, 24] and supplements earlier established psychometric properties of the WHO-5. Since we were unable to identify other test-retest studies of the scale, we believe this is the first study to determine the test-retest reliability of the WHO-5. Several studies have explored another aspect of reliability: internal consistency [8,9,10,11,12,13,14]. The Cronbach’s alpha of the WHO-5 in these studies ranged from 0.82 to 0.95, which is consistent with the findings in this study. However, this aspect determines the correlation between items within a scale and not the degree of agreement for repeated measurements over time [22, 24]. The unidimensionality of the WHO-5 scale has been confirmed by using Rasch item response theory analyses in both a younger and elderly population [14, 25].
Test-retest reliability should be assessed in a stable population with an appropriate time interval between measurements . We assumed that the epilepsy outpatient population was stable and allowed a longer time interval. Sensitivity analyses were used to assess potential change in health status; however, excluding participants with longer intervals between test 1 and 2 did not substantially alter the estimates. Still, we cannot rule out that a change in patients’ health status had occurred and that this might have affected the ICC and measurement error estimates of the WHO-5 scale, as we did not collect information on the change in patients’ mental health status from test 1 to test 2.
The WHO-5 scale ranges from 0 to 100, and an MDC of 23.6 points observed in this study may indicate that longitudinal differences of at least 24 points are needed to detect a “true” within-person change. The relatively large measurement error observed in this study may be taken into consideration by researchers planning future clinical trials and clinicians who use the scale on the individual level in clinical practice to evaluate change over time. Furthermore, the tendency towards ceiling effect may produce difficulties in measuring longitudinal changes. Web, paper, or a mixture of the two modalities showed nearly the same test-retest reliability, which is consistent with other test-retest studies [15, 16].
One important limitation of this study is the possibility of selection bias. A very low response rate was observed especially in the paper-web group (9%). This may be due to the pragmatic design, which allowed patients to choose administration method for their response to test 1. In the Danish general population, a mean WHO-5 score of 70 points has been reported [26, 27]. This is comparable with the result in this study; however, the responders tended to be a healthier group of patients compared to non-responders in test 2 who had lower self-reported general health and mental well-being in test 1. The reliability estimates indicate how well patients can be distinguished from each other despite the presence of measurement error, e.g. a lower ICC value tends to occur in a homogenous study sample . Thus, in this study, the ICC estimates may have been underestimated due to a homogenous and healthy study population; whereas the measurement error estimates were probably less affected.
The WHO-5 Well-being Index showed acceptable test-retest reliability in a Danish epilepsy outpatient population, but the measurement error of the scale was relatively large. Different methods of administration did not influence the results. Further studies are required to provide insight into the test-retest reliability and measurement error in different language versions of the WHO-5 Well-being Index and in different patient populations.
Intraclass correlation coefficients
Inter quartile range
Limits of agreement
Minimal detectable change
Standard error of the measurement
WHO-5 Well-being Index
Snyder CF, Aaronson NK, Choucair AK, Elliott TE, Greenhalgh J, Halyard MY, Hess R, Miller DM, Reeve BB, Santana M. Implementing patient-reported outcomes assessment in clinical practice: a review of the options and considerations. Qual Life Res. 2012;21(8):1305–14.
Schougaard LM, Larsen LP, Jessen A, Sidenius P, Dorflinger L, de Thurah A, Hjollund NH. AmbuFlex: tele-patient-reported outcomes (telePRO) as the basis for follow-up in chronic and malignant diseases. Qual Life Res. 2016;25(3):525–34.
Hjollund NH, Larsen LP, Biering K, Johnsen SP, Riiskjaer E, Schougaard LM. Use of patient-reported outcome (PRO) measures at group and patient levels: experiences from the generic integrated PRO system WestChronic. Interact J Med Res. 2014;3(1):e5.
Fiest KM, Dykeman J, Patten SB, Wiebe S, Kaplan GG, Maxwell CJ, Bulloch AG, Jette N. Depression in epilepsy: a systematic review and meta-analysis. Neurology. 2013;80(6):590–9.
WHO Regional Office for Europe: Wellbeing measures in primary health care / the DepCare project; in World Health Organization. World Health Organization. 1998. http://www.euro.who.int/__data/assets/pdf_file/0016/130750/E60246.pdf. Assessed 18 Dec 2017.
Topp CW, Ostergaard SD, Sondergaard S, Bech P. The WHO-5 well-being index: a systematic review of the literature. Psychother Psychosom. 2015;84(3):167–76.
Hansen CP, Amiri M. Combined detection of depression and anxiety in epilepsy patients using the neurological disorders depression inventory for epilepsy and the World Health Organization well-being index. Seizure. 2015;33:41–5.
Halliday JA, Hendrieckx C, Busija L, Browne JL, Nefs G, Pouwer F, Speight J. Validation of the WHO-5 as a first-step screening instrument for depression in adults with diabetes: results from diabetes MILES - Australia. Diabetes Res Clin Pract. 2017;132:27–35.
Newnham EA, Hooke GR, Page AC. Monitoring treatment response and outcomes using the World Health Organization's wellbeing index in psychiatric care. J Affect Disord. 2010;122(1–2):133–8.
Guethmundsdottir HB, Olason DP, Guethmundsdottir DG, Sigurethsson JF. A psychometric evaluation of the Icelandic version of the WHO-5. Scand J Psychol. 2014;55(6):567–72.
Krieger T, Zimmermann J, Huffziger S, Ubl B, Diener C, Kuehner C, Grosse Holtforth M. Measuring depression with a well-being index: further evidence for the validity of the WHO well-being index (WHO-5) as a measure of the severity of depression. J Affect Disord. 2014;156:240–4.
Hajos TR, Pouwer F, Skovlund SE, Den Oudsten BL, Geelhoed-Duijvestijn PH, Tack CJ, Snoek FJ. Psychometric and screening properties of the WHO-5 well-being index in adult outpatients with type 1 or type 2 diabetes mellitus. Diabet Med. 2013;30(2):e63–9.
de Wit M, Pouwer F, Gemke RJ, Delemarre-van de Waal HA, Snoek FJ. Validation of the WHO-5 well-being index in adolescents with type 1 diabetes. Diabetes Care. 2007;30(8):2003–6.
Lucas-Carrasco R. Reliability and validity of the Spanish version of the World Health Organization-five well-being index in elderly. Psychiatry Clin Neurosci. 2012;66(6):508–13.
Egger MJ, Lukacz ES, Newhouse M, Wang J, Nygaard I. Web versus paper-based completion of the epidemiology of prolapse and incontinence questionnaire. Female Pelvic Med Reconstr Surg. 2013;19(1):17–22.
Sjostrom M, Stenlund H, Johansson S, Umefjord G, Samuelsson E. Stress urinary incontinence and quality of life: a reliability study of a condition-specific instrument in paper and web-based versions. Neurourol Urodyn. 2012;31(8):1242–6.
Bjorner JB, Thunedborg K, Kristensen TS, Modvig J, Bech P. The Danish SF-36 health survey: translation and preliminary validity studies. J Clin Epidemiol. 1998;51(11):991–9.
Bjorner JB, Damsgaard MT, Watt T, Groenvold M. Tests of data quality, scaling assumptions, and reliability of the Danish SF-36. J Clin Epidemiol. 1998;51(11):1001–11.
de Vet HC, Terwee CB, Mokkink LB, Knol DL. Measurement in Medicine: a practical guide. 1st ed. United Kingdom: Cambridge University Press; 2011.
Koo TK, Li MY. A guideline of selecting and reporting Intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15(2):155–63.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74.
Terwee CB, Bot SD, de Boer MR, van der Windt DA, Knol DL, Dekker J, Bouter LM, de Vet HC. Quality criteria were proposed for measurement properties of health status questionnaires. J Clin Epidemiol. 2007;60(1):34–42.
Mokkink LB, de Vet HCW, Prinsen CAC, Patrick DL, Alonso J, Bouter LM, Terwee CB. COSMIN risk of Bias checklist for systematic reviews of patient-reported outcome measures. Qual Life Res. 2018;27(5): 1171–79.
Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, Bouter LM, de Vet HC. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J Clin Epidemiol. 2010;63(7):737–45.
Blom EH, Bech P, Hogberg G, Larsson JO, Serlachius E. Screening for depressed mood in an adolescent psychiatric context by brief self-assessment scales--testing psychometric validity of WHO-5 and BDI-6 indices by latent trait analyses. Health Qual Life Outcomes. 2012;10:149–7525. 10-149
Bech P, Olsen LR, Kjoller M, Rasmussen NK. Measuring well-being rather than the absence of distress symptoms: a comparison of the SF-36 mental health subscale and the WHO-five well-being scale. Int J Methods Psychiatr Res. 2003;12(2):85–91.
Ellervik C, Kvetny J, Christensen KS, Vestergaard M, Bech P. Prevalence of depression, quality of life and antidepressant treatment in the Danish general suburban population study. Nord J Psychiatry. 2014;68(7):507–12.
This study is funded by Aarhus University, the Health Research Fund of Central Denmark Region, and the Danish foundation TrygFonden. The funding sources had no role in the design of the study, data collection, analysis, interpretation of data, and writing the manuscript.
Availability of data and materials
An anonymous version of the datasets used and analysed during the current study are available from the corresponding author on reasonable request.
Ethics approval and consent to participate
The study was approved by the Danish Data Protection Agency (j.no: 1–16–02-691-14). All procedures performed were in accordance with the ethical standards of national research committee and with the 1964 Declaration of Helsinki. According to Danish law, approval by the ethics committee and written informed consent was not required. The eligible patients were provided with information about the study and its purpose, including that participation was voluntary.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Schougaard, L.M.V., de Thurah, A., Bech, P. et al. Test-retest reliability and measurement error of the Danish WHO-5 Well-being Index in outpatients with epilepsy. Health Qual Life Outcomes 16, 175 (2018). https://doi.org/10.1186/s12955-018-1001-0