Skip to main content

Enhancing rigour in the validation of patient reported outcome measures (PROMs): bridging linguistic and psychometric testing



A strong consensus exists for a systematic approach to linguistic validation of patient reported outcome measures (PROMs) and discrete methods for assessing their psychometric properties. Despite the need for robust evidence of the appropriateness of measures, transition from linguistic to psychometric validation is poorly documented or evidenced. This paper demonstrates the importance of linking linguistic and psychometric testing through a purposeful stage which bridges the gap between translation and large-scale validation.


Evidence is drawn from a study to develop a Welsh language version of the Beck Depression Inventory-II (BDI-II) and investigate its psychometric properties. The BDI-II was translated into Welsh then administered to Welsh-speaking university students (n = 115) and patients with depression (n = 37) concurrent with the English BDI-II, and alongside other established depression and quality of life measures. A Welsh version of the BDI-II was produced that, on administration, showed conceptual equivalence with the original measure; high internal consistency reliability (Cronbach’s alpha = 0.90; 0.96); item homogeneity; adequate correlation with the English BDI-II (r = 0.96; 0.94) and additional measures; and a two-factor structure with one overriding dimension. Nevertheless, in the student sample, the Welsh version showed a significantly lower overall mean than the English (p = 0.002); and significant differences in six mean item scores. This prompted a review and refinement of the translated measure.


Exploring potential sources of bias in translated measures represents a critical step in the translation-validation process, which until now has been largely underutilised. This paper offers important findings that inform advanced methods of cross-cultural validation of PROMs.


Patient reported outcome measures (PROMs) are used increasingly in clinical practice and research where they must be fit for purpose and sensitive to patients’ cultural and linguistic needs[1]. Thus PROMs are required in a range of different languages; and the need to maintain reliability and validity of measures is paramount[2]. Whilst a rigorous multi-step approach to translation is endorsed[3, 4], there are no clear recommendations about the early assessment of reliability and validity of translated measures before large-scale testing. We demonstrate the value of undertaking early checks to refine measures. Our case in point is the translation and validation of the Beck Depression Inventory II (BDI-II)[5] for the Welsh language. The measure is widely used both clinically and in research for measuring the severity of depression and response to psychological and medical interventions; and it is one of the PROMS recommended by the Welsh and UK Governments for screening depression in high risk populations in primary care.

The BDI has been translated into numerous languages and is psychometrically robust for use in countries across the world[68]. There is, however, no Welsh language version currently available. Here, we report the linguistic and psychometric validation of the Welsh BDI-II and highlight the value of embedding early stage validation within the instrument development phase.


Linguistic validation

Under licence of the publisher and adopting the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) guidelines[3], two independent translators produced a Welsh BDI-II. Reconciliation of these translations into a merged document was undertaken through consensus. This version was then translated back into English by a third independent translator for quality assurance. Comparison between the back translation and original measure highlighted any discrepancies which were revised through discussion and consensus. Eight Welsh-speaking lay respondents (Table1) were invited to complete the Welsh BDI-II and check their comprehension and interpretation of the draft measure. Remaining discrepancies were identified by comparing these interpretations with the original measure. A final Welsh translation was agreed and subjected to an early exploratory stage of psychometric testing. In line with previous validation of the BDI-II[5], two test groups were identified: (i) a student sample, and (ii) a clinical sample of patients with depression (Table1).

Table 1 Characteristics of the study samples

Psychometric testing

In keeping with theoretical propositions[1], the Welsh BDI-II was expected to have (a) a two-factor structure similar to the original model presented, and (b) adequate correlations with other accepted depression scales, and negative correlations with quality of life scales. These hypotheses were tested by (a) performing a confirmatory factor analysis on the student sample data and (b) examining Pearson correlation coefficients between the Welsh BDI-II and other pre-specified measures, including the English BDI-II, for both the clinical and student samples. Further exploratory item level analysis was undertaken to identify potential sources of bias.

Student sample

Out of 144 bilingual (Welsh/English) university students approached, 115 (80%) consented to participate in the study. Data collection was undertaken during 2009 in a classroom setting, outside teaching hours, where participants were asked to complete the following measures in the order listed:

  1. (a)

    BDI-II (English) [5]

  2. (b)

    European Quality of Life-5 Dimensions (EQ-5D) (Welsh) [9]

  3. (c)

    Hospital Anxiety and Depression Scale (HADS) (English) [10]

  4. (d)

    Short-Form 12-item Health Survey version 2 (SF-12 v2) (English) [11]

  5. (e)

    BDI-II (Welsh) [5]

Clinical sample

A sample of Welsh-speaking patients with depression was recruited to participate in this validation study between 2009 and 2010 through the Folate Augmentation of Treatment - Evaluation for Depression (FolATED) trial[12]. Thirty-seven of 81 (46%) bilingual speakers consented to participate. Consistent with the trial protocol, the following English measures were completed at randomisation (followed by other trial measures):

  1. (a)

    BDI-II [5]

  2. (b)

    Researcher-rated Montgomery-Asberg Depression Rating Scale [13]

  3. (c)

    SF-12 v2 [11]

  4. (d)

    EQ-5D [9]

For the validation study, participants were also invited to complete the Welsh BDI-II.

Bangor University School of Healthcare Sciences Ethics Committee approved the student study whilst the Multi-centre Research Ethics Committee for Wales approved the patient study through the FolATED trial processes[12]. All data were anonymised and analysed using PASW[14] and AMOS[15] for Windows (version 18.0). All statistical tests were two-sided, and P-values of ≤0.05 were considered statistically significant.


The Welsh BDI-II showed a high level of internal consistency for both student (α = 0.90) and clinical (α = 0.96) samples similar to that reported for the English BDI-II (α = 0.87 student sample; α = 0.92 clinical sample) and by Beck and colleagues (α = 0.93)[5]. The Welsh measure demonstrated a high degree of concurrent and discriminant validity with a positive correlation with HADS (student sample: depression component r = 0.71; anxiety component r = 0.66); and negative correlation with the mental component of SF-12v2 (student sample: r = −0.74; clinical sample: r = −0.71) and EQ-5D (student sample: r = −0.66; clinical sample: r = −0.55). Factor analysis revealed a two factor structure emerging from both samples for each language version; with one overriding depression-related dimension. However, confirmatory factor analysis of the student data revealed that the three indices did not meet the criteria for good fit (GFI = 0.54, AGFI = 0.47, RMR = 0.06).

The student Welsh BDI-II depression score was highly correlated to the English (r = 0.94), but the overall mean was significantly lower (Welsh M = 5.09, SD = 5.85; English M = 5.70, SD = 5.5), t110 = 3.217, p = 0.002. The Bland Altman graph[16] (Figure1) revealed a small but significant bias towards the English BDI-II, showing a slightly higher score than its Welsh comparator; the mean difference (MD) in scores being just over half a point (MD = 0.61, 95% limits of agreement 0.23 to 1.00). The depression score on the Welsh BDI-II was also highly correlated to the English (r = 0.96) within the clinical sample but no statistically significant differences were noted between the mean scores.

Figure 1
figure 1

Bland Altman plot for Welsh and English BDI-II (student sample).

Given the evidence of a seemingly biased measure and poorly fitting confirmatory factor analysis for the student sample, further item-level exploration was performed. No differences were found within the clinical sample between mean scores of the Welsh and English BDI-II for the individual items; and there were no indications of asymmetry. However, within the student sample, six items showed statistical significant differences on a paired t-test comparing mean scores between the Welsh and English BDI-II. Three of these items also indicated significant asymmetry (Table2). Close inspection of the three items demonstrating bias revealed potential interpretations that may have led to an underscoring of the item in the Welsh BDI-II (Table3).

Table 2 Item level analysis of the BDI-II (student sample)
Table 3 Summary of items and potential interpretations which caused bias in the student sample


We have demonstrated how a thorough and rigorous approach to early validation can inform the refinement of translated outcome measures. Here, we examine the juxtaposition of these two processes (often reported independently in the literature); and discuss the wider implications for a revision of the guidelines and methods of cross-cultural validation of PROMs.

Our results support previous findings on the psychometric properties of the BDI-II, particularly in relation to the two-factor structure[5, 7, 8, 17]; and concurrent validity with other depression and quality of life measures[1820]. This indicates that the translation and early validation process was relatively successful. Despite the high correlation between the two language versions, the observed poor fit (indicating poor construct validity) and bias led us to explore potential sources of bias and items of concern. This prompted further scrutiny of the translated items to rule out any inaccuracies or misinterpretations, thus providing the opportunity to amend any problematic items. Whilst this step is acknowledged in the literature[4, 21], it attracts little attention within current translation and validation guidelines[3, 22].

In light of our evidence, it is possible that ambiguities in translation at the lower end of the scale biased response to some items. This interpretation is strengthened as we detected no other subtle dissonances when the remaining items were similarly scrutinised. Moreover, since the student data aggregated to the lower end of the scale, this bias is not observed amongst the clinical sample because the majority reported symptoms of moderate to severe depression. Thus, whilst we acknowledge that our samples were small; our results are suggestive of a potential bias found at the lower end of the scale. A stronger study design involving a qualitative exploration of the students’ interpretations of the discrepant items may well have endorsed this finding.

Whilst this finding led to the refinement of the Welsh BDI-II, it also has several wider implications for instrument translators and developers. Firstly, it draws attention to the need for careful scrutiny in the translation of everyday vocabulary. Secondly, it demonstrates the importance of ensuring that the translated version of a measure is scaled in an equivalent way as the original version. Thirdly, and more importantly, this finding confirms the value of investigating item discrepancies through early exploratory psychometric evaluations of translated measures prior to large-scale, psychometric testing.


On the basis of our findings, we propose an additional final step (early psychometric testing) to the ISPOR guidelines[3]. This offers a novel, cost-effective approach towards bridging the linguistic and psychometric testing of PROMs that plugs a gap in the current literature and brings the rigour associated with clinical research development to the translation and validation platform.

Authors’ information

GR is director of LLAIS, the Language Awareness Infrastructure Support Service of the National Institute for Social Care and Health Research (NISCHR) Clinical Research Centre in Wales, UK. LLAIS is committed towards developing and validating Welsh language versions of PROMs for the bilingual context of Wales; and establishing the evidence base for best practice in the translation and validation of outcome measures.



Adjusted Goodness of Fit Index


Analysis of Moment Structures for Windows


Beck Depression Inventory-II


European Quality of Life-5 Dimensions


Folate Augmentation of Treatment - Evaluation for Depression


Goodness of Fit Index


Hospital Anxiety and Depression Scale


International Society for Pharmacoeconomics and Outcomes Research


Patient reported outcome measures


Predictive Analytic Software


Root Mean Square Residual

(SF-12 v2):

Short-Form 12-item Health Survey version 2.


  1. Streiner D, Norman G: Health Measurement Scales: a practical guide to their development and use. 4th edition. Oxford: Oxford University Press; 2008.

    Book  Google Scholar 

  2. Frost MH, Reeve BB, Liepa AM, Stauffer JW, Hays RD: What is sufficient evidence for the reliability and validity of patient reported outcome measures? Value Health 2007, 10(Suppl 2):S94-S105.

    Article  PubMed  Google Scholar 

  3. Wild D, Grove A, Martin M, Eremenco S, McElroy S, Verjee-Lorenz A, Erikson P: Principles of good practice for the translation and cultural adaptation process for patient-reported outcomes (PRO) measures: report of the ISPOR task force for translation and cultural adaptation. Value Health 2005, 8: 94–104. 10.1111/j.1524-4733.2005.04054.x

    Article  PubMed  Google Scholar 

  4. Acquardo C, Conway K, Hareendran A, Aaronson N: Literature review of methods to translate health-related quality of life questionnaires for use in multinational clinical trials. Value Health 2007, 11: 509–521.

    Google Scholar 

  5. Beck AT, Steer RA, Brown GK: Manual for the Becks Depression Inventory II. San Antonio TX: Psychological Corporation; 1996.

    Google Scholar 

  6. Bonicatto S, Dew AM, Soria JJ: Analysis of the psychometric properties of the Spanish version of the Beck Depression Inventory in Argentina. Psychiatry Res 1998, 79: 227–285. 10.1016/S0165-1781(98)00042-0

    Article  Google Scholar 

  7. Suarez-Mendoza AA, Cardiel MH, Caballero-Uribe C, Ortega-Soto HA, Márquez-Marin M: Measurement of depression in Mexican patients with rheumatoid arthritis: validity of the Beck Depression Inventory. Arthr Care Res 1997, 10: 194–199. 10.1002/art.1790100307

    Article  CAS  Google Scholar 

  8. Kojima M, Furukawa TA, Takahashi H, Kawai M, Nagaya T, Tokudome S: Cross cultural validation of the Beck Depression Inventory-II in Japan. Psychiatry Res 2002, 110: 291–299. 10.1016/S0165-1781(02)00106-3

    Article  PubMed  Google Scholar 

  9. EuroQol Group, EQ-5D™: The EuroQol: a new facility for the measurement of health-related quality of life. Health Policy 1990, 6: 199–208.

    Google Scholar 

  10. Zigmond AS, Snaith RP: The hospital anxiety and depression scale. Acta Psychiat Scand 1993, 67: 361–370.

    Article  Google Scholar 

  11. Ware JE, Kosinski M, Keller SK: SF-36® Physical and Mental Health Summary Scales: A User's Manual. Boston: The Health Institute; 1994.

    Google Scholar 

  12. Roberts SH, Bedson E, Hughes D, Lloyd K, Menkes DB, Moat S, Pirmohamed M, Slegg G, Thome J, Tranter R, Whitaker R, Wilkinson C, Russell I: Folate augmentation of treatment - evaluation for depression (folated): a protocol of a randomised controlled trial. BMC Psychiatry 2007, 7: 65. 10.1186/1471-244X-7-65

    Article  PubMed Central  PubMed  Google Scholar 

  13. Montgomery SA, Asberg M: A new depression scale designed to be sensitive to change. Br J Psychiat 1979, 135: 382–389.

    Article  Google Scholar 

  14. IBM: PASW Statistics 18. Chicago, IL: SPSS, Inc; 2010.

    Google Scholar 

  15. IBM: AMOS 18. Chicago IL: Smallwaters Corporation; 2010.

    Google Scholar 

  16. Bland JM, Altman DG: Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986, 1((8476)):307–310.

    Article  CAS  PubMed  Google Scholar 

  17. Wang YP, Andrade LH, Gorenstein C: Validation of the Beck Depression Inventory for a Portugese-speaking Chinese community in Brazil. Brazilian. J Med Biol Res 2005, 38: 399–408. 10.1590/S0100-879X2005000300011

    Article  Google Scholar 

  18. Bjelland I, Dahl AA, Haug TT, Neckelmann D: The validity of the Hospital Anxiety and Depression Scale: and updated literature review. J Psychosom Res 2002, 52: 69–77. 10.1016/S0022-3999(01)00296-3

    Article  PubMed  Google Scholar 

  19. Arnarson PÖ, Ólason DP, Smári J, Sigurdsson JF: The Beck Depression Inventory Second Edition (BDI-II): psychometric properties in Icelandic student and patient populations. Nordic J Psychiat 2008, 62: 360–365. 10.1080/08039480801962681

    Article  Google Scholar 

  20. Kapci EG, Uslu R, Turkcapar H, Karaoglan A: Beck Depression Inventory II: evaluation of the psychometric properties and cut-off points in a Turkish adult population. Depress Anxiety 2008, 25: E104-E110. 10.1002/da.20371

    Article  PubMed  Google Scholar 

  21. McKenna SP, Doward LC: The translation and cultural adaptation of patient-reported outcome measures. Value Health 2005, 8: 89–91. 10.1111/j.1524-4733.2005.08203.x

    Article  PubMed  Google Scholar 

  22. Mapi Research Institute: Linguistic Validation of a Patient Reported Outcome Measure. Lyon: Mapi Research Institute; 2005.

    Google Scholar 

Download references


We are grateful to the translators, Dr Sylvia Prys, Dawi Griffiths and Gruffydd Prys; to the FolATED team; and to the students and service users for their valuable contribution to this study. The validation study was funded by NISCHR; and the FolATED trial is funded by the National Institute for Health Research Health Technology Assessment programme.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Gwerfyl Roberts.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

GR conceptualised and designed the study, acquired and interpreted the data and drafted the manuscript. SR and RT conceptualised and designed the study, acquired and interpreted the data and revised the manuscript. RW supervised the data analysis, interpreted the data and revised the manuscript. EB acquired and interpreted the data and revised the manuscript. ST, DP and HO acquired the data and YS analysed the data. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Roberts, G., Roberts, S., Tranter, R. et al. Enhancing rigour in the validation of patient reported outcome measures (PROMs): bridging linguistic and psychometric testing. Health Qual Life Outcomes 10, 64 (2012).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: