Enhancing rigour in the validation of patient reported outcome measures (PROMs): bridging linguistic and psychometric testing

Background A strong consensus exists for a systematic approach to linguistic validation of patient reported outcome measures (PROMs) and discrete methods for assessing their psychometric properties. Despite the need for robust evidence of the appropriateness of measures, transition from linguistic to psychometric validation is poorly documented or evidenced. This paper demonstrates the importance of linking linguistic and psychometric testing through a purposeful stage which bridges the gap between translation and large-scale validation. Findings Evidence is drawn from a study to develop a Welsh language version of the Beck Depression Inventory-II (BDI-II) and investigate its psychometric properties. The BDI-II was translated into Welsh then administered to Welsh-speaking university students (n = 115) and patients with depression (n = 37) concurrent with the English BDI-II, and alongside other established depression and quality of life measures. A Welsh version of the BDI-II was produced that, on administration, showed conceptual equivalence with the original measure; high internal consistency reliability (Cronbach’s alpha = 0.90; 0.96); item homogeneity; adequate correlation with the English BDI-II (r = 0.96; 0.94) and additional measures; and a two-factor structure with one overriding dimension. Nevertheless, in the student sample, the Welsh version showed a significantly lower overall mean than the English (p = 0.002); and significant differences in six mean item scores. This prompted a review and refinement of the translated measure. Conclusions Exploring potential sources of bias in translated measures represents a critical step in the translation-validation process, which until now has been largely underutilised. This paper offers important findings that inform advanced methods of cross-cultural validation of PROMs.


Background
Patient reported outcome measures (PROMs) are used increasingly in clinical practice and research where they must be fit for purpose and sensitive to patients' cultural and linguistic needs [1]. Thus PROMs are required in a range of different languages; and the need to maintain reliability and validity of measures is paramount [2]. Whilst a rigorous multi-step approach to translation is endorsed [3,4], there are no clear recommendations about the early assessment of reliability and validity of translated measures before large-scale testing. We demonstrate the value of undertaking early checks to refine measures. Our case in point is the translation and validation of the Beck Depression Inventory II (BDI-II) [5] for the Welsh language. The measure is widely used both clinically and in research for measuring the severity of depression and response to psychological and medical interventions; and it is one of the PROMS recommended by the Welsh and UK Governments for screening depression in high risk populations in primary care.
The BDI has been translated into numerous languages and is psychometrically robust for use in countries across the world [6][7][8]. There is, however, no Welsh language version currently available. Here, we report the linguistic and psychometric validation of the Welsh BDI-II and highlight the value of embedding early stage validation within the instrument development phase.

Linguistic validation
Under licence of the publisher and adopting the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) guidelines [3], two independent translators produced a Welsh BDI-II. Reconciliation of these translations into a merged document was undertaken through consensus. This version was then translated back into English by a third independent translator for quality assurance. Comparison between the back translation and original measure highlighted any discrepancies which were revised through discussion and consensus. Eight Welsh-speaking lay respondents (Table 1) were invited to complete the Welsh BDI-II and check their comprehension and interpretation of the draft measure. Remaining discrepancies were identified by comparing these interpretations with the original measure. A final Welsh translation was agreed and subjected to an early exploratory stage of psychometric testing. In line with previous validation of the BDI-II [5], two test groups were identified: (i) a student sample, and (ii) a clinical sample of patients with depression (Table 1).

Psychometric testing
In keeping with theoretical propositions [1], the Welsh BDI-II was expected to have (a) a two-factor structure similar to the original model presented, and (b) adequate correlations with other accepted depression scales, and negative correlations with quality of life scales. These hypotheses were tested by (a) performing a confirmatory factor analysis on the student sample data and (b) examining Pearson correlation coefficients between the Welsh BDI-II and other pre-specified measures, including the English BDI-II, for both the clinical and student samples. Further exploratory item level analysis was undertaken to identify potential sources of bias.

Student sample
Out of 144 bilingual (Welsh/English) university students approached, 115 (80%) consented to participate in the study. Data collection was undertaken during 2009 in a classroom setting, outside teaching hours, where participants were asked to complete the following measures in the order listed: (a) BDI-II (English) [5] (b)European Quality of Life-5 Dimensions (EQ-5D) (Welsh) [9] (c) Hospital Anxiety and Depression Scale (HADS) (English) [10] (d)Short-Form 12-item Health Survey version 2 (SF-12 v2) (English) [11] (e) BDI-II (Welsh) [5] Clinical sample A sample of Welsh-speaking patients with depression was recruited to participate in this validation study between 2009 and 2010 through the Folate Augmentation of Treatment -Evaluation for Depression (FolATED) trial [12]. Thirty-seven of 81 (46%) bilingual speakers consented to participate. Consistent with the trial protocol, the following English measures were completed at randomisation (followed by other trial measures): (a) BDI-II [5] (b)Researcher-rated Montgomery-Asberg Depression Rating Scale [13] (c) SF-12 v2 [11] For the validation study, participants were also invited to complete the Welsh BDI-II.
Bangor University School of Healthcare Sciences Ethics Committee approved the student study whilst the Multicentre Research Ethics Committee for Wales approved the patient study through the FolATED trial processes [12]. All data were anonymised and analysed using PASW [14] and AMOS [15] for Windows (version 18.0). All statistical tests were two-sided, and P-values of ≤0.05 were considered statistically significant.

Results
The Welsh BDI-II showed a high level of internal consistency for both student (α = 0.90) and clinical (α = 0.96) samples similar to that reported for the English BDI-II (α = 0.87 student sample; α = 0.92 clinical sample) and by Beck and colleagues (α = 0.93) [5]. The Welsh measure demonstrated a high degree of concurrent and discriminant validity with a positive correlation with HADS (student sample: depression component r = 0.71; anxiety component r = 0.66); and negative correlation with the mental component of SF-12v2 (student sample: r = −0.74; clinical sample: r = −0.71) and EQ-5D (student sample: r = −0.66; clinical sample: r = −0.55). Factor analysis revealed a two factor structure emerging from both samples for each language version; with one overriding depression-related dimension. However, confirmatory factor analysis of the student data revealed that the three indices did not meet the criteria for good fit (GFI = 0.54, AGFI = 0.47, RMR = 0.06).
The student Welsh BDI-II depression score was highly correlated to the English (r = 0.94), but the overall mean was significantly lower (Welsh M = 5.09, SD = 5.85; English M = 5.70, SD = 5.5), t 110 = 3.217, p = 0.002. The Bland Altman graph [16] (Figure 1) revealed a small but significant bias towards the English BDI-II, showing a slightly higher score than its Welsh comparator; the mean difference (MD) in scores being just over half a point (MD = 0.61, 95% limits of agreement 0.23 to 1.00). The depression score on the Welsh BDI-II was also highly correlated to the English (r = 0.96) within the clinical sample but no statistically significant differences were noted between the mean scores.
Given the evidence of a seemingly biased measure and poorly fitting confirmatory factor analysis for the student sample, further item-level exploration was performed. No differences were found within the clinical sample between mean scores of the Welsh and English BDI-II for the individual items; and there were no indications of asymmetry. However, within the student sample, six items showed statistical significant differences on a paired t-test comparing mean scores between the Welsh and English BDI-II. Three of these items also indicated significant asymmetry (Table 2). Close inspection of the three items demonstrating bias revealed potential interpretations that may have led to an underscoring of the item in the Welsh BDI-II (Table 3).

Discussion
We have demonstrated how a thorough and rigorous approach to early validation can inform the refinement of translated outcome measures. Here, we examine the juxtaposition of these two processes (often reported independently in the literature); and discuss the wider implications for a revision of the guidelines and methods of cross-cultural validation of PROMs.
Our results support previous findings on the psychometric properties of the BDI-II, particularly in relation to the two-factor structure [5,7,8,17]; and concurrent validity with other depression and quality of life measures [18][19][20]. This indicates that the translation and early validation process was relatively successful. Despite the high correlation between the two language versions, the observed poor fit (indicating poor construct validity) and bias led us to explore potential sources of bias and items of concern. This prompted further scrutiny of the translated items to rule out any inaccuracies or misinterpretations, thus providing the opportunity to amend any problematic items. Whilst this step is acknowledged in the literature [4,21], it attracts little attention within current translation and validation guidelines [3,22].
In light of our evidence, it is possible that ambiguities in translation at the lower end of the scale biased response to some items. This interpretation is strengthened as we detected no other subtle dissonances when the remaining items were similarly scrutinised. Moreover, since the student data aggregated to the lower end of the scale, this bias is not observed amongst the clinical sample because the majority reported symptoms of moderate to severe depression. Thus, whilst we acknowledge that our samples were small; our results are suggestive of a potential bias found at the lower end of the scale. A stronger study design involving a qualitative exploration of the students' interpretations of the discrepant items may well have endorsed this finding.
Whilst this finding led to the refinement of the Welsh BDI-II, it also has several wider implications for instrument translators and developers. Firstly, it draws attention to the need for careful scrutiny in the translation of everyday vocabulary. Secondly, it demonstrates the importance of ensuring that the translated version of a measure is scaled in an equivalent way as the original version. Thirdly, and more importantly, this finding confirms the value of investigating item discrepancies through early exploratory psychometric evaluations of translated measures prior to large-scale, psychometric testing.

Recommendations
On the basis of our findings, we propose an additional final step (early psychometric testing) to the ISPOR guidelines [3]. This offers a novel, cost-effective approach towards bridging the linguistic and psychometric testing of PROMs that plugs a gap in the current literature and brings the rigour associated with clinical research development to the translation and validation platform.