Translation and adaptation of the German version of the Veterans Rand—36/12 Item Health Survey

Background The translated and culturally adapted German version of the Veterans Rand 36 Items Health Survey (VR-36), and its short form, the VR-12 counterpart, were validated in a German sample of orthopedic (n = 399) and psychosomatic (n = 292) inpatient rehabilitation patients. Methods The instruments were analyzed regarding their acceptance, distributional properties, validity, responsiveness and ability to discriminate between groups by age, sex and clinically specific groups. Eligible study participants completed the VR-36 (n = 169) and the VR-12 (n = 177). They also completed validated patient-reported outcome measures (PROs) including the Euroqol-5 Dimensions 5 Level (EQ-5D-5L); Depression, Anxiety and Stress Scale (DASS); Hannover Functional Abilities Questionnaire (HFAQ); and CDC Healthy Days. The VR-12 and the VR-36 were compared to the reference instruments MOS Short Form-12 Items Health Survey (SF-12) version 1.0 and MOS Short Form-36 Items Health Survey (SF-36) version 1.0, using percent of completed items, distributional properties, correlation patterns, distribution measures of known groups validity, and effect size measures. Results Item non-response varied between 1.8%/1.1% (SFVR-36/RESF-36) and 6.5%/8.6% (GHVR-36/GHSF-36). PCS was normally distributed (Kolmogorov–Smirnov tests: p > 0.05) with means, standard deviations and ranges very similar between SF-36 (37.5 ± 11.7 [13.8–66.1]) and VR-36 (38.5 ± 10.1 [11.7–67.8]), SF-12 (36.9 ± 10.9 [15.5–61.6]) and VR-12 (36.2 ± 11.5 [12.7–59.3]). MCS was not normally distributed with slightly differing means and ranges between the instruments (MCSVR-36: 36.2 ± 14.2 [12.9–66.6], MCSSF-36: 39.0 ± 15.6 [2.0–73.2], MCSVR-12: 37.2 ± 13.8 [8.4–70.2], MCSSF-12: 39.0 ± 12.3 [17.6–65.4]). Construct validity was established by comparing correlation patterns of the MCSVR and PCSVR with measures of physical and mental health. For both PCSVR and MCSVR there were moderate (≥ 0.3) to high (≥ 0.5) correlations with convergent (PCSVR: 0.55–0.76, MCSVR: 0.60–0.78) and small correlations (< 0.1) with divergent (PCSVR: < 0.12, MCSVR: < 0.16) self-report measures. Known-groups validity was demonstrated for both VR-12 and VR-36 (MCS and PCS) via comparisons of distribution parameters with significant higher mean PCS and MCS scores in both VR instruments found in younger patients with fewer sick days in the last year and a shorter duration of rehabilitation. Conclusions The psychometric analysis confirmed that the German VR is a valid and reliable instrument for use in orthopedic and psychosomatic rehabilitation. Yet further research is needed to evaluate its usefulness in other populations. Supplementary Information The online version contains supplementary material available at 10.1186/s12955-021-01722-y.


Background
Health related quality of life (HRQoL) is a crucial outcome metric used in settings from clinical trials [1,2] to population health surveillance [3][4][5][6][7]. The Veterans Rand questionnaire (VR) is a multi-attribute generic instrument measuring patient-reported HRQoL. The instrument has a long (VR-36) and a short form (VR-12), both measuring a physical component summary (PCS VR ) and a mental component summary (MCS VR ). The VR-36 also is comprised of eight scales, which correspond closely to the Medical Outcome Study (MOS) Short Form 36 version 1.0 (SF-36, [8][9][10]).
The VR instruments were created to address the veteran population in the United States (US) [11]. The Veterans Health Administration (VHA) is a national health care system, which serves over nine million military veterans in the US. It is one of the largest integrated health care systems in the US. This patient population has special medical needs, is older, poorer, sicker (with more diseases than veterans nationally) and has a higher percentage of men than the general adult population [12][13][14]. The creation of the VR instruments has been previously documented [13][14][15][16] and shown to be valid for the VA population [13,[17][18][19][20][21][22][23][24][25][26][27] as well as other general US populations [28][29][30][31][32][33][34][35]. The English-language VR instruments have become an integral part of registries [36] and studies of National U.S. health programs [18,37,38] including the evaluation of the Medicare Advantage Program by the Centers for Medicare and Medicaid Services (CMS). Advantages of the VR instruments include their validity in older and sicker populations, their availability (all instruments are in the public domain) and their strong psychometric properties across different and wide-ranging socio-demographic and clinical groups.
In this study, we translated and culturally adapted the VR-36 into the German language (Germany) and validated the VR-36 and VR-12 in a population of German patients undergoing inpatient rehabilitation. The German VR-36 and VR-12 were comprehensively validated and compared to the SF-36 and SF-12 in inpatient populations of orthopedic and psychosomatic rehabilitation patients (the two largest clinical indications of German inpatient rehabilitation patients).
The SF-36 and the SF-12 are considered gold standards of self-assessed generic health instruments and they have been extensively distributed and used across a wide range of countries, populations and purposes. They are recommended for measuring patient outcomes in the medical rehabilitation setting in Germany [39][40][41][42]. Since the field of medical rehabilitation has been one of the most common applications of the SF-36 in the German-speaking countries, it was important to compare the measurement properties of the VR instruments to the SF-instruments in this setting.

Methods
The study was conducted in two phases: phase (A) translating and culturally adapting the original English VR-36 into the German language (Germany); and phase (B) validating the VR-36 and its short version, the VR-12, in a randomized prospective study of inpatient rehabilitation patients with orthopedic and psychosomatic conditions.

Phase (A) translation and cultural adaptation of the German VR
The translation methodology followed a rigorous iterative forward-backward format to maintain the conceptual, functional, linguistic and cultural equivalence between the original (English) and the adapted (German) questionnaire. The translation procedure is summarized in Fig. 1. First, a German translation of the VR-36 was produced from the English original version by an experienced translator (DB). Because the VR-36 is analogous to the SF-36, the official German translation of the SF-36 items, which has already undergone rigorous translation and adaptation, served as a second translation to which we compared the forward translated VR items (German SF-36 Version 1 [8][9][10] and Version 2 [43]). A reconciled German VR-36 was produced after discussion of agreements and disagreements between the forward translation, SF-36 Version 1 and SF-36 Version 2, and translated back into the source language (English) by an experienced translator who is a native speaker of English and fluent in German. The backward translation was compared to the original English VR-36. Any discrepancies between the back translation and the English VR-36 were addressed with the back translator to determine the origins of discrepancies in the first reconciled German VR-36. After this stage, a pre-final version was produced, which was tested in a cognitive debriefing process with 26 patients and finalized afterwards.

Phase (B) validation study Patient recruitment
Study participants were rehabilitation patients undergoing a three-or six-week inpatient rehabilitation due to an Keywords: Quality of life, Veterans Rand 36 Items Health Survey, SF-36, SF-12, Health assessment, Health-related quality of life orthopedic or a psychosomatic indication. Recruitment took place in five rehabilitation clinics between October 2015 and November 2017. Patients who did not had cognitive or linguistic impairments were consecutively included in the study if they provided written informed consent. Participants completed questionnaires at the beginning (t1, baseline) and at the end (t2, three-to sixweek follow-up) of their course of rehabilitation. Based on sample size calculations, which included drop-outassumptions of 20%, a study sample of n = 800 patients at t1 (n = 400/clinical indication and n = 200/instrument version) and n = 640 patients at t2 (n = 320/clinical indication and n = 160/instrument version) was targeted. Because the SF-36, the VR-36, the SF-12 and the VR-12 questionnaires are very similar, participants were randomly assigned to one of four groups (block-randomization) to complete only one of these instruments (Fig. 2). By block-randomization an indirect comparison between the long-and the short-forms of the VR and the SF could be made.  The study was approved by the ethics committee of the University Medicine Greifswald, Germany, and was conducted according to the Declaration of Helsinki.

Measures
In addition to the VR and SF instruments, the patient questionnaires contained several other self-report measures. These measures were chosen to correspond to the eight scales and the summary scores of the VR instruments in order to validate the VR instruments.
The EQ-5D-5L questionnaire is an internationally widely used preference-based measure of self-assessed health [44][45][46]. The questionnaire measures impairments in five dimensions of health using five items, each with five levels of impairments, and a thermometer-like visual analogue scale (EQ VAS). The values of the five items can be converted into a preference-based single utility index. In the present study, index values were calculated using the German tariff [47].
The Centers for Disease Control and Prevention (CDC) "Healthy Days" is a generic HRQoL questionnaire containing four items measuring self-rated health and the number of disability days (out of the last 30) due to physical and mental health or limitations in activities [48,49]. The instrument is valid and reliable [48].
The Hannover Functional Abilities Questionnaire (HFAQ) is a 12-item generic measure of (physical) functional ability of daily activities [50][51][52]. Each item has three levels of functioning. All items can be combined to an additive summary score.
The Depression, Anxiety and Stress Scale (DASS) is an extensively validated measure of mental health [53,54]. In this study, the short form (21-item, DASS-21) instrument was used.
The Graded Chronic Pain Scale (GCPS) is an internationally established instrument developed by van Korff et al. [55,56]. The GCPS measures self-rated pain intensity and pain disability using a 0 to 10 numeric rating scale plus one item regarding number of disability days (in the past three months) due to pain using seven items. Summation of GCPS items produce scores describing pain intensity and pain disability.
The Index for the Assessment of Health Impairments, IMET [57,58], measures participation as defined by the WHO International Classification of Functioning, Disability and Health, ICF. The 9-item questionnaire was applied and tested in several samples from rehabilitation patients of different clinical indications. It is suitable as a screening method to assess the risk of a failure in the professional reintegration of rehabilitation patients. The instrument is demonstrated to be an economic, highly practicable, valid and reliable operationalization of "activities and participation" according to the concept of the ICF. Norm values for the IMET were assessed in a random sample of Lübeck inhabitants comprising subjects between 19 and 79 years of age, and enable classification of limitations in participation for people undergoing rehabilitation or suffering from chronic diseases.
The vitality subscale of the Indicators of the REhabilitation Status (IRES-VE) was included to examine the construct validity of the VR items on vitality [59]. In Germany, the IRES is recommended (in addition to the SF-36) for rehabilitation research and practice [42].

Statistical analysis
The VR-36 and the VR-12 were analyzed regarding the completeness of data on the scale-level, distributional properties, construct validity, known-groups validity, internal consistency (as one aspect of reliability), and responsiveness to change. This was done on the summary scores of the VR-36 and the VR-12 (physical component score (PCS VR ) and mental component score (MCS VR )) as well as the eight VR-36 scales: (physical functioning (PF VR-36 ), role functioning/physical (RP VR-36 ), role functioning/emotional (RE VR-36 ), vitality (VT VR-36 ), mental health (MH VR-36 ), social functioning (SF VR-36 ), pain (BP VR-36 ), and general health (GH VR-36 )). The VR instruments have not previously been used in German populations and normed scores have not yet been developed. Therefore, summary scores and scales were scored according to the VR-36 and VR-12 algorithms, using a t-score transformation with a mean of 50 and a standard deviation of 10 and normed to a general sample of the US population for the summary scales (PCS and MCS) [23,[60][61][62]. The scoring algorithms for the VR-36 and the VR-12 impute for missing data. VR-12 extrapolates scoring based on the missing pattern; VR-36 conducts mean imputation at the subscale level if less than 50% of the subscale items is missing. In all analyses, all available data were used (available case analysis). Because the SF-36 and the SF-12 instruments are well validated across a range of populations, they were used as the comparator to the VR instruments for all analyses.
Completeness of data is an indicator of data quality and acceptance of the questionnaire by respondents. The percentage of non-missing responses was calculated for the eight VR-36 scales, stratified by respondent characteristics (e.g. clinical indication, age, sex, education). No imputation was carried out to deal with missing data for statistical analyses.
Distributional properties (such as means, standard deviations and range) for the VR instruments were analyzed on the scale and summary score levels. To compare the distributional properties of the PCS and MCS for both the VR-12 and SF-12 as well as the VR-36 and the SF-36, classical statistical indices of distribution such as mean, standard deviation, minimum, maximum, skewness (to assess and compare the type and strength of symmetry) and kurtosis (as a measure of the steepness / flatness of the frequency distribution) were assessed. Kolmogorow-Smirnov-test was used to compare the distributions of the two summary scores of the VR and the SF-i.e. PCS VR and PCS SF as well as MCS VR and MCS SF . Kernel density plots using the Epanechnikov function were used to visually examine distribution of summary scores and scales.
Construct validity refers to the degree of accuracy with which a measurement instrument captures the construct it claims to measure. To examine construct validity, Pearson correlation coefficients (r p ) between VR summary scores (PCS VR and MCS VR ) and other self-completed health measures were assessed. We compared these to the correlations between the PCS SF and MCS SF with other self-completed health measures. Correlation coefficients were compared using significance tests for correlations for independent samples [63]. The correlations between PCS VR and other self-reported physical health measures (e.g. HFAQ, CDC Physical unhealthy days, GCPS Disability) were expected to be higher (convergent validity) than with self-report measures of mental health (divergent validity). Similarly, MCS VR is expected to be more strongly correlated with self-reported mental health measures (e.g. DASS-Anxiety, DASS-Stress, DASS-Depression, CDC Mental unhealthy days) than with physical measures. Both PCS and MCS are expected to be similarly correlated with generic self-report measures (e.g. EQ VAS, IMET) and GCPS-Pain. Correlations were interpreted as follows: r p < 0.1 small, 0.3 ≥ r p < 0.5 moderate, r p ≥ 0.5 high/strong [64].
Known-groups validity is a criteria-based technique to investigate the ability of a measure to discriminate between groups known to differ in the construct of interest. For this study, known-groups were defined by clinical indication (psychosomatic, orthopedic), treatment program ("curative therapy" typically for chronically ill patients, "medical follow-up treatment" generally after joint replacement, only for orthopedic patients) age (< 45 years, 45-65 years, > 65 years), duration of rehabilitation (median), sick days in the past 12 month, self-rated health (SRH, "excellent/very good/good" vs. "fair/poor"). We examined if mean PCS VR and mean MCS VR scores were significantly different between those pre-defined groups using t-tests for two groups or ANOVA for more than two groups.
Internal consistency (IC) is a measure of reliability. A scale is considered reliable if its items are homogeneous-i.e., highly correlated because they measure the same underlying construct [65]. In this study, Cronbach's alpha was used as a measure of IC with α ≥ 0.7 interpreted as acceptable, α ≥ 0.8 as good, and α ≥ 0.9 as excellent.
Responsiveness refers to a self-assessed health instrument's ability to capture changes in health over time [66]. The raw difference of SF and VR summary scores from t1 to t2 were divided by the pooled standard deviation of change to produce standardized response means (SRM), or divided by baseline standard deviation to produce standardized effect size (SES). As we assess patients before and after an intensive treatment, analysis were restricted to respondents who reported stable (t1 = t2) or improved (t1 < t2) health on a single SRH item (n = 133) to assess responsiveness to health improvements. We further checked improvement (from t1 to t2) for all PCSand MCS-scores of all four instruments using paired t-tests. The magnitude of changes in scores (expressed as SRM and SES) was interpreted as following: values of < 0.3 were considered as small, values between 0.3 and 0.59 were considered as medium, and values ≥ 0.6 were considered as large [67]. Since there are different methods to estimate the magnitude of change within groups, and consensus is lacking on their interpretation [68], we are calculating both SES and SRM for comparison purposes. Due to the repeated measurement design the measurements are correlated, which was shown to affect the magnitude of SRM [69]; to account for this, we additionally correlated both measurements (Pearson correlation coefficient, r t1/t2 ).
Data were analyzed using IBM SPSS Statistics 24 and STATA SE 13. Wherever applicable, analyses were stratified by clinical indication (orthopedic or psychosomatic rehabilitation).

(A) Translation and cultural adaptation of the German VR
There were no major problems found in the forwardbackward-translations. Reconciliation of the items did not lead to problems. The field test yielded that most of the questions (except for RE and RP instructions, response scales and questions) of the VR-36 are clear and simple to both rehabilitation patients (n = 15, 4 male, 11 female, 30 to 80 years (mean 55.3 years)) and patients from general practice (n = 11, 25 to 77 years (mean 57.4 years)) of all ages. Additional file 1 shows the key issues that were discussed during the translation process (forward-backward translation, reconciliation and cognitive debriefing) and how the items were reconciled. Besides the already described adaptation needs identified during the cognitive debriefings, adaptations to the cultural context were needed. The German SF-36 was used as a guide in these decisions. For example, playing golf (used as example in one item) is a less popular activity in Germany than for the USA. In the considerations for a culturally appropriate counterpart, hiking and walking were found to be appropriate but not practicable. We therefore removed the example as was also done for the German SF-36. In two items (BP2, SF1), for purposes of international equivalency, the right-most response category "extremely" was translated into German as "sehr" (English: "very much"), which is also used by the German version of the SF-36.
During the translation process, some double negatives were introduced as a result of combining the questions with their response choices (e.g. "[…] nicht so lange […]" (part of the question) "nein, nie" (response option)). As these double negatives also exist in the English version of the instruments, they were left in the German translation. However, nearly every third field-test participant had problems with the double negatives. Therefore, "yes" and "no" were omitted for these response categories to clarify the language. From a linguistic point of view, these revised response categories resemble the English SF-36 Version 2 and the German SF-36 (versions 1 and 2).
The final German VR-36 is conceptually identical to the English original.

Completeness of data
Missing values were acceptable (< 7%) for the VR-36 and comparable to missing data patterns of the SF-36 ( Table 2). The scale GH had the lowest percentage of completion for both the SF-36 (93.1%) and VR-36 (93.5%). As expected, there is a tendency of missing values to increase with increasing age and lower education.  For the long and the short form versions of the VR and the SF, the PCS has normal distributions (p = 0.057 to 0.097) while the MCS does not (p < 0.05, Table 3). The findings do not substantively change when stratified by study arm and clinical indication (results not shown).
The VR-36 scales distribute toward slightly lower scores than the SF-36 on the MCS, but not for the PCS. Kernel density plots show that the four instruments were more similar in PCS for orthopedic and MCS for  psychosomatic patients. The distributions were more similar between the SF-12 and the SF-36 than between the SF and the VR instruments in PCS for psychosomatic and MCS for orthopedic patients (Fig. 3a). Differences were observed after stratifying by clinical indication. For the scales of the instruments, kernel plots of the VR-36 and the SF-36 are comparable for PF and BP, RP and RE, while kernel plots of SF VR-36 , VT VR-36 and MH VR-36 are slightly more left-skewed compared to the SF-36 (Fig. 3b). Table 4 Table 5 illustrates the PCS VR-36 and MCS VR-36 scores in sub-samples of known groups. Lower mean PCS VR-36 was found for orthopedic patients while lower mean MCS VR-36 was found for psychosomatic patients. In line with our hypothesis, higher mean PCS VR-36 and MCS VR-36 scores were found in younger patients with fewer sick days in the last year and a shorter duration of rehabilitation. As expected, at baseline, orthopedic patients reported better mental health compared to psychosomatic patients and the other way around for mental health, which is reflected by higher mean MCS VR-36 scores in orthopedic and higher mean PCS VR-36 scores in psychosomatic patients. Results were similar for VR-12 and VR-36 suggesting that both instruments perform similarly with respect to known-groups validity ( Table 6): all MCS and PCS scales differentiated groups based on clinical indication, duration of rehabilitation and selfrated health, PCS VR-12 additionally for sick days. As this is only applicable for orthopedic patients, both PCS scales additionally differentiated for type of therapy.

Internal consistency (IC)
Except for GH (acceptable), IC was good to excellent for both VR and SF scales and with one exception (MH) always higher for the VR scales (Table 7).

Responsiveness
Responsiveness to change analysis included the n = 50 to n = 88 cases with no deterioration in SRH from t1 to t2, stratified as necessary by study arm (Table 8). For PCS, SES varied from 0.102 (VR-36 psychosomatic) to 0.398 (SF-12 orthopedic) and SRM varied from 0.127 (VR-36 psychosomatic) to 0.695 (VR-12 orthopedic) with better responsiveness across all instruments for orthopedic patients. Effect sizes of the short versions (VR-12, SF-12) were larger than those of the long versions (VR-36, SF-36). In psychosomatic patients, responsiveness to change of MCS was at least twice as large as responsiveness of PCS, while in orthopedic patients there were less obvious differences in responsiveness to change between PCS and MCS. Responsiveness of the PCS VR-36 for psychosomatic patients was smaller than the other instruments. Score improvements for all four instruments were statistically significant at p < 0.001 (paired t-tests).

Discussion
This research project (1) translated and culturally adapted the English VR-36 to the German language (Germany) and (2) validated the adapted VR-36 and VR-12 in German orthopedic and psychosomatic inpatient rehabilitation patients. This article provides details of the translation and cultural adaptation process of the German VR and the main findings of the validation study.
The German translation of the VR was prepared according to "state of the art" criteria for cultural adaptation of self-assessed health questionnaires using forward and backward translations. The study produced a selfreport questionnaire that is conceptually and semantically equivalent to the English language VR-36. The only difficulty during translation was the role physical (RP) and role emotional (RE) items which produced double negatives when the question stems and responses were taken together. This was resolved by a slight change in response category wording.
The German VR-36 is the third cultural adaptation and translation of the VR after the Spanish and the Chinese version. Three more language versions (Japanese, Russian, Polish) are being planned. 1 The validation phase of this study found the VR instruments to be acceptable, valid and moderately to strongly responsive to improvements in health. We indirectly compared the German VR-36 and VR-12 to the well-established SF-36 and SF-12, and found the instruments to be comparable in their distribution properties, validity, and responsiveness. Data quality indicators, such as the extent of item non-response, show the VR to be acceptable instruments in a German rehabilitation population, and were similar compared to the SF instruments. PCS score distributions were similar for VR and SF instruments. However, the MCS VR was distributed more in the lower range of the scale than the MCS SF . The VR scales and summary scores were moderately to strongly correlated with expected external measures such as self-reported pain, physical functioning, mental functioning and disability. Both the long and the short form of the VR could distinguish between patient type (orthopedic and psychosomatic), duration of rehabilitation and self-rated health while both PCS VR-12 and PCS VR-36 could also distinguish between type of therapy and PCS VR-12 whether the patient had over 100 sick days in the last year. The short version (VR-12) was similarly responsive as the VR-36 and SF-36. Thus, the VR was established as a valid and responsive measure of quality of life in orthopedic and psychosomatic samples of German inpatient rehabilitation patients. The number of studies using one of the instruments of the VR family is increasing every year with well over 400 publications [70]. The developers of the VR family provided the original psychometric evidence for the VR-36 and VR-12 [13,15,16,23].
Item level missing values were low and comparable to other studies suggesting high acceptability. While in this study 1.8% to 6.5% were missing per question for the baseline VR-36, Kronzer et al. [71] reported missing values in adult patients undergoing elective surgery on the baseline VR-12 from 1.5 to 3.7% per question and from 3.3 to 8.9% on the follow-up VR-12 (median 56 days).
Descriptive statistics indicated acceptable distributional characteristics. Summary scale means and SD of the PCS VR-36 are comparable with the results of the  [17]. The differences in MCS may be a function of the populations sampled; while the means were different the SD are quite similar.
The validity results are comparable with other studies investigating physically impaired patients: a study with patients undergoing knee arthroplasty [31] found a moderate correlation between the PCS VR-12 and a diseasespecific measure (KOOS-pain score: 0.57). Since only few studies investigated the factor structure of the VR-36, e.g. [60], this needs further investigation.
Oak et al. [31] found the PCS VR-12 to capture statistically significant improvements in n = 45 pre-and postoperatively tracked patients who underwent knee arthroplasty. They found no statistical differences in internal or external responsiveness to change among the EQ-5D, VR-12 and PROMIS 10 physical instruments with SRMs of the PCS VR-12 of 0.681 and for the MCS VR-12 of 0.103 (SRM EQ-5D: 0.704, PROMIS 10 physical: 0.721, PROMIS 10 mental: 0.083). SRM of VR-12 scores at baseline and at the end of therapy (0.549) can be calculated from results Days of sick leave in the last 12 month c SRH Self-rated health. Patients reporting "excellent", "very good" or "good" health and those reporting "poor" or "fair" health were aggregated. SD standard deviation     [74] study of an orthotic and rehabilitation program found statistically significant improvements only in the PCS but not in the MCS. For orthopedic patients, we found PCS to be less sensitive to changes in both SF and VR than the MCS, with the VR-12 similar or more sensitive to improvements than the SF instruments. However, the VR 36 was found to be slightly less sensitive to improvements than the SF-36 for psychosomatic patients.
Although the VR-36 and VR-12 are based on version 1 of the SF-36 and SF-12, the VR instruments use the fivelevel response format of the role functioning and role emotional scales whereas the SF version 1 instruments use the two-level format. The SF version 2 uses five-level response scales for those scales, but has slightly different wording and is in general a different instrument than version 1. This difference is likely the source of differences in distribution and responsiveness in our comparison of the VR to SF version 1 instruments. The floor was raised and ceiling lowered with the 5-point set of response choices for the role physical and role emotional scales compared with the dichotomized choices for the SF version 1 instruments [16]. Previous findings suggest that this could also be a possible explanation for the differences in responsiveness [16]. Gornet et al. [35] investigated the conversion of the SF-36 to PCS VR-12 and MCS VR-12 in 1968 patients who underwent lumbar (n = 1559) and cervical (n = 409) surgery between 1998 and 2013. They found the SF-36 and converted VR-12 mean scores, the mean (pre to post) change scores for PCS and MCS, and the minimum detectable change (MDC) to be extremely similar. However, as their study only collected SF-36 data, they could not compare how a 2-level and 5-level response category in the two scales might differ.
The primary limitation of this study is the indirect comparison of the instruments: the VR-36, VR-12, SF-36 and SF-12 were completed by different patients. The design choice was to minimize respondent burden and frustration as the four instruments are very similar. Although patients were randomized to the study arms, there could be underlying differences across the groups not captured by demographic or patient characteristics. Thus, it is possible that the detected distribution and responsiveness differences may in part be due to differences in the sample characteristics and perhaps unmeasured variables and not due to the instruments themselves.
Due to the magnitude of this time interval (of four to six weeks) and the intervention, it was not feasible to investigate test-retest reliability. Even after a week, which is the usual lag time between test-retests, we would expect patients to change as they are undergoing intense rehabilitation treatment. This is why we investigated internal consistency as a measure of internal reliability. However, test-retest reliability it is still to be investigated for the German version of VR.
Furthermore, the German VR was validated in an inpatient rehabilitation setting, and the results may not be generalizable to other populations nor to outpatient rehabilitation settings. Future research applying the German VR in other settings is necessary. The instruments were also administered only as a paper-and-pencil survey. As self-assessment questionnaires are increasingly being used in electronic formats, the comparison between the classical paper-pencil and other new computer platform applications should be studied.
Since this is the first study to this new German instrument, which aimed to adapt and test it in the German population, German norms have not yet been developed. This will be one of the next steps of instrument development. Therefore, for evaluation for this study, we relied on the US norms.

Conclusions
The VR is a credible measure in the public domain that can be applied in the German rehabilitation context. The VR measure may be appropriate for use in clinical research and clinical practice, but further research is needed to evaluate its usefulness in other populations in German. Due to the high demand for the German VR during the study period, it can be assumed that in the foreseeable future more data from different clinical settings and administrative modes will be available. The scoring algorithms also have been developed by the project working group for common statistical programs (e.g. SPSS, Stata, R) and is, as well as the questionnaires, freely available for use to the research community.