This study has important implications for studies using the RV to compare the validity of PRO measures. First, this work demonstrates the importance of calculating the confidence interval and determining statistical significance of the RV when comparing the validity of PRO measures. Second, our findings suggest that RVs of equal size but calculated under different comparison conditions have distinct statistical implications and should be interpreted differently. A review of about 40 articles published in three relevant journals (*Journal of Clinical Epidemiology*, *Medical Care*, and *Quality of Life Research*) between 1990 and 2012 revealed that the circumstances under which the RV was computed varied widely. The sample size per ANOVA group ranged broadly from 42 to near 4000 [32, 33], the F-statistic of reference measure ranged widely from less than 4 to over 400 [12, 24], and the correlation between the comparator and reference measures was rarely reported. We suspect that most studies, without constructing a confidence interval for the RV estimate, over-interpreted the observed differences in the RVs with small denominator F-statistics, ignoring the possibility of falsely rejecting the null hypothesis of no difference when only chance was in operation. On the other hand, “small” but possibly meaningful and statistically significant differences may have been overlooked.

This work also has important implication for designing future studies using the RV. In planning for power calculations in such studies, we suggest that researchers begin with reasonable estimates of the correlation between the comparator and reference measures along with the ANOVA group means and standard deviations. Armed with these estimates, the investigators will better understand how to control the sample size to achieve a desired magnitude of denominator F-statistic for sufficient power. The effect of correlation between measures on the RV is important given that there is an increasing interest in developing more “efficient” forms from the same item bank [34]. Thus, it becomes very realistic to assume that the PRO measures with the same questions but varying in length are very highly correlated (r > 0.9) for the same group of respondents, as the alternative forms of QDIS-CKD (CAT-5, Static-6, and Static-34) presented in our current study. Furthermore, it seems reasonable to assume at least moderate correlations (r > 0.5) for measures assessing similar concepts but having different questions, such as the different CKD-specific measures, or the CKD-specific and generic measures with common domains. Our findings also suggest lower correlations (r < 0.5) for measures of distinct domains, such as the physical and mental health.

All confidence intervals in this study were based on the bias-corrected and accelerated (BCa) bootstrap method. Generally, there is wide consensus that this method is preferred over other methods [27]. However, there are a few caveats. First, if the acceleration parameter is small (< 0.025), then some simulations suggest that coverage of the BCa interval may be erratic. Second, if there is no bias, meaning that the bootstrap distribution is not skewed and the center of the bootstrap distribution is very close to the center of the observed distribution, bias correction may decrease the precision and unnecessarily increase the width of the BCa interval [35]. Therefore, under the circumstances of no bias and minimal acceleration, the percentile-based confidence interval may offer some advantage. However, we would urge caution because these "ideal" circumstances are not likely to be found in real studies. In fact, we found important bias and substantial acceleration factors in our bootstrap simulations.

This study has specific limitations worth consideration. First, the simulations were based on one data set of PRO measures administered to CKD patients. Nevertheless, it is expected that the findings could be generalizable to PRO measures in other conditions. That said, validations using different samples and conditions are desired. In addition, we limited the number of simulation replications to 100 for most simulation conditions. Selected comparisons were made with a much larger number of simulation replications, and similar results were found. Finally, we selected the reference measure which had the largest F-statistics and thus limited the values of RVs below the null value of 1. Nevertheless, the statistical significance of the RV should not be affected by the choice of the reference measure, and we would like to further investigate this in the future. Out next and follow-up plan is to have a more comprehensive study with additional conditions and data sets. It is hoped that such a comprehensive simulation study will provide some practical guidance with a look-up table suggesting minimum denominator F-statistics required for sufficient power to detect a range of RVs (both below and above 1) under varying circumstances (e.g., measures with different degrees of correlations).

It is noteworthy that the methodology of the RV proposed in the study is appropriate only when the assumptions of ANOVA are met. These assumptions include independent observations, normally distributed dependent variable within groups, and homogeneity of variances across groups. That stated, it is also well recognized that ANOVA is quite robust to deviations from normality and violations of homogeneous variance [36, 37]. To implement this methodology, it would be ideal to have all respondents complete all measures being compared, as in our current study. However, in longer surveys this could greatly increase respondent burden. Therefore, one potential approach would be to randomize respondents to complete only selected measures. However, to achieve the randomization, the sample sizes of ANOVA groups should be approximately equal (if equal sample sizes for the measures) or proportionally the same (if unequal sample sizes for the measures) for the measures, so that their F-statistics are comparable.

Finally, when evaluating the statistical significance of the RV, it is important to recognize that a low power increases the risk of failing to detect clinically important differences, and that a very large power could convey statistical significance upon clinically trivial differences. Therefore, differences in measures should always be considered clinically, for example, by accounting for the proportions of patients misclassified using the different PRO measures.

### Consent

Written informed consent was obtained from the patient for publication of this report.