Reproducibility: reliability and agreement parameters of the Revised Short McGill Pain Questionnaire Version-2 for use in patients with musculoskeletal shoulder pain

Background The Revised Short McGill Pain Questionnaire Version-2 (SF-MPQ-2) is a multidimensional outcome measure designed to capture, evaluate and discriminate pain from neuropathic and non-neuropathic sources. A recent systematic review found insufficient psychometric data with respect to musculoskeletal (MSK) health conditions. This study aimed to describe the reproducibility (test–retest reliability and agreement) and internal consistency of the SF-MPQ-2 for use among patients with musculoskeletal shoulder pain. Methods Eligible patients with shoulder pain from MSK sources completed the SF-MPQ-2: at baseline (n = 195), and a subset did so again after 3–7 days (n = 48), if their response to the Global Rating of Change (GROC) scale remained unchanged. Cronbach alpha (α) and intraclass correlation coefficient (ICC2,1), and their related 95% CI were calculated. Standard error of measurement (SEM), group and individual minimal detectable change (MDC90), and Bland–Altman (BA) plots were used to assess agreement. Results Cronbach α ranged from 0.83 to 0.95 suggesting very satisfactory internal consistency across the SF-MPQ-2 domains. Excellent ICC2,1 scores were found in support of the total scale (0.95) and continuous subscale (0.92) scores; the remaining subscales displayed good ICC2,1 scores (0.78–0.88). Bland–Altman analysis revealed no systematic bias between the test and retest scores (mean difference = 0.13–0.19). While the best agreement coefficients were seen on the total scale (SEM = 0.5; MDC90individual = 1.2 and MDC90group = 0.3), they were acceptable for the SF-MPQ-2 subscales (SEM: range 0.7–1; MDC90individual: range 1.7–2.3; MDC90group: range 0.4–0.5). Conclusion Good reproducibility supports the SF-MPQ-2 domains for augmented or independent use in MSK-related shoulder pain assessment, with the total scale displaying the best reproducibility coefficients. Additional research on the validity and responsiveness of the SF-MPQ-2 is still required in this population.

Pain assessment in clinical practice and research often places emphasis on monitoring pain intensity, even though pain is known to be multidimensional and experienced uniquely by individuals [9]. Patients perceive pain across six diverse dimensions: physiologic, sensory, affective, cognitive, behavioral and socio-cultural [9,10]. The comprehensive assessment and monitoring of these dimensions should improve patient care [11]. A multidimensional pain assessment tool that provides a holistic assessment of pain has been recommended by experts [12][13][14] for use in upper extremity conditions, including shoulder disorders.
The Revised Short McGill Pain Questionnaire Version-2 (SF-MPQ-2) is an example of a general use multidimensional pain tool that comprehensively examines the sensory and affective dimensions of pain. Dworkin et al. [15] created the SF-MPQ-2 by adding seven new items that explicitly examines neuropathic and non-neuropathic pain characteristics to the original 15-item Short McGill Pain Questionnaire (SF-MPQ). They also replaced the previous 4-point descriptive rating scale with a 10-item numerical rating scale to enhance its responsiveness [15]. Since then, multiple studies have utilized the SF-MPQ-2 as a primary outcome for pain assessment in clinical trials; its measurement properties have been examined in different populations including cancer pain [16], surgical pain [17], visceral pain [18], and neuropathic pain [19]. Among MSK conditions, studies have reported measurement evidence for patients with complex regional pain syndrome [20], back pain [21], knee osteoarthritis (OA) [22], and mixed MSK populations [23,24]. Although the SF-MPQ-2 is becoming increasingly popular, our recent review [25][26][27] reported on evidence with design flaws including inadequate description of Intraclass correlation coefficient (ICC) models, insufficient justification of retest interval, and a lack of attention to absolute reliability parameters.
In the absence of such evidence, the primary purpose of this study was to investigate the reproducibility (testretest reliability and agreement) and internal consistency of the Revised Short McGill Pain Questionnaire Version-2 (SF-MPQ-2) among persons with MSK-related shoulder disorders.

Methods
This study was based on a cross-sectional study of internal consistency and test-retest reliability. The SF-MPQ-2 questionnaire was administered to examine reproducibility (test-retest reliability and agreement) and internal consistency at two time points: at baseline and after 3-7 days (when patients would, for the most part, be stable) [28,29]. The participants were recruited from the Roth-McFarlane Hand and Upper Limb Centre (HULC), London, Ontario, Canada during a period of 6-months (June to November 2018). Ethics approval for a clinical database of routine outcome measures from which this data were extracted was approved by the University of Western Ontario Research Ethics Board (REB# 4986).

Patients
Adults proficient in English, above 18 years of age, that experienced pain from one or more shoulder conditions of known MSK source (for example: rotator cuff tear or tendinopathy, adhesive capsulitis, glenohumeral anterior instability, and superior labral anterior-posterior (SLAP) lesions) were included. Potential participants were excluded if they had: (1) an unstable cardiorespiratory condition; (2) any history of problems relating with the central nervous system e.g. hemiplegia; (3) pain resulting from neoplastic or infectious or vascular disorders or referred from internal organs; (4) any neuropathic pain symptoms resulting from thoracic outlet syndrome, carpal tunnel syndrome or any peripheral nerve entrapment, or (5) did not provide consent.

Procedure
Assessors (SJ and HULC research assistants) identified eligible participants by reviewing the outpatient appointment list of patients scheduled for a clinical visit with two shoulder surgeons (KF and GA), a day prior. Potential participants were then contacted on the day of their clinical appointment and screened to ensure all criteria were satisfied; they were provided with an explanation of the objectives of the study before a questionnaire booklet containing the SF-MPQ-2 and Global Rating of Pain Scale (GROC) was administered. Each participant was verbally instructed to carefully read and circle the response that described their pain experience. In cases where participants had difficulty with selecting an answer, they were told to choose the answer that comes closest to describing their pain symptoms. If help was needed with understanding any words or phrases, or with marking their responses, the assessors assisted. The participants were instructed to complete all items in the questionnaire. Participants were permitted to withdraw from the study for any reason at any time. For the second test occasion, a subset of the participants (102 in total) that verbally confirmed being in unchanged/stable pain in the past 7-days were conveniently sampled to self-complete the SF-MPQ-2 and GROC at home within 3-7 days, if their pain remained unchanged (i.e. if they could confirm that the threshold of their perceived pain for their shoulder disorder had not changed in the past week). The GROC scale was administered, intentionally, on both test occasions solely to serve as an objective means of comparing participants test and retest responses thus ensuring that only participants in stable/unchanged pain conditions were included in our analysis of reproducibility (testretest reliability and agreement). Demographic information including age, hand dominance, primary cause of shoulder pain and sex were recorded.

Outcome measure
The Revised Short McGill Pain Questionnaire Version-2 (SF-MPQ-2) contains 22-items/pain descriptors and 4 subscales/domains that examine pain intensity and quality as follows: (1) continuous pain (throbbing, cramping, gnawing, aching, heavy, and tender pain); (2) intermittent pain (shooting, stabbing, sharp pain, splitting pain, electric-shock, and piercing pain); (3) neuropathic pain (hot-burning, cold-freezing, pain caused by light touch, itching, tingling or pins and needles, and numbness pain), and (4) affective pain (tiring-exhausting, sickening, fearful, and punishing-cruel). All the items are bounded on a zero (none) to 10 (worst possible) numerical rating scale. The mean of the 22-items yields the SF-MPQ-2 total score, while the mean of the items that comprise each of four-subscales yields the summary score for the subscale [15,21]. Higher subscale or total scores suggest greater pain symptoms/experience, and more than 2 missing values renders patients' response to the questionnaire invalid [21]. The SF-MPQ-2 uses a recall period of 7-days, instructing the person to base their rating on their symptoms in the past week [15].

Statistical analyses
The SF-MPQ-2 total and subscale scores were considered as interval variables. Data quality and screening, including the percentage of missing data, outliers, and presence of floor/ceiling effects was performed. Respondents with two or more missing items were excluded, in line with the developers' instructions [21]. Continuous variables were descriptively summarized using means and standard deviations while percentages were used to report categorical variables. The data were then examined for normality with histograms, and the Shapiro-Wilk test. All statistical analyses were completed with Microsoft Excel

Floor/ceiling effects
Floor/ceiling effects for the SF-MPQ-2 were assessed by identifying the number of participants with the absolute lowest (0-points = floor) and highest (10-points = ceiling) scores on the total and subscales. Floor/ceiling effects occurring at the magnitude of 15% were considered substantial [30].

Hypothesis:
We expected substantial floor effects on the neuropathic and affective subscales of the SF-MPQ-2 because they evaluate pain dimensions that are relatively uncommon in orthopaedic shoulder disorders.

Hypothesis:
We expected the SF-MPQ-2 to be internally consistent with Cronbach α at 0.8 or above for its subscale scores, and 0.9 or above for its total scores as previously reported in the literature [22,24].

Relative reliability (test-retest reliability)
The intraclass correlation coefficient (ICC 2,1 ) was used to assess the retest reliability of the SF-MPQ-2 total and subscales [34]. ICC 2, 1 with 95% confidence intervals (CI) were computed using the two-way mixed and absolute agreement model, that assumes the patients were randomly selected but the occasions were fixed choices [35]. We chose an ICC 2,1 absolute agreement over a consistency model because it captures elements of systematic bias and is preferred for computing an absolute reliability indicator. ICC 2,1 values for the SF-MPQ-2 total and subscale scores were considered Negative ≤ 0.49, Doubtful 0.50-0.69, Good 0.70-0.89, and Excellent 0.90-1.00 [36].
Hypothesis: We expected good ICC 2,1 scores for group level analysis at ≥ 0.80 for the total scale and ≥ 0.70 for the subscale scores as previously reported in the literature [22,24].

Agreement properties (standard error of measurement [SEM] and minimal detectable change [MDC])
Standard error of measurement (SEM) is defined as the standard deviation of errors of measurement associated with particular test takers' scores [37]. Table 1 explains the five equations used for agreement analysis. To define SEM agreement for the SF-MPQ-2 total and subscales scores, the pooled standard deviation calculated from participants' mean responses to the SF-MPQ-2 domains on both test and retest using Eq. 1 [37,38] and the respective non-transformed ICC 2,1 for the SF-MPQ-2 domain under evaluation was keyed into Eq. 2 [37][38][39] (Table 1). Further, the proportion of the resulting SEM per domain to the total score of the scale was calculated to yield the SEM percentage or SEM%, as previously used [39][40][41] and interpreted as follows: ≤ 5% = very good; > 5-≤ 10% = good; > 10-< 20% = doubtful; and values above 20% = negative [39].
The minimal detectable change (MDC) or repeatability coefficient describes the minimum amount of change that must occur on a score to be confident that true/real change (that may or may not be clinically significant) has occurred without error after two repeated measures, within the period of the test-retest [42]. For this study, a 90% confidence interval was estimated for the Minimal Detectable Change (MDC 90 ). Like the SEM, it is also expressed in the unit of the measure and may be computed at an individual level (MDC 90individual ) or for a group (MDC 90group ) [29]. We estimated MDC 90individual for the total and subscale scores of the SF-MPQ-2 by entering each scale's SEM agreement into Eq. 3 (Table 1) assuming the data was normally distributed and free of systematic error. The MDC 90individual confidence interval was then computed from the mean differences (d) of each subscale using Eq. 4 (Table 1) [29,40,43].
To determine the group level minimal detectable change (MDC 90group ), which is useful for determining if changes have occurred in an entire population, Eq. 5 (Table 1), the formula proposed by de Vet et al. [30,44] was employed. The proportion of the resulting MDC coefficient per SF-MPQ-2 domain to the total score of the scale was computed to yield the MDC percent score (MDC%) and interpreted as follows: ≤ 5% = very good; > 5-≤ 10% = good; > 10-< 20% = doubtful; and values above 20% = negative [39,40].

Bland-Altman Plots (BA Plots)
The Bland-Altman (BA) method was used to visually examine the agreement between the test and retest scores [45,46]. Scatter plots were created to demonstrate the differences between the total and subscale scores obtained at time one and time two of the testretest interval against their mean score for the two time points [45][46][47][48]. We then calculated the mean difference between the two measurement intervals (the 'bias') and the 95% limits of agreement (LoA) using: LoA = mean difference (d) ± 1.96 SD of the mean differences. The BA plots were used to visually judge the 95% limits of agreement to determine how well the scores from repeated measurements agreed: narrower LoAs suggested better agreement at the individual level [29,47,49]. Agreement at the group level was determined by how close the bias (mean difference) was to zero. Also, the distribution of scatter points on the BA plots were visually scrutinized for evidence of variability or heteroscedasticity, where the absence of a linear relationship between test-retest mean differences and their mean scores, per subscale, suggest the absence of systematic bias [44][45][46][47][48]50]. Linear regression models were used to explore the presence of systematic bias. For each domain of the SF-MPQ-2, mean scores and differences in mean scores were modelled as the independent and dependent variables, respectively. The potential for systematic bias was appraised by checking if the prediction of the differences in the mean scores was statistically significant [47,51]. Finally, outliers that presented beyond the upper and lower boundaries of the LoA were noted and explored [29,52].  Both the graphical and statistical tests of normality revealed the dataset was skewed/abnormal. To address the assumption of normality for further analysis, a square root calculation was used to transform the data. A closer look at the reliability coefficients obtained using the transformed and untransformed data revealed only a small difference in scores (see Table 3 for results). Parametric statistics were used in our analysis because the sample size was greater than 30 participants (based on the central limit theorem). Despite that, we still examined for differences in reproducibility coefficients obtained using the transformed and non-transformed ICC scores.

Floor and ceiling effects
The presence of floor/ceiling effect may suggest an outcome measure is not responsive to detecting improvement (ceiling effect) even though a decline in status can be captured, and vice versa for floor effects [21]. The number of patients who obtained the absolute maximum (Ten, 10) and minimal (zero, 0) scores on the SF-MPQ-2 total and subscales are summarized in Table 3. The greatest level of floor effect was observed on the affective subscale at both periods of the test-retest. Substantial floor effects were also noted on the neuropathic and intermittent subscales. None of the SF-MPQ-2 indices had remarkable ceiling effects. Table 4 summarizes the results obtained for cross sectional reliability. The SF-MPQ-2 displayed excellent internal consistency with robust α coefficients within a range that suggest the absence of redundancy: α coefficients for the total subscale peaked at 0.95 as posited, while that for the subscales fluctuated between 0.83 and 0.86 points. Inter-item correlations were satisfactory, ranging from 0.23-0.53 across the scales. Table 5 summarizes the agreement parameters supporting the SF-MPQ-2 domains. The total scale SEM agreement was very low (0.51points) and approximately 5% of the total score of the scale, which is 'very good' according to    (Table 5).

Relative test-retest reliability
The test-retest reliability of the SF-MPQ-2 domains was rated "Good" to "Excellent" (Table 6). Our results for ICC 2,1 were based on an analysis conducted with the non-transformed data, as they did not differ from that obtained with transformed data. ICC 2,1 scores were highest on the continuous and total subscales and rated excellent according to our criteria. The neuropathic, affective and intermittent subscales displayed good ICC 2,1 coefficients (Table 6) in support of relative reliability.

Bland-Altman (BA) analysis/plots
The results of our Bland-Altman analysis are presented in Table 6. The Bland-Altman plots superimposed with the LoA and mean difference (bias) scores for each domain of the SF-MPQ-2 are graphically illustrated (Fig. 2a-e).All of the SF-MPQ-2 domains displayed acceptable LoA at a 95% confidence level with the highest distance ranging 5 points (intermittent subscale). The total scale score displayed the narrowest LoA (range = 3 points), with the remaining subscales within satisfactory limits. Mean difference scores (bias) were very acceptable for all the SF-MPQ-2 domains (0.15-0.19 points).  Visual inspection of scatter points on the BA plots for each domain of the SF-MPQ-2 revealed that the magnitude of the mean differences against the mean scores were uniformly distributed from the zero point and most scatter points were within the 95% LoA with the exception of a few outliers. This supports the absence of systematic bias and suggest a good level of agreement among test-retest scores. Furthermore, for each of the SF-MPQ-2 domains, there was no evidence of the mean difference scores predicting the mean average after our regression model analysis. These findings suggest that systematic bias is unlikely and confirms good level of agreement between the test-retest scores ( Table 6). . The difference between test-retest scores is plotted against the mean of test and retest scores for the respective SF-MPQ-2 total and subscales depicted. On each plot, the central blue line represents the mean of intra individual differences (d); the upper and lower horizontal broken lines represent the 95% LoA. The 95% LoA shows that 95% of the intra individual differences are within ± 1.96 SD of the mean difference (d). The outlier noted in each BA plot is numbered, according to participant #RS I.D., and presented in accordance with the SF-MPQ-2 subscale or total scores in which they were noted The few outliers noted were explored. First, we determined if they were erroneous responses in entry by rechecking hard copies but, indeed, they were 'interesting' outliers [53] and labelled according to their #RS on each BA plot. The greatest number of interesting outliers presented on the intermittent (n = 6, 12%) and neuropathic (n = 4, 10%) subscales. The least number of outliers were seen on the affective subscale (n = 2, 4%). In general, however, the presence of these outliers did not indicate the presence or absence of bias [53].

Discussion
This study provides reproducibility evidence that supports the use of the SF-MPQ-2 in multidimensional pain assessment of people with MSK shoulder pain. The SF-MPQ-2 displayed good to excellent coefficients in support of its relative reliability and absolute reliability properties. The limits of agreement for the subscales and total scores were very satisfactory.
The substantial floor effect observed on the neuropathic, intermittent and affective subscales can be attributed to the robust discriminative properties of the SF-MPQ-2 subscales and to the lower prevalence of these problems in our study population. Conceptually, the SF-MPQ-2 was expanded to provide a single tool that can classify pain between neuropathic and non-neuropathic sources [15,21]. As outcome measures can be evaluative or discriminative, combining both purposes within an outcome measure is likely to result in these types of statistical issues. For instance, participants with pain emerging from neuropathic sources will be more inclined to respond adequately to the neuropathic subscale, thereby reducing the likelihood of floor effects. This has been observed with the use of the SF-MPQ-2 among complex regional pain syndrome (CPRS) patients [20]. This implies that floor effects on the SF-MPQ-2 domains may not always represent redundancy, but rather, may suggest that an item does not describe the patient's pain experience [25].
In the present study, ICC 2,1 coefficients were good to excellent for all the SF-MPQ-2 domain scores (total, 0.93; subscales, 0.78-0.91), suggesting that they can adequately discriminate among patients at the individual level (total and continuous scale) and at the group level (all of the SF-MPQ-2 domains) [29,54]. These results are comparable or better than previous findings reporting estimates among knee OA [22] (total scale, 0.90; subscales, 0.73-0.90) and mixed MSK patients [24,55] (total scale, 0.90-0.94; subscales, 0.73-0.90). Although acceptable, the lower performance of the neuropathic subscale (0.78), with an ICC score that overlapped the 'moderate' confidence interval threshold (0.64-0.87), suggests greater variability on this subscale, which makes it more difficult to achieve a high ICC 2,1 score.
Absolute reliability estimates allow clinicians to assess true change in a patient in comparison to change that might be expected from measurement error [30,44]. Currently, no previous data have examined absolute reliability indices for the SF-MPQ-2 scores in any population. This makes direct interpretation and comparison difficult; however, our use of the Ostelo et al. [39] definition of SEM and MDC by percentages allows comparison across the domains of the SF-MPQ-2, and with its former version (SF-MPQ). The SEM for the total score (≤ 5% of total scale score) was 'very good' and comparable to that reported for the former version (SF-MPQ) among OA patients (≤ 3.64%) [56], but better than those seen among mixed MSK patients assessed with the Norwegian version of the SF-MPQ (≤ 10%) [41]. Although not as favorable as estimates noted on the total scale, the affective and intermittent/continuous subscales had 'good' SEM coefficients (< 10%), which were comparable to findings reported with the sensory subscale of the former SF-MPQ version among OA patients (< 10%) [56], and superior to that reported in a mixed MSK population (< 14%) [41]. Basically, SEM estimates for all the SF-MPQ-2 subscales were satisfactory and suggest an adequate evaluative capacity that can yield scores less prone to error when utilized by researchers/clinicians for MSK shoulder pain assessment over time.
The MDC scores represent the minimal change in scores after repeated administration that clinicians/ researchers can interpret as not due to chance variation for an individual or group in a population [42]. The MDC 90indivdiual scores obtained for the SF-MPQ-2 domains implies that change at a magnitude equal or greater than 1.8 (neuropathic), 1.7 (affective), 1.8 (continuous), 2.3 (intermittent), 1.2 (total) points represents genuine improvement beyond chance with 90% confidence. The MDC scores for the total scale (≤ 12% of the total score of the scale) were comparable to previous studies with the former version (SF-MPQ) among OA patients (≤ 11.5%) and better than the results seen among mixed MSK patients (≤ 26.4% of total score). For the MDC 90group scores, the results obtained for the SF-MPQ-2 domains imply that a change of at least 0.4 (affective), 0.5 (intermittent), 0.3 (total), 0.4 (neuropathic), 0.4 (continuous) points must be observed in a group to be 90% confident that this was change beyond random or systematic error. In general, minimal detectable change scores are useful when interventions are administered; to be sure the intervention is effective, it must demonstrate change beyond the MDC score reported for the scale. Also, MDC 90group indices can be used for sample size estimation in a randomized controlled trial, as they determine the number of participants that will be needed to detect a change in the measure beyond error for a group, if the Minimal Clinically Important Difference (MCID) score for the population is unknown.
The Bland-Altman plots revealed satisfactory limits of agreement in support of the SF-MPQ-2 subscales. However, the interpretation of how far apart two measurements can be before they are no longer considered interchangeable depends on the contextual application [47]. The limits of agreement between the test-retest of the SF-MPQ-2 domains were reasonably smaller than those seen in previous studies of its former version (SF-MPQ) [41,56], suggesting there is less variation between the test and the retest of the SF-MPQ-2 [50]. Furthermore, no bias was found in the measurements between the test-retest, as the inter-occasion mean difference was minimal. This suggests that learning or test accommodation are not issues with using the SF-MPQ-2; moreover, our compliance to recommended time intervals (3-7 days) [28,29,57] may have favored the agreement outcomes. The intermittent subscale had the greatest number of outliers of all the Bland-Altman plots (12%) and may be due to the volatile nature of the pain descriptors comprising the scale.
The SF-MPQ-2 total scores displayed the best reproducibility parameters in support of its relative, absolute and level of agreement parameters. This could be from the number of items contained in the scale. For instance, better ICC scores can be expected when variability is low. Variability decreases when a greater number of descriptors comprise a scale, in comparison to those with fewer descriptors [29]. As all 22 items of the SF-MPQ-2 contribute to the summary total scale scores, it is possible this favors reproducibility.

Study limitations
While the present study findings provide preliminary evidence supporting the reproducibility of the SF-MPQ-2 for use in patients with shoulder disorders, it has several limitations. First, the study sample size (48 participants) was just under 50 participants which has been suggested as a benchmark by the COSMIN [58,59]. However, in conflict with the COSMIN recommendation, our sample size calculation suggested at least 46 patients were required (see Appendix 1), which indicates our study was adequately powered. Second, the patient population were from a single tertiary referral practice and our findings may not be generalizable to a different context. Third, since participants completed the retest (Time 2) at home, we were unable to clarify instructions. However, independent completion is a requirement for routine administration. Further, the high level of agreement between scores of the tests and the absence of systematic bias suggest this was not a problem. Fourth, sample mean age was 62 (± 17) years, which may not adequately reflect the reliability of younger populations although shoulder pathology prevalence increases with age. Finally, we did not determine minimal clinically important difference.

Conclusion
We conclude that the SF-MPQ-2 is satisfactorily internally consistent and provides good to excellent reproducibility coefficients (test-retest reliability and agreement) for multidimensional pain assessment among patients with musculoskeletal shoulder pain conditions. The total scale displays the best reproducibility coefficients. Additional research on the validity and responsiveness of the SF-MPQ-2 is still required in this population.